Wednesday, August 1, 2007

On BizTalk, Assemblies and Deployment, pt. 1

This is the first article in a multi-part series about assemblies and how they relate to BizTalk. This part covers what you need to know about assemblies in general in order to understand what they do and how to deploy them properly. I don't endeavor to cover every detail of .NET assemblies and how they are put together, but this hits most of the big topics. It's a bit long, so feel free to just skim.

The concept of assemblies and the GAC can be very confusing to a BizTalk developer who hasn't interacted with these ideas in .NET before. Fortunately, if you don't want to spend the time, there's no need to have a complete, detailed understanding of what assemblies are and how they work. By grasping a few simple ideas, working with assemblies in BizTalk becomes very easy, and gives you great insight into how BizTalk and .NET actually work with assemblies.

To get a good overall idea of what an assembly is, Wikipedia has a fairly descriptive and high-level article about it: http://en.wikipedia.org/wiki/.net_assembly.

Here's the short-short version: an assembly contains your compiled code, resources, and some metadata about that code and those resources. That's about it. Typically, a "project" in Visual Studio (which can consist of multiple code files, resources, references to other assemblies, etc.) equates to an assembly - a single .dll or .exe file. When you build that project it compiles your code and puts all of those other resources together in a big pile and stuffs it into a .dll or .exe. Additionally, it creates an "index" of machine-readable metadata that is incredibly useful - it contains all the information about what callable methods are in that code, it powers Visual Studio's IntelliSense functionality, and generally provides enough information to let everything outside of the assembly know what it does - just not how exactly it does it. Whenever you create any kind of application, all of the code and resources used to run it are in one or more .dll or .exe files. If you make a lot of little one-off "toolbelt" applications that are contained in a single Visual Studio project, that .exe that you get after you build has all of the code in it. If you were to spread the code over multiple projects and add the appropriate references within those projects, then when you ran the .exe and did something in your application that required code in one of those assemblies, the assembly it needs must be in the right place or the application will throw an exception. What is "the right place" for an assembly? I'll discuss that momentarily.

The next important bit about assemblies is how they are named. A name might seem trivial, but it contains a lot of information and is a guaranteed unique identifier for the assembly. The full name of an assembly has four parts: The short name, the version, the culture, and the public key token.
  • The short name is typically the name of the assembly without the file extension, e.g. NW.Applications.MyAssembly. This short name is also typically used as the namespace of all of the modules in the assembly - a namespace is simply something that helps to uniquely identify a module. There's lots of code in the universe, and it's highly likely that for every module, there's another module out there with the same name - let's take a hypothetical module called NumberCruncher. The namespace is like a surname that gets tacked onto it, usually containing a company name or a product name, that guarantees that those two modules with the same name are still unique - NumberCruncher in the NW.Applications.MyAssembly namespace is different from the NumberCruncher module in the ABC.Software.EnterpriseApps namespace. For this reason, many assembly names are like my sample one above: multiple tokens separated with dots, each token representing some kind of arbitrary hierarchy that's unique to my company or organization.
  • The version number of the assembly, presented like so: 1.0.4.0. The names for each of those values are "major version," "minor version", "build" and "revision." Major version is typically used to signify the product version: Is this SuperToolbox 2 or 3? Minor version represents things like service packs or patches. Build number is typically incremented by the developer every time the assembly is built - this number is often represented by four digits (filling in zeroes if necessary) because builds happen thousands of times. The last number can be used for things like hot fixes. Some developers may increment the last two numbers based on the date and time the build was completed at. Here's the important bit about versions: Two versions of the same assembly can co-exist within the GAC (keep reading), and are unique! Note that an assembly has an "assembly version" and a "file version." The one that really counts here is the assembly version. These two version numbers don't necessarily have to always be the same: one way to manage version numbers is to only increment the file version while you are developing. This helps avoid confusion with version numbers when you keep redeploying your code for testing - the assembly version always stays the same, but the file version (which does not make an assembly unique, but can still be viewed) can be used to determine exactly what build you are using.
  • Culture provides information about the language that the assembly is presented in (human language, not programming language). This will typically be "neutral."
  • Public key token: a 16-character hexadecimal string, the public half of a public-key cryptography pair. I won't discuss the details of how public-key cryptography works here, but the short and long of it is that only people who have the private half of the key can generate assemblies that have that public half of the key. This token essentially ensures that the assembly has come from a certain author and is guaranteed authentic. An assembly does not have to be given a public key token. If it has one, it can be called a "strongly-named assembly." Assigning a strong name is done in Visual Studio using a .snk file, which can be generated by using a Visual Studio command-line tool. An assembly must be strongly-named to be placed in the GAC.
So... what's the GAC? The GAC is the Global Assembly Cache, a universal repository of strong-named assemblies on your machine. Any assembly placed here is globally accessible and can be referenced easily and shared by multiple applications. An assembly can co-exist here with other assemblies that have the same short name, so long as they have different versions. Try browsing to C:\Windows\Assembly (the Assembly folder in your Windows install directory, wherever that might be): what you will see isn't actually a folder on your disk, but a specially-crafted view of all the assemblies in the GAC. There's a folder structure under there, but it's generally irrelevant to human beings. Try right-clicking on a few assemblies and click Properties to see interesting info about them.

This is where I get to the part about where assemblies need to be placed so they can be used. The part of .NET that finds assemblies when it needs them is called Fusion. If you are running an application that has references to other assemblies and it needs one of those other assemblies, Fusion kicks in and looks for assemblies in the following places in this order (this is direct from the .NET Assembly article on Wikipedia):
  1. If the assembly is strongly named it will first look in the GAC (your app knows if the assembly is strongly named because it captures this information from the assembly when you add a reference to it in your project).
  2. Fusion will then look for redirection information in the application's configuration file. If the library is strongly named then this can specify that another version should be loaded, or it can specify an absolute address of a folder on the local hard disk, or the URL of a file on a web server. If the library is not strongly named, then the configuration file can specify a subfolder beneath the application folder to be used in the search path.
  3. Fusion will then look for the assembly in the application folder with either the extension .exe or .dll.
  4. Fusion will look for a subfolder with the same name as the short name (PE file name) of the assembly and then looks for the assembly in that folder with either the extension .exe or .dll.
So, as you can see, you can essentially put an assembly anywhere as long as you configure your application properly. However, for ease of use and universal understanding, most people will either GAC their assemblies or put them in the application folder. Ah, I almost forgot to mention how to GAC an assembly: use gacutil.exe (the path should already exist in a Visual Studio Tools command prompt; on my machine it's located at C:\Program Files\Microsoft Visual Studio 8\SDK\v2.0\Bin\gacutil.exe) with the /i switch and the path to the assembly to GAC. You can use /if to "force" the installation, for example if the assembly already exists and you are reinstalling.

That's it for the lecture on assemblies. Next post I'll talk about how assemblies and the GAC work with BizTalk - how to deploy and redeploy assemblies, what needs to be GACed and what doesn't, where to put things, etc.