| y now, all the hype about MicrosoftÂ® .NET has probably gotten you to at least install Visual Studio.NET Beta 1 and take a look at this new beast. Your first thought is that it does look interesting, and judging from the samples there's quite a bit you can do with the beta. However, if you are anything like me, once you got past Hello World! you were in for a shock. This .NET thing isn't just another language and it's certainly not just another class libraryâ"it's a whole new development environment! Consequently, .NET can sometimes seem a little too daunting to comprehend.|
When I made my transition from MS-DOSÂ® to WindowsÂ® 3.0 and I became confused about what was going on (I think I just dated myself a little bit there!), I stopped and got to the assembly language level so I could get a clue. One beautiful thing about assembly language (also known as unambiguous mode) is that it never lies. When working on my transition from Win32Â® to .NET, my world became a little topsy-turvy. I was lost without my assembly language crutch. While I could look at the Intel assembly language in the debuggers, that didn't help much because it didn't relate back to anything in my source code. All of a sudden, I was in a world of hurt, and I couldn't see what was going on.
How did I decide to digest this elephant-size mound of stuff known as .NET? By taking a single bite at a time. I wanted to start at the most atomic level so I could see a single operation at a time and build up the rest. That's when I found my new best friend: Intermediate Language Disassembler (ILDASM). ILDASM allows you to see the pseudo assembly language for .NET and it's the only way you can see the who, what, when, where, and why of .NET. While I will probably never write major programs in Microsoft intermediate language (MSIL), knowing your way around the assembly language certainly helps. Additionally, while the Visual Studio.NET documentation is excellent for a Beta 1 release, there are still plenty of holes marked by "[To be supplied]." When trying to figure out how to use some of the .NET runtime classes, I've had to resort to looking at the disassembly to see how it works. In this edition of Bugslayer, I want to introduce some of the core MSIL instructions and show what various constructs look like so you can get up to speed with .NET. Before I jump into the instructions, I will cover a little bit about what you will see in some of those text windows that pop up all over the place in ILDASM.
ILDASM Basics The Beta 1 SDK documentation barely covers ILDASM, so I thought I would discuss a few topics that will help you out when using it. The first thing that's interesting about ILDASM is that it is a complete "round-trip" disassembler. In other words, pump the output of the disassembly through ILASM, the MSIL assembler, and it will produce a good binary file. While most of you will never need to program in MSIL directly, some of you will be interested in moving your special language compilers over to the common language runtime (CLR). The easy way to do that is to generate .IL files and run them through ILASM; the output will be important to you. Since there are so few examples of programming directly in MSIL at this point, that output file is all the documentation you will have.
Before you jump into ILDASM, the first thing you should do is run ILDASM with the /? option to see some help output on all the options. If you are interested in the text file, the /OUT=<file name> option will send all output to the specified file. One command-line option that is not listed is the /ADV option. Turning on /ADV will allow you to dump additional information about the file. This info mostly concerns metadata information and other file statistics, but if you need this information, this is the only way to get it. If you use /ADV with the GUI, it will add three new menu items to the View menu:
- COR Header lets you view the file header information.
- Statistics lets you view various file statistics.
- Metainfo is a pop-out menu where you select the items to see, and choose the Show! Item, or Ctrl+M on the keyboard, to see the specific information. If none of the pop-out menu items are selected, you will see all the metadata information.
Figure 1 ILDASM
When you first fire up ILDASM, the GUI shows something like Figure 1. There are all sorts of symbols and different types shown in the tree. Additionally, if you choose to dump the tree to a file, there are three-letter acronyms associated with each node. Since it's a little confusing to see what each glyph and the text describes, I've created a chart (see Figure 2) that defines all of them for you.
Figure 3 ILDASM User Interface
In the ILDASM GUI, it's simple to get more information about an item: simply double-click on it. Parent nodes will expand and child nodes will pop up a new window showing the disassembly, declaration, or information, depending on the item. If you are looking at something like Figure 3, you are ready to learn MSIL assembly language! The last tip I will mention about ILDASM is that it fully supports drag and drop, so you can easily move from file to file to hunt down exactly which module holds which class and method for an assembly.
CLR Basics Before you start grinding through MSIL instructions, I need to introduce a little bit about how the CLR works because it is essentially the CPU for the instructions. Where traditional CPUs rely on registers and stacks to do everything, the CLR uses only a stack. That means that to add two numbers, load both numbers onto the stack and call an instruction to add them. The instruction will remove the two numbers from the stack and put the result on top of the stack. If you are like me, it sometimes helps to see the actual implementation. To see a system similar to the CLR that's small enough to digest, see Brian Kernighan and Rob Pike's book, The Unix Programming Environment (Prentice Hall, 1984). In it they implement a higher order calculator (hoc), a nontrivial C example of a stack-based machine.
The CLR evaluation stack can hold any type of value in the stack slots. Copying values from memory to the stack is referred to as loading, while copying items from the stack to memory is referred to as storing. Unlike the Intel CPU, the CLR stack does not hold the locals, but the locals are in memory. The stacks are local to the method doing the work and the CLR saves them across method invocations. Finally, the stack is also where method return values are placed. Now that I've covered just enough about how the CLR works, I'll move to the instructions.
MSIL, Locals, and Parameters Since I am your average developer, the first thing I write is Hello World! so I can see what's going on. Figure 4 shows the smallest MSIL program I could write to produce the required output. Even if this is the first time you have ever looked at MSIL (see Figure 5 for a longer example), you can easily see what's going on. Anything that starts with a period is a directive for the assembler, ILASM.EXE, and comments are delimited with the standard C++ double slashes.
The important parts of the code in Figure 4 are the last three lines. The LDSTR instruction takes care of getting the string onto the stack. Getting items on the stack is loading; so all instructions that start with LD are getting items from memory and putting them on the stack. Even though I didn't use it in the Hello World! program, getting items from the stack and putting them into memory is storing, and all those instructions begin with ST. Armed with those two little facts and the help ILDASM gives you by placing the hardcoded strings inline with the disassembly, you can perform a good portion of your reverse-engineering.
Now that I've shown you a little bite of MSIL assembly language, it's time to turn to what ILDASM shows you so you can start seeing how the various constructs fit together.
Getting the parameters and return types in ILDASM is trivial because the disassembly gives them to you when you double-click on a method to view it. The best part is that the disassembly shows the actual parameter names. Class values are shown as [module]namespace.class format. The core System natural types, int, char, and so on, are shown as their specific class type. For example, ints are show as Int32.
Local variable display is very easy to decipher as well. If you have debugging symbols available, the locals display will show the actual names. However, disassembling the system classes will look like the following:
The .locals and the parentheses delineate the complete list of parameters, and commas separate all individual parameters. The type is given followed by a V_# format, where the # indicates each parameter number. As you will see later, the number is used in quite a few instructions. In the previous snippet, [mscorlib] indicates the particular DLL where the class comes from.
.locals (class [mscorlib]Microsoft.Win32.RegistryKey V_0,
class System.Object V_1,
The Important Instructions Instead of providing a huge table of instructions, I want to show the most important instructions you will run into and examples of their use. I will start with the loading instructions and explain all their options. As I get to the other types of instructions, I will skip parts that are common with the load instructions and just show their usage. The instructions I don't cover are quite easy to figure out based on their names. For example, add and sub perform addition and subtraction, respectively.
LDC (load numeric constant). This instruction pushes a hardcoded number on the stack. The instruction format is LDC.size[.num], where size is the byte size of the value and num is a special short encoding for a 4-byte integer from -128 to 127 (when size is I4). The size is either I4 (4-byte integer), I8 (8-byte integer), R4 (4-byte floating point), or R8 (floating point). There are numerous forms to this instruction to keep the number of opcodes down.
LDARG and LDARGA (load argument and load argument address, respectively). The argument numbers start at 0. For instance methods, argument 0 is the this pointer and the first argument starts at 1 rather than 0.
ldc.i4.0 // Load 0 onto the stack using the
// special form.
ldc.r8 2.1000000000000001 // Load 2.1000000000000001.
ldc.i4.m1 // Load -1 onto the stack. This
// is the special form.
ldc.i4.s -9 // Load -9 onto the stack
// using the short form.
LDLOC and LDLOCA (load local variable and load local variable address, respectively). Loads the specified local variable onto the stack. All local variables are specified by the order in which they appear in the locals declaration. The instruction ldloca loads the local variables address.
ldarg.2 // Load argument 2 onto the stack. 3 is the
// highest number using this form.
ldarg.s 6 // Load argument 6 onto the stack. All argument
// numbers past 4 (inclusive) use this form.
Ldarga.s newSample // Load newSample's address
LDFLD and LDSFLD (Load Object Field and Load Static Field of a Class, respectively). These instructions load the normal or static field from an object onto the stack. MSIL disassembly of an object is very easy because the complete field value is specified. The instruction ldflda loads the field's address.
ldloc.0 // Load local 0 onto the stack. 3 is the
// highest number using this form.
ldloc.s V_6 // Load local variable 6 onto the stack. All
// variables past number 4 (inclusive) use this form.
ldloca.s V_5 // Load local variable 5's address onto the stack.
LDELEM (load an element of an array). This instruction loads the specified element onto the stack for single-dimensional, zero-based arrays. The previous two instructions put the array item and the index onto the stack (in that order). Ldelem removes the array and index from the stack and puts the specified element on the top of the stack. A type field will follow the ldelem instruction. The most common type field in the compiled base class library is ldelem.ref, which gets the element as an object. Other common types are ldelem.i4 for getting the element as a signed 4-byte integer, and ldelem.i8 to get a 64-bit integer.
// Load the _Originator field from System.Reflection.AssemblyName.
// Notice the type of the field is given as well.
ldfld unsigned int8 System.Reflection.AssemblyName::_Originator
// Load the empty string from System.String.
ldsfld class System.String [mscorlib]System.String::Empty
LDLEN (load the length of an array). This instruction will remove the zero-based, single-dimensional array from the stack and push the length of the array onto the stack.
.locals (System.String V_0, // The  indicate an array declaration.
int32 V_1 ) // The index.
â¢â¢â¢ // Do work to fill V_0.
ldloc.0 // Load the array.
ldc.i4.0 // Load the zero index.
ldelem.ref // Get the object at index zero.
STARG (store a value in an argument slot). Takes the value off the top of the stack and places it into the specified argument.
// Load the attribute field, which is an array.
ldfld class System.ComponentModel.MemberAttribute
stloc.1 // Store the value into the first
// local (an array).
ldloc.1 // Load the first local onto the stack.
ldlen // Get the array length.
STELEM (store an element of an array). While the previous three instructions place the zero-based, single-dimensional array, the index, and the value onto the stack (in that order), the stelem instruction casts the value into the appropriate array type before moving the value into the array. The stelem instruction removes all three items from the stack. Like the ldelem instruction, the type field specifies the conversion. The most common conversion is stelem.ref to convert to an object.
starg.s categoryHelp // Store the top of the stack into
// categoryHelp. All starg
// instructions us the .s form.
STFLD (store into a field of an object). Takes the value off the top of the stack and places it into the object field. Like loading a field, the complete reference is given.
.method public hidebysig specialname
instance void set_MachineName(class System.String 'value') il managed
.locals (class System.String V_0)
ldloc.0 // Load the array on the stack.
ldc.i4.1 // Load the index, the constant 1.
ldarg.1 // Load the argument, the string.
stelem.ref // Store the element.
CEQ (compare equal). This instruction compares the top two values on the stack. The two items are removed from the stack, and if the values are equal, a 1 is pushed onto the stack; otherwise, a 0 is pushed onto the stack.
stfld int32 System.Diagnostics.CategoryEntry::HelpIndexes
CGT (compare greater than). This instruction also compares the top two values on the stack. The two items are removed, and if the first value pushed is greater than the second value, a 1 is pushed on the stack; otherwise, a 0 is pushed. The cgt instruction can also have the .un modifier applied to indicate the comparison is unsigned or unordered.
ldloc.1 // Load the first local.
ldc.i4.0 // Load the constant zero.
ceq // Compare the items for equality.
CLT (compare less than). This instruction performs identically to cgt except 1 is pushed if the first value is less than the second value.
// Get the collection count.
call instance int32 System.Diagnostics.
ldc.i4.0 // Load the constant zero.
cgt // Compare if the count is
// greater than zero.
BR (unconditional branch). This instruction is the goto of MSIL.
// Get the trace switch level.
call instance value class System.Diagnostics.TraceLevel
ldc.i4.1 // Load the constant one.
clt // Compare if the trace level is
// less than one.
BRFALSE and BRTRUE (branch on false and branch on true, respectively). Both instructions look at the value on the top of the stack and branch accordingly. The brtrue instruction only branches if the value is 1, while brfalse only branches if it is 0. Both instructions remove the value from the top of the stack.
br.s IL_008d // Goto offset into the method.
The rest of the branching instructions are listed in Figure 6. In each case, the instruction takes the two values at the top of the stack and compares the top value with the next value. In all cases, the branch takes the place of a comparison followed by one of the Boolean branches. For example, BGT is equivalent to a cgt instruction followed by a brtrue instruction.
ldloc.1 // Load the first local.
brfalse.s IL_006a // If zero, branch.
ldloc.2 // Load the second local.
brtrue.s IL_006c // Branch if one.
CONV (data conversion). This instruction converts the data on the top of the stack to a new type and leaves the converted value on the top of the stack. The final conversion type follows the conv instruction. For example, conv.u4 converts to an unsigned 4-byte integer. The conv instruction with just the type does not throw any exceptions if there is any sort of overflow. If the instruction has .ovf between the conv and the type (for example, conv.ovf.u8), an overflow generates an exception.
NEWARR (create a zero-based, one-dimensional array). This instruction creates a new array of the specified type with the number of elements indicated by the value on the top of the stack. The number of elements count is removed from the stack and the new array is placed on the top of the stack.
ldloc.0 // Load local zero (an array).
Ldlen // Get the array length.
conv.i4 // Convert the array length to a
// four byte value.
NEWOBJ (create a new object). Creates a new object and calls the object's constructor. All constructor arguments are passed on the stack. If the creation succeeds, the arguments are removed from the stack, and the object reference is left on the stack.
ldc.i4.5 // Set the number of elements to
// create to five.
// Create a new array.
BOX (convert value type to object reference). This instruction forces a value into an object and leaves the object on the stack when the conversion is done. When boxing, this instruction does the work. You will see the code in Figure 7 a lot when passing parameters.
.method public hidebysig specialname rtspecialname
instance void .ctor(class [mscorlib]System.IO.Stream 'stream',
class System.String name) il managed
ldarg.1 // Load the stream argument.
// Create the new class.
newobj instance void [mscorlib]
UNBOX (convert boxed value type to its raw form). This instruction returns a managed reference to the value type in the boxed form. The returned reference is not a copy, but the actual object state. With C# and Visual Basic.NET compiled code, after an unbox instruction comes a ldind (load value indirect onto the stack) or ldobj (copy value type to the stack).
CALL and CALLVIRT (call a method and call a method associated at runtime with an object, respectively). The call instruction calls static and nonvirtual normal methods. Virtual methods and interface methods use the callvirt instruction. Arguments are placed in left-to-right order. Note that this order is the opposite of most calling conventions in the IA32 world. Figure 8 shows an example of using callvirt.
// Convert the value into a System.Reflection.Emit.LocalToken
// Get the value onto the stack
unbox [mscorlib]System.Int16 // Convert the value to a Int16
ldind.i2 // Put the object's value onto the
Wrap-up To bring everything together, the code shown earlier in Figure 5 is a partial listing of an MSIL program I wrote to calculate some mathematical formulas as an exercise to learn MSIL. The full program is included with this month's source code distribution (see the link at the top of this article). Getting a handle on the MSIL you are looking at with ILDASM can make your life much easier when wandering around the beta landscape. Additionally, knowing how things work at the lowest level does make it easier to see the big picture. If you are motivated to learn more about MSIL, make sure to expand the extra documentation in ...\Program Files\Microsoft.Net\FrameworkSDK\Tool Developers Guide. The two files of greatest interest are ILINSTRSET.DOC and ILAssemblyLanguageProgrammersReference.DOC.
As you have seen, it looks quite easy to reverse-engineer .NET-compiled applications. In order to give you the cool metadata and xcopy deployment, quite a bit of information does have to go with the binary. Consequently, it's easier to figure out what's going on. The Java language has the same problems and there are even decompilers that will turn byte codes back into full Java-language source code for you. However, that has not stopped Java development, and the ease of disassembly should have no effect on .NET-compiled code either.
Most readers will be doing some ASP.NET deployment because it makes Web development so incredibly easy and powerful. Since everything runs on the server, there is no way for users or other developers to figure out your secret algorithms. While disassembly is possible on client applications in .NET, I feel the extraordinary positive aspects of .NET will far outweigh the ease of disassembly.
The Tips! In colleges around the world, students are thinking hard about graduation this time of year, so you better send those tips to me by e-mailing them to email@example.com!
Tip 43 Pavel Lebedinsky found an extremely cool trick for the Visual C++Â® 6.0 debugger buried deep in the Microsoft Knowledge Base: the debugger can read crash dump files! Knowledge Base article Q248115 lists the secret registry key to get crash dumps loaded. Setting the CrashDumpEnabled REG_DWORD value to 1 in HKEY_CURRENT_USER\Software\Microsoft\DevStudio\6.0\Debug adds a *.DMP option when opening workspaces. It looks like the feature is partially done, but it does work. To get the best results, copy all the PDB files necessary for the process that crashed into the same directory as the .DMP file.
Tip 44 Mike Morearty thought it would be very nice if you could have true hardware read and write breakpoints from the Visual C++ 6.0 debugger. He wrote a very cool class, CBreakpoint, which allows you to specify which address you would like to stop on each time your program truly reads or writes them. Mike's class goes way beyond the data access breakpoints offered in the debugger today. It's a fantastic class and one that I have used to track down some very difficult bugs already! You can download the complete code at http://www.morearty.com/code/breakpoint. Mike also has a nice document set to show you exactly how to use it.
Send questions and comments for John to firstname.lastname@example.org.