PSPCLR: Implementing newobj

PSPCLR just recently turned five bits old. The commit that brought the repository revision to 16 was actually pretty large, in terms of files touched, but most of the changes are actually just a result of some fairly heavy refactoring I’ve been doing slowly over the last few weeks. In particular,

  • I got rid of the TableTool and replaced it with a real MSBuild task. I also added a task for generated opcode-related boilderplate and dispatch functions, to further decrease the amount of effort involved in implementin a new opcode in the interpreter.
  • I reorganized the project files somewhat, in particular splitting the build process projects into a PspClr.Prerequisite.sln and leaving the actual loader, corelib implementation, et cetera in PspClr.Product.sln. The motivation behind this change was to work around the assembly unload limitation that makes developing MSBuild tasks in the same solution file you call them from a painful experience.
  • I streamlined a lot of the metadata handling, especially as relates to runtime representations of the metadata — there is an actual object model that wraps the metadata row, making them much easier to work with in code.

After all of that, however, the basically functionality remains about the same as last time. With one small addition: newobj.

“4.21 newobj - create a new object”

newobj causes the CLR to create a new instance of either a reference or value type. Since nearly everything you’d ever want to work with is such an instance, it’s a fairly critical opcode to support. The only reason I was able to get away without it so far is that the assembly entry point method is static, as is System.Console.WriteLine.

newobj is very straightforward: it’s a single-byte instruction followed by a four-byte argument, which is a metadata token (essentially an encoded index) that refers to the constructor to invoke on the created object. The reference can be to the method definition metadata or the member reference metadata table — the former is for methods that exist in the same assembly, the latter for references outside the assembly. I’m only going to discuss the former, as the process for resolving the latter is essentially the same but with the extra layer of indirection of finding the assembly in your dependency list first.

When you encounter a newobj call, that constuctor method definition is all you have available. It looks like this:

  • A four-byte address to the start of the method’s CIL opcode stream.
  • Four bytes of flags.
  • An index into the string heap, which provides the method’s name (”.ctor” for constructors).
  • An index into the blob heap, which provides the method’s signature.
  • An index into another metadata table which defines the parameters of the method.

Does it look like there’s anything missing? At first glance, there doesn’t appear to be any information about the type that owns the method — without knowing the type, you can’t know the size of an instance of that type, which means you can’t create that instance, which means you cannot call the constructor since it wouldn’t have anything to initialize.

You might try looking in the parameter list or in the signature in the blob heap to try and extract the information about the owning type, but that would not be a fruitful search. The existence of the this parameter is encoded with a flag, not a full parameter entry (in the interests of saving space). No, to find out who owns a method, you have to look through the type list until you find a type that has the desired method.

This isn’t as bad as it might seem at first glance — the metadata that defines a type contains an index to the first method definition for that type, and you are guaranteed that all method definitions are owned by at most a single type and that all definitions belong to a given type are contiguous in the method definition table. This makes it pretty easy to associate methods with a type — the fields of a type are stored in the same fashion, and you need to process them to create the in-memory layout for the type anyway (since you need to know how many bytes an instance of the type occupies). This is what spurred me to create a more robust object model for the runtime representations of all this metadata that the execution engine consumes.

At assembly load time, I iterate the type definition metadata, and produce a TypeDef object for each entry, which I store in the assembly. Creating a TypeDef object iterates the subsection of the method and field metadata tables that are owned by that type and creates MethodDef and FieldDef objects which are stored in the TypeDef. Methods and fields are imbued with a reference back to their owning TypeDef, which allows me to make the mapping between method and owning type O(1) at the expense of a small amount of storage.

With that object model in place, it becomes possible to implement newobj. Read the method reference argument, and find the appropriate MethodDef object (this is an O(1) lookup in an array stored in the assembly). Get the TypeDef for the MethodDef (another constant time lookup), and allocate enough space for that type on the heap. Then call the constructor.

The Horizon: ldfld and stfld

newobj and its supporting code changes are a pretty decent step forward for PSPCLR, but I’m still a long ways away from something useful. Next on my plate are the opcodes for loading and storing object fields — these are used when, for example, you read or write from members. Since constructors do this often, they make for a logical next step.