Optimization – TURBU Tech

The next RTTI bottleneck

Mason Wheeler — Fri, 01 Mar 2013 06:01:26 +0000

A few years back, when I posted an analysis of how TValue is very slow, it prompted a lot of response from the community. Various people ran their own benchmarks, and started working on building or optimizing their own meta-value types. Some people are even still working on that today. But one of the most interesting things was Robert Love’s response. He looked at the TValue code and found a way that it could be optimized for the common case to speed things up.

I built on his foundation and sent a suggested patch to the Delphi team. They made their own tweaks, and newer versions of Delphi have had a much faster TValue because of that. But one of the most interesting things I heard while I was working on that improvement came from Barry Kelly. He said that he wasn’t sure how much those speedups would actually help, because the bulk of your CPU time in RTTI work was going to be spent in invocation (using the RTTI system to call methods) and not in moving data in and out of TValue variables.

And he was right. If you wanna really break your brain sometime, trace into a call to TRttiMethod.Invoke and take a look at what all is taking place under the hood. That’s a huge amount of work going on, and the interesting thing is how much of it will be exactly the same every time you invoke the same method, assuming you have a valid parameter list.

To do it quickly, you have to recompute as little of that setup code as possible every time. The fastest way would be to write your own invocation routines, something like this:

[code lang="delphi"]
function InvokeFuncA(var params: TArray; const self: TValue): TValue;
begin
   result := (self.AsObject as TMyClass).FuncA(params[0].AsInteger);
end;
[/code]

Of course, that’s a static invocation routine which only works for one method, which is the exact opposite of what TRttiMethod.Invoke does: provides you with a generic invocation routine that works for any method. But it’s fast.

So… what if we could find a way to get TRttiMethod to create something like that for us, at runtime? Instead of doing a bunch of work every time you call it to figure out what goes into memory where in order to convert your list of TValues into a native parameter list for a method, maybe it could work out how to perform that mapping, and express it in machine code.

It’s not as strange as it sounds. TMethodImplementation already does a certain amount of machine code generation at runtime to make some of the fancier RTTI tricks possible. And with RTTI becoming used in more places in Delphi (it’s all over the place in Live Bindings, for example,) the last thing anyone wants is for it to be slow.

So, I’d like to challenge the community to step up to the plate again. Is there anyone out there who knows enough about low-level code to build what would essentially be a JIT compiler that takes a TRttiMethod as input and outputs a shim for fast invocation? I’ll be poking at things from one angle, but I’d like to invite anyone else who’s interested to help out. Let’s see how fast we can get RTTI invocation.

How default settings can slow down FastMM

Mason Wheeler — Sat, 21 May 2011 04:48:22 +0000

One of the biggest challenges in working on the TURBU engine has been minimizing load times. Some large projects have a whole lot of data to work with, which could take the better part of a minute to load if I tried to load it all up front. No one wants to sit and wait for that, so I’ve pared down the loading so that only the stuff that’s needed right away gets loaded from the project database right at startup.

And yet, on one of my larger test projects, that wasn’t enough. One of the things that has to be loaded upfront was map tile data, so that the maps can draw. Unfortunately, this project has over 200 different tilesets, and it was taking quite a while to load that much data. I’ve got a RTTI-based deserializer that can turn dataset records into objects, but it was taking a completely unreasonable 3.3 seconds to read the tile data.

Profiling said that most of the delay–close to 60%–was coming from FastMM’s FastFreeMem calling something in ntdll.dll. It didn’t say what, and I didn’t figure I needed to poke around inside the memory manager. I’d be better off making sure there weren’t so many calls into FastFreeMem, right?

So I poked around in the deserializer code and found several places in inner loops where strings were being created and disposed of quite unnecessarily. I fixed the code so that that wouldn’t happen, optimizing out all the unnecessary FreeMem calls. That should have fixed things up, I figured. My reward was a measly 0.4 seconds, down from 3.3 to 2.9, with the bulk of the time still taking place in ntdll.

So I poked around in the FastFreeMem code a little, and was surprised to run across this:

[code lang=”Delphi”]
@LockBlockTypeLoop:
mov eax, $100
{Attempt to grab the block type}
lock cmpxchg TSmallBlockType([ebx]).BlockTypeLocked, ah
je @GotLockOnSmallBlockType
{$ifndef NeverSleepOnThreadContention}
{Couldn’t grab the block type – sleep and try again}
push ecx
push edx
push InitialSleepTime
call Sleep
pop edx
pop ecx
{Try again}
mov eax, $100
{Attempt to grab the block type}
lock cmpxchg TSmallBlockType([ebx]).BlockTypeLocked, ah
je @GotLockOnSmallBlockType
{Couldn’t grab the block type – sleep and try again}
push ecx
push edx
push AdditionalSleepTime
call Sleep
pop edx
pop ecx
{Try again}
jmp @LockBlockTypeLoop
{Align branch target}
nop
nop
{$else}
{Pause instruction (improves performance on P4)}
rep nop
{Try again}
jmp @LockBlockTypeLoop
{Align branch target}
nop
{$endif}
[/code]

So when it tries to lock the memory block to free some memory, unless a special “NeverSleepOnThreadContention” compiler flag is set, it’ll call the Winapi Sleep function, giving up the entire timeslice (several milliseconds) because it’s blocked by an operation that will take a few dozen lines of ASM to complete.

I looked for this option in FastMM4Options.inc, and found the following explanation:

{Enable this option to never put a thread to sleep if a thread contention occurs. This option will improve performance if the ratio of the number of active threads to the number of CPU cores is low (typically < 2). With this option set a thread will enter a “busy waiting” loop instead of relinquishing its timeslice when a thread contention occurs.}

So sleeping the thread instead of spinlocking can be helpful when there are a high number of threads running. But there’s no code to detect this. It’s never call Sleep or always call Sleep, with the decision hardcoded in at compile time. I wonder if it would be possible to always spinlock for a certain number of cycles first and see if that helps, before calling Sleep?

Anyway, it turns out that that was my problem. I was doing some other data-intensive loading in a background thread, and the memory allocations were clashing with each other. When I set the NeverSleepOnThreadContention flag and rebuilt, the load time for tile data dropped to a far more acceptable 1.1 seconds.

TStringList updating pitfalls

Mason Wheeler — Tue, 19 Oct 2010 05:55:47 +0000

What’s wrong with this code?

[code lang="Delphi"]
procedure TMyCustomChecklistPopupControl.ClosePopup;
var
  i: integer;
begin
  inherited ClosePopup;
  FInternalItemStringList.Clear;
  for i := 0 to Self.CheckedCount - 1 do
    FInternalItemStringList.Add(Self.CheckedItems[i].Name);
end;
[/code]

At first glance, it looks just fine. It’s semantically correct–it will do what you want it to. If you happen to have seen a certain issue before, something might jump out at you, but if not, you probably think this is OK. And most of the time, it is.

This is a simplified version of something I ran into at work today, in one of our custom controls. I ran into it in the debugger, but not because it was raising exceptions or corrupting data. No, the problem was that when I hit the Check All button, selecting all 200 or so items, and then closed the popup, it took left the UI unresponsive for a good 15 seconds or so.

Turns out the problem isn’t in what this code was written to do, but in what else it does. You see, there’s an OnUpdate event handler attached to the internal TSwissArmyKnife TStringList which goes over the data in the list, calculates a few things, and updates some UI elements. And yeah, you want that to happen when you make a change. But you want it to happen once per change, from the user’s perspective. This was happening once per change from the TStringList’s perspective, or in other words, 200+ times in total for a single user action. And it took forever to finish.

You can be a really good programmer and still not know all the ins and outs of the framework you’re working with. I’m always discovering new little details about how things work. Turns out I’ve seen this one before, so when I hit Pause a few seconds in and dropped to the debugger, and saw the following right in the middle of the call stack, I knew what was going on right away.

TStringList.Changed
TStringList.InsertItem
TStringList.AddObject
TStringList.Add

What whoever coded this control apparently didn’t know, probably because they’d just never run across it before, was that Borland anticipated this very problem–or more liklely, because so many VCL classes use TStrings descendantes internally, they ran into it themselves at one point–and put a little switch into TStrings to turn off the OnChanged event handler temporarily.

Once I surrounded this code with a BeginUpdate and EndUpdate pair, the delay on closing up the box went from an angonizing 15 seconds to a tiny fraction of a second that I wouldn’t have noticed at all if I wasn’t watching for it.

Hopefully most of the people reading this are familiar with BeginUpdate and EndUpdate. But if anyone who hasn’t seen it runs across this, now you have a new trick. Please make sure to use it, to spare your end-users some pain. Even if you don’t think it’s likely to be necessary, please use it anyway. When this special checklist control was originally written, years ago, it was intended to hold a dozen or so items at most, not hundreds, and it probably performed fine at that scale. But growing client demand means the app’s working with more data than it used to, and eventually you hit something like this unless you’re careful in your design.

Inheritance baggage

Mason Wheeler — Mon, 21 Jun 2010 15:03:25 +0000

A couple posts ago, I mentioned that I’ve been working with code generation lately. This is for a part of the TURBU project. An RPG relies pretty heavily on scripting, and RPG Maker, the system I created TURBU to replace, has a fairly extensive, if limited, scripting system. The limitations were one of the things that made me say “I could do better than this,” in fact: No functions, no local variables, callable procedures exist but parameters don’t, so any “passing” has to be done in global variables, only two data types: integer and boolean, no event handlers, minimal looping support, etc.

The upside of all this, though, is a very simple scripting system that doesn’t look much like a programming language, with a simple interface that almost anyone can pick up. I wanted to keep that simplicity as much as possible, while adding the full flexibility and power of a real scripting language. So I dreamed up EventBuilder, a set of objects which represent a high-level scripting interface and can also express themselves as PascalScript code.

I needed some way to create EventBuilder objects that could form a hierarchical tree that can represent blocks of code. They needed to be easily serializable to some human-readable format so people can copy and paste blocks of EventBuilder script in order to share scripts, ask for help with debugging, etc. And it needed to be ready quickly, since I want to be able to present as much of this as possible at Delphi Live! in August.

So is there any pre-existing system that supports hierarchical trees of objects and easy serialization to a simple text-based format? The answer should be obvious to any experienced Delphi user: descend from TComponent and use its built-in serialization to “DFM format.” I tried that and, once I’d figured out how to handle a few quirks related to object ownership, it worked great! All the infrastructure was there for me, tested and tried and proven over the last 15 years, and I could focus on the actual Event Builder logic. It’s taken me about a month to get the system to a workable state, and now it’s more or less all ready.

Then I tried running a very, very large RPG Maker project through my project importer, and it took a long time on converting the global script block. That’s sort of to be expected, since there are almost 2000 event scripts in there, but even so it felt like it was taking far too long for the amount of work involved. I looked at my code and couldn’t find any obvious issues, so I ran it through Sampling Profiler.

It’s a good thing I did, too. It found a very clear bottleneck in a place I’d have never thought to look. Apparently I was spending 77% of my time in TComponent.Notification. And why would I have never thought to look there? Because I’ve never heard of it! But apparently every time I added a component, it would recursively call this on the entire subtree, turning what ought to have been a O(n) conversion into O(n^2).

With a bit of research, it turns out that TComponent.Notification is for dealing with linked components. For example, when you link a TDataset to a TDatasource, it needs a notification mechanism so it can clean up references if you free one of them. Since EventBuilder doesn’t use linked components, I didn’t really need this functionality. Good thing TComponent.Notification is virtual! I overrode it with a blank method, and suddenly the conversion time dropped from about 12 seconds to about 3 seconds, and everything’s running smoothly again.

Moral of the story? Be careful that you understand what you’re inheriting from, otherwise you might end up with killer kangaroos or other unwanted features.

Real-world optimization

Mason Wheeler — Tue, 20 Oct 2009 15:22:27 +0000

Last week at work, I was asked to look at one of our verification modules that was taking about three times longer to run than it had in an earlier version. This module takes a set of result files, compares them against another file showing expected results, and reports any discrepancies that are outside the defined margin of error. It’s some pretty heavy work involving hundreds of thousands of data points, and the old version already took more than ten minutes. Increasing the running time by a factor of three just wasn’t acceptable. So I started to look at what was going on.

Verification takes place in four steps:

Load the data from the files
Process the data
Process the data some more
Retrieve the results

Steps 3 and 4 only take a few seconds each. Step 2 takes a couple minutes, but the bulk of the time is spent in step 1. So I decided to focus on there to see if I could find what was making it take so long. First thing to do is establish a baseline. I built the old version and turned on SQL Server Profiler for the database and Sampling Profiler, an excellent tool written in Delphi that helps you profile Delphi apps without slowing them down the way AQTime does. I ran the entire verification process and found that yes, not only was step 1 taking most of the time, over 90% of the time was spent on one single line that matches the data from the files against the data in the database.

The data-loading system looks something like this. Names and a few details have been changed to protect the innocent the corporate intellectual property, of course, but this is the general idea of what was going on. See how many problems you can spot in this code. (Bear in mind, this was the original, faster version.)

[code lang="delphi"]
procedure TVerificationDataModule.LoadFile(const filename: string;
                                           otherParams: TOtherData);
var
   lines: TStringList;
   fileData: TObjectList
   dbData: TOrmObjectList;
   i: integer;
begin
   lines := TStringList.Create;
   fileData := TObjectList.Create;
   try
      lines.LoadFromFile(filename);
      for I := 0 to lines.Count - 1 do
         fileData.Add(parseLine(lines[i]));
   finally
      lines.Free;
   end;

   dbData := GetRelevantDBData(otherParams);
   try
      for i := 0 to fileData.Count - 1 do
         MatchFileDataAgainstDB(fileData[i] as TFileData, dbData);
   finally
      dbData.Free;
   end;
end;

procedure TVerificationDataModule.MatchFileDataAgainstDB(fileData: TFileData;
                                                         dbData: TOrmObjectList);
var
   i: integer;
   dbItem: TOrmVerificationObject;
   updateProcedure: IStoredProcedureRecord;
begin
   for i := 0 to dbData.Count - 1 do
   begin
      dbItem := dbData[i] as TOrmVerificationObject;

      //90% of time is spent on this next line:
      if fileData.param1 = dbItem.param1 and
         fileData.param2 = dbItem.param2 and
         fileData.param3 = dbItem.param3 and
         fileData.param4 = dbItem.param4 then
      begin
         updateProcedure := CreateStoredProc('VERIFICATION_DATA_LOADER');
         updateProcedure.param1 := fileData.param3;
         updateProcedure.param2 := fileData.param4;
         updateProcedure.param3 := fileData.param5;
         updateProcedure.param4 := fileData.param6;
         updateProcedure.param5 := fileData.param7;
         updateProcedure.Execute;
         if updateProcedure.ResultCode <> GOOD_RESULT then
            raise Exception.Create('Something went wrong');

         dbData.Delete(i);
      end;
   end;
end;
[/code]

Then I profiled the newer version and got very similar results, except that it was spending even more time in the if statement to match the objects against each other. Close to 99% now. So what had changed? I looked back through version control and found that the SQL that generates the result set that goes into dbData had been changed between versions. A new table was added to simplify the big mess of joins, but they forgot one of the on criteria, so it was returning three times as many results as it should have. There’s your factor of three right there. Easy enough to fix. But that still doesn’t address the quality of the original code. A couple things jumped right out at me, and I wondered if I could bring the time down below the original mark.

The first thing came out of the SQL profiler. I kept seeing a call to sp_procedure_params_rowset, an undocumented procedure in SQL Server that the connection object uses internally to get information about the expected parameters for a stored procedure, immediately followed by a call to the VERIFICATION_DATA_LOADER proc. This seemed a bit silly to me. The signature of the stored procedure isn’t going to change! Turns out that was called internally by the CreateStoredProc function, which was being called every time it went to save some data to the database, in order to create the proper object.

So I moved the call to CreateStoredProc out to the main procedure and set it up as an extra parameter to pass into MatchFileDataAgainstDB. It would reuse the same basic stored procedure object and reassign its parameters for each call, so you get the same net effect, but with 50% less database hits. Unfortunately, this didn’t yield a 50% increase in performance. SQL Server can cache the results of redundant queries, so this call wasn’t taking much time at all to process repeatedly, but the transport layer overhead was still a factor, and removing this redundant call sped the overall process up by about 20%.

But the big one was in the matching, where the profiler said the system was spending the majority of its time. It doesn’t exactly look like a speed bottleneck, because it’s stored inside a method call, but what it is is a linear search inside of a loop, with both lists containing a few thousand elements each. But how do you make something like this run faster? I could try sorting the second list and using a binary search, but have you ever written a binary search? It’s a bunch of extra code, and it’s often confusing and hard to read. I couldn’t use a TDictionary to index the second list, because I need to match against 4 items, not just 1. So instead I used a very simple trick that’s been around for decades but I don’t tend to see very often these days: list comparison.

The general algorithm goes like this:

Sort both lists by the same criteria. This must also be the same as the matching criteria.
Start at the top of both lists. Pick the first item from each and compare them.
If they match, handle the case and advance the index for both lists.
If they don’t match, loop through, advancing the index for the list with the “lesser” value each time, until a match is found.
When you reach the end of either list, you’re done. (Unless you want to handle any leftovers from the other list.)

This is a very simple and very useful algorithm for reconciling two sets of data, and I’ve managed to find all sorts of uses for it. Unlike a double-nested loop, which basically runs in quadratic time, this is guaranteed to run in linear time and never walk either list more than once. I managed to adapt this algorithm to the existing code, and suddenly processing the input files, which had previously taken at least a minute each, takes between 2 and 6 seconds per file. Now loading the data takes about the same amount of time as performing the calculations, instead of an order of magnitude longer.

Lessons learned:

Profilers, especially non-invasive ones, are invaluable for finding what’s going on in your app. I’d have probably noticed that double-nested loop soon enough, but I would never have found the stored procedure issue without SQL Server Profiler to point it out.
Pulling things out of loops—especially other loops!—is a great way to increase performance.
Reducing algorithmic time complexity is by far the best optimization for large data sets.
Linear, single-threaded techniques are still relevant. A lot of people are talking these days about parallel programming and how the meaning of optimization has changed in today’s world. They’re right, to a certain extent, but as hard as I try I can’t think of any way to parallelize this check that would make it faster than a simple list comparison. The only thing I know of with the potential to be faster than this is a hash table lookup, which could be parallelized, but it won’t work particularly well when you need to look up your values based on more than one index value.