The case of the one-thread race condition

You know what’s even worse than a race condition between two threads in your code?

A race condition in one thread in your code, because there are good solutions and debugging techniques for tracking down multi-threading conflicts, but they don’t work when there’s only one thread involved.

That’s right.  I just spent the last few hours tracking down what turned out to be a reentrancy problem.

At first, it looked like simple memory corruption.  A certain routine in a third-party DLL (that I happen to have the source to) was crashing under certain conditions, and it was reproducible about 80% of the time, but not all the time. This pointed to nondeterministic behavior, probably threading-related.  This DLL makes heavy use of multithreading, so that didn’t surprise me.

So I rebuilt the DLL with FastMM4 FullDebugMode enabled, which is incredibly helpful in tracking down corruption, and soon I started to get a clearer picture of what was going on, because I was seeing access violations with the telltale address $80808084, which meant something was trying to access a member of an object that had already been freed.  (FullDebugMode sets all bytes of the object’s memory, except the VMT pointer, to $80 when you free it to make it highly visible under the debugger.  It sets the VMT pointer to a special class called TFreedObject that has some extra debugging features.)

So I figured out what object was the problem, then changed the code a little, adding in a check that said “if thisObject.ClassType <> TExpectedClassType then asm int 3 end;”  This would cause a manual breakpoint if it came across a TFreedObject where it expected a live object.  I rebuilt the DLL, and sure enough it hit the breakpoint on the first run.  Now to figure out why.

This turned out to be not as easy as it sounds, not because the code was too complicated, but because the code was too simple.  The object in question was only owned by one other object, which held one instance of it.  It got created in one place, and destroyed in one place, with nothing that looked like it could screw up in some strange and unexpected way.

Then I looked into the code of the problematic object, and found that it was caching instances inside a global TList. Aha! Now we’re getting somewhere.  Obviously there’s a race condition corrupting the list somehow.  So I added a bunch of special breakpoints to output data to the event log instead of breaking, (if you don’t know about these things, check out the Advanced section of the Breakpoint Properties window sometime; there’s a bunch of useful things you can do with them,) but the more data I logged, the more frustrated I got.  Every object went into the cache and came back out just as it should, and at no point was the program retrieving an object from the cache that was actually a TFreedObject.

I did see one interesting thing, though.  There was never more than one object in the cache, and when I logged constructor and destructor calls, I could see it taking a (still valid) object out of the cache after a destructor had run!  I tried putting  TCriticalSection in and locking it before accessing the cache list in any way.  Still nothing.  The problem continued, and it was getting ridiculous by this point.

So I put in a global counter called LiveInstances and made the constructor increment it, and the destructor decrement it, and I put in a line of code where the object was retrieved from the cache: “if LiveInstances = 0 then asm int 3 end;”  Still nothing.  I used custom breakpoints to log the thread IDs involved, and found that everything that was being problematic was actually all running on the main thread!

Then I decided that, if the destructor was running and things were getting logged to the event log, I needed to check at that point. So I moved the “dec(LiveInstances);” call to the top of the destructor, instead of where it had been near the bottom, and finally hit an int 3 breakpoint: it was retrieving an object while LiveInstances was 0.  Somewhere in between the top of the destructor and the line I had originally had the “dec(LiveInstances);” call on, something was getting screwed up.

Then I looked at the code, and I wanted to scream and tear my hair out.  There it was, mocking me.  The original coder had put not one but two calls to Application.ProcessMessages inside the destructor!  And it just so happened that, in certain circumstances, it was possible for this message processing to result in some other code coming along and trying to retrive an item from the cache and use it.

And the bizarre thing is, I don’t even know what those calls were there for.  I figured they had to be there for a reason, so I moved things around so they wouldn’t be called until after the destructor had removed Self from the cache, but that caused other things to break.  So I put it back the way it was, and tried just removing the calls to Application.ProcessMessages entirely… and everything worked fine!

Excuse me while I go scream.

YEEEEEEEEEEAAAAAAARRRRRRRRRRRRRGGGGGGGGGGGHHHHHHHHHHHHHH!!!!!!!!!!!

OK, back.  I feel better now.

Remember, folks, Application.ProcessMessages was put in as a crutch for the benefit of n00bish VB coders who didn’t know any better, to make it easier for them to convert over to Delphi.  It should not actually be used in your code.  OK, OK, the unit in question has a header indicating it was written in 1996 for Delphi 3, so I suppose I can cut the original author some slack.  But if you ever feel like you need to use it, there’s almost certainly a better way.

For example, I actually had a point last year where I thought I needed it.  My code in the main thread was getting into a deadlock waiting on something to complete in another thread, which was waiting on a call to TThread.Synchronize to return.  I knew that Synchronize uses the message pump, so I considered calling ProcessMessages, but I figured there had to be a better way… and there was.  Turns out there’s a routine in Classes.pas designed for exactly this purpose: CheckSynchronize.  That solved my problem without having to worry about ProcessMessages and possible reentrancy headaches.

Has anyone out there found a place where they legitimately need to call ProcessMessages and there’s no better way to handle it?  Let me know in the comments. 🙂

12 Comments

  1. A. Bouchez says:

    The only way I call ProcessMessages by hand is when I do regression tests involving messages (e.g. when we communicate over GDI messages with our mORMot framework).
    GDI messages are pretty fast locally, but you have to call ProcessMessages in a test environment without any GUI (which is the case for mORMot regression tests) if you want the messaging process to take place.

    I remember also having called ProcessMessages in some places of GUI implementation, e.g. to ensure the screen is refreshed as expected.

    Using CheckSynchronize instead of ProcessMessages is indeed a clean way.
    But I do not like Synchronize() itself, and rather send a notification message (WM_USER####) to my main form instead, or use TEvent / WaitForMultiObject() instead.

    I always try to let the main UI thread not involved by any background process.
    I would rather creates a dedicated “main” thread for a given bounded context, if I need some central place.
    In fact, there may be cases where UI is not available nor wished at all. Especially if you want your code to be cross-platform ready.
    Separation of concern helps here.

    • A. Bouchez says:

      In fact, about separation of concern, my main rule is to avoid putting the Forms.pas in any uses clause outside the UI.

      Application.ProcessMessages is VCL bound, whereas any business logic code (even multi-threaded) should never have any dependency on the VCL, e.g. if you want to switch to FireMonkey. 🙂

      If I’ve a “uses … Forms, …” clause in any business-related logic, it just smells… 🙁

  2. Ondrej Kelle says:

    Agreed, I try to avoid Application.ProcessMessages if possible, exactly for this reason.

  3. Richard Marsden says:

    Hi All

    does anyone know which version of Delphi CheckSynchronize was introduced in, I am maintaining a legacy application written in Delphi 5 and it definitely isn’t in that version – so I guess I’m stuck with Application processmessages at the moment. Unless I get the chance to implement the custom message as Arnaud suggests.

    Many thanks Arnaud for the suggestion.

    regards

    Richard

    • A. Bouchez says:

      AFAIK TThread.Synchronize() in Delphi 5 uses a regular GDI message (named CM_EXECPROC), and not a TEvent, like in Delphi 6+.
      See Classes.pas unit.

      So you MUST call Application.ProcessMessages in Delphi 5.
      There is no CheckSynchronize method available in this version!

      • Richard Marsden says:

        Many thanks Arnaud

        Work are currently re-implementing the legacy product in C# – I have been maintaining the old system for four years so they will never upgrade from Delphi 5. I have Delphi 7 at home so I can at least practice using CheckSynchronize there and improve my skills at home.

    • Remy Lebeau says:

      CheckSynchronize() was introduced in Delphi 6 when TThread.Synchronize() was re-written to support Kylix for cross-platform support with Linux.

  4. A.J. van de Ven says:

    We (unfortunately) have a database upgrade application which isn’t multi-threaded, though it absolutely should be, and we use Application.ProcessMessages during long database operations to update the UI to ensure it doesn’t appear to go unresponsive. Prior to adding the calls, users would often wonder whether anything was happening, click on the window, see the “not responding” Windows message and terminate the application. Since we also weren’t using transactions then, I can’t tell you how many hours were spent trying to repair those databases.

  5. I once found one really awkward place that Delphi’s own code has an Application.ProcessMessages call: Printers.pas in the default procedure AbortProc (it’s right below procedure RaiseError that’s used to throw EPrinter exceptions). I found out in an application that was printing forms started by input of a barcode scanner… When a following scan came soon after, and got handled in this ProcessMessages call, it would corrupt data. I solved it with a custom SetAbortProc after each BeginDoc

  6. Darkhog says:

    I’m usually do this when doing complicated processing in loop (such as loading big file) so GUI won’t get locked.

Leave a Reply