You know what’s even worse than a race condition between two threads in your code?
A race condition in one thread in your code, because there are good solutions and debugging techniques for tracking down multi-threading conflicts, but they don’t work when there’s only one thread involved.
That’s right. I just spent the last few hours tracking down what turned out to be a reentrancy problem.
At first, it looked like simple memory corruption. A certain routine in a third-party DLL (that I happen to have the source to) was crashing under certain conditions, and it was reproducible about 80% of the time, but not all the time. This pointed to nondeterministic behavior, probably threading-related. This DLL makes heavy use of multithreading, so that didn’t surprise me.
So I rebuilt the DLL with FastMM4 FullDebugMode enabled, which is incredibly helpful in tracking down corruption, and soon I started to get a clearer picture of what was going on, because I was seeing access violations with the telltale address $80808084, which meant something was trying to access a member of an object that had already been freed. (FullDebugMode sets all bytes of the object’s memory, except the VMT pointer, to $80 when you free it to make it highly visible under the debugger. It sets the VMT pointer to a special class called TFreedObject that has some extra debugging features.)
So I figured out what object was the problem, then changed the code a little, adding in a check that said “if thisObject.ClassType <> TExpectedClassType then asm int 3 end;” This would cause a manual breakpoint if it came across a TFreedObject where it expected a live object. I rebuilt the DLL, and sure enough it hit the breakpoint on the first run. Now to figure out why.
This turned out to be not as easy as it sounds, not because the code was too complicated, but because the code was too simple. The object in question was only owned by one other object, which held one instance of it. It got created in one place, and destroyed in one place, with nothing that looked like it could screw up in some strange and unexpected way.
Then I looked into the code of the problematic object, and found that it was caching instances inside a global TList. Aha! Now we’re getting somewhere. Obviously there’s a race condition corrupting the list somehow. So I added a bunch of special breakpoints to output data to the event log instead of breaking, (if you don’t know about these things, check out the Advanced section of the Breakpoint Properties window sometime; there’s a bunch of useful things you can do with them,) but the more data I logged, the more frustrated I got. Every object went into the cache and came back out just as it should, and at no point was the program retrieving an object from the cache that was actually a TFreedObject.
I did see one interesting thing, though. There was never more than one object in the cache, and when I logged constructor and destructor calls, I could see it taking a (still valid) object out of the cache after a destructor had run! I tried putting TCriticalSection in and locking it before accessing the cache list in any way. Still nothing. The problem continued, and it was getting ridiculous by this point.
So I put in a global counter called LiveInstances and made the constructor increment it, and the destructor decrement it, and I put in a line of code where the object was retrieved from the cache: “if LiveInstances = 0 then asm int 3 end;” Still nothing. I used custom breakpoints to log the thread IDs involved, and found that everything that was being problematic was actually all running on the main thread!
Then I decided that, if the destructor was running and things were getting logged to the event log, I needed to check at that point. So I moved the “dec(LiveInstances);” call to the top of the destructor, instead of where it had been near the bottom, and finally hit an int 3 breakpoint: it was retrieving an object while LiveInstances was 0. Somewhere in between the top of the destructor and the line I had originally had the “dec(LiveInstances);” call on, something was getting screwed up.
Then I looked at the code, and I wanted to scream and tear my hair out. There it was, mocking me. The original coder had put not one but two calls to Application.ProcessMessages inside the destructor! And it just so happened that, in certain circumstances, it was possible for this message processing to result in some other code coming along and trying to retrive an item from the cache and use it.
And the bizarre thing is, I don’t even know what those calls were there for. I figured they had to be there for a reason, so I moved things around so they wouldn’t be called until after the destructor had removed Self from the cache, but that caused other things to break. So I put it back the way it was, and tried just removing the calls to Application.ProcessMessages entirely… and everything worked fine!
Excuse me while I go scream.
OK, back. I feel better now.
Remember, folks, Application.ProcessMessages was put in as a crutch for the benefit of n00bish VB coders who didn’t know any better, to make it easier for them to convert over to Delphi. It should not actually be used in your code. OK, OK, the unit in question has a header indicating it was written in 1996 for Delphi 3, so I suppose I can cut the original author some slack. But if you ever feel like you need to use it, there’s almost certainly a better way.
For example, I actually had a point last year where I thought I needed it. My code in the main thread was getting into a deadlock waiting on something to complete in another thread, which was waiting on a call to TThread.Synchronize to return. I knew that Synchronize uses the message pump, so I considered calling ProcessMessages, but I figured there had to be a better way… and there was. Turns out there’s a routine in Classes.pas designed for exactly this purpose: CheckSynchronize. That solved my problem without having to worry about ProcessMessages and possible reentrancy headaches.
Has anyone out there found a place where they legitimately need to call ProcessMessages and there’s no better way to handle it? Let me know in the comments. 🙂