How default settings can slow down FastMM
One of the biggest challenges in working on the TURBU engine has been minimizing load times. Some large projects have a whole lot of data to work with, which could take the better part of a minute to load if I tried to load it all up front. No one wants to sit and wait for that, so I’ve pared down the loading so that only the stuff that’s needed right away gets loaded from the project database right at startup.
And yet, on one of my larger test projects, that wasn’t enough. One of the things that has to be loaded upfront was map tile data, so that the maps can draw. Unfortunately, this project has over 200 different tilesets, and it was taking quite a while to load that much data. I’ve got a RTTI-based deserializer that can turn dataset records into objects, but it was taking a completely unreasonable 3.3 seconds to read the tile data.
Profiling said that most of the delay–close to 60%–was coming from FastMM’s FastFreeMem calling something in ntdll.dll. It didn’t say what, and I didn’t figure I needed to poke around inside the memory manager. I’d be better off making sure there weren’t so many calls into FastFreeMem, right?
So I poked around in the deserializer code and found several places in inner loops where strings were being created and disposed of quite unnecessarily. I fixed the code so that that wouldn’t happen, optimizing out all the unnecessary FreeMem calls. That should have fixed things up, I figured. My reward was a measly 0.4 seconds, down from 3.3 to 2.9, with the bulk of the time still taking place in ntdll.
So I poked around in the FastFreeMem code a little, and was surprised to run across this:
[code lang=”Delphi”]
@LockBlockTypeLoop:
mov eax, $100
{Attempt to grab the block type}
lock cmpxchg TSmallBlockType([ebx]).BlockTypeLocked, ah
je @GotLockOnSmallBlockType
{$ifndef NeverSleepOnThreadContention}
{Couldn’t grab the block type – sleep and try again}
push ecx
push edx
push InitialSleepTime
call Sleep
pop edx
pop ecx
{Try again}
mov eax, $100
{Attempt to grab the block type}
lock cmpxchg TSmallBlockType([ebx]).BlockTypeLocked, ah
je @GotLockOnSmallBlockType
{Couldn’t grab the block type – sleep and try again}
push ecx
push edx
push AdditionalSleepTime
call Sleep
pop edx
pop ecx
{Try again}
jmp @LockBlockTypeLoop
{Align branch target}
nop
nop
{$else}
{Pause instruction (improves performance on P4)}
rep nop
{Try again}
jmp @LockBlockTypeLoop
{Align branch target}
nop
{$endif}
[/code]
So when it tries to lock the memory block to free some memory, unless a special “NeverSleepOnThreadContention” compiler flag is set, it’ll call the Winapi Sleep function, giving up the entire timeslice (several milliseconds) because it’s blocked by an operation that will take a few dozen lines of ASM to complete.
I looked for this option in FastMM4Options.inc, and found the following explanation:
{Enable this option to never put a thread to sleep if a thread contention occurs. This option will improve performance if the ratio of the number of active threads to the number of CPU cores is low (typically < 2). With this option set a thread will enter a “busy waiting” loop instead of relinquishing its timeslice when a thread contention occurs.}
So sleeping the thread instead of spinlocking can be helpful when there are a high number of threads running. But there’s no code to detect this. It’s never call Sleep or always call Sleep, with the decision hardcoded in at compile time. I wonder if it would be possible to always spinlock for a certain number of cycles first and see if that helps, before calling Sleep?
Anyway, it turns out that that was my problem. I was doing some other data-intensive loading in a background thread, and the memory allocations were clashing with each other. When I set the NeverSleepOnThreadContention flag and rebuilt, the load time for tile data dropped to a far more acceptable 1.1 seconds.
I predict that you’ll get even more benefit from a scalable memory manager. Try the suggestions I made on yesterday’s Stack Overflow question on the subject. Enabling spinning rather than a lock in FastMM is just a band aid.
@David, I’d say a band aid that gets the load time from 2.9s down to 1.1s looks very much like an acceptable solution to the problem at hand. Still, I would like to see what gain your solution would provide in this situation; if it goes down to 0.5s or less, it could be worth taking the *risk* to bring in a new MM.
@Mason, thanks for sharing and showcasing these fundamental principles: profile, get numbers, read the source…
When the number of processors increase this spin when busy band aid shows its true colours! Scalable allocator is what is needed.
There are APIs like InitializeCriticalSectionAndSpinCount() (http://msdn.microsoft.com/en-us/library/ms683476(VS.85).aspx) that could help in situations like such that, but it is not available in Win9x. Guess maybe FastMM could copy the idea. It could be difficult for the MM to understand how many threads are running, and as far as I know that code can deadlock (see QC #76832) in some situations, if a memory operation fail.