TURBU Tech

The next RTTI bottleneck

March 1, 2013, 1:01 am

A few years back, when I posted an analysis of how TValue is very slow, it prompted a lot of response from the community. Various people ran their own benchmarks, and started working on building or optimizing their own meta-value types. Some people are even still working on that today. But one of the most interesting things was Robert Love’s response. He looked at the TValue code and found a way that it could be optimized for the common case to speed things up.

I built on his foundation and sent a suggested patch to the Delphi team. They made their own tweaks, and newer versions of Delphi have had a much faster TValue because of that. But one of the most interesting things I heard while I was working on that improvement came from Barry Kelly. He said that he wasn’t sure how much those speedups would actually help, because the bulk of your CPU time in RTTI work was going to be spent in invocation (using the RTTI system to call methods) and not in moving data in and out of TValue variables.

And he was right. If you wanna really break your brain sometime, trace into a call to TRttiMethod.Invoke and take a look at what all is taking place under the hood. That’s a huge amount of work going on, and the interesting thing is how much of it will be exactly the same every time you invoke the same method, assuming you have a valid parameter list.

To do it quickly, you have to recompute as little of that setup code as possible every time. The fastest way would be to write your own invocation routines, something like this:

[code lang="delphi"]
function InvokeFuncA(var params: TArray; const self: TValue): TValue;
begin
   result := (self.AsObject as TMyClass).FuncA(params[0].AsInteger);
end;
[/code]

Of course, that’s a static invocation routine which only works for one method, which is the exact opposite of what TRttiMethod.Invoke does: provides you with a generic invocation routine that works for any method. But it’s fast.

So… what if we could find a way to get TRttiMethod to create something like that for us, at runtime? Instead of doing a bunch of work every time you call it to figure out what goes into memory where in order to convert your list of TValues into a native parameter list for a method, maybe it could work out how to perform that mapping, and express it in machine code.

It’s not as strange as it sounds. TMethodImplementation already does a certain amount of machine code generation at runtime to make some of the fancier RTTI tricks possible. And with RTTI becoming used in more places in Delphi (it’s all over the place in Live Bindings, for example,) the last thing anyone wants is for it to be slow.

So, I’d like to challenge the community to step up to the plate again. Is there anyone out there who knows enough about low-level code to build what would essentially be a JIT compiler that takes a TRttiMethod as input and outputs a shim for fast invocation? I’ll be poking at things from one angle, but I’d like to invite anyone else who’s interested to help out. Let’s see how fast we can get RTTI invocation.

Tags: Dark Corners, Delphi, Optimization, RTTI
Category: Dark Corners, Delphi, Optimization, RTTI | Comment (RSS) | Trackback

15 Comments

Iztok Kacin says:

March 1, 2013 at 2:58 am

Stop it, I am still working on your previous TValue findings 🙂

Joke aside I find it a little sad that Delphi community had to do a fix for TValue. I am afraid otherwise it would have stayed like it was forever. But with that fix included it is still just slow. TanyValue is way faster. And I am building a little framework on top of it. More as a personal challenge as anything else.

Would be nice to see if someone can do something about Invoke. I bet Eric would be able to 🙂

Reply to this comment
Stefan Glienke says:

March 1, 2013 at 5:19 am

@Iztok Kacin: TValue is *not* slower than TAnyValue depending on what you are testing. Take your profiling project and set DoInteger, DoExtended and DoInteger64 True tells me that TValue is ahead TAnyValue approx 10% (> XE though because XE did not have the optimizations for the Implicit operator overloads and the AsXXX methods). If you apply these fixes to 2010 and XE (which I did, will publish them soon) TValue is also faster on these versions.

However when strings come into play TAnyValue gains performance *but* only under the circumstances that you are reusing the same T(Any)Value variable for all types because then that TValueDataImpl instance gets created and destroyed all the time which causes that huge performance impact.

Reply to this comment
Iztok Kacin says:

March 1, 2013 at 5:45 am

@Stefan

It is faster. Forget the code you were testing when we had the debate. Look at this post:

http://www.cromis.net/blog/2013/02/tanyvalue-an-attempt-to-make-the-best-variable-data-container/

I hooked FinalizeRecord and its friends so now not only is TAnyValue way faster it also only takes 9 bytes on both x32 and x64. TValue has big memory footprint. But even after that I made improvments to it feature wise and speed wise. Just an hour ago I made final commits to my internal SVN. I will publish the changes today or over the weekend. Look my last post on array handling build right into TAnyValue. Also I experimented with improved dynamic array implementation, improving deletes and inserts by factor of 100.

Its not only speed its also the power of use. When I am done TAnyValue will do all TValue does, have great array support build right into it (way better then TList). There are also flexible hash classes that are also very fast and can store “anything” as value.

To be honest that speed tests I used are flawed in a way. The only measure assigning values, they do not meassure memory management. That means they do not meassure creation and destruction of values. I am still surprised how fast variants are if used correctly, but I dislike how the work and they also have a big memory footprint. My aim is to make a flexible, fast and memory friendly container with rich supporting infrastructure behind it.

Reply to this comment
Stefan Glienke says:

March 1, 2013 at 6:53 am

@Iztok Kacin: I was using the code you posted in exactly that post.

With my fixes for 2010 and XE (which are basically the same as they did in XE2) it’s (TValue vs TAnyValue):

6377 vs 3312 using all 4 types
780 vs 1058 when not using string
2950 vs 2700 when using different variables for each type (so TValue does not create and destroy the TValueDataImpl instance all the time – only in my fix)
780 vs 520 when not using string and different variables (no performance gain here for TValue because Integer, Int64 and Extended are directly stored in the record anyway)

The performance gain from assigning and retrieving values is most important to me since when using the RTTI in many places the implicit operators are often used (for example if the caller passes arguments in a array of TValue)

Reply to this comment
Stefan Glienke says:

March 1, 2013 at 6:57 am

@Mason: I think using the Invoke or even RawInvoke routine (which is only in the implementation part) would provide some performance gain already without the need to dynamically create asm stubs at runtime.

Reply to this comment
Iztok Kacin says:

March 1, 2013 at 7:14 am

@Stefan

That is interesting. I will wait until you release your fixes and do the tests again. I tested on XE3 and there TAnyValue was faster for all data types as you can see from my post. I am interested what you have done 🙂

Reply to this comment
V. Antonov says:

March 1, 2013 at 8:37 am

Mr Kelly transformed Delphi into a bloated pig… and left.

Reply to this comment
- Stefan Glienke says:
  
  March 1, 2013 at 9:24 am
  
  I guess you are one of the ever living in the past Delphi 7 users?
  
  Barry added features to the language that make it able to keep up at least a bit with other languages.
  How they are implemented sometimes is another story and not his fault but problem of the annual release cycle.
  
  Reply to this comment
- Iztok Kacin says:
  
  March 1, 2013 at 9:35 am
  
  Barry was a good engineer and proved that with his posts and answers on SO many times. I am certain that if he had more freedom he would make some things better then they are currently.
  
  Reply to this comment
ObjectMethodology.com says:

March 1, 2013 at 11:29 am

More good analysis, thx.

Reply to this comment
Isopod says:

March 1, 2013 at 8:08 pm

Hi,

this is unrelated to this article, but I couldn’t find a general contact form. I just stumbled across your blog and noticed it has spam at the top with JavaScript disabled:

http://i.imgur.com/2IsHfcK.png

You might want to take action…

(Sorry for withholding my real email adress, but since your site has apparently been hacked, I’m not sure how safe my personal information really is here…)

Greetings,
Isopod

Reply to this comment
- Stuart Kelly says:
  
  March 2, 2013 at 5:20 pm
  
  Hi Isopod, I noticed that too, yesterday. I could not find a contact page or email, so I sent a PM to Mason Wheeler on Embarcadero Discussion Forums.
  
  Reply to this comment
Nenad says:

March 3, 2013 at 4:37 pm

I think that emballo (the DI-Framework) tried to do something similar with BeaEngine.
Although he seemed to have stopped the development in Jan 2012
and i don’t know how far this was implemented.

https://bitbucket.org/magnomp/emballo
From their googlecode website :
currently, the project cover areas such as mocking, dependency injection, dynamic proxies, etc.

Reply to this comment
Arioch says:

March 21, 2013 at 6:43 am

i believe direct codegen would be a questionable behavior.

It wold not be portable (Mobile Studio / Android ? FPC ?)
It would be insecure (Windows 8 market? )

But i think the invocation can be reduced to a number of fixed set choices.

How to invoke – IOW calling convention.
Then how the parameters should be prepared on stack, one by one ?

It comes to http://en.wikipedia.org/wiki/Threaded_code
And i remember i read about .Net 1.x implementation called Portable.NET They used the same approach and before JITting emerged, that was the interpreter both working very fast and ported to new archs in very short time frames.

Well, that probably would require careful craftign of register parameters passing…
Either Turbo Pascal style interrupt calling (via register snapshot variable, simple but not very fast), or maybe parameters passing can be implemented preserving all the used registers, dunno

Reply to this comment
Arnaud Bouchez says:

November 21, 2015 at 8:02 am

Why not just get rid of RTTI use for method invocation, and use plain good “interface” types instead?

Using RTTI is a tricky business… which should be exceptional.
Relying on RTTI at runtime, for usual code, smells to be like a break of Liskov Substitution Principle (and the Open/Close Principle).

Most of the time, code would be much cleaner when using proper OOP and interfaces, instead of relying on RTTI and TValue.

It is not my own opinion.
In fact, Microsoft, in its newest orientation to “Native” compilation in .Net, does advice to avoid RTTI (ab)use.
See https://msdn.microsoft.com/en-us/library/dn600640

Reply to this comment

The next RTTI bottleneck

15 Comments

Leave a Reply

Pages

Categories

Archives