Friday, April 07, 2006
3D rendering speed now upto twice as fast!
As some of you already know, 3D rendering speed went up with a factor of 1.4-2.0 on all supported GeForce style cards. TNT2(M64) render 1-4% faster, which is not very impressive compared to the GeForce speedup: but nevertheless noticable in some occasions.
The 3D driver renders at approx. 40% of the Windows driver speed for TNTx style cards, if the Windows driver uses the same setup as our BeOS driver: blits for swapbuffers, 16-bit texturing and disabled retrace-sync. For GeForce style cards, the driver now runs at approx. 30% of the Windows driver speed.
Needless to say I think, is that I am very happy with the big improve in speed we now see. If you compare different cards for speed using Windows, and you put the results of that next to a comparison in speed using BeOS: you'll find that the relative scores are starting to match! In other words, if a certain card performs fastest in Windows, it will now also perform fastest in BeOS..
Hey, what happened?
Well, after finding the results of the delta-speed test I did, I (once again) did a sweep of the engine init code in the 2D driver. Only this time, much more detailed than ever before. And then I found what I needed: just a few tiny bits needed toggling to get the new speeds! Don't really know what they do though.. We had sub-optimal settings, especially for GeForce cards. When you think about it, it could have been expected, as the UtahGLX driver my work is based on was very old: first created probably in the pre-GeForce era.
Anyhow, I did the sweep on all cards I have here, so TNT1, TNT2, TNT2-M64, GeForce2MX, GeForce2Ti, GeForce 4MX440, and GeForce 4MX4000. I am very sure the init setup code is now optimal speedwise. Of course I also looked at degrading quality: but that I didn't see. This combined with the speed comparison against Windows (for all those cards) leads me to believe this should be perfectly OK now.
So, thinking about how the engine's inner workings are: those ROP's in there (parallel raster outputs) are all working already. So two on most cards, and four on the GeForce 2Ti cards. The new speed comparisons on BeOS confirm this (more or less).
Release?
I'll release a new 2D driver asap: probably current SVN version 0.79. I'll also recompile the Alpha4 3D add-on against it and publish the combination as well as Alpha 4.1. Note: the 3D driver has not changed one bit, it's just that the shared_info in the 2D driver was expanded with a new (unrelated) nv.setting: that's why the recompile is needed.
Furthermore I'll publish a new benchmark table with all those results on both Windows and BeOS. Still done with Quake2 timedemo 1 (with sound running), as I always did: you can easily compare it to those old Alpha 2 benchmarking results I published.
The new benchmark table will be on the driver's homepage (3D news page), instead of here though. Just so you know.
So, why are we not yet at 100% of 'Windows' speed?
Well, there are a number of things to account for that.
1. The acceleration engines are still idle at times during rendering, even on the faster CPU systems. This has to do with the bursting nature of those vertices being sent, and can only be fixed by using a higher-level interface to Mesa. Luckily Mesa3.4.2 turns out to have this interface, so I'll try to interface to it for a later Alpha release. This interface sends entire GLbegin()/GLend() sequences in one call to the driver, instead of the current one-triangle-per-call system. Furthermore this higher-level interface is needed for current Mesa as well, so interfacing to that would be a 'translation step' for me: which is good. :-)
Having this higher-level interface in place might very well increase rendering speed with a factor 1.5 or so: even (much) more so on slower CPU systems. Of course I'm guessing a bit about the possible speedgain at this point.
2. on TNT style cards, no more gain is possible: this high-level Mesa interface needs to be translated down to the current low-level hardware interface: sending seperate triangles.
On GeForce style cards however, this higher-level interface Mesa has, is also supported in the hardware!. This means that in theory it's possible to tell the engine what type of primitives we want to render (at GLbegin() time), like seperate triangles, tri-strips, tri-fans, and all other styles GLbegin() supports. The nice thing about this is of course, that we don't need to send seperate triangles to the engine, but just the vertices needed to describe the primitive we want to render (so literally like the app does that sends the commands).
Seems that this would save the driver sending tremendous amounts of redundant vertices: after all, in a real scene a lot of triangles have sides in common.
(Example: a quad. sending seperate triangles means we need to send 6 vertices, while a quad by itself only needs 4 vertices to describe it.)
(Example 2: a cube. There are 8 vertices needed to descibe it. Now break it down in seperate triangles, we'd need 36. Right?)
Well, this would increase rendering speed: can't miss. And the (very) good news is, that we might even be able to get this hardware interface up and running! Thanks to a very new open nVidia DRI driver attempt that is...
How about maximizing effective use of the RAM-datapath-width?
We talked about this a bit: about how fetching a single point (pixel) would waste valuable bandwidth, and how a crossbar memory controller could improve effectiveness by using smaller 'lanes'. Remember? Well, that crossbar controller was invented for GeForce3 and later: so we can't be suffering from that.
No. It turns out that it's like this:
- the hardware already maximises effective use of bandwidth! After all, we are not drawing single pixels, we are drawing 'entire' triangles! These consists of a large number of pixels (most of the time), so the engine can 'auto-optimize' access by doing parallel pixel rendering!
- a crossbar memory controller only comes in handy when you render lots of very small triangles: here the internal engine parallelisation fails (we deal with just one, or a few pixels). So, this crossbar controller is needed for next-generation games, where much more (smaller) triangles make up a scene. Makes all sense now, no?
So, we don't have more bottlenecks than described just now (those two), apart from probably some extra Mesa software overhead caused by the fact that we will 'never' be able to utilize all hardware tweaks and features that could exist in the cards: lack of docs (and time), as ususal.
Personally, I can live with that. I mean, if those two bottlenecks could be fixed, I'd be very satisfied. Ah, I'm glad with what we have already now as well... ;-)
Have fun!
The 3D driver renders at approx. 40% of the Windows driver speed for TNTx style cards, if the Windows driver uses the same setup as our BeOS driver: blits for swapbuffers, 16-bit texturing and disabled retrace-sync. For GeForce style cards, the driver now runs at approx. 30% of the Windows driver speed.
Needless to say I think, is that I am very happy with the big improve in speed we now see. If you compare different cards for speed using Windows, and you put the results of that next to a comparison in speed using BeOS: you'll find that the relative scores are starting to match! In other words, if a certain card performs fastest in Windows, it will now also perform fastest in BeOS..
Hey, what happened?
Well, after finding the results of the delta-speed test I did, I (once again) did a sweep of the engine init code in the 2D driver. Only this time, much more detailed than ever before. And then I found what I needed: just a few tiny bits needed toggling to get the new speeds! Don't really know what they do though.. We had sub-optimal settings, especially for GeForce cards. When you think about it, it could have been expected, as the UtahGLX driver my work is based on was very old: first created probably in the pre-GeForce era.
Anyhow, I did the sweep on all cards I have here, so TNT1, TNT2, TNT2-M64, GeForce2MX, GeForce2Ti, GeForce 4MX440, and GeForce 4MX4000. I am very sure the init setup code is now optimal speedwise. Of course I also looked at degrading quality: but that I didn't see. This combined with the speed comparison against Windows (for all those cards) leads me to believe this should be perfectly OK now.
So, thinking about how the engine's inner workings are: those ROP's in there (parallel raster outputs) are all working already. So two on most cards, and four on the GeForce 2Ti cards. The new speed comparisons on BeOS confirm this (more or less).
Release?
I'll release a new 2D driver asap: probably current SVN version 0.79. I'll also recompile the Alpha4 3D add-on against it and publish the combination as well as Alpha 4.1. Note: the 3D driver has not changed one bit, it's just that the shared_info in the 2D driver was expanded with a new (unrelated) nv.setting: that's why the recompile is needed.
Furthermore I'll publish a new benchmark table with all those results on both Windows and BeOS. Still done with Quake2 timedemo 1 (with sound running), as I always did: you can easily compare it to those old Alpha 2 benchmarking results I published.
The new benchmark table will be on the driver's homepage (3D news page), instead of here though. Just so you know.
So, why are we not yet at 100% of 'Windows' speed?
Well, there are a number of things to account for that.
1. The acceleration engines are still idle at times during rendering, even on the faster CPU systems. This has to do with the bursting nature of those vertices being sent, and can only be fixed by using a higher-level interface to Mesa. Luckily Mesa3.4.2 turns out to have this interface, so I'll try to interface to it for a later Alpha release. This interface sends entire GLbegin()/GLend() sequences in one call to the driver, instead of the current one-triangle-per-call system. Furthermore this higher-level interface is needed for current Mesa as well, so interfacing to that would be a 'translation step' for me: which is good. :-)
Having this higher-level interface in place might very well increase rendering speed with a factor 1.5 or so: even (much) more so on slower CPU systems. Of course I'm guessing a bit about the possible speedgain at this point.
2. on TNT style cards, no more gain is possible: this high-level Mesa interface needs to be translated down to the current low-level hardware interface: sending seperate triangles.
On GeForce style cards however, this higher-level interface Mesa has, is also supported in the hardware!. This means that in theory it's possible to tell the engine what type of primitives we want to render (at GLbegin() time), like seperate triangles, tri-strips, tri-fans, and all other styles GLbegin() supports. The nice thing about this is of course, that we don't need to send seperate triangles to the engine, but just the vertices needed to describe the primitive we want to render (so literally like the app does that sends the commands).
Seems that this would save the driver sending tremendous amounts of redundant vertices: after all, in a real scene a lot of triangles have sides in common.
(Example: a quad. sending seperate triangles means we need to send 6 vertices, while a quad by itself only needs 4 vertices to describe it.)
(Example 2: a cube. There are 8 vertices needed to descibe it. Now break it down in seperate triangles, we'd need 36. Right?)
Well, this would increase rendering speed: can't miss. And the (very) good news is, that we might even be able to get this hardware interface up and running! Thanks to a very new open nVidia DRI driver attempt that is...
How about maximizing effective use of the RAM-datapath-width?
We talked about this a bit: about how fetching a single point (pixel) would waste valuable bandwidth, and how a crossbar memory controller could improve effectiveness by using smaller 'lanes'. Remember? Well, that crossbar controller was invented for GeForce3 and later: so we can't be suffering from that.
No. It turns out that it's like this:
- the hardware already maximises effective use of bandwidth! After all, we are not drawing single pixels, we are drawing 'entire' triangles! These consists of a large number of pixels (most of the time), so the engine can 'auto-optimize' access by doing parallel pixel rendering!
- a crossbar memory controller only comes in handy when you render lots of very small triangles: here the internal engine parallelisation fails (we deal with just one, or a few pixels). So, this crossbar controller is needed for next-generation games, where much more (smaller) triangles make up a scene. Makes all sense now, no?
So, we don't have more bottlenecks than described just now (those two), apart from probably some extra Mesa software overhead caused by the fact that we will 'never' be able to utilize all hardware tweaks and features that could exist in the cards: lack of docs (and time), as ususal.
Personally, I can live with that. I mean, if those two bottlenecks could be fixed, I'd be very satisfied. Ah, I'm glad with what we have already now as well... ;-)
Have fun!