Tuesday, April 18, 2006
3D driver Alpha 4.1 / 2D driver 0.80 benchmarks
As promised, I posted the benchmark results on the nVidia 3D news page for 3D driver Alpha 4.1 combined with 2D driver 0.80. Have a look there for the details. I've placed a link at the left column of this blogger page that will take you there.
About my driver's pages:
I'll try to update them a bit more in the coming time as it's all very much outdated these days. I'll probably take down some stuff and point at Haiku's bugzilla and the Bebits download entries instead, since that's more official anyway and saves me updating efforts at the same time.
That's it for now. Seems I can now work a bit more on the videonode related stuff. See you later !:-)
About my driver's pages:
I'll try to update them a bit more in the coming time as it's all very much outdated these days. I'll probably take down some stuff and point at Haiku's bugzilla and the Bebits download entries instead, since that's more official anyway and saves me updating efforts at the same time.
That's it for now. Seems I can now work a bit more on the videonode related stuff. See you later !:-)
Friday, April 07, 2006
3D rendering speed now upto twice as fast!
As some of you already know, 3D rendering speed went up with a factor of 1.4-2.0 on all supported GeForce style cards. TNT2(M64) render 1-4% faster, which is not very impressive compared to the GeForce speedup: but nevertheless noticable in some occasions.
The 3D driver renders at approx. 40% of the Windows driver speed for TNTx style cards, if the Windows driver uses the same setup as our BeOS driver: blits for swapbuffers, 16-bit texturing and disabled retrace-sync. For GeForce style cards, the driver now runs at approx. 30% of the Windows driver speed.
Needless to say I think, is that I am very happy with the big improve in speed we now see. If you compare different cards for speed using Windows, and you put the results of that next to a comparison in speed using BeOS: you'll find that the relative scores are starting to match! In other words, if a certain card performs fastest in Windows, it will now also perform fastest in BeOS..
Hey, what happened?
Well, after finding the results of the delta-speed test I did, I (once again) did a sweep of the engine init code in the 2D driver. Only this time, much more detailed than ever before. And then I found what I needed: just a few tiny bits needed toggling to get the new speeds! Don't really know what they do though.. We had sub-optimal settings, especially for GeForce cards. When you think about it, it could have been expected, as the UtahGLX driver my work is based on was very old: first created probably in the pre-GeForce era.
Anyhow, I did the sweep on all cards I have here, so TNT1, TNT2, TNT2-M64, GeForce2MX, GeForce2Ti, GeForce 4MX440, and GeForce 4MX4000. I am very sure the init setup code is now optimal speedwise. Of course I also looked at degrading quality: but that I didn't see. This combined with the speed comparison against Windows (for all those cards) leads me to believe this should be perfectly OK now.
So, thinking about how the engine's inner workings are: those ROP's in there (parallel raster outputs) are all working already. So two on most cards, and four on the GeForce 2Ti cards. The new speed comparisons on BeOS confirm this (more or less).
Release?
I'll release a new 2D driver asap: probably current SVN version 0.79. I'll also recompile the Alpha4 3D add-on against it and publish the combination as well as Alpha 4.1. Note: the 3D driver has not changed one bit, it's just that the shared_info in the 2D driver was expanded with a new (unrelated) nv.setting: that's why the recompile is needed.
Furthermore I'll publish a new benchmark table with all those results on both Windows and BeOS. Still done with Quake2 timedemo 1 (with sound running), as I always did: you can easily compare it to those old Alpha 2 benchmarking results I published.
The new benchmark table will be on the driver's homepage (3D news page), instead of here though. Just so you know.
So, why are we not yet at 100% of 'Windows' speed?
Well, there are a number of things to account for that.
1. The acceleration engines are still idle at times during rendering, even on the faster CPU systems. This has to do with the bursting nature of those vertices being sent, and can only be fixed by using a higher-level interface to Mesa. Luckily Mesa3.4.2 turns out to have this interface, so I'll try to interface to it for a later Alpha release. This interface sends entire GLbegin()/GLend() sequences in one call to the driver, instead of the current one-triangle-per-call system. Furthermore this higher-level interface is needed for current Mesa as well, so interfacing to that would be a 'translation step' for me: which is good. :-)
Having this higher-level interface in place might very well increase rendering speed with a factor 1.5 or so: even (much) more so on slower CPU systems. Of course I'm guessing a bit about the possible speedgain at this point.
2. on TNT style cards, no more gain is possible: this high-level Mesa interface needs to be translated down to the current low-level hardware interface: sending seperate triangles.
On GeForce style cards however, this higher-level interface Mesa has, is also supported in the hardware!. This means that in theory it's possible to tell the engine what type of primitives we want to render (at GLbegin() time), like seperate triangles, tri-strips, tri-fans, and all other styles GLbegin() supports. The nice thing about this is of course, that we don't need to send seperate triangles to the engine, but just the vertices needed to describe the primitive we want to render (so literally like the app does that sends the commands).
Seems that this would save the driver sending tremendous amounts of redundant vertices: after all, in a real scene a lot of triangles have sides in common.
(Example: a quad. sending seperate triangles means we need to send 6 vertices, while a quad by itself only needs 4 vertices to describe it.)
(Example 2: a cube. There are 8 vertices needed to descibe it. Now break it down in seperate triangles, we'd need 36. Right?)
Well, this would increase rendering speed: can't miss. And the (very) good news is, that we might even be able to get this hardware interface up and running! Thanks to a very new open nVidia DRI driver attempt that is...
How about maximizing effective use of the RAM-datapath-width?
We talked about this a bit: about how fetching a single point (pixel) would waste valuable bandwidth, and how a crossbar memory controller could improve effectiveness by using smaller 'lanes'. Remember? Well, that crossbar controller was invented for GeForce3 and later: so we can't be suffering from that.
No. It turns out that it's like this:
- the hardware already maximises effective use of bandwidth! After all, we are not drawing single pixels, we are drawing 'entire' triangles! These consists of a large number of pixels (most of the time), so the engine can 'auto-optimize' access by doing parallel pixel rendering!
- a crossbar memory controller only comes in handy when you render lots of very small triangles: here the internal engine parallelisation fails (we deal with just one, or a few pixels). So, this crossbar controller is needed for next-generation games, where much more (smaller) triangles make up a scene. Makes all sense now, no?
So, we don't have more bottlenecks than described just now (those two), apart from probably some extra Mesa software overhead caused by the fact that we will 'never' be able to utilize all hardware tweaks and features that could exist in the cards: lack of docs (and time), as ususal.
Personally, I can live with that. I mean, if those two bottlenecks could be fixed, I'd be very satisfied. Ah, I'm glad with what we have already now as well... ;-)
Have fun!
The 3D driver renders at approx. 40% of the Windows driver speed for TNTx style cards, if the Windows driver uses the same setup as our BeOS driver: blits for swapbuffers, 16-bit texturing and disabled retrace-sync. For GeForce style cards, the driver now runs at approx. 30% of the Windows driver speed.
Needless to say I think, is that I am very happy with the big improve in speed we now see. If you compare different cards for speed using Windows, and you put the results of that next to a comparison in speed using BeOS: you'll find that the relative scores are starting to match! In other words, if a certain card performs fastest in Windows, it will now also perform fastest in BeOS..
Hey, what happened?
Well, after finding the results of the delta-speed test I did, I (once again) did a sweep of the engine init code in the 2D driver. Only this time, much more detailed than ever before. And then I found what I needed: just a few tiny bits needed toggling to get the new speeds! Don't really know what they do though.. We had sub-optimal settings, especially for GeForce cards. When you think about it, it could have been expected, as the UtahGLX driver my work is based on was very old: first created probably in the pre-GeForce era.
Anyhow, I did the sweep on all cards I have here, so TNT1, TNT2, TNT2-M64, GeForce2MX, GeForce2Ti, GeForce 4MX440, and GeForce 4MX4000. I am very sure the init setup code is now optimal speedwise. Of course I also looked at degrading quality: but that I didn't see. This combined with the speed comparison against Windows (for all those cards) leads me to believe this should be perfectly OK now.
So, thinking about how the engine's inner workings are: those ROP's in there (parallel raster outputs) are all working already. So two on most cards, and four on the GeForce 2Ti cards. The new speed comparisons on BeOS confirm this (more or less).
Release?
I'll release a new 2D driver asap: probably current SVN version 0.79. I'll also recompile the Alpha4 3D add-on against it and publish the combination as well as Alpha 4.1. Note: the 3D driver has not changed one bit, it's just that the shared_info in the 2D driver was expanded with a new (unrelated) nv.setting: that's why the recompile is needed.
Furthermore I'll publish a new benchmark table with all those results on both Windows and BeOS. Still done with Quake2 timedemo 1 (with sound running), as I always did: you can easily compare it to those old Alpha 2 benchmarking results I published.
The new benchmark table will be on the driver's homepage (3D news page), instead of here though. Just so you know.
So, why are we not yet at 100% of 'Windows' speed?
Well, there are a number of things to account for that.
1. The acceleration engines are still idle at times during rendering, even on the faster CPU systems. This has to do with the bursting nature of those vertices being sent, and can only be fixed by using a higher-level interface to Mesa. Luckily Mesa3.4.2 turns out to have this interface, so I'll try to interface to it for a later Alpha release. This interface sends entire GLbegin()/GLend() sequences in one call to the driver, instead of the current one-triangle-per-call system. Furthermore this higher-level interface is needed for current Mesa as well, so interfacing to that would be a 'translation step' for me: which is good. :-)
Having this higher-level interface in place might very well increase rendering speed with a factor 1.5 or so: even (much) more so on slower CPU systems. Of course I'm guessing a bit about the possible speedgain at this point.
2. on TNT style cards, no more gain is possible: this high-level Mesa interface needs to be translated down to the current low-level hardware interface: sending seperate triangles.
On GeForce style cards however, this higher-level interface Mesa has, is also supported in the hardware!. This means that in theory it's possible to tell the engine what type of primitives we want to render (at GLbegin() time), like seperate triangles, tri-strips, tri-fans, and all other styles GLbegin() supports. The nice thing about this is of course, that we don't need to send seperate triangles to the engine, but just the vertices needed to describe the primitive we want to render (so literally like the app does that sends the commands).
Seems that this would save the driver sending tremendous amounts of redundant vertices: after all, in a real scene a lot of triangles have sides in common.
(Example: a quad. sending seperate triangles means we need to send 6 vertices, while a quad by itself only needs 4 vertices to describe it.)
(Example 2: a cube. There are 8 vertices needed to descibe it. Now break it down in seperate triangles, we'd need 36. Right?)
Well, this would increase rendering speed: can't miss. And the (very) good news is, that we might even be able to get this hardware interface up and running! Thanks to a very new open nVidia DRI driver attempt that is...
How about maximizing effective use of the RAM-datapath-width?
We talked about this a bit: about how fetching a single point (pixel) would waste valuable bandwidth, and how a crossbar memory controller could improve effectiveness by using smaller 'lanes'. Remember? Well, that crossbar controller was invented for GeForce3 and later: so we can't be suffering from that.
No. It turns out that it's like this:
- the hardware already maximises effective use of bandwidth! After all, we are not drawing single pixels, we are drawing 'entire' triangles! These consists of a large number of pixels (most of the time), so the engine can 'auto-optimize' access by doing parallel pixel rendering!
- a crossbar memory controller only comes in handy when you render lots of very small triangles: here the internal engine parallelisation fails (we deal with just one, or a few pixels). So, this crossbar controller is needed for next-generation games, where much more (smaller) triangles make up a scene. Makes all sense now, no?
So, we don't have more bottlenecks than described just now (those two), apart from probably some extra Mesa software overhead caused by the fact that we will 'never' be able to utilize all hardware tweaks and features that could exist in the cards: lack of docs (and time), as ususal.
Personally, I can live with that. I mean, if those two bottlenecks could be fixed, I'd be very satisfied. Ah, I'm glad with what we have already now as well... ;-)
Have fun!
Monday, April 03, 2006
RAM access bandwidth test for 3D in nVidia driver
I promised to tell you about that 'delta speed test' I did with the 3D driver, which I did to get a better idea about how fast the RAM could be actually accessed by the 3D part of the acceleration engine. I considered this interesting because the 3D driver was rather slow compared to it's windows counterpart.
So how did I test?
It was rather simple, really. In the previous post I did I already 'calculated' some numbers for RAM bandwidth, and bandwidth needed by the CRTC to fetch the data to show us the memory content on the monitor. So I thought, if I can ascertain how much fps is gained by stopping CRTC accesses to the memory, I also know how much fps is theoretically feasable if the engine could use the complete bandwidth.
In a formula
(total RAM bandwidth / needed bandwidth by CRTC access) * (fps without CRTC accessing memory - fps with CRTC accessing memory) = nominal fps rate possible.
The setup
Just modify the 2D driver to enter DPMS sleep mode as soon as the cursor is turned off, and enter DPMS on mode when the cursor is turned back on. DPMS sleep mode in facts sets the CRTC in a 'reset' state since it's not required to fetch any data anymore: we won't be looking at it anyway (monitor is shut off). So this DPMS sleep state gives back the memory bandwidth otherwise used by the CRTC, to the 3D engine.
Since:
- you can start Quake2 from a terminal 'command line', including instructing it to do a timedemo test,
- you see the fps results back on the command line after game quitting and,
- starting Quake2 turns off the driver's hardware cursor, while stopping it turns it back on,
This setup will work nicely.
Result for the Geforce4MX, NV18
(6000 / 225) * (29.9 - 26.3) = 96 fps. (Windows measured value = 119 fps)
Well, for my taste this proves enough already that RAM access is actually OK, and the fault would not be low clocked RAM or something like that. So the reason for slow fps on BeOS should be found somewhere else.
Conclusion
Why is the delta speed with the CRTC test actually OK, while the total rendering speed is much to low? Well, it's interesting to realize that CRTC accesses are spread evenly through time, while 3D engine memory access requests are of a bursting nature.
The conclusion would be that there's some bottleneck in the GPU somehow after all... And with this new knowledge I went to sleep, not knowing yet what to make of this new information.
So how did I test?
It was rather simple, really. In the previous post I did I already 'calculated' some numbers for RAM bandwidth, and bandwidth needed by the CRTC to fetch the data to show us the memory content on the monitor. So I thought, if I can ascertain how much fps is gained by stopping CRTC accesses to the memory, I also know how much fps is theoretically feasable if the engine could use the complete bandwidth.
In a formula
(total RAM bandwidth / needed bandwidth by CRTC access) * (fps without CRTC accessing memory - fps with CRTC accessing memory) = nominal fps rate possible.
The setup
Just modify the 2D driver to enter DPMS sleep mode as soon as the cursor is turned off, and enter DPMS on mode when the cursor is turned back on. DPMS sleep mode in facts sets the CRTC in a 'reset' state since it's not required to fetch any data anymore: we won't be looking at it anyway (monitor is shut off). So this DPMS sleep state gives back the memory bandwidth otherwise used by the CRTC, to the 3D engine.
Since:
- you can start Quake2 from a terminal 'command line', including instructing it to do a timedemo test,
- you see the fps results back on the command line after game quitting and,
- starting Quake2 turns off the driver's hardware cursor, while stopping it turns it back on,
This setup will work nicely.
Result for the Geforce4MX, NV18
(6000 / 225) * (29.9 - 26.3) = 96 fps. (Windows measured value = 119 fps)
Well, for my taste this proves enough already that RAM access is actually OK, and the fault would not be low clocked RAM or something like that. So the reason for slow fps on BeOS should be found somewhere else.
Conclusion
Why is the delta speed with the CRTC test actually OK, while the total rendering speed is much to low? Well, it's interesting to realize that CRTC accesses are spread evenly through time, while 3D engine memory access requests are of a bursting nature.
The conclusion would be that there's some bottleneck in the GPU somehow after all... And with this new knowledge I went to sleep, not knowing yet what to make of this new information.