Saturday, February 18, 2006
nVidia 3D: back on Mesa 3.4.2
While searching for a good solution for that delayed swap I mentioned to solve drawing errors, I once again compared Mesa 3.2.1 and 3.4.2. One of the differences between them turns out to be the added Mesa feature to complete pending drawing commands right before a driver issues the swapbuffer() command.
In other words, I switched back to Mesa 3.4.2, as that solves the drawing problem neatly if I add executing that Mesa internal command in the driver's swapbuffer function.
Mesa speed
The reason for me to fallback to the older Mesa before (alpha 3.5) was the apparant lower rendering speed the newer version had. Luckily it turns out I made a mistake myself: I forgot to enable hardware accelerated Z-buffer clearing! Once I enabled that, speed came a lot closer to Mesa 3.2.1's. Mesa 3.4.2 is indeed a bit slower, but just a tiny bit now.
Of course, I wanted to see if I could at least come up with the old speed, so I started looking once more at the hardware rendering commands hoping to find something I could optimize a bit more. Well, I found something indeed. Instead of issuing the vertexes and drawing commands seperately, I now issue them in one single burst of writes into the DMA cmd buffer. Also I use other vertex offsets in the engine, so the last vertex written automatically points me at the first drawing command entry. This saves overhead in explicitly setting engine register pointers in there, increasing rendering speed a bit (5-10% less words written into the DMA cmd buffer). If only Mesa could send 4 points, 4 lines and 5 triangles in one call... then I could increase the burst to it's max increasing speed another few percents, and save software routine calling overhead a lot.
Vertical retrace sync
In the meantime I added a new nv.settings option in the 2D driver called 'force_sync'. This option is now taken by the 3D accelerant to enforce retrace syncing for the swapbuffers command. With a small addition to the 2D's driver acc engine init code, we can now instruct the acceleration engine to wait for a retrace occuring before coninuing exec of commands. This is a bit nicer than explicitly waiting for a retrace (by CPU), as this enables us to keep sending rendering commands to the engine while that engine wait for the retrace. One of the important things for optimum fps rates, is that we try to keep the engine's DMA command buffer 'filled' at all times... The app should stay ahead with filling of the engine emptying and executing.
This engine wait exists in NV11 and newer cards, so the driver autoselects the retrace sync method depending on card architecture.
Swapbuffers: swapping instead of blitting (copying)
Adding real swapping turns out to be a challenge! The reason for wanting to add this function is to remove some of the acc engine load (used for blitting), so that space-in-time becomes available for 3D acceleration.
Unfortunately, swapping requires a sync to retrace: the CRTC (cathode ray tube controller) part of the GPU cannot do a swap at all times, as it's very busy with data fetching while the screen is drawn. You need to issue such a swap during retraces, otherwise the point in time the actual switch occurs cannot be guaranteed (over here it typically delays some 100 'lines': this register holding the pointer is doublebuffered in hardware apparantly).
Well, you guessed it: if we have to wait for a retrace, then we waste valuable GPU time we could otherwise use for 3D rendering! This contradicts our goal of course... In effect we loose speed instead of gaining it.
So: is there a solution to this problem? Yes, there is I think (I still need to check this out though). On BeOS, all 3D rendering uses double buffering. We have the 'frontbuffer' and the 'backbuffer': two buffers. You could switch between those once rendering to the backbuffer is done. The backbuffer then becomes the frontbuffer and vice versa. The next frame will be rendered to the old frontbuffer, now being the backbuffer. You see the problem: we need to wait until the CRTC actually displays the new frontbuffer, before we can delete and render in the old frontbuffer.
The solution presents itself: setup triple buffering. When we use that we don't really care when exactly the CRTC switches to the new buffer, we leave the old one alone anyway! Instead we use the third buffer to render into... Of course we need to wait anyway if the CRTC needs more time to switch than we need to render one frame: I don't know yet how to sync rendering to this limitation.
All in all I don't know yet if swapping will make it into alpha 4: I want to see the thing actually work OK before I promise that. Oh, this solution has a downside as well (of course): we need extra graphics memory to hold the third buffer. Though that is no real problem in practice.
Resizing buffers (viewports, the output window)
This is another subject I am once again putting time into. The last action I took (late last yesterday evening) was determining that we have a hardware (sync) problem here! This is good news for me, as I should be able to fix that. Once I do I can retest the Mesa internal function for resizing buffers (and resizing viewports). Looks like Mesa deservers (much) more credit than I initially gave it: it's well thought out, internal sync wise. I was under the impression that hangs and render errors could occur with out-of-sync events like resizing output windows, but it might well be that this is not the case at all. I'm a happy camper.
If I can find a solution the the hardware problem, and Mesa's function for resizing works well enough (under heavy acc engine load): I'll enable driver support for resizing. This means the teapot can be stretched etc via resizing the window it spins in (unlike having the repeating pots pattern you see now). Also this means that apps that initially create very small buffers and resize later will work (without modification) now.
Conclusion
Well, it looks like this will be an important update: Alpha4. Although in effect the speed won't differ that much (upto 10% speed gain depending on CPU and GPU power: P4-2800, NV18 +10%, dualP3-500 NV11 +3%), there are a lot of visible bugfixes concerning rendering.
I hope this will stimulate people to do some more 3D apps... ;-)
In other words, I switched back to Mesa 3.4.2, as that solves the drawing problem neatly if I add executing that Mesa internal command in the driver's swapbuffer function.
Mesa speed
The reason for me to fallback to the older Mesa before (alpha 3.5) was the apparant lower rendering speed the newer version had. Luckily it turns out I made a mistake myself: I forgot to enable hardware accelerated Z-buffer clearing! Once I enabled that, speed came a lot closer to Mesa 3.2.1's. Mesa 3.4.2 is indeed a bit slower, but just a tiny bit now.
Of course, I wanted to see if I could at least come up with the old speed, so I started looking once more at the hardware rendering commands hoping to find something I could optimize a bit more. Well, I found something indeed. Instead of issuing the vertexes and drawing commands seperately, I now issue them in one single burst of writes into the DMA cmd buffer. Also I use other vertex offsets in the engine, so the last vertex written automatically points me at the first drawing command entry. This saves overhead in explicitly setting engine register pointers in there, increasing rendering speed a bit (5-10% less words written into the DMA cmd buffer). If only Mesa could send 4 points, 4 lines and 5 triangles in one call... then I could increase the burst to it's max increasing speed another few percents, and save software routine calling overhead a lot.
Vertical retrace sync
In the meantime I added a new nv.settings option in the 2D driver called 'force_sync'. This option is now taken by the 3D accelerant to enforce retrace syncing for the swapbuffers command. With a small addition to the 2D's driver acc engine init code, we can now instruct the acceleration engine to wait for a retrace occuring before coninuing exec of commands. This is a bit nicer than explicitly waiting for a retrace (by CPU), as this enables us to keep sending rendering commands to the engine while that engine wait for the retrace. One of the important things for optimum fps rates, is that we try to keep the engine's DMA command buffer 'filled' at all times... The app should stay ahead with filling of the engine emptying and executing.
This engine wait exists in NV11 and newer cards, so the driver autoselects the retrace sync method depending on card architecture.
Swapbuffers: swapping instead of blitting (copying)
Adding real swapping turns out to be a challenge! The reason for wanting to add this function is to remove some of the acc engine load (used for blitting), so that space-in-time becomes available for 3D acceleration.
Unfortunately, swapping requires a sync to retrace: the CRTC (cathode ray tube controller) part of the GPU cannot do a swap at all times, as it's very busy with data fetching while the screen is drawn. You need to issue such a swap during retraces, otherwise the point in time the actual switch occurs cannot be guaranteed (over here it typically delays some 100 'lines': this register holding the pointer is doublebuffered in hardware apparantly).
Well, you guessed it: if we have to wait for a retrace, then we waste valuable GPU time we could otherwise use for 3D rendering! This contradicts our goal of course... In effect we loose speed instead of gaining it.
So: is there a solution to this problem? Yes, there is I think (I still need to check this out though). On BeOS, all 3D rendering uses double buffering. We have the 'frontbuffer' and the 'backbuffer': two buffers. You could switch between those once rendering to the backbuffer is done. The backbuffer then becomes the frontbuffer and vice versa. The next frame will be rendered to the old frontbuffer, now being the backbuffer. You see the problem: we need to wait until the CRTC actually displays the new frontbuffer, before we can delete and render in the old frontbuffer.
The solution presents itself: setup triple buffering. When we use that we don't really care when exactly the CRTC switches to the new buffer, we leave the old one alone anyway! Instead we use the third buffer to render into... Of course we need to wait anyway if the CRTC needs more time to switch than we need to render one frame: I don't know yet how to sync rendering to this limitation.
All in all I don't know yet if swapping will make it into alpha 4: I want to see the thing actually work OK before I promise that. Oh, this solution has a downside as well (of course): we need extra graphics memory to hold the third buffer. Though that is no real problem in practice.
Resizing buffers (viewports, the output window)
This is another subject I am once again putting time into. The last action I took (late last yesterday evening) was determining that we have a hardware (sync) problem here! This is good news for me, as I should be able to fix that. Once I do I can retest the Mesa internal function for resizing buffers (and resizing viewports). Looks like Mesa deservers (much) more credit than I initially gave it: it's well thought out, internal sync wise. I was under the impression that hangs and render errors could occur with out-of-sync events like resizing output windows, but it might well be that this is not the case at all. I'm a happy camper.
If I can find a solution the the hardware problem, and Mesa's function for resizing works well enough (under heavy acc engine load): I'll enable driver support for resizing. This means the teapot can be stretched etc via resizing the window it spins in (unlike having the repeating pots pattern you see now). Also this means that apps that initially create very small buffers and resize later will work (without modification) now.
Conclusion
Well, it looks like this will be an important update: Alpha4. Although in effect the speed won't differ that much (upto 10% speed gain depending on CPU and GPU power: P4-2800, NV18 +10%, dualP3-500 NV11 +3%), there are a lot of visible bugfixes concerning rendering.
I hope this will stimulate people to do some more 3D apps... ;-)
Comments:
<< Home
would be cool if more openGL stuff ran on BeOS :-)
Currently BTW MrX is testing a first pre version of Alpha 4 with modded Quake2 versions to see if he can find trouble (he always can :).
I am already convinced that this alpha 4 is going to be the best version I did yet.
Thanks for the compliment BTW! I really hope Haiku will be that big a success...
Currently BTW MrX is testing a first pre version of Alpha 4 with modded Quake2 versions to see if he can find trouble (he always can :).
I am already convinced that this alpha 4 is going to be the best version I did yet.
Thanks for the compliment BTW! I really hope Haiku will be that big a success...
I'm working on an OpenGL project as well, and using the existing alpha driver on a GF4 MX440. I'm always anxious to see what comes next with this driver development.
ModeenF:
The switch to current Mesa is relatively hard. The change was _after_ version 3.4.2 (3.5 is different).
Mesa 6.2.1 is the youngest version that runs on BeOS AFAIK: 6.3 added some stuff that needs looking into (said Philippe Houdoin, maintainer for BeOS AFAIK).
So you can already use that version. The problem is to rewrite and plugin the current old style nVidia driver.
I'll be looking into that subject again anyway someday. Maybe I'll take the Matrox route instead of VIA (I have 'seperate' cards I could test with on relatively fast systems).
For now I'll concentrate on just a more stable driver with less faults. BTW resizing the viewport works now, so I can rescale the Teapot to fullscreen now :-)
---
Quake3 would be cool to work on indeed :-) Thanks for the pointer to that site. I really hope someone picks that up and starts working on it.
---
to PS:
Me too :)
It's cool that I seem to be able to dig-in further and further. For instance I am now even looking at mipmap versus trilinear filtering: the driver supports that, but quake2 hangs when you use it at some point. It seems the engine is fed with an illegal setup: the trick is to find out what it is and let the driver check for that, falling back to a lesser quality setting for those setups, so it won't crash.
OTOH: we could also be looking at a plain error in Quake2 here (GL_LINEAR_MIPMAP_LINEAR setting).
Anyone have any clues here?
The switch to current Mesa is relatively hard. The change was _after_ version 3.4.2 (3.5 is different).
Mesa 6.2.1 is the youngest version that runs on BeOS AFAIK: 6.3 added some stuff that needs looking into (said Philippe Houdoin, maintainer for BeOS AFAIK).
So you can already use that version. The problem is to rewrite and plugin the current old style nVidia driver.
I'll be looking into that subject again anyway someday. Maybe I'll take the Matrox route instead of VIA (I have 'seperate' cards I could test with on relatively fast systems).
For now I'll concentrate on just a more stable driver with less faults. BTW resizing the viewport works now, so I can rescale the Teapot to fullscreen now :-)
---
Quake3 would be cool to work on indeed :-) Thanks for the pointer to that site. I really hope someone picks that up and starts working on it.
---
to PS:
Me too :)
It's cool that I seem to be able to dig-in further and further. For instance I am now even looking at mipmap versus trilinear filtering: the driver supports that, but quake2 hangs when you use it at some point. It seems the engine is fed with an illegal setup: the trick is to find out what it is and let the driver check for that, falling back to a lesser quality setting for those setups, so it won't crash.
OTOH: we could also be looking at a plain error in Quake2 here (GL_LINEAR_MIPMAP_LINEAR setting).
Anyone have any clues here?
BTW:
Using trilinear filtering as compared to 'standard Q2 filtering' does not slow the engine down one bit: it seems the speed bottleneck sits somewhere else... ;-)
Post a Comment
Using trilinear filtering as compared to 'standard Q2 filtering' does not slow the engine down one bit: it seems the speed bottleneck sits somewhere else... ;-)
<< Home