Saturday, February 25, 2006
nVidia 3D status update
High-time for a status update I'd say. A lot has happened.
Resizing outputwindows and another BeOS bug
This function is implemented and working reliably now. However, that took some time: there turns out to be another BeOS (R5 and dano) bug here: when you enable Window resizing events in the constructor of the BGLview, the corresponding routines are called: but the new size given to the resize routine is NOT the new one! Instead the one before the latest one is given.. Needles to say it took a lot of testing and trying to recognize this bug. By the time I knew what was going wrong, I saw a workaround in Be's teapot code: not using the given size, but asking for it itself.
After I added this workaround to the driver as well, resizing worked correctly. But not before I added LockLooper()/UnlockLooper() calls to the routines LockGL() and UnlockGL(): this was the trouble I saw about resizing being asynchronous to rendering. So not a Mesa problem, but 'my' problem: syncing threads. Fortunately this solution was already there in current Mesa, so I just copied it over to my driver.
Resizing sometimes still temporary distorts the rendered output, but that's not very important. Although it could look better of course. The driver is not calling Mesa GLviewport or resize_buffer functions, that's a task for the application programmer (as can be seen in various code examples outthere). The driver only takes care of programming the card right. It's nice to see a very large Teapot spin around, and to recognize the speed effects of resizing it. :-)
Trilineair filtering: 'random' engine crashes
As usual (so I can say these days :-) BeOS Mr.X is 'betatesting' the driver for me. As he has a lot of knowledge about the Quake series of games and the console commands you can give them, he finds bugs rather easy compared to myself. One of the problems he encountered was that using GL_LINEAR_MIPMAP_LINEAR texture rendering in Quake2 hung the driver. After two days of searching I discovered what was wrong: the original UtahGLX driver I based my work on contained an error: the rendering quality setting was increased 'by one step' when rendering switched from textured rendering to non-textured rendering. Of course the driver meant to set a basic 'level 1' as textures aren't actually used then.
The driver crashed the acceleration engine because it was fed with an illegal setup: a non-existing filtering mode.
So: alpha 4 will have this problem fixed. And the really funny part is that I didn't even know the driver supported filtering in this area... But it does: you can choose no filtering, bilinear filtering, and trilinear filtering: all just a setting fed into the engine. Checkout the gl_texturemode console command for Quake1 and Quake2 if you are interested.
Quake1 touching keyboard hangs game.
Another interesting thing to have a look at. Since the sources of GLquake are outthere, I made it compile on R5 and dano. That turned out to be a relatively easy thing todo. After having the game run, I could now go on a bughunt for this error. It turns out that the keyboard routine is protected by a semaphore, which is also grabbed when a new frame is rendered. This is a bad situation apparantly, as rendering stops completely when the keyboard was touched. I could fix it anyway by modifying the game executable a bit: acquiring/releaseing the semaphore as close to actual rendering as can be (inside the LockGL()/UnlockGL() part instead of outside of it.) Still you see the keyboard lagging on high-res modes however. Anyway: it has to do. Unless someone else is going to have a better look ofcourse. (Enable the networking support someone?).
With Be's software GL the game did not hang, but then, this renders rather slowly. I could prevent hanging in the accelerated driver by adding a snooze(100000)...
If someone knows of a driver-way to overcome this problem I am interested in knowing: after all R4.5's accelerated GL did not suffer from this symptom.
Quake1 rendering faults.
While I was playing around with Q1, I decided to try to find the reason for the wrongly rendered game scores at the mid-bottom of Quake1's screen. I found it after another day or two: yet another 'original' UtahGLX driver bug. It forgot to send the active texture's colorspace to the engine when a new one was activated...
Oh, doing a timerefresh in Q1 console was strange to see as well: it rendered in the foreground buffer. As the nVidia driver doesn't support that yet, I modified this command to use the backbuffer so it's accelerated: it's nasty to have to wait for 128 frames when rendering a single one cost about a second or so.
Anyway: it makes more sense as well to have it render in the backbuffer as otherwise we would always see the 'distortions' of the engine building up the frames in plain sight.
Next up..
So: all in all all rendering faults for both Quake1 and Quake2 have been solved. I guess I should release an updated version of Quake1 to bebits (including source) so this game can be played using openGL once more on BeOS. It would be nice if someone could take it from there to update it for networking and such I'd say.
OK, back to work. I want to have a look at switching resolutions inside Quake2 (wich only works partly for some reason), and then comes the 'real' swapbuffer thing. Apart from these items the driver seems ready for a new release.
Talk to you later!
Resizing outputwindows and another BeOS bug
This function is implemented and working reliably now. However, that took some time: there turns out to be another BeOS (R5 and dano) bug here: when you enable Window resizing events in the constructor of the BGLview, the corresponding routines are called: but the new size given to the resize routine is NOT the new one! Instead the one before the latest one is given.. Needles to say it took a lot of testing and trying to recognize this bug. By the time I knew what was going wrong, I saw a workaround in Be's teapot code: not using the given size, but asking for it itself.
After I added this workaround to the driver as well, resizing worked correctly. But not before I added LockLooper()/UnlockLooper() calls to the routines LockGL() and UnlockGL(): this was the trouble I saw about resizing being asynchronous to rendering. So not a Mesa problem, but 'my' problem: syncing threads. Fortunately this solution was already there in current Mesa, so I just copied it over to my driver.
Resizing sometimes still temporary distorts the rendered output, but that's not very important. Although it could look better of course. The driver is not calling Mesa GLviewport or resize_buffer functions, that's a task for the application programmer (as can be seen in various code examples outthere). The driver only takes care of programming the card right. It's nice to see a very large Teapot spin around, and to recognize the speed effects of resizing it. :-)
Trilineair filtering: 'random' engine crashes
As usual (so I can say these days :-) BeOS Mr.X is 'betatesting' the driver for me. As he has a lot of knowledge about the Quake series of games and the console commands you can give them, he finds bugs rather easy compared to myself. One of the problems he encountered was that using GL_LINEAR_MIPMAP_LINEAR texture rendering in Quake2 hung the driver. After two days of searching I discovered what was wrong: the original UtahGLX driver I based my work on contained an error: the rendering quality setting was increased 'by one step' when rendering switched from textured rendering to non-textured rendering. Of course the driver meant to set a basic 'level 1' as textures aren't actually used then.
The driver crashed the acceleration engine because it was fed with an illegal setup: a non-existing filtering mode.
So: alpha 4 will have this problem fixed. And the really funny part is that I didn't even know the driver supported filtering in this area... But it does: you can choose no filtering, bilinear filtering, and trilinear filtering: all just a setting fed into the engine. Checkout the gl_texturemode console command for Quake1 and Quake2 if you are interested.
Quake1 touching keyboard hangs game.
Another interesting thing to have a look at. Since the sources of GLquake are outthere, I made it compile on R5 and dano. That turned out to be a relatively easy thing todo. After having the game run, I could now go on a bughunt for this error. It turns out that the keyboard routine is protected by a semaphore, which is also grabbed when a new frame is rendered. This is a bad situation apparantly, as rendering stops completely when the keyboard was touched. I could fix it anyway by modifying the game executable a bit: acquiring/releaseing the semaphore as close to actual rendering as can be (inside the LockGL()/UnlockGL() part instead of outside of it.) Still you see the keyboard lagging on high-res modes however. Anyway: it has to do. Unless someone else is going to have a better look ofcourse. (Enable the networking support someone?).
With Be's software GL the game did not hang, but then, this renders rather slowly. I could prevent hanging in the accelerated driver by adding a snooze(100000)...
If someone knows of a driver-way to overcome this problem I am interested in knowing: after all R4.5's accelerated GL did not suffer from this symptom.
Quake1 rendering faults.
While I was playing around with Q1, I decided to try to find the reason for the wrongly rendered game scores at the mid-bottom of Quake1's screen. I found it after another day or two: yet another 'original' UtahGLX driver bug. It forgot to send the active texture's colorspace to the engine when a new one was activated...
Oh, doing a timerefresh in Q1 console was strange to see as well: it rendered in the foreground buffer. As the nVidia driver doesn't support that yet, I modified this command to use the backbuffer so it's accelerated: it's nasty to have to wait for 128 frames when rendering a single one cost about a second or so.
Anyway: it makes more sense as well to have it render in the backbuffer as otherwise we would always see the 'distortions' of the engine building up the frames in plain sight.
Next up..
So: all in all all rendering faults for both Quake1 and Quake2 have been solved. I guess I should release an updated version of Quake1 to bebits (including source) so this game can be played using openGL once more on BeOS. It would be nice if someone could take it from there to update it for networking and such I'd say.
OK, back to work. I want to have a look at switching resolutions inside Quake2 (wich only works partly for some reason), and then comes the 'real' swapbuffer thing. Apart from these items the driver seems ready for a new release.
Talk to you later!
Saturday, February 18, 2006
nVidia 3D: back on Mesa 3.4.2
While searching for a good solution for that delayed swap I mentioned to solve drawing errors, I once again compared Mesa 3.2.1 and 3.4.2. One of the differences between them turns out to be the added Mesa feature to complete pending drawing commands right before a driver issues the swapbuffer() command.
In other words, I switched back to Mesa 3.4.2, as that solves the drawing problem neatly if I add executing that Mesa internal command in the driver's swapbuffer function.
Mesa speed
The reason for me to fallback to the older Mesa before (alpha 3.5) was the apparant lower rendering speed the newer version had. Luckily it turns out I made a mistake myself: I forgot to enable hardware accelerated Z-buffer clearing! Once I enabled that, speed came a lot closer to Mesa 3.2.1's. Mesa 3.4.2 is indeed a bit slower, but just a tiny bit now.
Of course, I wanted to see if I could at least come up with the old speed, so I started looking once more at the hardware rendering commands hoping to find something I could optimize a bit more. Well, I found something indeed. Instead of issuing the vertexes and drawing commands seperately, I now issue them in one single burst of writes into the DMA cmd buffer. Also I use other vertex offsets in the engine, so the last vertex written automatically points me at the first drawing command entry. This saves overhead in explicitly setting engine register pointers in there, increasing rendering speed a bit (5-10% less words written into the DMA cmd buffer). If only Mesa could send 4 points, 4 lines and 5 triangles in one call... then I could increase the burst to it's max increasing speed another few percents, and save software routine calling overhead a lot.
Vertical retrace sync
In the meantime I added a new nv.settings option in the 2D driver called 'force_sync'. This option is now taken by the 3D accelerant to enforce retrace syncing for the swapbuffers command. With a small addition to the 2D's driver acc engine init code, we can now instruct the acceleration engine to wait for a retrace occuring before coninuing exec of commands. This is a bit nicer than explicitly waiting for a retrace (by CPU), as this enables us to keep sending rendering commands to the engine while that engine wait for the retrace. One of the important things for optimum fps rates, is that we try to keep the engine's DMA command buffer 'filled' at all times... The app should stay ahead with filling of the engine emptying and executing.
This engine wait exists in NV11 and newer cards, so the driver autoselects the retrace sync method depending on card architecture.
Swapbuffers: swapping instead of blitting (copying)
Adding real swapping turns out to be a challenge! The reason for wanting to add this function is to remove some of the acc engine load (used for blitting), so that space-in-time becomes available for 3D acceleration.
Unfortunately, swapping requires a sync to retrace: the CRTC (cathode ray tube controller) part of the GPU cannot do a swap at all times, as it's very busy with data fetching while the screen is drawn. You need to issue such a swap during retraces, otherwise the point in time the actual switch occurs cannot be guaranteed (over here it typically delays some 100 'lines': this register holding the pointer is doublebuffered in hardware apparantly).
Well, you guessed it: if we have to wait for a retrace, then we waste valuable GPU time we could otherwise use for 3D rendering! This contradicts our goal of course... In effect we loose speed instead of gaining it.
So: is there a solution to this problem? Yes, there is I think (I still need to check this out though). On BeOS, all 3D rendering uses double buffering. We have the 'frontbuffer' and the 'backbuffer': two buffers. You could switch between those once rendering to the backbuffer is done. The backbuffer then becomes the frontbuffer and vice versa. The next frame will be rendered to the old frontbuffer, now being the backbuffer. You see the problem: we need to wait until the CRTC actually displays the new frontbuffer, before we can delete and render in the old frontbuffer.
The solution presents itself: setup triple buffering. When we use that we don't really care when exactly the CRTC switches to the new buffer, we leave the old one alone anyway! Instead we use the third buffer to render into... Of course we need to wait anyway if the CRTC needs more time to switch than we need to render one frame: I don't know yet how to sync rendering to this limitation.
All in all I don't know yet if swapping will make it into alpha 4: I want to see the thing actually work OK before I promise that. Oh, this solution has a downside as well (of course): we need extra graphics memory to hold the third buffer. Though that is no real problem in practice.
Resizing buffers (viewports, the output window)
This is another subject I am once again putting time into. The last action I took (late last yesterday evening) was determining that we have a hardware (sync) problem here! This is good news for me, as I should be able to fix that. Once I do I can retest the Mesa internal function for resizing buffers (and resizing viewports). Looks like Mesa deservers (much) more credit than I initially gave it: it's well thought out, internal sync wise. I was under the impression that hangs and render errors could occur with out-of-sync events like resizing output windows, but it might well be that this is not the case at all. I'm a happy camper.
If I can find a solution the the hardware problem, and Mesa's function for resizing works well enough (under heavy acc engine load): I'll enable driver support for resizing. This means the teapot can be stretched etc via resizing the window it spins in (unlike having the repeating pots pattern you see now). Also this means that apps that initially create very small buffers and resize later will work (without modification) now.
Conclusion
Well, it looks like this will be an important update: Alpha4. Although in effect the speed won't differ that much (upto 10% speed gain depending on CPU and GPU power: P4-2800, NV18 +10%, dualP3-500 NV11 +3%), there are a lot of visible bugfixes concerning rendering.
I hope this will stimulate people to do some more 3D apps... ;-)
In other words, I switched back to Mesa 3.4.2, as that solves the drawing problem neatly if I add executing that Mesa internal command in the driver's swapbuffer function.
Mesa speed
The reason for me to fallback to the older Mesa before (alpha 3.5) was the apparant lower rendering speed the newer version had. Luckily it turns out I made a mistake myself: I forgot to enable hardware accelerated Z-buffer clearing! Once I enabled that, speed came a lot closer to Mesa 3.2.1's. Mesa 3.4.2 is indeed a bit slower, but just a tiny bit now.
Of course, I wanted to see if I could at least come up with the old speed, so I started looking once more at the hardware rendering commands hoping to find something I could optimize a bit more. Well, I found something indeed. Instead of issuing the vertexes and drawing commands seperately, I now issue them in one single burst of writes into the DMA cmd buffer. Also I use other vertex offsets in the engine, so the last vertex written automatically points me at the first drawing command entry. This saves overhead in explicitly setting engine register pointers in there, increasing rendering speed a bit (5-10% less words written into the DMA cmd buffer). If only Mesa could send 4 points, 4 lines and 5 triangles in one call... then I could increase the burst to it's max increasing speed another few percents, and save software routine calling overhead a lot.
Vertical retrace sync
In the meantime I added a new nv.settings option in the 2D driver called 'force_sync'. This option is now taken by the 3D accelerant to enforce retrace syncing for the swapbuffers command. With a small addition to the 2D's driver acc engine init code, we can now instruct the acceleration engine to wait for a retrace occuring before coninuing exec of commands. This is a bit nicer than explicitly waiting for a retrace (by CPU), as this enables us to keep sending rendering commands to the engine while that engine wait for the retrace. One of the important things for optimum fps rates, is that we try to keep the engine's DMA command buffer 'filled' at all times... The app should stay ahead with filling of the engine emptying and executing.
This engine wait exists in NV11 and newer cards, so the driver autoselects the retrace sync method depending on card architecture.
Swapbuffers: swapping instead of blitting (copying)
Adding real swapping turns out to be a challenge! The reason for wanting to add this function is to remove some of the acc engine load (used for blitting), so that space-in-time becomes available for 3D acceleration.
Unfortunately, swapping requires a sync to retrace: the CRTC (cathode ray tube controller) part of the GPU cannot do a swap at all times, as it's very busy with data fetching while the screen is drawn. You need to issue such a swap during retraces, otherwise the point in time the actual switch occurs cannot be guaranteed (over here it typically delays some 100 'lines': this register holding the pointer is doublebuffered in hardware apparantly).
Well, you guessed it: if we have to wait for a retrace, then we waste valuable GPU time we could otherwise use for 3D rendering! This contradicts our goal of course... In effect we loose speed instead of gaining it.
So: is there a solution to this problem? Yes, there is I think (I still need to check this out though). On BeOS, all 3D rendering uses double buffering. We have the 'frontbuffer' and the 'backbuffer': two buffers. You could switch between those once rendering to the backbuffer is done. The backbuffer then becomes the frontbuffer and vice versa. The next frame will be rendered to the old frontbuffer, now being the backbuffer. You see the problem: we need to wait until the CRTC actually displays the new frontbuffer, before we can delete and render in the old frontbuffer.
The solution presents itself: setup triple buffering. When we use that we don't really care when exactly the CRTC switches to the new buffer, we leave the old one alone anyway! Instead we use the third buffer to render into... Of course we need to wait anyway if the CRTC needs more time to switch than we need to render one frame: I don't know yet how to sync rendering to this limitation.
All in all I don't know yet if swapping will make it into alpha 4: I want to see the thing actually work OK before I promise that. Oh, this solution has a downside as well (of course): we need extra graphics memory to hold the third buffer. Though that is no real problem in practice.
Resizing buffers (viewports, the output window)
This is another subject I am once again putting time into. The last action I took (late last yesterday evening) was determining that we have a hardware (sync) problem here! This is good news for me, as I should be able to fix that. Once I do I can retest the Mesa internal function for resizing buffers (and resizing viewports). Looks like Mesa deservers (much) more credit than I initially gave it: it's well thought out, internal sync wise. I was under the impression that hangs and render errors could occur with out-of-sync events like resizing output windows, but it might well be that this is not the case at all. I'm a happy camper.
If I can find a solution the the hardware problem, and Mesa's function for resizing works well enough (under heavy acc engine load): I'll enable driver support for resizing. This means the teapot can be stretched etc via resizing the window it spins in (unlike having the repeating pots pattern you see now). Also this means that apps that initially create very small buffers and resize later will work (without modification) now.
Conclusion
Well, it looks like this will be an important update: Alpha4. Although in effect the speed won't differ that much (upto 10% speed gain depending on CPU and GPU power: P4-2800, NV18 +10%, dualP3-500 NV11 +3%), there are a lot of visible bugfixes concerning rendering.
I hope this will stimulate people to do some more 3D apps... ;-)
Sunday, February 12, 2006
Mesa 3.2.1 and accelerated 3D on nVidia (again)
While working on updating the nVidia 2D driver for a new BeBits release, I decided to clean-up for 3D support. I tested a lot of register configuration settings with help of two cards in a system: I could never test as speedy as these days (nolonger reboots required).
Some 'nonsense' 3D setup where removed, and I could also find one point where more speed could be gained for 3D: I modified the rendering output colorspace from some 'special' type with different input/output spaces, to 'standard' ARGB32. This apparantly means less drawing overhead, which lead to a 11% rendering speedup in B_RGB32 space on my P4-2.8Ghz using the NV18 card. On my dualP3-500 with NV11 a 7% speedup was still gained. 15 and 16 bit spaces are unmodified speed-wise.
While I was very nearby the 3D subject again, I wandered off more into the 3D accelerant and Mesa3.2.1. I ended up trying to setup a real swapbuffer command (instead of using blitting), which could give us another 5% speedgain for all spaces in fullscreen modes. It still doesn't work correctly (I am thinking another Mesa bug..), but it gave me an interesting view on the rendering behind-the-scenes (seeing a scene being constructed in the backbuffer).
It became apparant to me, that although we have several drawing errors with Quake2 (missing texts, missing parts of text, missing bitmaps, intermittant missing crosshair and scores), these items were drawn none the less! As it turns out, the normally visible rendered buffer (with the just mentioned errors) is rendered in the background, then swapped on-screen, and then the missing pieces are rendered in the now obsolete background! Of course they are never shown, as after this final rendering part, the erase buffers command comes up...
So, I tried a delayed swap, and YES!! Q2 renders without any drawing fault (32bit mode atm)!
This sudden success means I'll put some more time in the alpha3.5 3D add-on, and see what I can do to modify it for general improvement here: I am hoping I can get this incorporated so that it's still useable with other 3D apps as well. I will update the BeBits entry to be Alpha4 after I release the 'current' 2D driver, hopefully with both the 'perfect draw' and the fullscreen swap function in place: all in all for instance 1024x768x32 mode in Q2 would go up from 22 to 28fps then, and without drawing errors anymore.
Well, 'back to work'. Talk to you later!
Some 'nonsense' 3D setup where removed, and I could also find one point where more speed could be gained for 3D: I modified the rendering output colorspace from some 'special' type with different input/output spaces, to 'standard' ARGB32. This apparantly means less drawing overhead, which lead to a 11% rendering speedup in B_RGB32 space on my P4-2.8Ghz using the NV18 card. On my dualP3-500 with NV11 a 7% speedup was still gained. 15 and 16 bit spaces are unmodified speed-wise.
While I was very nearby the 3D subject again, I wandered off more into the 3D accelerant and Mesa3.2.1. I ended up trying to setup a real swapbuffer command (instead of using blitting), which could give us another 5% speedgain for all spaces in fullscreen modes. It still doesn't work correctly (I am thinking another Mesa bug..), but it gave me an interesting view on the rendering behind-the-scenes (seeing a scene being constructed in the backbuffer).
It became apparant to me, that although we have several drawing errors with Quake2 (missing texts, missing parts of text, missing bitmaps, intermittant missing crosshair and scores), these items were drawn none the less! As it turns out, the normally visible rendered buffer (with the just mentioned errors) is rendered in the background, then swapped on-screen, and then the missing pieces are rendered in the now obsolete background! Of course they are never shown, as after this final rendering part, the erase buffers command comes up...
So, I tried a delayed swap, and YES!! Q2 renders without any drawing fault (32bit mode atm)!
This sudden success means I'll put some more time in the alpha3.5 3D add-on, and see what I can do to modify it for general improvement here: I am hoping I can get this incorporated so that it's still useable with other 3D apps as well. I will update the BeBits entry to be Alpha4 after I release the 'current' 2D driver, hopefully with both the 'perfect draw' and the fullscreen swap function in place: all in all for instance 1024x768x32 mode in Q2 would go up from 22 to 28fps then, and without drawing errors anymore.
Well, 'back to work'. Talk to you later!