Saturday, March 25, 2006


3D add-on Alpha 4 released..


Well, it's done: Alpha 4 is outthere. I had to do a lot more work than I anticipated, hence the delay. But it's well worth it I think: a lot more bugs were solved. In other words, Quake 1 and 2 showed more rendering errors after all, just less easy to see for someone not much running games (yet :).

The driver entry on bebits contains a list of all errors solved, so just have a look there for the nifty details, or just download Alpha 4 and read the included HTML file (which contains the same list plus updated application running status info).

What can I add to that info?
Well, rendering speed is higher than ever before, although still slow compared to windows and linux closed source drivers. But I mentioned that already as well. I'll try to do a new benchmark using Alpha 4 to give you the current detailed results: that way you can (finally) compare it to old Alpha 2 speeds I once gave.

GPU and RAM speed (overclocking and bottlenecks)
Let's talk a bit about rendering speed and bottlenecks. I spent a lot of time to find out why the speed is so much slower than on Windows and Linux closed-source drivers. I also tried to get NV20 and higher going once more. I did not find the solution to either problem, but I learned more about the cards and windows drivers in the meantime: who knows, it might help one day.

Anyway, one of the things I did was add tweaking options for GPU and RAM clocking speed in the 2D driver. Of course the 3D driver also benefits from this, and that was the intended result.
I did a test on my P4 at 2.8Ghz/533Mhz fsb, using the GeForce 4MX 440. This card has BIOS settings 275Mhz clock for GPU and 400Mhz for RAM. Coldstarting the card revealed that these speeds are actually programmed.
Here's the result of testing GPU speed with RAM speed at default (400Mhz):
GPU speed, Q2 timedemo1 fps in 800x600x16 mode (Alpha 4)
50, 38.8
100, 66.2
150, 78.8
200, 82.8
275, 86.0

I find this interesting, doubing the GPU speed did NOT double the rendering speed. Now look at RAM testing with GPU at default (275Mhz):
RAM speed, Q2 fps
100, engine hang
150, engine hang
200, engine hang
250, 55.7
300, 66.4
350, 76.4
400, 86.1
450, 92.9 (overclocking 12%!)

Interesting here is the fact that increasing the RAM with a certain percentage, increases fps with the same percentage! When we combine both tables, we can conclude that the RAM access speed is the bottleneck, not the GPU speed. When you benchmark Q2 some more using different settings for texture filtering this conclusion remains intact: fps is not influenced one bit depending on filtering. At least, the GPU doesn't care.

About the engine hangs at low speeds: most RAM used on graphics cards is of the dynamic type: it must be refreshed within a certain amount of time to keep it's content. When it's done too slow the content gets damaged: hence trouble.

So, why is RAM access the bottleneck?

background: RAM bandwidth considerations
So how much data can be transferred with RAM anyway? Well, raw speeds look like about this.
the NV18 (most cards as a fact still), have a 128bit wide path between GPU and RAM. This means that per clock-cycle (SD-RAM) 128/8=16 bytes data are transferred. If we have a clock of 400Mhz, that means 400.000.000 * 16 bytes = 6,4Gbytes/second can be transferred.
We need to deduct some room for refreshcycles, so let's say for argument sake we keep about 6Gbytes/sec bandwidth for our card's functions.

So which functions are running? Well, we need to send RAM content to the screen (monitor). In 1024x768x32 mode at 75Hz refresh that means 1024*768*4*75 = 225Mb/sec are transferred.

This leaves some 4.8Gbytes/sec bandwidth for the GPU accesses and CPU accesses combined.
Note that when you run for instance Quake2 timedemo, that at first the textures are loaded into the cardRAM, and then the demo starts running. Running the demo nolonger transfers data between the host system (CPU) and cardRAM: everything needed is already there. Apart from the actual rendering commands that is, but these are a relatively small amount of data: which resides in main system RAM, and are fetched by the GPU directly (AGP DMA accesses). So these commands don't load the RAM bandwidth. Just the GPU.

Bottleneck identification?
One serious 'problem' with these calculations is the fact that the GPU not always needs chunks of 16 bytes (128bits, the width of the datapath, data transferred in one clock-cycle). If you render using a 16-bit Z-buffer, and some serious hardware access optimisation doesn't exist, those two bytes will cost 16 bytes worth of bandwidth. In other words: these accesses run at 2/16 = 12,5% of maximum speed. For a 32bit colorbuffer, this would be 4/16 = 25% speed.

This could be what we are looking at. Unfortunately, I don't have a clue how to engage optimisation in the GPU for this kind of stuff: the crossbar memory controller (if it exists in those cards, this I should check in the coarse specs from nVidia). This piece of hardware is capable of splitting up those 16 bytes in seperate smaller lanes so to speak.

On the other hand, this same problem should exist on TNT class cards: but we are running at relatively high speeds there already compared to GeForce class cards. If you look at the windows driver results. But then again, there might be completely other reasons for that. It remains sort of guessing.

So, how are those speeds again? I'll sum a bit up once more (for the P4 2.8Ghz system at 1024x768x32 mode):
card, windows (blit-swapbuffer function forced, 16-bit textures), beos ALpha 4
TNT2 (ASUS V3800) , 41.3, 15.6
GF2MX400, 86.0, ---
GF2Ti, 165, ---
GF4MX440, 119, 26.3
(--- means: not tested)

So, the TNT2 works at 15.6/41.3 = 38% of max. speed. The GF4MX440 at 26.3/119 = 22% of max speed. Both cards have 128 bit wide buses by the way.
Note to self: so geforce class cards seem to be running at relatively 50% speed of the TNT class cards. Do we need to enable DDR (double data rate) explicitly?? At least more cards should be compared for this. I seem to remember the GF2Ti running at some 23fps on BeOS, which would be relatively much slower than the GF4MX440.

Indications for actual RAM access bandwidth on BeOS
I had in mind to give you the result of an interesting delta-speed test I did on BeOS, but that will have to wait until another day. Time is up for now. But I'll post it as next item, I promise. It's very much related to the above story after all!

In the meantime: have fun with 3D! It seems we can run Quake1,2 and 3 all on BeOS now. With acceleration...

Signing off. Good night :-)

This comment has been removed by a blog administrator.
How portable is this? Would it be easy to make it work with xorg? Are you in contact with anyone interested in porting your work to xorg? What is the copyright status of your 3D work? I heard it was derived from utah-glx, is everything koser - you left the copyright info intact and the licence the same?

I guess it can be made to work with Xorg, the driver has been ported before. Although the driver needs to be rewritten for current Mesa, something I need to do as well.

Copyright: I now named it BSD/MIT, since after talking to Haiku's openGL kit lead person, all of the parts of this stuff should be compatible with that. I did not remove copyright stuff, nor did I add stuff for that in the files. You'll find nVidia's own messages still there for example, as they should be.

Anyway, I am no copyright/licence expert, so I gladly leave discussions about that to other people.

And: I kind of even hope some linux group picks up again, since it seems like a waste this driver was never further developed before. Currently I work on this alone. The switch to current Mesa seems like very difficult to me: I can't do it ATM: first I need to port some other DRI driver to BeOS to learn more about the current Mesa driver-interface.

But I am convinced this driver will speed up while lowering the software overhead at the same time, once it's on current Mesa.
Further reading cut and pasted from Anandtech:

Lightspeed Memory Architecture

The GeForce3's memory controller is actually drastically changed from the GeForce2 Ultra. Instead of having a 128-bit interface to memory, there are actually four fully independent memory controllers that are present within the GPU in what NVIDIA likes to call their Crossbar based memory controller.

These four memory controllers are each 32-bits in width that are essentially interleaved, meaning that they all add up to the 128-bit memory controller we're used to, and they do all support DDR SDRAM. The crossbar memory architecture dictates that these four independent memory controllers are also load balancing in reference to the bandwidth they share with the rest of the GPU.

The point of having four independent, load balanced memory controllers is for increased parallelism in the GPU (is anyone else picking up on the fact that this is starting to sound like a real CPU?). The four narrower memory controllers come quite in handy when dealing with a lot of small datasets. If the GPU is requesting 64-bits of data, the GeForce2 Ultra uses a total of 256-bits of bandwidth (128-bit DDR) in order to fetch it from the local memory. This results in quite a bit of wasted bandwidth. However in the case of the GeForce3, if the GPU requests 64-bits of data, that request can be handled in 32-bit chunks, leaving much less bandwidth unused. Didn't your mother ever tell you that it's bad to leave food wasted? It looks like NVIDIA is finally listening to their mother.

Thanks Euan,

Interesting read. After writing this article I remembered that the crossbar controller was 'invented' after GF2, so that's also not why my driver performs relatively slow.

So, thinking further, I am now looking into parallel ROP's, which all cards the driver currently supports have: all cards have two of them, except the pro line of GF2: these have 4.

This is interesting to look at. I was pointed in this direction by an alternate explanation I found about what ROP means.
Mostly, this is explained as Raster OPerations. But, another version is Raster OutPut. (or something like that).

I am now briefly testing for a solution in that very direction today. Hold on, I'll keep you posted. :-)
A small update (still working on it):

I was able to increase GeForce2Ti rendering speed with some 21%, and GeForce2MX with some 44%.

640x480x16 Q2tdemo1 = 140fps
1024x768x32 28.3fps (was 23.3)

640x480x16 = 109fps (was 60-70)
1024x768x32 = 18.9fps (was 13.1)

Others are unchanged.

GF4MX440 is still at:
640x480x16 = 125fps
1024x768x32 = 26.3fps

So, GeForce2Ti has taken the lead.
I have a feeling this is max I can do for now, because the relative comparison between cards seems OK now (if you take ROPS and RAM-bandwith both into account).

I'll probably do some calculations about that, and, as said, I'll keep fiddling around a bit more.

I'm glad I got this going at least :-)

Your site is looking Pretty good. I like it.

I have a new site that focuses on Tech Reviews, RMA Reviews, and more.

You should check it out.

##Link## Reviews


Post a Comment

<< Home

This page is powered by Blogger. Isn't yours?