C++ and Perl bakeoff

HHunt · Aug 7, 2006

mikeblas said:
<Regarding working per-character>

For slickness? It would be kind of neat; you'd invent a nifty little state machine. That's kind of sexy, in a CS nerd way.

For performance? It would suck. You have to do a call to get each character, juggle around a buffer, and then finally get the character. In my implementation (IIRC) I'm using strchr() to find things. strchr() is very highly optimized, if you have a compiler with a decent and mature runtime library implementation. Like, instead of using an eight-bit register to get a cahracter and decide, it's going to get four bytes at a time in a 32-bit register and do fancy shifting and masking to find the beginning of your string. Doesn't that sound lots faster than setting up a function call for each character?

This is basically how I thought about it: Map the file to memory, and progress through it per-byte. It will still be read in larger chunks, but that is entirely hidden from me, allowing the code to be fairly neat.
(Ok, the neatness of my code can probably be disputed, but it's still conceptually simpler than having to handle a buffer by hand.)

Of course, given that it's C++, you could make an object around the file stream that contained a buffer and had a popByte()method (or something prettier with streams; I'm not used to C++), allowing the core loop to be written as if you were reading a single byte per time. While the actual operations performed would be similar, it could be an useful level of abstraction.

mikeblas · Aug 7, 2006

HHunt said:
Of course, given that it's C++, you could make an object around the file stream that contained a buffer and had a popByte()method (or something prettier with streams; I'm not used to C++),

Right; but that would mean you're no longer "reading each char from the file", as Shadow specified. The larger the blocksize for your I/O (to a point) the better your performance can be. Making a call per large block is fine because it amoritizes more cost; making a call per character is unforgivable.

Shadow2531 · Aug 7, 2006

@Mike

Want to give this one a shot?

The previous code took ~ 5.5 minutes for me on my old pc. This one only takes 2 minutes, so this should definitely complete in under 92 seconds on your system.

Code:

#include <iostream>
#include <string>
#include <fstream>

using namespace std;

inline void qstate( string::const_iterator& i, string& out, const string::const_iterator& size ) {
    while ( i < size ) {
        ++i;
        if ( *i == '\\' && i + 1 < size ) {
            char h = *( i + 1 );
            if ( h == '\'' ) {
                out.push_back('\'');
                ++i;
            } else if (  h == '\\' ) {
                out.push_back('\\');
                ++i;
            } 
        } else if ( *i == '\'' ) {
            return;
        } else if ( *i == '_' ) {
            out.push_back(' ');
        } else if ( *i != ' ' ) {
            out.push_back( *i );
        }
    }
}

inline void parseLine( const string& s, ofstream& out ) {
    const string::const_iterator size = s.end();
    string buffer;
    for ( string::const_iterator i = s.begin() + 26 ; i < size; ++i ) {
        if ( *i == '\'' ) {
            qstate( i, buffer, size );
        } else if ( *i == ',' ) {
            buffer.push_back('\t');
        } else if ( *i == ')' ) {
            buffer.push_back('\r');
            buffer.push_back('\n');
            ++i;
            ++i;
        } else if ( *i != '(' ) {
            buffer.push_back( *i );
        }
    }
    out << buffer;
}

int main( int argc, char* argv[] ) {
    if ( argc != 3 ) {
        return 1;
    }
    ifstream in( argv[1], ios::binary );
    if ( !in ) {
        return 1;
    }
    ofstream out( argv[2], ios::binary );
    if ( !out ) {
        return 1;
    }
    for ( string s; getline( in, s ); ) {
        if ( s.find( "INSERT INTO `page` VALUES" ) == 0 ) {
            parseLine( s, out );
        }
    }
}

// g++ -Wall -Wextra -ansi -pedantic fixer3.cpp -o fixer3 -O3 -msse3 -mtune=i686 -s

HHunt · Aug 7, 2006

mikeblas said:
Right; but that would mean you're no longer "reading each char from the file", as Shadow specified. The larger the blocksize for your I/O (to a point) the better your performance can be. Making a call per large block is fine because it amoritizes more cost; making a call per character is unforgivable.

True, but it would make the algorithm per-byte (or character; not neccesarily the same thing), and that might or might not be what he was thinking about.

mikeblas · Aug 7, 2006

HHunt said:
True, but it would make the algorithm per-byte (or character; not neccesarily the same thing), and that might or might not be what he was thinking about.

I think it was, since he specifically contrasted it against parsing the whole line.

Shadow2531 · Aug 7, 2006

HHunt said:
True, but it would make the algorithm per-byte (or character; not neccesarily the same thing), and that might or might not be what he was thinking about.

Basically, I was asking if:

grab char from infile
decide if it or some other char goes to outfile
write the char to outfile

would be fast if the infile was on one drive and the outfile was created on another.

The reason I thought it might be fast is that it would use very little memory because I would just be allocating a byte ( or 2 when looking ahead for an escape ) instead of stuffing a whole line in a buffer, scanning the buffer and then writing it to the file.

mikeblas · Aug 7, 2006

You're trading memory allocation (which happens once per line, worst case) for inefficiencly handling individual characters (which happens once per character, always).

chipmonk010 · Aug 8, 2006

heres another attempt at my python version I think this one will actually work correctly completes in 2m30s on my machine dualxeons @ 3ghz

parser.py

mikeblas · Aug 8, 2006

Shadow2531 said:
@Mike

Want to give this one a shot?

This code crashes in release. In debug, it asserts at this callstack:

Code:

>	Shadow.exe!std::_String_const_iterator<char,std::char_traits<char>,std::allocator<char> >::operator++()  Line 135 + 0x52 bytes	C++
 	Shadow.exe!parseLine(const std::basic_string<char,std::char_traits<char>,std::allocator<char> > & s="INSERT INTO `page` VALUES (1,0,'AaA','',8,1,0,0.116338664774167,'20060401120725',46448774,70),(5,0,'AlgeriA','',0,1,0,0.553221851171201,'20060301005610',18063769,41),(6,0,'AmericanSamoa','',0,1,0,0.387834867823715,'20060301005610',18063795,48),(8,0,'AppliedEthics','',0,1,0,0.279508816338618,'20050628012319',15898943,30),(10,0,'AccessibleComputing','',0,1,0,0.33167112649574,'20060603165541',56681914,36),(12,0,'Anarchism','edit=autoconfirmed:move=sysop',5252,0,0,0.786172332974311,'20060702174137',61712927,41221),(13,0,'AfghanistanHistory','',5,1,0,0.0621502865684687,'20050628012319',15898948,36),(14,0,'AfghanistanGeography','',0,1,0,0.952234464653055,'20050628012319',15898949,40),(15,0,'AfghanistanPeople','',4,1,0,0.574721494293512,'20050628012319',15898950,41),(17,0,'AfghanistanEconomy','',0,1,0,0.36035608906332,'20050628012319',15898951,38),(18,0,'AfghanistanCommunications','',8,1,0,0.75106815132412,'20050628012319',15898952,43),(19,0,'AfghanistanTransportations','',2,1,0,0.674272520164282,'20060430070203", std::basic_ofstream<char,std::char_traits<char> > & out={...})  Line 38 + 0x59 bytes	C++
 	Shadow.exe!main(int argc=3, char * * argv=0x00355ca8)  Line 69 + 0x13 bytes	C++
 	Shadow.exe!__tmainCRTStartup()  Line 586 + 0x19 bytes	C
 	Shadow.exe!mainCRTStartup()  Line 403	C
 	kernel32.dll!7c816d4f() 	
 	[Frames below may be incorrect and/or missing, no symbols loaded for kernel32.dll]	
 	kernel32.dll!7c8399f3()

I expect the problem is that, in the loop, you do two ++i increments and that pushes you past s.end().

LATER: Yeah, that's precisely it. I had to turn off _SECURE_SCL to get your code working.

With iterator checking off, it runs in 70 seconds on my rig.

mikeblas · Aug 8, 2006

I hacked the code to only run through five INSERT statements:

Code:

    int n = 0;
    for ( string s; getline( in, s ); )	{
        if ( s.find( "INSERT INTO `page` VALUES" ) == 0 ) {
			if (++n == 5)
				break;
            parseLine( s, out );
        }
    }

so that I could get a manageable run under the profiler. The instrumented run takes just less than 600 mS. Of that time, you're spending 263 mS in std::basic_string:

ush_back, and 191 mS in std::getline().

Your qstate() function calls push_back 1382718 times, and parseLine calls it 2426127 times.

I'm amused to see your program delcare both qstate and parseLine as "inline".

By preallocating space in your output string:

Code:

inline void parseLine( const string& s, ofstream& out ) {
    const string::const_iterator size = s.end();
    string buffer;
    buffer.reserve(1024*1024);
    for ( string::const_iterator i = s.begin() + 26 ; i < size; ++i ) {

I get a 10% improvement, down to 62 seconds. Reallocating buffers is a problem, but the fundamental issue is using low-bandwidth I/O routines.

mikeblas · Aug 8, 2006

chipmonk010 said:
heres another attempt at my python version I think this one will actually work correctly completes in 2m30s on my machine dualxeons @ 3ghz

parser.py

This is lots better, but you've still got some correctness problems. You're not converting \' to ', for example.

The code runs in 146 seconds on my machine. I ran again with the -O option and noticed no change in runtime.

My test machine runs at 3.2 GHz, and the input file is 412 million bytes.

My C++ solution runs in 9 seconds, so that's about 70 clock cycles per character.
Shadow's current solution takes 70 seconds, for 545 cycles per character.
The Python solution runs in 146 seconds... about 1135 cycles per character.

chipmonk010 · Aug 8, 2006

@ mike could u give me a little more detail on the backslashes your refering too.

heres the same code but with Psyco enabled, parser2.py
this version requires you to install the Psyco optimizer for python. On my machine this version runs 50 seconds faster.

mikeblas · Aug 8, 2006

Post #1 in this thread explains the escapements. A specific example of a record that should change when the \' escape is applied is in post #90 of this thread.

You have the same problem with \":

Code:

Line 2694
3375    0       "Love and Theft"                97      0       0       0.0622300403772202      20060629143033  61200299        22851

3375    0       \"Love and Theft\"              97      0       0       0.0622300403772202      20060629143033  61200299        22851

If you can provide a link to and instructions for instlling Psyco, I can give it a twirl. But I think it's more important to get code that's working than to play around with magic optimizations.

chipmonk010 · Aug 8, 2006

To use psyco simply download and install the version that corisponds to your version of python from the following link. No configuration is required. The updated version of my code below will automatically use psyco if it is installed and will display a message indicating whether its running with or without Psyco optimizations.
http://psyco.sourceforge.net/psycoguide/binaries.html

heres my updated code, i fixed the escaped characters problem. I thought i had already handled it thats why i was confused when you mentioned it. Anyway its working correctly now.

parser3.py

Shadow2531 · Aug 8, 2006

mikeblas said:
This code crashes in release.

Thanks.

I can ditch the iterator way anyway. It didn't do anything for my program. I just left it that way after trying it to see if it was any faster. No detectable difference. It crashed for me at first too because I was using + 26 and != end(), which was causing it to jump past end(). That's why I used < end(), but I guess that doesn't cut it.

I get a 10% improvement, down to 62 seconds. Reallocating buffers is a problem, but the fundamental issue is using low-bandwidth I/O routines.

I tried reserve() before posting the code and it didn't help, but I only reserved 1024 or so.

I'm amused to see your program delcare both qstate and parseLine as "inline".

Should I get rid of those? They should help in this situation. No?

Your qstate() function calls push_back 1382718 times, and parseLine calls it 2426127 times.

Oh wow. Thanks. I'm gonna squeeze as much as I can out of this before I try a better method of scanning the string. I really figured you'd get a better result seeing as I cut my time more than half, but ...

I've already been messing with a python version, but not there yet. Might as well throw in a ruby one eventually too.

Thanks.

mikeblas · Aug 9, 2006

chipmonk010 said:
heres my updated code, i fixed the escaped characters problem.

Not quite. If the escaped character is last in the string, your code gets it wrong:

Code:

Line 40584
45027   0       Kievan Rus'             1027    0       0       0.733274703596974       20060702204438  61737352        29011
45027   0       Kievan Rus\             1027    0       0       0.733274703596974       20060702204438  61737352        29011

That topic names started life as 'Kievan Rus\''. I suppose the problem is here:

Code:

	[newitem.append(field.strip("'")) for field in item]	#does not remove ' from inside string literals

where you strip all apostrophies off the end, not just one. You also fail to handle \\.

The runtime is 149 seconds, but I don't have Psyco installed yet.

HHunt · Aug 9, 2006

mikeblas said:
You're trading memory allocation (which happens once per line, worst case) for inefficiencly handling individual characters (which happens once per character, always).

Sort of related to this, I've been wondering. Would it be better to read a fixed chunk at the time instead of reading lines?
When asking for a line, it has to scan the data for a newline when you read, while reading a fixed amount of data could be cheaper. With a strictly per-byte algorithm, it won't matter how much you get at the time anyway, so there's no point to getting a line.

I'll code up an example later; right now I'm too tired (and not entirely sober), probably not the best state for reinventing buffered IO.

mikeblas · Aug 9, 2006

HHunt said:
Sort of related to this, I've been wondering. Would it be better to read a fixed chunk at the time instead of reading lines?

You're always reading a fixed chunk, eventually, from the OS, since the OS doesn't know what a line is.

HHunt said:
When asking for a line, it has to scan the data for a newline when you read, while reading a fixed amount of data could be cheaper.

Sure; this is what the fgets() function in the C standard runtime does. Just step through it and see. The overhead involved is copying; fgets() calls fread(), which calls read(), which calls ReadFile(on Windows) and reads to a buffer internal to the FILE* you're using. fgets() then copies from that buffer to the buffer you gave fgets().

So when you reinvent buffered I/O, you might have the opportunity to do less copying; or do more efficient copying, or more efficient testing for the newline.

TheDude05 · Aug 10, 2006

I might tool around with this just for my own personal learning experience. Oh and also, if you're still looking for a text editor that can handle this file, try Context. It's handling mine decently well.

mikeblas · Aug 11, 2006

My machine has 1 gig of memory (to be upgraded to 2 on Monday, when the FedEx truck comes); it's a 3.0 GHz Pentium 4 in a Asus P4P800.

Opening the Pages.SQL file with ConTEXT takes more than 2 minutes. If I cause the window to repaint (eg, maximize IE, ALT+TAB back to ConTEXT) it takes about 2 seconds to repaint the text visible on the screen.

chipmonk010 · Aug 11, 2006

parser-3.1.py

Heres the updated python proggy, i fixed the literal that ends with a ' problemo and i added a line that turns \\ to \ but im not sure if the later is needed. The output seems to be correct regardless....Python does like to get ride of double backslashes so im not sure whether its performing correctly or not when it comes to \\.

If all occurences of \\ were left wouldnt the output pirnt a \ before each escaped character like below?

Code:

\ammerikka\'s most wanted\

cheers

TheDude05 · Aug 12, 2006

mikeblas said:
My machine has 1 gig of memory (to be upgraded to 2 on Monday, when the FedEx truck comes); it's a 3.0 GHz Pentium 4 in a Asus P4P800.

Opening the Pages.SQL file with ConTEXT takes more than 2 minutes. If I cause the window to repaint (eg, maximize IE, ALT+TAB back to ConTEXT) it takes about 2 seconds to repaint the text visible on the screen.

Really? That sucks. I have a similiar system. (1gig ram, amd3500) and I can load the sql file in approx 40 seconds, and that was over a Samba share. I did just recently reformat so that might help some

mikeblas · Aug 12, 2006

chipmonk010 said:
parser-3.1.py

This generates correct output, and runs in 194 seconds.

HHunt · Aug 12, 2006

Just to confirm the earlier assertion that reading more at the time is more effective (up to a point, at least):
I wrote a C++ class that will let you read a byte at the time from a file, through a buffer of a given size. When it reaches the end of the buffer, it fills it with a single read(). I then put it to use in a program that will simply read and discard a file a byte at the time. The run time, for assorted buffer sizes:

Code:

4194304 : 4.125000 sec.
2097152 : 4.063000 sec.
1048576 : 4.062000 sec.
524288 : 4.109000 sec.
262144 : 3.813000 sec.
131072 : 3.531000 sec.
65536 : 3.563000 sec.
32768 : 3.500000 sec.
16384 : 3.547000 sec.
8192 : 3.593000 sec.
4096 : 3.766000 sec.
2048 : 3.984000 sec.
1024 : 4.532000 sec.
512 : 5.640000 sec.
256 : 7.688000 sec.
128 : 11.906000 sec.
64 : 20.969000 sec.
32 : 37.468000 sec.
16 : 71.000000 sec.
8 : 190.329000 sec.
4 : 347.406000 sec.

Of course, my buffer handling code adds even more overhead per buffer fill, but I sincerely doubt it's the biggest contributor.

edit: I ran it once more but with the smallest sizes first (In case there's any caching effects), and got basically the same results. 32K seems to be the ideal size for me.

mikeblas · Aug 12, 2006

Looks like you found the optimim point around 32K. This'll vary from system to system. On my RAID 0 array, I expect its 128K or 256K. (I/O request size tuning is one of the "secrets" that the "experts" in the Disk and Storage section of the forum usually forget to take into account when parroting their "RAID 0 is a scam" advice.)

It'll also vary with the frequency of the requests. After 32K, you're probably spending more time waiting for the disk to rotate again at the next call to read() than you are waiting for I/O. If your processing between two calls didn't take as long (or took longer) you'd spend less time waiting and that might change your curve.

HHunt · Aug 12, 2006

mikeblas said:
Looks like you found the optimim point around 32K. This'll vary from system to system. On my RAID 0 array, I expect its 128K or 256K. (I/O request size tuning is one of the "secrets" that the "experts" in the Disk and Storage section of the forum usually forget to take into account when parroting their "RAID 0 is a scam" advice.)

It'll also vary with the frequency of the requests. After 32K, you're probably spending more time waiting for the disk to rotate again at the next call to read() than you are waiting for I/O. If your processing between two calls didn't take as long (or took longer) you'd spend less time waiting and that might change your curve.

True. I'll link you a copy of my code (Win32-friendly this time) so you can check as soon as I find out why VS 2005 suddenly complains about my .NET runtime.
I'll also re-run it when I've added the actual fixer code to see if/how it changes.

edit: Geez, my windows install has passed it's best-before date. One of these days I'll get around to doing a full reinstall.
edit again: I can neither repair nor uninstall VS, and I found an unused hard drive. This is that day.

HHunt · Aug 13, 2006

With the output buffered in a similar way (write a byte at the time to buffer, write() the whole buffer to disk in a single operation when full), and the actual fixer code added, my times look like this.

I haven't verified the output yet, but if it looks ok I'll zip up my files and link them.

edit: Yeh, it looks good to me. The code is a bit unpolished, but it seems to work.
Apologies for what is probably rather ugly C++, it's not a language I claim to know.

mikeblas · Aug 15, 2006

I just noticed that you posted your code; editing a note doesn't put it back onto my control panel, so I'm likely to not see it. (Even posting a new note seems to sometimes miss my control panel, here.)

Your code is very fast; nice work! It runs in its default configuration in about six seconds on my machine, faster than the high 8.9's I was getting on my code. With larger read blocks, it runs in just less than 5.8 seconds, best case.

Most of your performance increase actually comes from carefully thinking through the state machine for parsing. You can skip ranges of characters, for instance, without any extra testing (even for boundary conditions).

Since you're using /LTCG in the linker, you get some interesting optimizations. In particular, you end up inlining everything, so testing EOF isn't a function call, nor is getting the next character:

Code:

	while (! br->isEOF() ) {
		val = br->getByte();
00401136  mov         eax,dword ptr [edi+4] 
00401139  add         dword ptr [edi+0Ch],ebp 
0040113C  mov         ecx,dword ptr [edi] 
0040113E  mov         bl,byte ptr [ecx+eax] 
00401141  add         eax,1 
00401144  cmp         eax,dword ptr [edi+8] 
00401147  mov         dword ptr [edi+4],eax 
0040114A  jne         wmain+9Fh (40117Fh) 
0040114C  cmp         byte ptr [edi+18h],0 
00401150  jne         wmain+9Bh (40117Bh) 
00401152  mov         eax,dword ptr [edi+10h] 
00401155  push        eax  
00401156  push        ecx  
00401157  mov         ecx,dword ptr [edi+14h] 
0040115A  push        ecx  
0040115B  mov         dword ptr [edi+4],0 
00401162  call        dword ptr [__imp___read (4020BCh)] 
00401168  add         esp,0Ch 
0040116B  cmp         eax,dword ptr [edi+10h] 
0040116E  mov         dword ptr [edi+8],eax 
00401171  jae         wmain+97h (401177h) 
00401173  mov         byte ptr [edi+18h],1 
00401177  test        eax,eax 
00401179  jne         wmain+9Fh (40117Fh) 
0040117B  mov         byte ptr [edi+19h],1

Here's your BENCHMARK run on my rig.

Code:

C:\sqlfix\Release>sqlfix
In: 4096         Out: 4096       : 6.625 sec.
In: 4096         Out: 8192       : 6.359 sec.
In: 4096         Out: 16384      : 6.516 sec.
In: 4096         Out: 32768      : 6.171 sec.
In: 4096         Out: 65536      : 6.110 sec.
In: 4096         Out: 131072     : 6.250 sec.
In: 8192         Out: 4096       : 6.250 sec.
In: 8192         Out: 8192       : 6.234 sec.
In: 8192         Out: 16384      : 5.985 sec.
In: 8192         Out: 32768      : 5.859 sec.
In: 8192         Out: 65536      : 6.187 sec.
In: 8192         Out: 131072     : 5.829 sec.
In: 16384        Out: 4096       : 6.437 sec.
In: 16384        Out: 8192       : 5.969 sec.
In: 16384        Out: 16384      : 5.922 sec.
In: 16384        Out: 32768      : 6.172 sec.
In: 16384        Out: 65536      : 5.812 sec.
In: 16384        Out: 131072     : 6.141 sec.
In: 32768        Out: 4096       : 6.172 sec.
In: 32768        Out: 8192       : 5.984 sec.
In: 32768        Out: 16384      : 6.078 sec.
In: 32768        Out: 32768      : 5.828 sec.
In: 32768        Out: 65536      : 5.797 sec.
In: 32768        Out: 131072     : 6.047 sec.
In: 65536        Out: 4096       : 6.156 sec.
In: 65536        Out: 8192       : 6.188 sec.
In: 65536        Out: 16384      : 5.859 sec.
In: 65536        Out: 32768      : 5.859 sec.
In: 65536        Out: 65536      : 6.125 sec.
In: 65536        Out: 131072     : 5.922 sec.
In: 131072       Out: 4096       : 6.407 sec.
In: 131072       Out: 8192       : 5.953 sec.
In: 131072       Out: 16384      : 5.859 sec.
In: 131072       Out: 32768      : 6.031 sec.
In: 131072       Out: 65536      : 5.797 sec.
In: 131072       Out: 131072     : 5.953 sec.

Studying the code, you'll find that the compiler also flattens out the references to your pointers to your objects. That is, it makes br->member and bw->member references work like they'd been coded against local variables instead of pointers. I've never seen the GNU compiler performe these optiomizations (doing the inlining, then letting the inlining results influence global optimizations).

generelz · Aug 15, 2006

mikeblas said:
...

Code:

In: 32768 Out: 65536 : 5.797 sec. In: 131072 Out: 65536 : 5.797 sec..

Those times are really impressive to me. Are those times the result of the first dry run, i.e. when the SQL file has not been cached/paged into memory in any way? I know you are running a RAID in your system so his code benefits from the read speed advantages from that. Back of the envelope calculation finds that this program is doing 130-140MB/s raw I/O if you count both reading and writing.

mikeblas · Aug 15, 2006

As you can see from the code, the "BENCHMARK" mode that HHUNT does just runs again and again. There's no supported way to clear the file system cache on Windows.

I can try a cold run after rebooting again, but I did that eariler in the thread and found that the differences were quite small.

mikeblas · Aug 15, 2006

Also, please help me put together a scoreboard. The thread is really spread-out; if you can PM me with the post # that links to your newest version, I'll add it to the scoreboard. (I'll try to get to this later tonight). That way, we can have the results in one post instead of spread out thru the thread.

HHunt · Aug 15, 2006

generelz said:
Those times are really impressive to me. Are those times the result of the first dry run, i.e. when the SQL file has not been cached/paged into memory in any way? I know you are running a RAID in your system so his code benefits from the read speed advantages from that. Back of the envelope calculation finds that this program is doing 130-140MB/s raw I/O if you count both reading and writing.

I got even larger numbers when I was toying with the FreeBSD buffer system.
With a P4 Xeon 2.8GHz (533MHz FSB), two ATA100 disks, FreeBSD 7-CURRENT, I got the time down to 3.66 seconds, or roughly 210 MB/s. Of course, I had to make sure the entire input file was in ram before I ran the program, and the output didn't neccesarily touch the physical disk until some unspecified time later. I imagine a cold run would take something like 15s.

However, Windows does not cache files as aggresively as FreeBSD can be coaxed into, and my IO is quite modest compared to Mike's.

Indeed, a few consecutive runs of one of these programs could be an interesting test of both IO and the effects of OS-level caching for different systems.

HHunt · Aug 15, 2006

mikeblas said:
I just noticed that you posted your code; editing a note doesn't put it back onto my control panel, so I'm likely to not see it. (Even posting a new note seems to sometimes miss my control panel, here.)

No problem.

Your code is very fast; nice work! It runs in its default configuration in about six seconds on my machine, faster than the high 8.9's I was getting on my code. With larger read blocks, it runs in just less than 5.8 seconds, best case.

In general, your setup has much flatter times with various block sizes than mine. Could you try extending the range a bit, so we can see when it begins breaking down?
Mine just bombs when I move under 1K or above 32K input, while the size of the output buffer is less important.
I ran a small benchmark-version that just read and discarded the entire file, and from 512 byte and down the time roughly doubled with each halving of buffer size. As expected, really.
And thanks.

Most of your performance increase actually comes from carefully thinking through the state machine for parsing. You can skip ranges of characters, for instance, without any extra testing (even for boundary conditions).

It's actually copied from the java, with a few minor changes (to remove all seeking). I'm rather happy with it.

Since you're using /LTCG in the linker, you get some interesting optimizations. In particular, you end up inlining everything, so testing EOF isn't a function call, nor is getting the next character:

<code omitted>

Studying the code, you'll find that the compiler also flattens out the references to your pointers to your objects. That is, it makes br->member and bw->member references work like they'd been coded against local variables instead of pointers. I've never seen the GNU compiler performe these optiomizations (doing the inlining, then letting the inlining results influence global optimizations).

I haven't really looked into the linker and compiler settings in VS, so I'll chalk that one up as good default settings. Some day I'll play around with icc and gcc and see how well it's possible to make them perform. I have some hope for icc, but ... we'll see.

mikeblas · Aug 15, 2006

Here's a run right after a reboot:

Code:

C:\sqlfix\Release>sqlfix
In: 4096         Out: 4096       : 6.937 sec.
In: 4096         Out: 8192       : 6.484 sec.
In: 4096         Out: 16384      : 6.125 sec.
In: 4096         Out: 32768      : 6.172 sec.
In: 4096         Out: 65536      : 6.203 sec.
In: 4096         Out: 131072     : 6.234 sec.
In: 8192         Out: 4096       : 6.188 sec.
In: 8192         Out: 8192       : 6.203 sec.
In: 8192         Out: 16384      : 5.859 sec.
In: 8192         Out: 32768      : 5.844 sec.
In: 8192         Out: 65536      : 5.969 sec.
In: 8192         Out: 131072     : 5.797 sec.
In: 16384        Out: 4096       : 6.281 sec.
In: 16384        Out: 8192       : 5.938 sec.
In: 16384        Out: 16384      : 5.828 sec.
In: 16384        Out: 32768      : 6.062 sec.
In: 16384        Out: 65536      : 5.750 sec.
In: 16384        Out: 131072     : 6.000 sec.
In: 32768        Out: 4096       : 6.063 sec.
In: 32768        Out: 8192       : 5.906 sec.
In: 32768        Out: 16384      : 6.078 sec.
In: 32768        Out: 32768      : 5.750 sec.
In: 32768        Out: 65536      : 5.813 sec.
In: 32768        Out: 131072     : 6.062 sec.
In: 65536        Out: 4096       : 6.125 sec.
In: 65536        Out: 8192       : 6.063 sec.
In: 65536        Out: 16384      : 5.796 sec.
In: 65536        Out: 32768      : 5.750 sec.
In: 65536        Out: 65536      : 5.938 sec.
In: 65536        Out: 131072     : 5.734 sec.
In: 131072       Out: 4096       : 6.187 sec.
In: 131072       Out: 8192       : 6.063 sec.
In: 131072       Out: 16384      : 5.875 sec.
In: 131072       Out: 32768      : 5.922 sec.
In: 131072       Out: 65536      : 5.734 sec.
In: 131072       Out: 131072     : 5.891 sec.

HHunt · Aug 15, 2006

mikeblas said:
Here's a run right after a reboot:
<data>

The more moderate caching of NT and/or the IO capabilities of your rig do seem to almost completely remove the differences between a hot and cold run. (Very much like the last time you tried.)
Conversely, the much more aggresive caching of FreeBSD combined with my rather moderate disk IO makes for a clear difference.

Practically speaking, this makes your setup a much more convenient benchmarking platform.

C++ and Perl bakeoff

Supreme [H]ardness

[H]ard|DCer of the Month - May 2006

[H]ard|Gawd

Supreme [H]ardness

[H]ard|DCer of the Month - May 2006

[H]ard|Gawd

[H]ard|DCer of the Month - May 2006

Weaksauce

[H]ard|DCer of the Month - May 2006

[H]ard|DCer of the Month - May 2006

[H]ard|DCer of the Month - May 2006

Weaksauce

[H]ard|DCer of the Month - May 2006

Weaksauce

[H]ard|Gawd

[H]ard|DCer of the Month - May 2006

Supreme [H]ardness

[H]ard|DCer of the Month - May 2006

Limp Gawd

[H]ard|DCer of the Month - May 2006

Weaksauce

Limp Gawd

[H]ard|DCer of the Month - May 2006

Supreme [H]ardness

[H]ard|DCer of the Month - May 2006

Supreme [H]ardness

Supreme [H]ardness

[H]ard|DCer of the Month - May 2006

Limp Gawd

[H]ard|DCer of the Month - May 2006

[H]ard|DCer of the Month - May 2006

Supreme [H]ardness

Supreme [H]ardness

[H]ard|DCer of the Month - May 2006

Supreme [H]ardness