C++ and Perl bakeoff

They've recently changed the gui look in this new 6.0 release. I'm not liking it much either, but I'm getting use to it. (had to turn off that current line highlighting and really hate how the search panel has changed) I've never noticed a flickering, I'll check later, but I was just pleased that it did have some large file consideration.

I've used it for so long that it would have to take something majorly upsetting to get me to switch. ;o)
 
I've got the C++ code down to nine seconds. I think I can use a similar technique to get the time down for the C# code. See Post #56 for links.
 
fluxion said:
holy hell man....this is what i came up with for the comma-in-title case. brace yourself...

Code:
#replace commas in titles with <z>
$records =~ s/(\(\d+,\d+,'[^']+?),(?!')/$1<z>/g;

#replace commas with tabs
$records =~ s/,/\t/g;

#replace <z> with commas again
$records =~ s/<z>/,/g;

2 passes for that particular case.....nasty regex....big capture+rewrite....unsurprisingly, it totally kills performance. i couldnt for the life of me figure out how to just ignore the commas-in-titles flat out. about an extra 30s on my machine with what is essentially your script. i havent giving up on it, a better method most certainly needs to be found. but it works....kinda

Code:
fluxion@purity:~/scripts/perl_bakeoff$ head -3 output
1       0       AaA             8       1       0       0.116338664774167       20060401120725  46448774        70
5       0       AlgeriA         0       1       0       0.553221851171201       20060301005610  18063769        41
6       0       AmericanSamoa           0       1       0       0.387834867823715       20060301005610  18063795        48
fluxion@purity:~/scripts/perl_bakeoff$ tail -3 output
5803516 6       CVCOG logo.png          0       0       1       0.723697823784  20060702231120  61757981        68
5803517 0       Phalacrocorax nigrogularis              0       1       1       0.955231507353  20060702231129  61758005        31
5803518 6       Uniontown PA.jpg                0       0       1       0.111196242907  20060702231130  61758009        89

i get a discepancy on the file output. 374,030,735 bytes instead of the 378,689,587 bytes mikeblas got. so i dont know what's up, everything seems right

here's the full script if you wanna check it against your output mikeblas. replace the \r\n with \n and i think you should get the proper output this time

Code:
#!/usr/bin/perl

use strict;
open(FILE, '>output');

while (my $records = <>) {
  #skip all uninteresting lines
  next unless ($records =~ /INSERT INTO `page` VALUES/o);

  #remove sql info from beginning, remove ); from the end
  $records = substr($records, 27, -3);

  #replace commas in titles with <z>
  $records =~ s/(\(\d+,\d+,'[^']+?),(?!')/$1<z>/g;

  #replace ),( with newlines
  $records =~ s/\),\(/\n/g;

  #remove non-escaped quotes 
  $records =~ s/(?<!\\)'//g;

  #commas with tabs
  $records =~ s/,/\t/g;

  #put embedded commas back
  $records =~ s/<z>/,/g;

  #underscores with spaces
  $records =~ s/_/ /g;

  #remove an escape off any sequence of escapes
  $records =~ s/(?<!\\)(\\+)\\(?!\\)/$1/g;

  print FILE "$records\r\n";
}

close FILE;

that's enough for one night

Well I've been out of this thread for a couple days, but let's see...

Now that mikeblas has brought his down to a sub-10 second processing time, I think our favorite dirty scripting language is done for in terms of processing (conciseness of solution can be debated). :D

I was hoping to remove my code as far from performance hinderances in Perl as possible, but I think to obtain the correct solution with regular expressions, we will have to give up runtime, fluxion. If you read my original (non-correct) code, you'll see I try to avoid hitting up regular expressions and named captures as much as possible. Doing several million named captures is a pretty bit performance hit, I'd imagine. And achieving something closer to correctness balloons the runtime significantly in this case.

I'll try to take a look and see what I can come up with. I will not go down without kicking and screaming. :)
 
I don't know perl, but how fast would a stateful per-byte solution (like mine or Mike's ) in perl be?
It's against the spirit of perl and what we're testing, but if it's faster ...

By the way, there's one thing that's missing so far. Someone has got to make a shellscript. I'm sure this can be done with sh/awk/sed/tr, but I do not want to be the one to prove it. :p
 
I'd love to see a Python solution, too.

I don't think a single-pass solution in Perl can be efficient. You can't get at a character within a string because Perl doesn't offer it, AFAICT. You have to do substring(), which gets a new string (which probably involves an allocation, and so on). Maybe, in the end, it's faster than splitting and regexing and so on.

What bothers me more, though, is that we still don't have a Perl implemnetation that's robust. Replacing commas with "<z>" and then replacing "<z>" leaves you open to tripping on input that had "<z>" in it.
 
I know there is a way to speed my Java version up considerably...however unfortunately I guess I don't have the prowess as far as working with raw byte buffers to do it.
 
It so happens that I just finished a rather ugly java java version that's basically a port of my C. While slower, it's still the fastest java version so far, unless I've overlooked something.
I'm certain there's room for improvement, though.


edit: New version which produces a file with the correct length, and looks good as far as I can see. I haven't looked to closely at the unicode, but since everything is done per-byte, there's a certain chance it's ok. Takes about 30sec to run, and spikes my CPU at 100% from start to finish. (Definitely not IO bound, then.)
I suggest running it with java -Xms200m -Xmx200m . It seems to remove a few seconds.
 
HHunt said:
I suggest running it with java -Xms200m -Xmx200m . It seems to remove a few seconds.
200 megs of stack space? Why?
 
BillLeeLee said:
Well I've been out of this thread for a couple days, but let's see...

Now that mikeblas has brought his down to a sub-10 second processing time, I think our favorite dirty scripting language is done for in terms of processing (conciseness of solution can be debated). :D

I was hoping to remove my code as far from performance hinderances in Perl as possible, but I think to obtain the correct solution with regular expressions, we will have to give up runtime, fluxion. If you read my original (non-correct) code, you'll see I try to avoid hitting up regular expressions and named captures as much as possible. Doing several million named captures is a pretty bit performance hit, I'd imagine. And achieving something closer to correctness balloons the runtime significantly in this case.

I'll try to take a look and see what I can come up with. I will not go down without kicking and screaming. :)

yah, the regex definetely has to be streamlined. but even then...i'm thinking a whole different approach is in order, just a single traversal through each line or something. i did a misguided attempt at a multithreaded solution, thinking if i could keep reading in lines while the processing took place i could speed it up a bit....but that didnt work out too well. probably had more to do with my approach than the concept though. im definetely gonna have to put this on the backburner till this weekend though. lookin forward to seeing what you might come up with though

edit: wow, lots of "though"'s in there
 
mikeblas said:
200 megs of stack space? Why?

I have no idea. It's a slightly insane amount, but it does actually seem to have an effect.
(Maybe the GC runs less often when there's more space to work in? I'd love input from some java guru.)

And by all means, do test without specifying, and with different amounts. Maybe it was just a fluke, I wasn't especially thorough.
 
mikeblas said:
200 megs of stack space? Why?

That argument is not stack space, it is heap space in the JVM. By default I believe the JVM allocates 16Mb of heap space. My guess is that increasing the heap space allows java to map more of the file into memory.
 
generelz said:
That argument is not stack space, it is heap space in the JVM. By default I believe the JVM allocates 16Mb of heap space. My guess is that increasing the heap space allows java to map more of the file into memory.

According to the help for my JRE, Xms is stack space and Xmx is heap space. Heap I can understand (I guess), but 200 megs of stack space is absurd, especially when there's no deep recursion going on.
 
According to the command line options here, it's initial and max heap size. (Should have looked at that yesterday, I guess.)
Given that I do indeed map the file into memory, I can see how that might be why it helps.
 
Oh! You're right; I misread the table. -Xss is stack size, not -Xms. Man, that's a real relief!
 
mikeblas said:
Oh! You're right; I misread the table. -Xss is stack size, not -Xms. Man, that's a real relief!
Heh. Restores your faith in java as at least somewhat sane?
 
HHunt said:
Heh. Restores your faith in java as at least somewhat sane?
Indeed, I'm happy to be wrong on that one. 200 megs of stack space is really hard to explain.
 
mikeblas said:
I've got the C++ code down to nine seconds. I think I can use a similar technique to get the time down for the C# code. See Post #56 for links.

Wow. 9 seconds is rather impressive. I don't even think I could get C# to read the contents of the .sql file in less than 9 seconds, let alone hope that I can do all the processing.

Right now, just a tight loop reading the contents of the file takes 22 seconds on my computer (see sig for details). I may be doing something wrong, or it might be the amount of garbage collection required to clean up the resources of the unreferenced memory.

I'd post the details of the loop, but I'm at work right now. From memory, it seems like it was:

Code:
string infile = [...];
StreamReader reader = new StreamReader(infile);
while(reader.CanRead)
{
    string line = reader.ReadLine();
}

Edit: This is using .NET 2.0 and Visual C# 2005 Express Edition.
 
It might be your machine. (Your sig doesn't say anything about your disk configuration.)

According to my notes, my C# version does the reading, plus the processing and the writing, in 34 secods... and I haven't gotten 'round to the same optimization I did for the C++ version just yet.
 
Reading + processing (using my C version on FreeBSD) varies between 26 and 3.6 seconds depending entirely on how much of the input file is in the file system buffers. In the best case, it does no physical reads, which explains the big difference. I would expect similar things to come into play on windows, independent of the language.

BTW, did you get any timings on my java?
 
Windows isn't as aggressive. (I provided timings and notes about this earlier in the thread.) I haven't tuned anything, but I'd actually want this code to not cache any file access. It's doing a sequential read, for one pass, through the file. Why evict other useful stuff from the cache, and spend time caching (which means copying) when you don't need to? Same for writing; sequential one-pass writes are best unbuffered.

My C++ code is using fopen() to get the files it uses. If I were to truly optimize this code, I'd use CreateFile(), or use one of the MS-specific fopen() replacements, that allows me to change the access mode of the file so I can specify FILE_FLAG_SEQUENTIAL and FILE_FLAG_NOBUFFERING and so on.

I haven't done timings yet for your Java code yet, no; when I have time to do it, I'll post them. Hopefully, that'll be tonight or Saturday.
 
Incidentally, I looked into that with the file reading benchmark I put together earlier in the thread : In FreeBSD, using read() will buffer aggresively. Using mmap() will apparently not buffer at all.

In order to get down to the lowest times, I cheated by reading through the file in some other way (I think I used your C++) to move it into memory first.
 
If I understand right, you're doing physical I/O operations to read the file into your address space, with the side-effect of getting it into cache. Then, logical I/O operations to find it in cache and move it into your application.

Why is that faster?

In your toolset, isn't fgets() implemented in terms of read() (eventually)?
 
mikeblas said:
If I understand right, you're doing physical I/O operations to read the file into your address space, with the side-effect of getting it into cache. Then, logical I/O operations to find it in cache and move it into your application.

Why is that faster?

In your toolset, isn't fgets() implemented in terms of read() (eventually)?

It's faster because the physical IO is done before I start the timer. (This is why I refered to it as cheating.)
The entire process that lead to the 3.6 second time was:
1: Run a program that reads the file with read(). As an intended side effect, the entire file is now in the system cache.
2: Time a run of my program, which then ends up only reading from the cache.

Omitting step 1 will most likely yield a run time slightly slower than an ideal implentation using read. [1] And yes, I'm fairly sure fgets is implemented with read.

[1] Our programs performed fairly close to each other with a cold file. Mine seemed to gain more from caching than yours, for some reason. I haven't tried your newest version yet.
However, just reading the entire file and summing the bytes is slightly faster with read than mmap when it's fully cached.
 
HHunt said:
It so happens that I just finished a rather ugly java java version that's basically a port of my C. While slower, it's still the fastest java version so far, unless I've overlooked something.
I'm certain there's room for improvement, though.

Takes about 39 seconds on my rig, but it doesn't handle Unicode correctly.
 
mikeblas said:
Takes about 39 seconds on my rig, but it doesn't handle Unicode correctly.

The time sounds about right, strange the unicode acts up. (Everything should be per-byte, so I wonder how it happens). I'll look at it, and also try a few different approaches to see if I can make it faster. (Or slower; it might be interesting to know if any plausible methods are much worse.)
 
HHunt said:
The time sounds about right, strange the unicode acts up. (Everything should be per-byte, so I wonder how it happens). I'll look at it, and also try a few different approaches to see if I can make it faster. (Or slower; it might be interesting to know if any plausible methods are much worse.)

I think you have to construct your OutputStreamWriter to be UTF-8 aware by doing the following:

Code:
ofb = new BufferedWriter(new OutputStreamWriter(oStream, new Charset("UTF-8")));
 
generelz said:
I think you have to construct your OutputStreamWriter to be UTF-8 aware by doing the following:

Code:
ofb = new BufferedWriter(new OutputStreamWriter(oStream, new Charset("UTF-8")));

Ah, right. (Or I could find a buffered output method that works with bytes instead of characters, I guess.)
 
hey guys im learning python and thought this 'chalange' would be a good exercise soooo i wrote this little script.
note: i only read the first three pages of this thread so sry if a python script has already been posted or if the bake off is basicallly over lol. anyway hear it is:

Code:
def parse_line(infile, outfile):
        for line in infile:
                #remove useless text
                                                                #jump to next line if:
                if (len(line) < 27                              #line is shorter then 27 char
                        or line[26] != '('                      #lines 27th char is not (
                        or line.find('(`') != -1): continue     #(` is not in the line
                #parse usefull text line by line
                line = line[26:]                                #chop beginning insert statment
                line = line.replace('),', '\n')                 #make newline for each data-item
                line = line.replace('(', '')                    #remove the ( from front
                line = line.replace('_', ' ')                   #replace _'s with spaces
                line = line.replace("'","")                     #remove single quotes
                line = line.replace(',','\t')                   #replace commas with tabs
                outfile.write(line)                             #write the parsed line to output file



#uncomment below to use timer
#replace 'dump-file' with the name of the wikipedia file
#this script must be in the same dir as that file

#import time
#start = time.clock()
parse_line(open('dump-file', 'r'), open('parsed', 'w'))
#end = time.clock() - start
#print end
 
chipmonk010 said:
hey guys im learning python and thought this 'chalange' would be a good exercise soooo i wrote this little script.
note: i only read the first three pages of this thread so sry if a python script has already been posted or if the bake off is basicallly over lol. anyway hear it is:
Thanks for sharing your code. I don't think it's close to working right, though; you're not doing the escapes (like \' to ') and you're replacing too many commas.
 
heres a snippet of my output:
Code:
2945    0       Affectional orientation         136     0       0       0.336258885763827       20060630015336  59882690        1740
2946    1       Ariel Sharon            179     0       0       0.466582354592702       20060630193320  61024222        99015
2947    0       Anoa            143     0       0       0.233213187412557       20060702193913  59138971        3199
2948    0       Agner Krarup Erlang             504     0       0       0.157289310504952       20060624144102  58976322        5280
2949    0       Arab-Israeli conflict           4507    0       0       0.0868070210207319      20060702053647  61446102        11190
2950    0       Anyone Can Whistle              98      0       0       0.962167204322449       20060630032913  61167368        3596
2951    0       Althusser               12      1       0       0.550414989283695       20050628012321  15901329        31
2952    0       Alcopop         195     0       0       0.865573364184772       20060630233839  61460885        5492
2953    0       Afrikaner               544     0       0       0.67662188380642        20060630174723  61212457        19437
2954    1       AI-complete             22      0       0       0.156289613019945       20051025001225  25917003        494
2955    0       Alkali          267     0       0       0.786389357211431       20060630230838  61456655        4524
2956    0       Ain\t I a Woman? book)          296     0       0       0.902081165371538       20060325055539  45380184        2765

Are the backslashes you are talking about like the one in the last line?
What commas should i not be replacing?
 
chipmonk010 said:
Are the backslashes you are talking about like the one in the last line?
What commas should i not be replacing?
You're replacing all commas in the line with tabs. There are commas within the quoted strings that should not be replacing. This was covered earlier in the thread when someone else made the same mistake. (LATER: see my reponse to BillLeeLee in post #71 of this thread.)

Indeed, you've found one of the problems with the escaped characters; your code doesn't handle any escaped character cases.
 
ok i revised my code, the only thing i dont really know what to do with are things like \xa1 \xc3 etc. what are these? should i let them pass through? anyway heres my code:

Code:
def parse_line(infile, outfile):
        for line in infile:
                #remove useless text
                #jump to next line if:
                if (len(line) < 27                              #line is shorter then 27 char
                        or line[26] != '('                      #lines 27th char is not (
                        or line.find('(`') != -1): continue     #(` is not in the line
                #parse usefull text line by line
                line = line[27:]                                #chop beginning insert statment
                line = line.replace('),', '\n')                 #make newline for each data-item
                line = line.replace('\n(','\n')
                line = line.replace(',_','\c')                  #replace commas with tabs
                line = line.replace(',','\t')                   #except the commas in single quotes
                line = line.replace('\c',', ')
                line = line.replace(',(', '')                   #remove the ( from front
                line = line.replace(');','')                    #remove end of line marker
                line = line.replace('_', ' ')                   #replace _'s with spaces
                line = line.replace("\t'","\t")                 #remove single quotes
                line = line.replace("'\t","\t")
                line = line.replace("\\'","'")                  #except apost.
                outfile.write(line)                             #write the parsed line to output file


#uncomment below to use timer
#replace 'dump-file' with the name of the wikipedia file
#this script must be in the same dir as that file

#import time
#start = time.clock()
parse_line(open('dump-file', 'r'), open('parsed', 'w'))
#end = time.clock() - start
#print end
 
I'm really surprised, chipmonk010, that there isn't a better way to do this in Python. The code you've written makes 11 seperate passes over the string. The C++ solution I wrote only goes through the string once. At best, this means you'll be about 11 times slower than the C++ code.

Your code still fails to handle the string literals correctly. The string literal field can't be touched at all, except for the escapement characters. Your code will replace characters within string literals because it is completely unaware of them. For example, what if I have a topic named 'Canaan (town), Connecticut'? Your code will see the '),' in it, and replace it with a newline. There's no topic named 'Canaan (town\n Connecticut' on Wikipedia.

There are escapements that you don't handle; \\, for example, should be translated to \.
 
im sure there is a better way to do this in python. Im learning the language and programing in general so i thought i would use this as an exercise. Your points are well taken, and i see that this design simply isnt going to work.

as for speed as is this program completes that wikipedia file in 75 seconds, i started another version that used a nested while loop and i projected it to take 200 minutes to complete the conversion.

i realize this is more a contest for experienced programers and i wont clutter the thread if its just an annoyance but im willing to learn and any pointers would be helpful.
 
This isn't a contest. I've already provided what pointeres I can think of offering -- if you have a specific question, I'd be happy to try to answer it.
 
@Mike

I want to figure out python and perl versions that do not use regex, but I need to test the methods I want to use first before I translate them.

Could you check the following to make sure it's O.K.

Code:
#include <iostream>
#include <string>
#include <fstream>
using namespace std;

inline void qstate( const string& s, size_t& i, ostream& out ) {
    while ( i < s.size() ) {
        ++i;
        if ( s[i] == '\\' && i + 1 < s.size() ) {
            if ( s[ i + 1 ] == '\'' ) {
                out << '\'';
                ++i;
            } else if (  s[ i + 1 ] == '\\' ) {
                out << '\\';
                ++i;
            } 
        } else if ( s[i] == '\'' ) {
            break;
        } else if ( s[i] == '_' ) {
            out << ' ';
        } else if ( s[i] != ' ' ) {
            out << s[i];
        }
    }
}

inline void parseLine( const string& s, ostream& out ) {
    for ( size_t i = 26; i < s.size(); ++i ) {
        if ( s[i] == '\'' ) {
            qstate( s, i, out );
        } else if ( s[i] == ',' ) {
            out << '\t';
        } else if ( s[i] == ')' ) {
            out << "\r\n";
            i += 2;
        } else if ( s[i] != '(' ) {
            out << s[i];
        }
    }
}

int main( int argc, char* argv[] ) {
    if ( argc != 3 ) {
        return 1;
    }
    ifstream in( argv[1], ios::binary );
    if ( !in ) {
        return 1;
    }
    ofstream out( argv[2], ios::binary );
    if ( !out ) {
        return 1;
    }
    for ( string s; getline( in, s ); ) {
        if ( s.find( "INSERT INTO `page` VALUES" ) == 0 ) {
            parseLine( s, out );
        }
    }
}

// g++ -Wall -Wextra -ansi -pedantic fixer2.cpp -o fixer2 -O3 -msse3 -mtune=i686 -s

I'm doing that on an ancient PII system on Kubuntu. It's a pain to even check a few lines of output to know if it's right. Also, it's really slow for me, but could you check it on your system to see how slow it is.

I think you said that you wanted to do a perl method that didn't use regex, but you were worried about substr() being to slow. Are you sure that substr( $str, pos, 1 ) is not as quick as accessing a char in c++ with []?

Also, what do you think about just reading each char from the file and deciding what goes to the output file instead of reading in each line at a time and parsing that line? If you could do that while properly detecting the insert lines, would that be faster (from drive a to drive b)?

Thanks
 
Hi, Shadow.

I built it and ran it. It hangs and doesn't write anything to the output file. I invoked it with this command line:

Code:
shadow f:\links\enwiki-20060702-externallinks.sql output

I'm in the middle of something at the moment ... I can give it a look in about an hour.

Shadow said:
Are you sure that substr( $str, pos, 1 ) is not as quick as accessing a char in c++ with []?
I don't think it is because I believe Perl will create a new string object instead of just getting a character.

Let's think about it in low-level pseudocode. In C, getting a character means:

Code:
get string
add index to string address
load a byte from that address
work on the byte value we loaded

in Perl, what it means is:

Code:
ask the string object for its base address
add the index offset to that address, if it is in bounds
load a byte from that address
create a new string object of length one
   allocate memory for it
   or die if memory fails
   tell refcount/garbage collector about new string object
copy byte to new string object
set length of new string object

For a few (or even a big bunch) of strings, this isn't a problem. People will think it outweighs the C/C++ requirement of managing the memory your own bad self. But if we're doing this for each character of a 480 megabyte file (and that's the smallest file I've got at the moment) this is past "a big bunch" and is up to a "real fucking lot".

Shadow said:
Also, what do you think about just reading each char from the file and deciding what goes to the output file instead of reading in each line at a time and parsing that line? If you could do that while properly detecting the insert lines, would that be faster (from drive a to drive b)?
For slickness? It would be kind of neat; you'd invent a nifty little state machine. That's kind of sexy, in a CS nerd way.

For performance? It would suck. You have to do a call to get each character, juggle around a buffer, and then finally get the character. In my implementation (IIRC) I'm using strchr() to find things. strchr() is very highly optimized, if you have a compiler with a decent and mature runtime library implementation. Like, instead of using an eight-bit register to get a cahracter and decide, it's going to get four bytes at a time in a 32-bit register and do fancy shifting and masking to find the beginning of your string. Doesn't that sound lots faster than setting up a function call for each character?
 
mikeblas said:
Hi, Shadow.

I built it and ran it. It hangs and doesn't write anything to the output file. I invoked it with this command line:

Thanks

hmm. It doesn't hang for me., but I only tested with enwiki-20060702-page.sql and tiny portions of it.

I'm doing:

./fixer2 enwiki-20060702-page.sql fixed.sql

g++ version is 4.0.3

Edit:

What happens if you drop the ios::binary from the ifstream? I don't remember if getline() works right in binary mode on windows.
 
It isn't hung; it's just very very slow. C++ stream I/O isn't noted for its performance, and going character-by-character against the streams is the worst of both worlds.

On my machine, a release build runs in about 92 seconds.

The good news is that your output is correct; IIRC, you're the first person to post a correct solution on the first try.
 
mikeblas said:
It isn't hung; it's just very very slow.
O.K. I'll have to make it faster.

C++ stream I/O isn't noted for its performance, and going character-by-character against the streams is the worst of both worlds.

I believe I can make better use of c++ streams.

On my machine, a release build runs in about 92 seconds.
It takes 4 to 5 times longer on mine. I'll see if I can knock a significant amount off of that.

Edit:

I can cut my time in half by using out.put(char) everywhere instead of out << char.

The good news is that your output is correct
Awesome. Thanks.
 
Back
Top