I have 3+gb text files (~2.5million records) that I need to parse out some data from. Ideally this would need to occur in < 60 seconds.
The lines I care about all start with one of two patterns: "TOT" or "TRL". They're essentially "trailer" records and the file contains many of them.
I wrote a primary version of this in bash, and that works, but it's incredibly slow (takes maybe 10 minutes+ depending on the load of the server I happen to be running it on). The way I implemented it in bash uses mostly egrep to capture the lines and then I just did capture group regular expressions from there. This works fine on smaller (less than 1gb) file fairly quickly, but the bigger ones it just chokes on. The egrep on the whole file is obviously the slow part.
We have perl 5.8 available to us (other versions of other tools are somewhat out of date (python is v2.4, etc)
The perl version shaves off about 4 minutes using a loop like this hitting the file one line at a time.
when I execute this code I get about a 6 minute run time as I mentioned before:
Is that about as efficient as it gets with perl?
I guess doing this in a compiled language is also an option, although I'd have to learn the syntax - it's not exactly a scary proposition - just new ground for me.
The lines I care about all start with one of two patterns: "TOT" or "TRL". They're essentially "trailer" records and the file contains many of them.
I wrote a primary version of this in bash, and that works, but it's incredibly slow (takes maybe 10 minutes+ depending on the load of the server I happen to be running it on). The way I implemented it in bash uses mostly egrep to capture the lines and then I just did capture group regular expressions from there. This works fine on smaller (less than 1gb) file fairly quickly, but the bigger ones it just chokes on. The egrep on the whole file is obviously the slow part.
We have perl 5.8 available to us (other versions of other tools are somewhat out of date (python is v2.4, etc)
The perl version shaves off about 4 minutes using a loop like this hitting the file one line at a time.
Code:
open DATA, $_ or die "Could not open file";
while(<DATA>){
$cnt++ if /^TRL/;
}
when I execute this code I get about a 6 minute run time as I mentioned before:
Code:
-bash-3.2$ time test.pl
16 matching lines in file.
real 6m8.820s
user 0m12.399s
sys 0m14.822s
Is that about as efficient as it gets with perl?
I guess doing this in a compiled language is also an option, although I'd have to learn the syntax - it's not exactly a scary proposition - just new ground for me.