perl and parsing big files

Kaos · Aug 3, 2012

I have 3+gb text files (~2.5million records) that I need to parse out some data from. Ideally this would need to occur in < 60 seconds.

The lines I care about all start with one of two patterns: "TOT" or "TRL". They're essentially "trailer" records and the file contains many of them.

I wrote a primary version of this in bash, and that works, but it's incredibly slow (takes maybe 10 minutes+ depending on the load of the server I happen to be running it on). The way I implemented it in bash uses mostly egrep to capture the lines and then I just did capture group regular expressions from there. This works fine on smaller (less than 1gb) file fairly quickly, but the bigger ones it just chokes on. The egrep on the whole file is obviously the slow part.

We have perl 5.8 available to us (other versions of other tools are somewhat out of date (python is v2.4, etc)

The perl version shaves off about 4 minutes using a loop like this hitting the file one line at a time.

Code:

open DATA, $_ or die "Could not open file";
        while(<DATA>){
                $cnt++ if /^TRL/;
        }

when I execute this code I get about a 6 minute run time as I mentioned before:

Code:

-bash-3.2$ time test.pl
16 matching lines in file.
real    6m8.820s
user    0m12.399s
sys     0m14.822s

Is that about as efficient as it gets with perl?

I guess doing this in a compiled language is also an option, although I'd have to learn the syntax - it's not exactly a scary proposition - just new ground for me.

eloj · Aug 3, 2012

I don't know of any way to make it faster. Maybe just checking the characters (or first character) directly instead of using a regex would be faster since the regex is so trivial, but I'd expect this task to be essentially I/O bound anyway.

Sounds a bit slow though. How does this compare with just egreping out the lines to >/dev/null?

Sgraffite · Aug 3, 2012

I don't know perl, but wouldn't you want to do a substring or index instead of a regex?

Something like this?

Code:

open DATA, $_ or die "Could not open file";
        while(<DATA>){
                $cnt++ if (index(<DATA>,"TOT") == 0 || index(<DATA>,"TRL") == 0);
        }

Kaos · Aug 3, 2012

Thanks for the suggestions, I'll have to test them out.

I wasn't sure if the regex pattern itself was slowing it down at all. I know that I had done RE in python before and there was a reason that we did re.compile() outside of loops, but I'm definitely new to perl so don't know if the convention holds there as well.

eloj · Aug 3, 2012

Perl will compile a regex like this the first time its used, so that shouldn't be an issue. Perl will do IO-buffering per default, so nothing to be done there. It possible reading blocks instead of lines could still improve performance, but I've done a lot of this stuff and I've NEVER bothered to try and optimize a 'read lines and parse' loop because you know... CPUs fast, disks slow.

If you look at your time output, you see that you're waiting for IO (or the system is loaded with other processess) because the sys+user time is much much lower than the real time. My interpretation is that you're sleeping for most of the real time, which is to say, waiting for the system to hand your script data to process.

Tawnos · Aug 6, 2012

How much memory do you have on the machine? It may make sense for you to just slurp the file if you're doing a batch processing job.

Kaos · Aug 7, 2012

The machine has a reported ~35-36gb of ram

Code:

-bash-3.2$ cat /proc/meminfo
MemTotal:     37021872 kB

So I guess that's an option. I also considered looking into forking or threading and then breaking the work up into slices.

I tried to use Tie::File to treat the file as an array, but getting the length of the array still takes several minutes.

Another alternative, and I guess this might be my easiest option, would be to use plain old grep itself and using the output into an array/var

Code:

-bash-3.2$ time grep -e '^TRL' file.dat > /dev/null
real    2m42.572s
user    0m8.003s
sys     0m4.011s

Or tack this on to the end of the process that retrieves these files and have it create some sort of "meta file" that is used instead of trying to gather these in real time.

mikeblas · Aug 7, 2012

Using a regex isn't helping, as it appears to be too much processing for what you want to do. The code that calls index() is better, but it's still looking for those tokens anywhere on the line when you said they'd only be interesting at the beginning of the line. Is there some translation happening? Wouldn't you be better off reading fixed-length blocks of binary data and looking for the newline yourself rather than letting the language make variable-length strings out of each line?

Have you written the same code in C to run on the same system and compare? When we did so right here in this forum a few years ago, we found that the performance of C/C++ couldn't be beat by Perl or Python.

Even so, the numbers show that you're blocking on I/O. Why is your I/O subsystem so slow? Is the server busy with something else on the volume where your input file lives? What kind of disk system do you have where the file is? Even a consumer-level spinning drive should be able to sequentially read a 3GB file in less than a minute. Are you able to turn off buffering and read the file sequentially? Are you able to do overlapped I/O? Are you able to increase the block size of each read?

Kaos · Aug 7, 2012

The disks are sitting on a NAS, I would have to ask the team managing the hardware the finer details.

It's supposed to be a decent machine - 8 cores, 36gb ram etc. I will get more specs (although I might have to go outside of the unix group for info on the storage systems).

The regex I'm using was mostly just to identify lines that had data I cared about. If I were using index and only wanted to gather the lines - whats the preferred method there? I'll read the docs but often it helps me to talk through them too.

I know the lines are fixed width and I can use substrings on those and that would be faster - I guess it was just a matter of knowing if the line is one I care about or not.

I haven't written them in C/C++ but that's mostly because I haven't written a line of that since high school (about 10 years) but I'm not scared or against it, just working with my current skills and will expand if it comes to that.

Forgive my ignorance - but I figure I'd better ask and potentially look dumb than misunderstand you

Is there some translation happening? Wouldn't you be better off reading fixed-length blocks of binary data and looking for the newline yourself rather than letting the language make variable-length strings out of each line?

Can you explain what you mean there?

mikeblas · Aug 8, 2012

Kaos said:
The disks are sitting on a NAS, I would have to ask the team managing the hardware the finer details.

It would also help to know how busy they are, and if the network between your host and the NAS box is the problem. Your grep solution is the fastest, handling the file in 162 seconds. If the file is 3 gigs, that means you're processing only 18.5 megabytes per second, and you should certainly be fastesr than that.

Kaos said:
The regex I'm using was mostly just to identify lines that had data I cared about. If I were using index and only wanted to gather the lines - whats the preferred method there? I'll read the docs but often it helps me to talk through them too.

Someone suggested that you use index() in perl. That function takes two arguments; the string you're searching (the haystack), and the string you're searching for (the needle). index() looks for needle exhaustively through the haystack, not just at the beginning. But you've said you want to find "TOT" and "TRL" only at the beginning of each line. That means index() is doing lots of work you don't need to do, on each and every line.

Using index() isn't a bad suggestion, as it's less runtime work than using the regex.

Kaos said:
I know the lines are fixed width and I can use substrings on those and that would be faster - I guess it was just a matter of knowing if the line is one I care about or not.

If you know "TOT" and "TRL" are only at the beginning of each line, as you say in your first post, why not look only at the beginning of each line? You can use the substr function to ask for only the first three characters. That gives you a string of up to three characters, which you can directly compare to "TOT" or "TRL" and see if there's a match. If no match, it's not a line you want; if there's a match, you've found an interesting record and can process it. Comparing only three characters anchored at the beginning of the line means no regex processing is involved, and also means that no per-line exhaustive searching is done.

Kaos said:
Can you explain what you mean there?

My perl is pretty rusty, but I believe the loop you've written will end up reading one line at a time from the file. That means perl is responsible for reading some bytes, finding the end of the line, allocating a string object, copying the data from the file buffer into the string object, then giving your code the string object to examine.

That work isn't really necessary, and it's pretty slow.

In a lower-level language, you can read a fixed-size block of data; 64 kilobytes, say. Then, you can walk through that block of data very fast, directly, and try to find a newline character. If you find one, you can check to see if the next character is a "T" (as in "TOT" or "TRL"). If it isn't, go back to looking for newlines. If it is, see if it's "TOT" or "TRL" in the next few bytes.

This has two advantages. First, you're not asking the language to look for individual lines and copy them to another string that you'll probably end up discarding anyway.

But more importantly, you've gained control over your program's interface to the file system and can work with the file system to read as optimally as possible. Maybe you know that your file system likes 128 kilobyte sequential reads better than 64 kilobyte reads -- you can make that happen. You can request no buffering and sequential mode only, and that will be faster. You can request read aheads, and that can help. You can do overlapped I/O, and that will be faster. And so on.

Some of these suggestions will reduce the CPU time you spend. That's helpful. But from the numbers you've posted, I/O time is really the problem. You should be able to read more than 18 megs/second sequentially even from the cheapest 5400 RPM SATA drive (unless it's miserably fragmented, I guess). Something is wrong in I/O land, and I think that's what you should focus on fixing.

robvas · Aug 8, 2012

Can you load it to RAM then hit it with multiple threads?

fgrep (or grep -f) is usually faster then grep

Tawnos · Aug 8, 2012

If you want to keep using perl, have you tried this (commented so you know what it's doing, info here):

Code:

# anonymous block to confine the redefinition of $/
{
# undefine $/ for the context of this block
local ( $/ );
# slurp the contents into a scalar
open ( my $fh, $fileName ) or die "Can't open file\n";
my $text = <$fh>;

my $TRLcount = 0; 
my $TOTcount = 0;

my $i = 0;
do
{
  if ($text[$i++] eq 'T') 
  {
    if ($text[$i] eq 'O')
    {
      if (++$i < $text.length() && $text[$i] eq 'T')
        $TOTcount++;
    }
    elsif ($text[$i] eq 'R')
    {
      if (++$i < $text.length() && $text[$i] eq 'L')
        $TRLcount++;
    }
 }
 
 # Loop till the next newline
 while (++$i < $text.length() && $text[i] != '\n');

} while (++$i < $text.length() );

}

Might be bugs, wrote and untested, but it's a pretty simple perl program. Just reads the entire file into memory without doing any sort of file splitting, then loops through it, looking for TOT or TRL. Every time it finds or doesn't find one, it loops till the next newline, and then verifies that we haven't reached the end of the text.

Dogs · Aug 8, 2012

I would argue that even something as inefficient as throwing regex at it shouldn't be too terribly slow if the data is coming in as fast as it should.

I agree with those who have mentioned I/O as a likely culprit.

mikeblas · Aug 8, 2012

Dogs said:
I would argue that even something as inefficient as throwing regex at it shouldn't be too terribly slow if the data is coming in as fast as it should.

Seems a farcical argument on your part, since you don't know how fast data should be coming in. It "should" come in at about 60 megs/second from a single spinning SATA drive. From enterprise-class SSD storage, do you really know that the regex evaluator "should" keep up with 800 megs/second?

Dogs · Aug 8, 2012

mikeblas said:
Seems a farcical argument on your part, since you don't know how fast data should be coming in. It "should" come in at about 60 megs/second from a single spinning SATA drive. From enterprise-class SSD storage, do you really know that the regex evaluator "should" keep up with 800 megs/second?

By 'shouldn't be too slow', I mean the actual physical amount of time the program takes to run shouldn't be an enormous amount of time, like 10 minutes, not 'shouldn't be too slow' as in the program should be able to keep up with the rate at which data is being fed in.

Evaluating the regex expression repeatedly may very be too slow to keep up with the I/O. However, regardless of whether the program is actually keeping up with the I/O the program should be able to get results at a reasonable pace as long as I/O is sufficiently fast.

Kaos · Aug 8, 2012

I requested info from the storage/systems group about the configuration of the server and the NAS, will report back once I have it. I'm going to tgz the file and pull it to my machine or a virtual server that I have for comparison testing in the mean time.

I'm going to learn the index/subtr way that was mentioned here as well - always take an opportunity when it arises!

eloj · Aug 8, 2012

If I did the math right, you're grepping through the file just UNDER 20MB/s. Are you sure the file is local?! :-o

Kaos said:
I requested info from the storage/systems group about the configuration of the server and the NAS

Oh.

mikeblas · Aug 8, 2012

Why do you not own the storage on your own servers? That seems an absurdly inefficient way to run an organization.

Kaos · Aug 8, 2012

Some interesting news...

I copied the file to a local disk that's on one of the servers in our server group that development was playing around with...

The execution time on my perl script is now at almost 6 seconds flat which is extremely acceptable - this definitely confirms that IO is the issue.

I still want to get a grip on the index/substr so I'll work on writing it that way.

Questions:

When I read the file - regardless of language - when it starts reading the file (or lines I guess) it doesn't read the whole line into memory first and then start working against it? Slurping seems like it would avoid that by just reading the whole thing into memory at once, right?

Or is it when you have a filehandle and you start looping through the file - the function you use determines how to read the line from the get go?

mikeblas · Aug 8, 2012

It doesn't read the whole file into memory. Yes, the language reads some buffer at a time and returns individual strings to you. There are tools you can use to monitor how the file is read.

Just think it through: if the language or program read the whole file into memory, the program would be limited to processing files only smaller than the memory you have. That would suck!

eloj · Aug 8, 2012

Kaos said:
When I read the file - regardless of language - when it starts reading the file (or lines I guess) it doesn't read the whole line into memory first and then start working against it?

Just to sate my curiosity, I strace'd a perl program with a line-parsing loop, and it was doing read() calls in 4096-byte blocks[0]

Reading very large files fully into memory is usually not what you want to do if you can avoid it. If you need that kind of access what you want to do is called memory mapped file-IO. On POSIX systems this is accomplished using the mmap() call.

That said, since you're only streaming through the file, mmap'ing it is unlikely to be the best approach here.

But now we're bike-shedding.

Edit: [0] On my system. This may be hard coded or it may depend on page size or something else, I haven't looked.

robvas · Aug 8, 2012

eloj said:
Reading very large files fully into memory is usually not what you want to do if you can avoid it. If you need that kind of access what you want to do is called memory mapped file-IO. On POSIX systems this is accomplished using the mmap() call.

That would just let the program treat accessing the file like it would memory, not actually load the whole thing.

You can tell grep to use mmap() to do the file I/O

--mmap If possible, use the mmap(2) system call to read input, instead of the
default read(2) system call. In some situations, --mmap yields better
performance. However, --mmap can cause undefined behavior (including
core dumps) if an input file shrinks while grep is operating, or if an
I/O error occurs.

Kaos · Aug 8, 2012

mikeblas said:
Why do you not own the storage on your own servers? That seems an absurdly inefficient way to run an organization.

Each group has their own "domain" that only they cover

Unix/Linux team covers administration of the boxes from a "root" perspective.
Storage team handles NAS' and SAN etc.
I'm a part of the operations team - we handle the shell scripting and loading of data using either our own processes or code from development.

It's all separated for some reason - it didn't use to be this way, but we were picked up as part of an aquisition and have to play by their rules. I'm going to file some tickets on why the IO is so poor.

eloj · Aug 8, 2012

robvas said:
That would just let the program treat accessing the file like it would memory, not actually load the whole thing.

Just to be clear; I didn't say that it would. It wasn't the best sentence I've written, I agree, but it hardly states that 'mmaping reads the whole file into memory', and even if someone was confused, the confusion wouldn't survive the wikipedia link. So really, if you're going to correct me, at least correct me on things I actually say. It's a pet peeve of mine.

mikeblas · Aug 8, 2012

robvas said:
That would just let the program treat accessing the file like it would memory, not actually load the whole thing.

It has to eventually read the whole thing, since the search is line-by-line exhaustive over the whole file. While you're at th eend of the file, you've still mappe the memory at the beginning of the file.

I can't imagine a situation where a sequential file read would be faster when memory mapped instead of sequentially read. Memory mapping is implemented by page-faulting in the missing requested data, and faults are slow to handle and disruptive for the system.

Kaos said:
I'm going to file some tickets on why the IO is so poor.

And you're completely blocked until someone resolves the ticket? This sounds excruciating.

Kaos · Aug 9, 2012

Yeah, it's not fun. The good part (eternal optimist) is that these kinds of projects are ones that I use for my own learning and advancement and aren't necessarily part of any "official" project.

If I get time today I'll report back with any changes I've made to the script and how that affects any performance etc.

Kaos · Aug 9, 2012

I still don't think I'm doing this quite right, because this took an extra minute longer than the previous method with regex (though - I shouldnt speak absolutely due to the unknowns about the NAS)

Here's what I tried

Code:

!/usr/bin/perl
use strict;
use warnings;

my @filelist = <filename.dat>;
my $cnt = 0;
foreach my $file (@filelist){
        open DATA, $file or die "Could not open file";
        while(<DATA>){
                $cnt++ if (substr($_,0,3) eq 'TOT' || substr($_,0,3) eq 'TRL') ;
        }
}

print $cnt;

Is my use of $_ still essentially evaluating the whole line before substr is called against it?

Note: I set up a filelist array from this because in the end I will have to process against many files, I'm just altering the filename/glob for now to limit it to the largest file I have so far.

I think I see one of the problems - I need to set the substring to a var once - having it this way forces it to eval twice possibly?

mikeblas · Aug 9, 2012

eloj said:
Just to sate my curiosity, I strace'd a perl program with a line-parsing loop, and it was doing read() calls in 4096-byte blocks[0]

The 4K number probably comes from the default FILE buffer size in stdio. Perl is a language, but that language is implemented in C.

Kaos said:
I still don't think I'm doing this quite right, because this took an extra minute longer than the previous method with regex (though - I shouldnt speak absolutely due to the unknowns about the NAS)

It's important to make equal comparisons when benchmarking. Indeed, you're evaluating substr() twice with the same parameters; you could shave off some processing time there. Did you run the substr() implementation against the NAS? The network -- and perhaps the NAS itself -- are shraed, so you don't know the same load is on inside each test run.

Kaos · Aug 9, 2012

I ran the substr version against the NAS, I will benchmark against the local disk for now.

robvas · Aug 9, 2012

eloj said:
Just to be clear; I didn't say that it would. It wasn't the best sentence I've written, I agree, but it hardly states that 'mmaping reads the whole file into memory', and even if someone was confused, the confusion wouldn't survive the wikipedia link. So really, if you're going to correct me, at least correct me on things I actually say. It's a pet peeve of mine.

It did sound like that's what you meant, I wasn't necessarily trying to correct you but clear up what that actually does.

mikeblas said:
It has to eventually read the whole thing, since the search is line-by-line exhaustive over the whole file. While you're at th eend of the file, you've still mappe the memory at the beginning of the file.

mapping the memory doesn't mean it's going to have it in memory - it just allows you to access the file like you would memory:

With regular file i/o you say "read x amount of bytes, write x amount of bytes, move the file point forward x amount of bytes..."

With it mmap'd you can work with it like you would an array or pointer in memory - "myfile[address] = bytes, memcpy(myData,theFile,myDataSize) ..."

Kaos · Aug 9, 2012

Here's some comparisons I did using local disk instead of the NAS. Are these "valid" tests? Anyway that anyone sees they could be improved?

I found it interesting that adding ^TOT to the regex i was using previously (^TRL) increased the time by so much.

Info about file(obviously had to obscure some stuff)

Code:

-bash-3.2$ ls -llh filename.dat
-rwxr--r-- 1 myuser mygroup 3.8G Aug  2 13:14 filename.dat

-bash-3.2$ wc -l filename.dat
2473386

Via substr with just one pattern:

Code:

-bash-3.2$ cat parse_via_substr.pl && echo -e '###########################\n\n' && time parse_via_substr.pl
#!/usr/bin/perl
use strict;
use warnings;
my @filelist = <filename.datt>;
my $cnt = 0;
foreach (@filelist){
        #my $type = substr($_,-12,5);
        #my $cycle = substr($_,-6,3);
        #print $_ . " $type $cycle\n";
        open DATA, $_ or die "Could not open file";
        while(<DATA>){
                my $startOfLine = substr($_,0,3);
                $cnt++ if ($startOfLine eq 'TRL');
        }
}
print $cnt;
###########################


16
real    0m7.256s
user    0m3.930s
sys     0m3.323s

Via Substr with both TRL and TOT:

Code:

-bash-3.2$ cat parse_via_substr.pl && echo -e "######################\n\n" && time parse_via_substr.pl
#!/usr/bin/perl
use strict;
use warnings;
my @filelist = <filename.dat>;
my $cnt = 0;
foreach (@filelist){
        #my $type = substr($_,-12,5);
        #my $cycle = substr($_,-6,3);
        #print $_ . " $type $cycle\n";
        open DATA, $_ or die "Could not open file";
        while(<DATA>){
                my $startOfLine = substr($_,0,3);
                $cnt++ if ($startOfLine eq 'TOT' || $startOfLine eq 'TRL');
        }
}
print $cnt;
######################


16
real    0m7.394s
user    0m4.094s
sys     0m3.298s

Using Regex:

Code:

-bash-3.2$ cat parse_via_regex.pl && echo -e '###########################\n\n' && time parse_via_regex.pl
#!/usr/bin/perl
use strict;
use warnings;
my @filelist = <filename.dat>;
my $cnt = 0;
foreach (@filelist){
        open DATA, $_ or die "Could not open file";
        while(<DATA>){
                $cnt++ if /^TRL|^TOT/;
        }
}
print $cnt;
###########################


16
real    4m3.758s
user    4m0.992s
sys     0m2.731s

So just to check, I reverted the regex back and checked that as well

Code:

-bash-3.2$ cat parse_via_regex.pl && echo -e '###########################\n\n' && time parse_via_regex.pl
#!/usr/bin/perl
use strict;
use warnings;
my @filelist = <filename.dat>;
my $cnt = 0;
foreach (@filelist){
        open DATA, $_ or die "Could not open file";
        while(<DATA>){
                $cnt++ if /^TRL/;
        }
}
print $cnt;
###########################


16
real    0m6.795s
user    0m3.459s
sys     0m3.334s

And just getting grep into the mix as well using one and then both expressions.

Code:

-bash-3.2$ time grep --count -e '^TRL' filename.dat
16

real    0m5.851s
user    0m4.825s
sys     0m1.024s

-bash-3.2$ time grep --count -e '^TRL' -e '^TOT' filename.dat
16

real    2m54.088s
user    2m53.063s
sys     0m1.001s

So using the substr method is faster overall considering my final script will have to cover both TRL and TOT lines

Code:

Method	Time (in seconds)
perl - using substr (both TRL and TOT)	    7.39
perl - using substr (just TRL)	            7.25
perl - using one  RE pattern (just TRL)	    6.795
perl - using both RE patterns	          243.75
grep - one pattern	                             5.85
grep - both patterns	                         174.08

I also considered playing around with Tie::File which treats the whole file like an array - but didn't go that route.

Whatsisname · Aug 9, 2012

and since your mem mapped data structure prioritizes random access behind the scenes, your performance in doing a sequential read operation is going to be even harder to predict compared to the regular file i/o.

This task doesn't require random access, you aren't going to get any benefit from pretending the file is an array.

From above, 200ish seconds to 6-7, quite a speedup.

robvas said:
mapping the memory doesn't mean it's going to have it in memory - it just allows you to access the file like you would memory:

With regular file i/o you say "read x amount of bytes, write x amount of bytes, move the file point forward x amount of bytes..."

With it mmap'd you can work with it like you would an array or pointer in memory - "myfile[address] = bytes, memcpy(myData,theFile,myDataSize) ..."

mikeblas · Aug 9, 2012

robvas said:
With it mmap'd you can work with it like you would an array or pointer in memory - "myfile[address] = bytes, memcpy(myData,theFile,myDataSize) ..."

At the time the file is mapped, a reservation is made in the virual address space of the process to hold the mapping. That reservation size might be tunable with the mapping API, or it might just be the size of the file. In this case, the file is three gigs -- which is quite large for a 32-bit process, and not so bad for a 64-bit process.

At the point that code runs, the content of the file must be in memory. When the code touches that address, the OS faults and gets a chunk of the file in memory at the virtual address where the code is expecting it. That part of the mapped address space changes from being reserved to being committed, and containing the content of the file. Handling the fault takes a long time and is disruptive to the system.

After that code runs, the data may be swapped out, but that takes time, too. Odds are it will be swapped out since the file is so large and as the code continues to walk through the file, it will continue to commit more and use more physical memory.

Processing the file block by block within the code really isn't inconvenient, but it is much faster than memory-mapping the file.

Kaos · Aug 9, 2012

Definitely learned a lot working through this.

Unfortunately the response I got from the storage team was "If you need performance, use local disk" ugh, yeah - you don't say?

This is why we have to push everything through the chain to get anyone to react properly.

robvas · Aug 9, 2012

I never suggested mmap would make it go faster.

Are you still using the 3GB file when you do the 5.8 second run? That's disk speed of almost 500MB/s!

If you could read it all into memory, making 4 processes (or however many CPU's you have) would be an interesting experiement. Split the memory up into sections and have each process read it's own section.

Kaos · Aug 9, 2012

All of those runs were with the file I gave the info on before I ran the tests - so that was a 3.8g file.

mikeblas · Aug 9, 2012

robvas said:
I never suggested mmap would make it go faster.

Then why would you want to tell grep to use nmap() ?

robvas said:
Are you still using the 3GB file when you do the 5.8 second run? That's disk speed of almost 500MB/s!

Indeed, this is very fast. This suggests a SSD local drive, or in-memory caching of the file between runs.

Kaos · Aug 9, 2012

I showed this to one of the perl gurus at work today since we discussed it over lunch and he was curious for me to change my regex patterns for looking for both from /^TRL|^TOT/ to

/^T(RL|OT)/ and /^T[RO][LT]/

both patterns ended up with correct results and were at

6.75 and 6.73 seconds each, which has them beating the substring method (unless I'm not reading the file in the most efficient manner)

Tawnos · Aug 9, 2012

Kaos said:
I showed this to one of the perl gurus at work today since we discussed it over lunch and he was curious for me to change my regex patterns for looking for both from /^TRL|^TOT/ to

/^T(RL|OT)/ and /^T[RO][LT]/

both patterns ended up with correct results and were at

6.75 and 6.73 seconds each, which has them beating the substring method (unless I'm not reading the file in the most efficient manner)

For the substring method, you could try this, which may reduce the total time (as it splits the search condition into two phases, the first giving an early bail point for what should be the most common scenario)

Code:

#!/usr/bin/perl
use strict;
use warnings;
my @filelist = <filename.dat>;
my $cnt = 0;
foreach (@filelist){
        #my $type = substr($_,-12,5);
        #my $cycle = substr($_,-6,3);
        #print $_ . " $type $cycle\n";
        open DATA, $_ or die "Could not open file";
        while(<DATA>){
                if (substr($_,0,1) eq 'T'){
                    cnt++ if (substr($_,1,2) == 'OT' || substr($_,1,2) == 'RL')
                }
        }
}
print $cnt;

Kaos · Aug 10, 2012

That's not a bad idea, and based off of what mikeblas explained above if the first char is T then you'd want to capture the next two instead of doing two evaluations - right?

perl and parsing big files

[H]ard|Gawd

2[H]4U

Supreme [H]ardness

[H]ard|Gawd

2[H]4U

2[H]4U

[H]ard|Gawd

[H]ard|DCer of the Month - May 2006

[H]ard|Gawd

[H]ard|DCer of the Month - May 2006

Gawd

2[H]4U

[H]ard|Gawd

[H]ard|DCer of the Month - May 2006

[H]ard|Gawd

[H]ard|Gawd

2[H]4U

[H]ard|DCer of the Month - May 2006

[H]ard|Gawd

[H]ard|DCer of the Month - May 2006

2[H]4U

Gawd

[H]ard|Gawd

2[H]4U

[H]ard|DCer of the Month - May 2006

[H]ard|Gawd

[H]ard|Gawd

[H]ard|DCer of the Month - May 2006

[H]ard|Gawd

Gawd

[H]ard|Gawd

[H]F Junkie

[H]ard|DCer of the Month - May 2006

[H]ard|Gawd

Gawd

[H]ard|Gawd

[H]ard|DCer of the Month - May 2006

[H]ard|Gawd

2[H]4U

[H]ard|Gawd