Java's split too slow

CefiroZ · Jan 12, 2010

I am currently working on a project where I am parsing data line-by-line. I only need parts of each line so I am splitting on spaces to extract the string tokens and then continue on. I have already created a compiled Pattern and am using Pattern's split instead of String's version. I did this because I have to split on " +" because the spacing is not consistent between tokens.

My problem is split is still taking too long. The split required for each line is using up almost all of the execution time and this program needs to run fast. And before you say "use C++!" I'm limited to using Java for this due to cross-platform needs.

So, is there a faster implementation of split out there? Or an alternative way to do it? I know StringTokenizer is faster but it's not exactly recommended these days and I need the regex capability of split obviously.

J-Will · Jan 12, 2010

Are you reading in input from the Java program, or from a text? If you're readig it in from a text doc you can just input it directly into an array using a loop.

Tawnos · Jan 12, 2010

Ignore this post, it was dumb

ambientZ · Jan 12, 2010

splitting a string is a relatively simple operation - im not sure what exactly could be done to speed it up.

I also find it suspect that string splitting performance is hindering your application that much.

PTNL · Jan 12, 2010

ambientZ said:
splitting a string is a relatively simple operation - im not sure what exactly could be done to speed it up.

I also find it suspect that string splitting performance is hindering your application that much.

Agreed. There's enough grey area commenting that needs to be further detailed (if not already evaluated). String size, iterative logic, object instantiation/cleanup, split/regex statement, and string source all come to mind. Not to mention what metrics were used and how they were gathered that point toward string-splitting as the OP's primary culprit.

oplin · Jan 12, 2010

Post some code if you want help.

Unless you're doing a bunch of operations on each line, you'll usually easily max out your disk before you do cpu. I have a program that parses 200-1000meg files, that are csv or space delimited files, and it does validation on the 50 columns in there and i'm getting 60MB/s on those files, a simple string split wouldn't do this.

This is on a Quad core Xeon with 8 threads and the program is multi threaded, it does max out cpu, but that's only cause of all the validation tests. If you are maxing out cpu, check your code make sure it's efficient and if it is, then make it multi threaded, it's easy to do if it's per line you do a thread and order of the rows doesn't matter.

keenan · Jan 12, 2010

The only reason I can think it would be particularly slow is if you're using it on large strings, as it creates copies of everything into a new array. Still should be pretty fast though unless you're running out of memory.

It's probably marginally faster to implement your own parser by iterating over the string characters, and then only copying/analyzing the fields you do care about. But +1 to posting some code if you're having particular problems.

The fact that your profile indicates this part of the processing is taking up the majority of CPU time isn't necessarily indicative of any sort of issue, you could easily still be I/O bound. Your program will always have bits that use more CPU than others.

Also writing portable C++ is pretty easy if you don't need to get into GUI or system libraries.

blanck · Jan 13, 2010

CefiroZ said:
I only need parts of each line so I am splitting on spaces to extract the string tokens and then continue on.

Does this imply you need all parts of each line that are delimited by spaces, or only a subset of the parts? If you only need a small subset of each line, it may be more efficient to write a regex with capturing groups as opposed to creating extra String objects during the split that never get used.

Honestly, unless you are parsing out only a very small subset of the split substrings, I doubt this change would result in a significant speed up over manually splitting each String. It might be worth trying if you run out of ideas, though.

mikeblas · Jan 13, 2010

CefiroZ said:
I need the regex capability of split obviously.

Actually, that's not obvious. You've given nearly zero details of what you're doing; you've also used relative terms like "slow" without grounding them quantitatively. Maybe what you're calling "slow" really is as fast as it goes, and you do have to change languages if you want something faster. Or, maybe what you're calling "slow" really is slow, and you've done some bad algorithm work (why are regular expressions necessary?), made some programming mistakes, or have neglected a more efficient approach.

Regular expressions are powerful, but they're not free. Their expense is often a trap to people who don't realize they're not necessary. Perhaps if you show what you're specifically doing, with both sample data and code, then we can help you do some analysis and suggest alternatives. Until then, we're obviously just guessing.

CefiroZ · Jan 13, 2010

Thanks everyone for the responses so far. I'm currently reworking parts of this project so after I get those straightened out I can work on getting up some examples to see if anyone can help. Until then here's some more details of the problem. I would have posted these earlier but I had to gather some "clean" stuff that I can publicly post.

Below is some sample data that is coming out of tshark, which is interpreting a pcap file.

Code:

  1   0.000000    10.0.2.15 -> 192.168.11.1 DNS Standard query AAAA ncsu.edu
  2   0.000495    10.0.2.15 -> 192.168.11.1 DNS Standard query A ncsu.edu
  3   0.125069 192.168.11.1 -> 10.0.2.15    DNS Standard query response
  4   0.125251 192.168.11.1 -> 10.0.2.15    DNS Standard query response A 152.1.226.10
138   1.768626    10.0.2.15 -> 192.168.11.1 DNS Standard query AAAA www.lib.ncsu.edu
139   1.772099    10.0.2.15 -> 192.168.11.1 DNS Standard query AAAA ncsu.edu
146   1.795223 192.168.11.1 -> 10.0.2.15    DNS Standard query response
151   1.826449 192.168.11.1 -> 10.0.2.15    DNS Standard query response CNAME web.lib.ncsu.edu
287   1.472035    10.0.2.15 -> 192.168.11.1 DNS Standard query A www.google-analytics.com
290   1.491738 192.168.11.1 -> 10.0.2.15    DNS Standard query response CNAME www-google-analytics.l.google.com A 72.14.213.100 A 72.14.213.139 A 72.14.213.102 A 72.14.213.101 A 72.14.213.113 A 72.14.213.138
16189 145.244953 192.168.11.1 -> 10.0.2.15    DNS Standard query response A 128.103.241.91
16190 145.365353    10.0.2.15 -> 192.168.11.1 DNS Standard query AAAA harvardscience.harvard.edu
16191 145.457259 192.168.11.1 -> 10.0.2.15    DNS Standard query response CNAME haggis.harvard.edu
16192 145.457555    10.0.2.15 -> 192.168.11.1 DNS Standard query A harvardscience.harvard.edu
16193 145.511170 192.168.11.1 -> 10.0.2.15    DNS Standard query response CNAME haggis.harvard.edu A 128.103.60.12

The first and second columns are irrelevant and are thrown away. The middle two IP numbers need to be extracted depending on the type of entry it is. The "DNS Standard query" is also irrelevant. After that things get parsed differently whether it is a DNS request (an A or AAAA line) or a response and the program matches up responses to requests. As you can see from the above there are many different forms of entries and even similar entries such as CNAME can have variation, for example the last line of the example there is one A response given in the CNAME, whereas the CNAME entry for google-analytics.com gives several responses. You'll also see the spacing is inconsistent between the items so it can't simply be split on a space without creating inconsistent arrays from the splits since some would have a 'blank' array slots. I chose to use the regex to eliminate the need to worry about the spacing but it could easily be the result of the poor performance and not the split itself, though I tried to eliminate this as much as possible by compiling the expression at the start of the program and then reusing it for the splits.

As I stated earlier, after profiling this application I found the split to be slow. And by slow I meant taking up the majority of the execution time, somewhere around 65-80%. This was identified by putting the split into a separate method and NetBeans's profiler saying that function was taking up that much execution. The method overhead didn't seem to make much of a difference as I got similar numbers for the calling function before moving the split code into its own function. Unfortunately another language is not an option because using Java is a business decision from on high.

ambientZ · Jan 13, 2010

Are there tabs and spaces in there? spacing looks funny for just spaces.....

CefiroZ · Jan 13, 2010

ambientZ said:
Are there tabs and spaces in there? spacing looks funny for just spaces.....

Yeah tell me about it, but they are indeed spaces.

J-Will · Jan 13, 2010

If this is coming from a text file I would:
with starting character offset, starting at the first IP address, store the string text into an array, where each block in the array is a line of text. I.E. 10.0.2.15 -> 192.168.11.1 DNS Standard query AAAA ncsu.edu

From there I would run another loop to put the entries you want to look at into another array (A or AAAA line). This can be accomplished by applying a filter.

Once you have that, then you can pretty it up with replace string method to take out the
'DNS Standard query' line and replace it with ' ' with is effectively nothing.

Tawnos · Jan 13, 2010

I'm trying to interpret what you need...

CefiroZ said:

Code:

  1   0.000000    10.0.2.15 -> 192.168.11.1 DNS Standard query AAAA ncsu.edu
  2   0.000495    10.0.2.15 -> 192.168.11.1 DNS Standard query A ncsu.edu
  3   0.125069 192.168.11.1 -> 10.0.2.15    DNS Standard query response
  4   0.125251 192.168.11.1 -> 10.0.2.15    DNS Standard query response A 152.1.226.10
138   1.768626    10.0.2.15 -> 192.168.11.1 DNS Standard query AAAA www.lib.ncsu.edu
139   1.772099    10.0.2.15 -> 192.168.11.1 DNS Standard query AAAA ncsu.edu
146   1.795223 192.168.11.1 -> 10.0.2.15    DNS Standard query response
151   1.826449 192.168.11.1 -> 10.0.2.15    DNS Standard query response CNAME web.lib.ncsu.edu
287   1.472035    10.0.2.15 -> 192.168.11.1 DNS Standard query A www.google-analytics.com
290   1.491738 192.168.11.1 -> 10.0.2.15    DNS Standard query response CNAME www-google-analytics.l.google.com A 72.14.213.100 A 72.14.213.139 A 72.14.213.102 A 72.14.213.101 A 72.14.213.113 A 72.14.213.138
16189 145.244953 192.168.11.1 -> 10.0.2.15    DNS Standard query response A 128.103.241.91
16190 145.365353    10.0.2.15 -> 192.168.11.1 DNS Standard query AAAA harvardscience.harvard.edu
16191 145.457259 192.168.11.1 -> 10.0.2.15    DNS Standard query response CNAME haggis.harvard.edu
16192 145.457555    10.0.2.15 -> 192.168.11.1 DNS Standard query A harvardscience.harvard.edu
16193 145.511170 192.168.11.1 -> 10.0.2.15    DNS Standard query response CNAME haggis.harvard.edu A 128.103.60.12

1) The first and second columns are irrelevant and are thrown away.

2) The middle two IP numbers need to be extracted depending on the type of entry it is.

3) The "DNS Standard query" is also irrelevant.

4) After that things get parsed differently whether it is a DNS request (an A or AAAA line) or a response and the program matches up responses to requests.

You'll also see the spacing is inconsistent between the items so it can't simply be split on a space without creating inconsistent arrays from the splits since some would have a 'blank' array slots.

Based on the broken-up and numbered stuff, above, and presuming you have a line of input read into a char[] buffer... It's been a while since I've done java, so this might need a bit of tweaking...

Code:

int i = 0;

// max length of IP is 4 blocks, 3 numbers long each, plus 3 dots and trailing null
char[] ip1 = new char[16]; 
char[] ip2 = new char[16];

for (; buffer[i] == ' '; ++i); // past leading spaces
for (; buffer[i] != ' '; ++i); // past first number
for (; buffer[i] == ' '; ++i); // past spaces between first and second number
for (; buffer[i] != ' '; ++i); // past second number // EDIT: this was faulty: i += 8; // second number is always 8 characters: #.######
for (; buffer[i] == ' '; ++i); // past spaces between second number and first IP

// copy first IP into char buffer
for (int j = 0; buffer[i] != ' ' && j < 16; ++i, ++j)
{
   ip1[j] = buffer[i];
}

// always appears to be one space, arrow, one space, then next IP; else you'll have to use the same loop pattern
i += 4;

// copy second IP into char buffer
for (int j = 0; buffer[i] != ' ' && j < 16; ++i, ++j)
{
   ip2[j] = buffer[i];
}

... other code here following same pattern

Would the above work? It's about as fast as I can think of making this off the cuff.

mikeblas · Jan 13, 2010

You still haven't shown us the regex that you're using, but given what you've posted so far, it seems like it's the most likely problem. The code that Tawnos provides (once you get it completed and working) is going to be faster than a regex--even if the regex is precompiled.

CefiroZ · Jan 13, 2010

mikeblas said:
You still haven't shown us the regex that you're using, but given what you've posted so far, it seems like it's the most likely problem.

I actually did give it in the first post, though it may not have been clear. I said I was splitting on " +", which is the regex.

mikeblas said:
The code that Tawnos provides (once you get it completed and working) is going to be faster than a regex--even if the regex is precompiled.

I'll give something of that sort a try and report back on how it affected performance.

keenan · Jan 13, 2010

CefiroZ said:
As I stated earlier, after profiling this application I found the split to be slow. And by slow I meant taking up the majority of the execution time, somewhere around 65-80%.

That doesn't really mean much. The fact that it takes up 65-80% of execution time is really irrelevant without context. If all the program is doing is parsing this file and quitting, that seems quite reasonable to me. Does it even matter if it can easily keep up with disk I/O?

Instead of splitting on a regex, I would try parsing the whole line with a a regex and extracting the fields you need that way. Should be faster.

CefiroZ · Jan 15, 2010

CefiroZ said:
I'll give something of that sort a try and report back on how it affected performance.

I was able to implement the manual parsing method using substring combined with either indexOfs or simply adding to the current index and was able to get the execution for one test from ~14 sec down to ~2, which is right about the performance throughput I need. I'm rather surprised how much removing the split sped up the program. Thanks for pointing me in the right direction guys!

mikeblas · Jan 15, 2010

Great! I'm glad you've got it working better.

Regular expressions are powerful, but really aren't that great for large-scale work. We had a bakeoff project here a while ago that stressed some of the issues with the dynamic languages and their regular expression implementations, in particular.

Java's split too slow

CefiroZ

Limp Gawd

J-Will

[H]ard|Gawd

Tawnos

2[H]4U

ambientZ

[H]ard|Gawd

PTNL

Supreme [H]ardness

oplin

Gawd

keenan

2[H]4U

blanck

n00b

mikeblas

[H]ard|DCer of the Month - May 2006

CefiroZ

Limp Gawd

ambientZ

[H]ard|Gawd

CefiroZ

Limp Gawd

J-Will

[H]ard|Gawd

Tawnos

2[H]4U

mikeblas

[H]ard|DCer of the Month - May 2006

CefiroZ

Limp Gawd

keenan

2[H]4U

CefiroZ

Limp Gawd

mikeblas

[H]ard|DCer of the Month - May 2006