Splitting a large XML file on Linux 2.4

A client recently had a problem processing an XML with his PHP script. “File too large” was the error, and the data file was over 2 gigabytes in size.

It turns out you can recompile PHP to deal with large files (see Requirements section here). But this blog entry made recompiling sound problematic…you get access to the large file but certain file functions break due to integer overflows. So I wanted to avoid this option.

The Linux command line tools seemed to be able to deal with it OK: stuff like less, head and tail were working on it. So I decided to try to break the XML file into parts. The split command worked but indiscriminately cut right through the middle of a data record. I considered using csplit, but as this file contained over 700,000 data records, I didn’t want to deal with that many individual XML files.

I decided to write a Perl script to split the file into blocks of 100,000 records each. It didn’t take long to put together and Perl’s regular expression matching made handling the records easy on the small test data. For some reason I thought Perl would be OK with the large data file, but when I went to run the script, it too choked on the large file. I would have to recompile Perl to get around it. As the client’s box is using a Red Hat package for Perl, I didn’t want to mess with it.

Then I had an idea. Since the Linux command line tools were handling the file OK, I wondered if I could trick my script by feeding it one line at a time. Instead of looping an open file in the Perl script, I used a loop like this:

while( $line = <STDIN> ) {
  # do stuff
}


Then I called the script like this:

cat bigfile.xml | ./split.pl

And it worked!