How to Work With a Very Large XML file in PHP

How to Work With a Very Large XML file in PHP

It doesn’t matter what kind of business you are running.  Fair Go Casino, a toy store, book store or an information website.  At some point, you are going to be dealing with data, and you are going to need to process, manipulate, add to, delete from, or modify the data.

In the past, the major forms of text based data transfer were in the form of fixed width files, where each item in a record was broken up by a specific column location.  The second major form was comma separated values (CSV), even if the separation between items in a record was not a column, but a different character. 

For the most part, these two methods of text based data transfer work when the data that you are working with is stored in a single line.  Names, addresses, phone numbers, status, etc. are all examples of these types of data fields.

But when the data field does not work match this standard (multiple lines of text as a single data item), a different form of data transfer is needed.

There are two forms that are used for this type of data transfer.  The first is XML.  XML looks a lot like HTML, except that instead of HTML tags, you are dealing with data field tags.  The second form, which is even more common today than XML is JSON.  But for this article, we are going to focus on XML.

What XML processing libraries are included in standard PHP and what is the limit of these libraries?

The following libraries are part of the standard PHP.  I am only going to talk about the ones that deal with XML files that are not connected with a database or other data source:

  1. libxml (low level library that most other libraries depend on, do not use directly)
  2. SimpleXML (Provides a very simple and easily usable toolset to convert XML to an object that can be processed with normal property selectors and array iterators)
  3. XML Parser (This is more of a low level library where for each XML item you can create a different handle/function).
  4. XML Reader 

The main problem with all of these is they are designed with the model that you read the whole entire XML file into memory and then manipulate it.  That is great when you are working with a single HTML page or something else of a similar length.

But I was dealing with a file that on the low end had 6000 records and on the high end had 35,000 records.  On top of that, the actual file size for my 6000 record file was 6 Megabytes long and there was no guarantee that in any given read of a record the whole record would be read in one reading.  So I need another option … prewk/xml-string-streamer

What is prewk/xml-string-streamer

xml-string-streamer is an XML parsing PHP library that is specifically designed to handle the reading and processing of very large XML files that cannot be all read into memory in one reading or even guarantee that a single data record can be read into memory in one reading.

This library is built on top of the standard SimpleXML library, so it has all of the benefits of that library.  That means that once the XML read into memory and parsed, you can access the individual XML items as easily as you can access elements in an array.

It was a learning curve getting my first application written, but once I figured it out, it is a work of beauty … simple in its design and easy to use.

As the saying goes, “A picture is worth a thousand words” or in this case “An simple example program is worth a thousand words”

Example source code using the xml-string-streamer library.

File: EditArticle.php

Purpose: Process records that contain a list of articles with a “title” field and a “free_text” field.

<source lang=”php”>
<?php
require_once(__DIR__.'/XmlStringStreamer.php');
require_once(__DIR__.'/XmlStringStreamer/StreamInterface.php');
require_once(__DIR__.'/XmlStringStreamer/ParserInterface.php');
require_once(__DIR__.'/XmlStringStreamer/Parser/StringWalker.php');
require_once(__DIR__.'/XmlStringStreamer/Parser/UniqueNode.php');
require_once(__DIR__.'/XmlStringStreamer/Stream/File.php');
require_once(__DIR__.'/XmlStringStreamer/Stream/Stdin.php');
use Prewk\XmlStringStreamer;
use Prewk\XmlStringStreamer\Stream\File;
use Prewk\XmlStringStreamer\Parser\StringWalker;
use Prewk\XmlStringStreamer\Stream;
use Prewk\XmlStringStreamer\Parser;
$options = array(
    "uniqueNode" => "customer"
);
$parser = new Parser\UniqueNode($options);
// Prepare our stream to be read with a 1kb buffer
$file = "gigantic.xml";
$totalSize = filesize($file);
// Construct the file stream
// 16384 is the default read size.
// stdin is only 1024, do not know why.
$stream = new File($file, 16384); 
//$stream = new File($file, 1024); 
// Construct the parser
$parser = new StringWalker;
// Construct the streamer
$streamer = new XmlStringStreamer($parser, $stream);
// Start parsing
$recordCount = 0;
while ($node = $streamer->getNode()) {
$recordCount++;
// preg_replace is needed to handle & characters in your data fields.
    $simpleXmlNode = simplexml_load_string(preg_replace('/&(?!;{6})/', '&amp;', $node));
    echo "Record " . $recordCount . ":" . (string)$simpleXmlNode->Title . "\n";
}
echo "\n";
?>
</source>

File: gigantic.xml

Purpose: our data file

<souce lang=”xml”>
<pages>
<page>
  <title>Page One </title>
  <free_text>
The quick brown fox jumps over the lazy dog.
  </free_text>
</page>
<page>
  <title>Page Two </title>
  <free_text>
Romeo, oh Romeo, where far out thou Romeo.
  </free_text>
</page>
<page>
  <title>Page Three </title>
  <free_text>
This file contains example text that contains an & character
which is normally an invalid character and would cause 
unexpected processing of our text.
  </free_text>
</page>
<page>
  <title>Page Four </title>
  <free_text>
"We the People of the United States, in Order to form a more perfect Union, establish Justice, insure domestic Tranquility, provide for the common defense, promote the general Welfare, and secure the Blessings of Liberty to ourselves and our Posterity, do ordain and establish this Constitution for the United States of America."
Yes, we are testing text with quote characters in it.
  </free_text>
</page>
<page>
  <title>Page Five </title>
  <free_text>
This is our last page.
  </free_text>
</page>
</pages>
</source>

Summary

The above contains the source code and a simple example file.  The code is simple and works exactly how you would expect it to work. 

The actual source code can be installed using Composer, but I am not fully versed in Composer, so I just converted the installation to manual installation.  Hope you learned something reading this article.  I did writing it.  Now it is time to try it out with your own data files, but don’t forget to check out the actual documentation, because there are options and other “goodies” that I did not touch on in this article.  I simply wanted to focus on getting a simple program written from start to finish.