I found a freely available data dump of USPTO patent data in XML format. Part of the XML for most of the patents has the following structure:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v45-2014-04-03.dtd" [ ]>
<us-patent-grant lang="EN" dtd-version="v4.5 2014-04-03" file="US09226443-20160105.XML" status="PRODUCTION" id="us-patent-grant" country="US" date-produced="20151221" date-publ="20160105">
...
<claims>
...
<claim id="CLM-00015" num="00015">
<claim-text>15. The system of <claim-ref idref="CLM-00013">claim 13</claim-ref>, wherein ...</claim-text>
</claim>
</claims>
</us-patent-grant>
When I execute the PHP simplexml_load_string
function on the XML, the <claim-ref idref="CLM-00013">claim 13</claim-ref>
part goes away and I'm left with the following for the claim text:
15. The system of , wherein ...
I tried executing the simplexml_load_string
function as follows:
$xml = simplexml_load_string($xmlTxt, 'SimpleXMLElement', LIBXML_NOCDATA);
But it didn't change anything.
What do I need to do in order to get the text within the claim-ref
tags to be retained as part of the CDATA within the claim-text
tags? Please note that I don't need to retain the actual claim-ref
tags, just the text within them.