普通网友 2014-02-13 18:43
浏览 497

正则表达式提取纯文本XML节点

I have a LARGE XML file. I'm troubleshooting some things, and I would like to extract specific nodes from the XML file. I don't want a SimpleXML object, I want to make a new file with the raw string matching what I want (posting this on bash/sed/php).

<?xml version="1.0" encoding="UTF-8"?>
<definition></definition>
    <metadata></metadata>
    <nodeToRegex>
        <nodeImightwant>
            <subnode>
                <subsubnode1></subsubnode1>
                <subsubnodeToCheck>stringCheck</subnodeToCheck>
                <subsubnode2></subsubnode2>
            </subnode>
        </nodeImightwant>
        <nodeImightwant></nodeImightwant>
        <nodeImightwant></nodeImightwant>
    </nodeToRegex>

So from this XML file, I want all lines from every node except the nodeToRegex. From nodeToRegex, I only want the nodeImightwant if the stringCheck string equals "aValidString". Can this be done via regex or should I just copy and paste the stuff out of the file? (my regex skills are subpar)

  • 写回答

1条回答 默认 最新

  • dongyu3967 2014-02-13 18:56
    关注

    Don't parse XML with regexes. There is no reason you can't repackage/rearrange the data using SimpleXML, but trying to do it with a regex is a recipe for lots of headaches and, ultimately, broken code.

    See this classic example for why parsing XML/HTML/XHTML with regexes is the road to madness.

    If you insist on using a regex, just replace the nodes you don't want, like this:

    $myxml = preg_replace('~<nodeToRegex>.*?</nodeToRegex>~', '', $myxml);
    

    Regular expression visualization

    Debuggex Demo

    评论

报告相同问题?