doumu4916 2013-11-05 05:13
浏览 42
已采纳

我需要匹配组中的所有字符,只要它们与某个单词不匹配即可

I'm not sure if this is a simple question, but i have been unable to find an answer to it thus far. I am trying to write a regular expression that pulls apart a .docx file and matches replaces all <w:tab /> tags with <w:ind /> tags, as the <w:tab> tags don't seem to preserve tabs correctly when they translate to html. I am working in PHP, and I have so far been unsuccessful at writing a regular expression that does what i need it to do correctly.

The problem is, I can't just run a simple find-and-replace function here. I have to remove the <w:tab /> tag and inject the <w:ind /> tag within the nearest opening and closing <w:rPr></w:rPr> tags.

A sample XML string would look something like this:

    <w:p w14:paraId="2679030C" w14:textId="4E6FFA99" w:rsidR="00ED4314" w:rsidRPr="00254747" w:rsidRDefault="00ED4314" w:rsidP="00322270">
        <w:pPr>
            <w:pStyle w:val="NoSpacing" />
            <w:spacing w:line="480" w:lineRule="auto" />
            <w:jc w:val="both" />
            <w:rPr>
                <w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman" />
                <w:sz w:val="24" />
                <w:szCs w:val="24" />
            </w:rPr>
        </w:pPr>
        <w:r w:rsidRPr="00254747">
            <w:rPr>
                <w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman" />
                <w:sz w:val="24" />
                <w:szCs w:val="24" />
            </w:rPr>
            <w:tab />
            <w:t>SOME text</w:t>
        </w:r>
        <w:r w:rsidR="0003297C">
            <w:rPr>
                <w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman" />
                <w:sz w:val="24" />
                <w:szCs w:val="24" />
            </w:rPr>
            <w:t>SOME more text</w:t>
        </w:r>
        <w:r w:rsidRPr="00254747">
            <w:rPr>
                <w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman" />
                <w:sz w:val="24" />
                <w:szCs w:val="24" />
            </w:rPr>
            <w:t>EVEN more text</w:t>
        </w:r>
    </w:p>

So each instance of <w:tab/> would need to be removed, and then i would need to trace backwards to the previous <w:rPr> tag and inject a <w:ind /> tag inside of it.

heres what i have so far:

$content = preg_replace("/<w:rPr>(.*?)<\/w:rPr>(.*?)<w:tab\/>/", "<w:rPr><w:ind w:firstLine=\"720\"/>$1</w:rPr>$2", $content);

This sort-of works, but the problem is i think the search is too global. even though i'm specifying for it to not be greedy, the results it returns to me have way more content then they should. Can anyone suggest an optimal way to refine this? Thanks in advance!

  • 写回答

1条回答 默认 最新

  • dpka7974 2013-11-05 05:58
    关注

    I think you're confusing non-greediness with regular expressions "knowing" to stop before finding more tags—which it can't. If you mean to disallow tags between </w:rPr> and <w:tab/>, then this should roughly work:

    /<w:rPr>(.*?)<\/w:rPr>([^<]*?)<w:tab\/>/
                           ^^^^
    

    This is known as a negated character class, and matches all characters that aren't <—therefore won't consume any other tags before finding a <w:tab/>.


    Edit. In response to your clarification, i.e. allowing all tags except <w:rPr> before finding a <w:tab/>, you'd need to use a negative lookahead assertion, because, as you correctly understood, negated character classes only exclude characters, not strings.

    /<w:rPr>(.*?)<\/w:rPr>((?:(?!<w:rPr>).)*?)<w:tab\/>/
                           ^^^^^^^^^^^^^^^^
    

    Ignore the (?:xyz) if that's confusing—that's merely a way to get parentheses not to capture—I need the parentheses though for the quantifier, *. The important piece here is the (?!xyz) which is known as a negative lookahead assertion (and incidentally is also a non-capturing group)—it matches if it looks ahead and does not find "xyz"—so, what we're doing above is this: (1) look ahead, and (2) if it's not <w:rPr>, then (3) match one character, ., and (4) repeat—until a <w:tab/> is found.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 Vue3 大型图片数据拖动排序
  • ¥15 划分vlan后不通了
  • ¥15 GDI处理通道视频时总是带有白色锯齿
  • ¥20 用雷电模拟器安装百达屋apk一直闪退
  • ¥15 算能科技20240506咨询(拒绝大模型回答)
  • ¥15 自适应 AR 模型 参数估计Matlab程序
  • ¥100 角动量包络面如何用MATLAB绘制
  • ¥15 merge函数占用内存过大
  • ¥15 使用EMD去噪处理RML2016数据集时候的原理
  • ¥15 神经网络预测均方误差很小 但是图像上看着差别太大