doumu4916 2013-11-05 05:13
浏览 42
已采纳

我需要匹配组中的所有字符,只要它们与某个单词不匹配即可

I'm not sure if this is a simple question, but i have been unable to find an answer to it thus far. I am trying to write a regular expression that pulls apart a .docx file and matches replaces all <w:tab /> tags with <w:ind /> tags, as the <w:tab> tags don't seem to preserve tabs correctly when they translate to html. I am working in PHP, and I have so far been unsuccessful at writing a regular expression that does what i need it to do correctly.

The problem is, I can't just run a simple find-and-replace function here. I have to remove the <w:tab /> tag and inject the <w:ind /> tag within the nearest opening and closing <w:rPr></w:rPr> tags.

A sample XML string would look something like this:

    <w:p w14:paraId="2679030C" w14:textId="4E6FFA99" w:rsidR="00ED4314" w:rsidRPr="00254747" w:rsidRDefault="00ED4314" w:rsidP="00322270">
        <w:pPr>
            <w:pStyle w:val="NoSpacing" />
            <w:spacing w:line="480" w:lineRule="auto" />
            <w:jc w:val="both" />
            <w:rPr>
                <w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman" />
                <w:sz w:val="24" />
                <w:szCs w:val="24" />
            </w:rPr>
        </w:pPr>
        <w:r w:rsidRPr="00254747">
            <w:rPr>
                <w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman" />
                <w:sz w:val="24" />
                <w:szCs w:val="24" />
            </w:rPr>
            <w:tab />
            <w:t>SOME text</w:t>
        </w:r>
        <w:r w:rsidR="0003297C">
            <w:rPr>
                <w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman" />
                <w:sz w:val="24" />
                <w:szCs w:val="24" />
            </w:rPr>
            <w:t>SOME more text</w:t>
        </w:r>
        <w:r w:rsidRPr="00254747">
            <w:rPr>
                <w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman" />
                <w:sz w:val="24" />
                <w:szCs w:val="24" />
            </w:rPr>
            <w:t>EVEN more text</w:t>
        </w:r>
    </w:p>

So each instance of <w:tab/> would need to be removed, and then i would need to trace backwards to the previous <w:rPr> tag and inject a <w:ind /> tag inside of it.

heres what i have so far:

$content = preg_replace("/<w:rPr>(.*?)<\/w:rPr>(.*?)<w:tab\/>/", "<w:rPr><w:ind w:firstLine=\"720\"/>$1</w:rPr>$2", $content);

This sort-of works, but the problem is i think the search is too global. even though i'm specifying for it to not be greedy, the results it returns to me have way more content then they should. Can anyone suggest an optimal way to refine this? Thanks in advance!

  • 写回答

1条回答 默认 最新

  • dpka7974 2013-11-05 05:58
    关注

    I think you're confusing non-greediness with regular expressions "knowing" to stop before finding more tags—which it can't. If you mean to disallow tags between </w:rPr> and <w:tab/>, then this should roughly work:

    /<w:rPr>(.*?)<\/w:rPr>([^<]*?)<w:tab\/>/
                           ^^^^
    

    This is known as a negated character class, and matches all characters that aren't <—therefore won't consume any other tags before finding a <w:tab/>.


    Edit. In response to your clarification, i.e. allowing all tags except <w:rPr> before finding a <w:tab/>, you'd need to use a negative lookahead assertion, because, as you correctly understood, negated character classes only exclude characters, not strings.

    /<w:rPr>(.*?)<\/w:rPr>((?:(?!<w:rPr>).)*?)<w:tab\/>/
                           ^^^^^^^^^^^^^^^^
    

    Ignore the (?:xyz) if that's confusing—that's merely a way to get parentheses not to capture—I need the parentheses though for the quantifier, *. The important piece here is the (?!xyz) which is known as a negative lookahead assertion (and incidentally is also a non-capturing group)—it matches if it looks ahead and does not find "xyz"—so, what we're doing above is this: (1) look ahead, and (2) if it's not <w:rPr>, then (3) match one character, ., and (4) repeat—until a <w:tab/> is found.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 如何用Labview在myRIO上做LCD显示?(语言-开发语言)
  • ¥15 Vue3地图和异步函数使用
  • ¥15 C++ yoloV5改写遇到的问题
  • ¥20 win11修改中文用户名路径
  • ¥15 win2012磁盘空间不足,c盘正常,d盘无法写入
  • ¥15 用土力学知识进行土坡稳定性分析与挡土墙设计
  • ¥70 PlayWright在Java上连接CDP关联本地Chrome启动失败,貌似是Windows端口转发问题
  • ¥15 帮我写一个c++工程
  • ¥30 Eclipse官网打不开,官网首页进不去,显示无法访问此页面,求解决方法
  • ¥15 关于smbclient 库的使用