I'm not sure if this is a simple question, but i have been unable to find an answer to it thus far. I am trying to write a regular expression that pulls apart a .docx file and matches replaces all <w:tab />
tags with <w:ind />
tags, as the <w:tab>
tags don't seem to preserve tabs correctly when they translate to html. I am working in PHP, and I have so far been unsuccessful at writing a regular expression that does what i need it to do correctly.
The problem is, I can't just run a simple find-and-replace function here. I have to remove the <w:tab />
tag and inject the <w:ind />
tag within the nearest opening and closing <w:rPr></w:rPr>
tags.
A sample XML string would look something like this:
<w:p w14:paraId="2679030C" w14:textId="4E6FFA99" w:rsidR="00ED4314" w:rsidRPr="00254747" w:rsidRDefault="00ED4314" w:rsidP="00322270">
<w:pPr>
<w:pStyle w:val="NoSpacing" />
<w:spacing w:line="480" w:lineRule="auto" />
<w:jc w:val="both" />
<w:rPr>
<w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman" />
<w:sz w:val="24" />
<w:szCs w:val="24" />
</w:rPr>
</w:pPr>
<w:r w:rsidRPr="00254747">
<w:rPr>
<w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman" />
<w:sz w:val="24" />
<w:szCs w:val="24" />
</w:rPr>
<w:tab />
<w:t>SOME text</w:t>
</w:r>
<w:r w:rsidR="0003297C">
<w:rPr>
<w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman" />
<w:sz w:val="24" />
<w:szCs w:val="24" />
</w:rPr>
<w:t>SOME more text</w:t>
</w:r>
<w:r w:rsidRPr="00254747">
<w:rPr>
<w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman" />
<w:sz w:val="24" />
<w:szCs w:val="24" />
</w:rPr>
<w:t>EVEN more text</w:t>
</w:r>
</w:p>
So each instance of <w:tab/>
would need to be removed, and then i would need to trace backwards to the previous <w:rPr>
tag and inject a <w:ind />
tag inside of it.
heres what i have so far:
$content = preg_replace("/<w:rPr>(.*?)<\/w:rPr>(.*?)<w:tab\/>/", "<w:rPr><w:ind w:firstLine=\"720\"/>$1</w:rPr>$2", $content);
This sort-of works, but the problem is i think the search is too global. even though i'm specifying for it to not be greedy, the results it returns to me have way more content then they should. Can anyone suggest an optimal way to refine this? Thanks in advance!