dqstti8945 2009-11-23 15:29
浏览 65
已采纳

替换“<code>”标记内的所有“\”字符*不*

First things first: Neither this, this, this nor this answered my question. So I'll open a new one.

Please read

Okay okay. I know that regexes are not the way to parse general HTML. Please take note that the created documents are written using a limited, controlled HTML subset. And people writing the docs know what they're doing. They are all IT professionals!

Given the controlled syntax it is possible to parse the documents I have here using regexes.

I am not trying to download arbitrary documents from the web and parse them!

And if the parsing does fail, the document is edited, so it'll parse. The problem I am addressing here is more general than that (i.e. not replace patterns inside two other patterns).

A little bit of background (you can skip this...)

In our office we are supposed to "pretty print" our documentation. Hence why some came up with putting it all into Word documents. So far we're thankfully not quite there yet. And, if I get this done, we might not need to.

The current state (... and this)

The main part of the docs are stored in a TikiWiki database. I've created a daft PHP script which converts the documents from HTML (via LaTeX) to PDF. One of the must have features of the selected Wiki-System was a WYSIWYG editor. Which, as expected leaves us with documents with a less then formal DOM.

Consequently, I am transliterating the document using "simple" regexes. It all works (mostly) fine so far, but I encountered one problem I haven't figured out on my own yet.

The problem

Some special characters need to replaced by LaTeX markup. For exaple, the \ character should be replaced by $\backslash$ (unless someone knows another solution?).

Except while in a verbatim block!

I do replace <code> tags with verbatim sections. But if this code block contains backslashes (as is the case for Windows folder names), the script still replaces these backslashes.

I reckon I could solve this using negative LookBehinds and/or LookAheads. But my attempts did not work.

Granted, I would be better off with a real parser. In fact, it is something on my "in-brain-roadmap", but it is currently out of the scope. The script works well enough for our limited knowledge domain. Creating a parser would require me to start pretty much from scratch.

My attempt

Example Input

The Hello \ World document is located in:
<code>C:\documents\hello_world.txt</code>

Expected output

The Hello $\backslash$ World document is located in:
\begin{verbatim}C:\documents\hello_world.txt\end{verbatim}

This is the best I could come up with so far:

<?php
$patterns = array(
    "special_chars2" => array( '/(?<!<code[^>]*>.*)\\\\[^$](?!.*<\/code>)/U', '$\\backslash$'),
);

foreach( $patterns as $name => $p ){
    $tex_input = preg_replace( $p[0], $p[1], $tex_input );
}
?>

Note that this is only an excerpt, and the [^$] is another LaTeX requirement.

Another attempt which seemed to work:

<?php
$patterns = array(
    "special_chars2" => array( '/\\\\[^$](?!.*<\/code>)/U', '$\\backslash$'),
);

foreach( $patterns as $name => $p ){
    $tex_input = preg_replace( $p[0], $p[1], $tex_input );
}
?>

... in other words: leaving out the negative lookbehind.

But this looks more error-prone than with both lookbehind and lookahead.

A related question

As you may have noticed, the pattern is ungreedy (/.../U). So will this match only as little possible inside a <code> block? Considering the look-arounds?

  • 写回答

6条回答 默认 最新

  • dousi8931 2009-11-23 15:59
    关注

    If me, I will try to find HTML parser and will do with that.

    Another option is will try to chunk the string into <code>.*?</code> and other parts.

    and will update other parts, and will recombine it.

    $x="The Hello \ World document is located in:
    <br>
    <code>C:\documents\hello_world.txt</code>";
    
    $r=preg_split("/(<code>.*?<\/code>)/", $x,-1,PREG_SPLIT_DELIM_CAPTURE);
    
    for($i=0;$i<count($r);$i+=2)
        $r[$i]=str_replace("\\","$\\backslash$",$r[$i]);
    
    $x=implode($r);
    
    echo $x;
    

    Here is the results.

    The Hello $\backslash$ World document is located in: 
    C:\documents\hello_world.txt
    

    Sorry, If my approach is not suitable for you.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(5条)

报告相同问题?

悬赏问题

  • ¥15 用C语言输入方程怎么
  • ¥15 网站显示不安全连接问题
  • ¥15 github训练的模型参数无法下载
  • ¥15 51单片机显示器问题
  • ¥20 关于#qt#的问题:Qt代码的移植问题
  • ¥50 求图像处理的matlab方案
  • ¥50 winform中使用edge的Kiosk模式
  • ¥15 关于#python#的问题:功能监听网页
  • ¥15 怎么让wx群机器人发送音乐
  • ¥15 fesafe材料库问题