dqstti8945 2009-11-23 15:29

已采纳

替换“<code>”标记内的所有“\”字符不

First things first: Neither this, this, this nor this answered my question. So I'll open a new one.

Please read

Okay okay. I know that regexes are not the way to parse general HTML. Please take note that the created documents are written using a limited, controlled HTML subset. And people writing the docs know what they're doing. They are all IT professionals!

Given the controlled syntax it is possible to parse the documents I have here using regexes.

I am not trying to download arbitrary documents from the web and parse them!

And if the parsing does fail, the document is edited, so it'll parse. The problem I am addressing here is more general than that (i.e. not replace patterns inside two other patterns).

A little bit of background (you can skip this...)

In our office we are supposed to "pretty print" our documentation. Hence why some came up with putting it all into Word documents. So far we're thankfully not quite there yet. And, if I get this done, we might not need to.

The current state (... and this)

The main part of the docs are stored in a TikiWiki database. I've created a daft PHP script which converts the documents from HTML (via LaTeX) to PDF. One of the must have features of the selected Wiki-System was a WYSIWYG editor. Which, as expected leaves us with documents with a less then formal DOM.

Consequently, I am transliterating the document using "simple" regexes. It all works (mostly) fine so far, but I encountered one problem I haven't figured out on my own yet.

The problem

Some special characters need to replaced by LaTeX markup. For exaple, the \ character should be replaced by $\backslash$ (unless someone knows another solution?).

Except while in a verbatim block!

I do replace <code> tags with verbatim sections. But if this code block contains backslashes (as is the case for Windows folder names), the script still replaces these backslashes.

I reckon I could solve this using negative LookBehinds and/or LookAheads. But my attempts did not work.

Granted, I would be better off with a real parser. In fact, it is something on my "in-brain-roadmap", but it is currently out of the scope. The script works well enough for our limited knowledge domain. Creating a parser would require me to start pretty much from scratch.

My attempt

Example Input

The Hello \ World document is located in:
<code>C:\documents\hello_world.txt</code>

Expected output

The Hello $\backslash$ World document is located in:
\begin{verbatim}C:\documents\hello_world.txt\end{verbatim}

This is the best I could come up with so far:

<?php
$patterns = array(
    "special_chars2" => array( '/(?<!<code[^>]*>.*)\\\\[^$](?!.*<\/code>)/U', '$\\backslash$'),
);

foreach( $patterns as $name => $p ){
    $tex_input = preg_replace( $p[0], $p[1], $tex_input );
}
?>

Note that this is only an excerpt, and the [^$] is another LaTeX requirement.

Another attempt which seemed to work:

<?php
$patterns = array(
    "special_chars2" => array( '/\\\\[^$](?!.*<\/code>)/U', '$\\backslash$'),
);

foreach( $patterns as $name => $p ){
    $tex_input = preg_replace( $p[0], $p[1], $tex_input );
}
?>

... in other words: leaving out the negative lookbehind.

But this looks more error-prone than with both lookbehind and lookahead.

A related question

As you may have noticed, the pattern is ungreedy (/.../U). So will this match only as little possible inside a <code> block? Considering the look-arounds?

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

6条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
dousi8931 2009-11-23 15:59
关注
If me, I will try to find HTML parser and will do with that.

Another option is will try to chunk the string into <code>.*?</code> and other parts.

and will update other parts, and will recombine it.

$x="The Hello \ World document is located in: <br> <code>C:\documents\hello_world.txt</code>"; $r=preg_split("/(<code>.*?<\/code>)/", $x,-1,PREG_SPLIT_DELIM_CAPTURE); for($i=0;$i<count($r);$i+=2) $r[$i]=str_replace("\\","$\\backslash$",$r[$i]); $x=implode($r); echo $x;

Here is the results.

The Hello $\backslash$ World document is located in: C:\documents\hello_world.txt

Sorry, If my approach is not suitable for you.
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(5条)

报告相同问题？

关注问题

替换<img>标记上的空alt标记 php
2017-06-07 10:41

回答 5 已采纳 Using Regex is not a good approach you should use DOMDocument for parsing HTML. Here we are queryi
php替换标签或字符之间的一些字符串 json php
2019-01-28 17:01

回答 3 已采纳 You can use preg_match_all to match all strings of type @somestring@ using regex @([^@]+)@ and the
php将字符串替换为html标记 php
2015-11-21 04:50

回答 2 已采纳 You can use the following regex like as \[(\w+):\s?(\w+.\w+)] Regex Explanation \[ : \[ will
php将空格转换为 br,换行符在HTML中直接替换为<br>
2021-04-28 01:51

大白话时事的博客 obj.getMeasure().replaceAll("\r\n",""))工作方法：$textjs替换字符串中的空格，换行符\r\n或\n替换成&lt&semi;br&gt&semi;为了让回车换行符正确显示,需要将 \n 或 \r\...
如何在php中替换两个字符串之间的特定字符串 mysql php
2016-11-24 10:50

回答 1 已采纳 Use the following regexp: /status_id\s*=\s*\'([^\']+)\'/ The whole solution would require somet
php替换特定字符的最后一个字符 php
2019-04-27 21:30

回答 3 已采纳 This should work; <?php $response=['file1.jpg','file2.jpg','file3.jpg']; $response=array_re
如何获取html <table>的<tr>中的所有值 html javascript jquery php
2015-06-06 14:14

回答 1 已采纳 I think this is what you're looking for: function Start() { $('#SubmitBtn').click(function (
HTML教程
2021-07-12 17:49

迷藏_victor的博客 </p> <hr /> <h2><a id="HTML__84"></a>HTML 元素</h2> <h3><a id="HTML__86"></a>HTML 元素</h3> <table><thead><tr><th>开始标签</th><th>元素内容</th><th>结束标签</th></tr></thead><tbody><tr><td><p></td><td...
PHP - 用星号替换字符，除非有减号 php
2018-10-11 20:05

回答 4 已采纳 hey try implmenting the following: function get_starred($str) { $str_array =str_split($str); f
使用JQuery动态隐藏/显示<td> css javascript jquery php
2016-01-15 06:42

回答 4 已采纳 It think the trouble comes in when you have something set as display none and are trying to hover
用ajax请求过来的数据，数据js <tr><td>追加到table里的，结果确很乱，求解？ ajax java
2015-11-01 05:18

回答 1 已采纳 $("#patchmanageTrainee").find("tr")改为$("#patchmanageTrainee").find("table") tr中不能插入tr。。
HTML基础
2020-12-03 10:03

shelleyHLX的博客下面的实例定义了所有标题。 <h1>这是标题 1</h1> <h2>这是标题 2</h2> <h3>这是标题 3</h3> <h4>这是标题 4</h4> <h5>这是标题 5</h5> <h6>...
微服务注册到nacos中报错 java spring boot spring cloud
2022-09-03 18:17

回答 2 已采纳给你找了一篇非常好的博客，你可以看看是否有帮助，链接：微服务架构-nacos搭建
php 检查字符串类型,PHP之字符串类型与检验
2021-05-05 02:37

weixin_39959569的博客 2.基本语法表达方式2.1单引号该表达方式不能解析变量,但能够解析转义符\’和\\2.2双引号能够解析所有变量转义符2.3heredoc语法结构在代码中可以解析变量，用法如下：echo <<My name is "$name". I am printing...
HTML 教程（一文彻底搞懂HTML）
2022-06-04 15:55

骑摩托的蜗牛的博客 </p> <ul><li>HTML 标签是由<em>尖括号</em>包围的关键词，比如 <html></li><li>HTML 标签通常是<em>成对出现</em>的，比如 <b> 和 </b></li><li>标签对中的第一个标签是<em>开始标签</em>，第二个标签是<em>结束...
没有解决我的问题, 去提问

悬赏问题

¥15 用C语言输入方程怎么
¥15 网站显示不安全连接问题
¥15 github训练的模型参数无法下载
¥15 51单片机显示器问题
¥20 关于#qt#的问题：Qt代码的移植问题
¥50 求图像处理的matlab方案
¥50 winform中使用edge的Kiosk模式
¥15 关于#python#的问题：功能监听网页
¥15 怎么让wx群机器人发送音乐
¥15 fesafe材料库问题

替换“<code>”标记内的所有“\”字符*不*