dtjkl42086 2012-03-24 15:37

已采纳

可靠有效的自定义搜索和替换功能 - preg或str替换

In a few different guises I've asked about this "filter" on here and WPSE. I'm now taking a different approach to it, and I'd like to make it solid and reliable.

My situation:

When I create a post in my WordPress CMS, I want to run a filter which searches for certain terms and replaces them with links.
I have the terms that I want to search for in two arrays: $glossary_terms and $species_terms.
$species_terms is a list of scientific names of fishes, such as Apistogramma panduro.
$glossary_terms is a list of fishkeeping glossary terms such as abdomen, caudal-fin and Gram's Method.

There are a few nuances worth noting:

Speed is not an issue, as I will be running this filter in the background rather than when a user visits the page or whan an author submits/edits a species profile or post.
Some of the post content being filtered may contain HTML with these terms in, like <img src="image.jpg" title="Apistogramma panduro male" />. Obviously these shouldn't be replaced.
Species are often referred to with an abbreviated Genus, so instead of Apistogramma panduro, you'll often see A. panduro. This means I need to search & replace all of the species terms as an abbreviation too - Apistogramma panduro, A. panduro, Satanoperca daemon, S. daemon etc.
If caudal-fin and caudal both exist in the glossary terms, caudal-fin should be replaced first.

I was contemplating simply adding a preg_replace which searched for the terms, but only with a space on the left, (i.e. ( )term) and a space, comma, exclamation, full-stop or hyphen on the right (i.e. term(, . ! - )) but that won't help me to not break the image HTML.

Example content

<br />
It looks very similar to fishes of the <i><a href="species/betta-foerschi" rel="species/betta-foerschi/?hover=true" class="link_species">B. foerschi</a></i> group/complex but its breeding strategy, adult size and observed behaviour preclude its inclusion in that <a href="glossary/a/assemblage" rel="glossary/a/assemblage?hover=true" class="link_glossary">assemblage</a>.

Instead it appears to be a member of the <i><a href="species/betta-coccina" rel="species/betta-coccina/?hover=true" class="link_species">B. coccina</a></i> group which currently includes <i><a href="species/betta-brownorum" rel="species/betta-brownorum/?hover=true" class="link_species">B. brownorum</a></i>, <i><a href="species/betta-burdigala" rel="species/betta-burdigala/?hover=true" class="link_species">B. burdigala</a></i>, <i><a href="species/betta-coccina" rel="species/betta-coccina/?hover=true" class="link_species">B. coccina</a></i>, <i><a href="species/betta-livida" rel="species/betta-livida/?hover=true" class="link_species">B. livida</a></i>, <i>B. miniopinna</i>, <i><a href="species/betta-persephone" rel="species/betta-persephone/?hover=true" class="link_species">B. persephone</a></i>, <i>B. tussyae</i>, <i><a href="species/betta-rutilans" rel="species/betta-rutilans/?hover=true" class="link_species">B. rutilans</a></i> and <i><a href="species/betta-uberis" rel="species/betta-uberis/?hover=true" class="link_species">B. uberis</a></i>.

Of these it's most similar in appearance to <i><a href="species/betta-uberis" rel="species/betta-uberis/?hover=true" class="link_species">B. uberis</a></i> but can be distinguished by its noticeably shorter <a href="glossary/d/dorsal" rel="glossary/d/dorsal?hover=true" class="link_glossary">dorsal</a>-<a href="glossary/f/fin" rel="glossary/f/fin?hover=true" class="link_glossary">fin</a> <a href="glossary/b/base" rel="glossary/b/base?hover=true" class="link_glossary">base</a> and overall blue-greenish (vs. green/reddish) colouration.

Members of this group are characterised by their small adult size (&lt; 40 mm SL), a uniform red or black <a href="glossary/b/base" rel="glossary/b/base?hover=true" class="link_glossary">base</a> body colour, the presence of a <a href="glossary/m/midlateral" rel="glossary/m/midlateral?hover=true" class="link_glossary">midlateral</a> body blotch in some <a href="glossary/s/species" rel="glossary/s/species?hover=true" class="link_glossary">species</a> and the fact they have 9 abdominal <a href="glossary/v/vertebrae" rel="glossary/v/vertebrae?hover=true" class="link_glossary">vertebrae</a> compared with 10-12 in the other <a href="glossary/s/species" rel="glossary/s/species?hover=true" class="link_glossary">species</a> groups. In addition all are <a href="glossary/o/obligate" rel="glossary/o/obligate?hover=true" class="link_glossary">obligate</a> <a href="glossary/p/peat" rel="glossary/p/peat?hover=true" class="link_glossary">peat</a> <a href="glossary/s/swamp" rel="glossary/s/swamp?hover=true" class="link_glossary">swamp</a> dwellers (Tan and Ng, 2005).<br />

^^^ This example here has had the correct links manually inserted. The filter shouldn't break these links!

It looks very similar to fishes of the B. foerschi group/complex but its breeding strategy, adult size and observed behaviour preclude its inclusion in that assemblage.

Instead it appears to be a member of the B. coccina group which currently includes B. brownorum, B. burdigala, B. coccina, B. livida, B. miniopinna, B. persephone, B. tussyae, B. rutilans and B. uberis.

Of these it's most similar in appearance to B. uberis but can be distinguished by its noticeably shorter dorsal-fin base and overall blue-greenish (vs. green/reddish) colouration.

Members of this group are characterised by their small adult size (< 40 mm SL), a uniform red or black base body colour, the presence of a midlateral body blotch in some species and the fact they have 9 abdominal vertebrae compared with 10-12 in the other species groups. In addition all are obligate peat swamp dwellers (Tan and Ng, 2005).

^^^ Same example pre-formatting.

[caption id="attachment_542" align="alignleft" width="125" caption="Amazonas Magazine - now in English!"]<a href="http://www.seriouslyfish.comwp-content/uploads/2011/12/Amazonas-English-1.jpg"><img class="size-thumbnail wp-image-542" title="Amazonas English" src="/wp-content/uploads/2011/12/Amazonas-English-1-288x381.jpg" alt="Amazonas English" width="125" height="165" /></a>[/caption]

Edited by Hans-Georg Evers, the magazine 'Amazonas' has been widely-regarded as among the finest regular publications in the hobby since its launch in 2005, an impressive achievment considering it's only been published in German to date. The long-awaited English version is just about to launch, and we think a subscription should be top of any serious fishkeeper's Xmas list...

The magazine is published in a bi-monthly basis and the English version launches with the January/February 2012 issue with distributors already organised in the United States, Canada, the United Kingdom, South Africa, Australia, and New Zealand. There are also mobile apps availablen which allow digital subscribers to read on portable devices.

It's fair to say that there currently exists no better publication for dedicated hobbyists with each issue featuring cutting-edge articles on fishes, invertebrates, aquatic plants, field trips to tropical destinations plus the latest in husbandry and breeding breakthroughs by expert aquarists, all accompanied by excellent photography throughout.

U.S. residents can subscribe to the printed edition for just $29 USD per year, which also includes a free digital subscription, with the same offer available to Canadian readers for $41 USD or overseas subscribers for $49 USD. Please see the <a href="http://www.amazonasmagazine.com/">Amazonas website</a> for further information and a sample digital issue!

Alternatively, subscribe directly to the print version <a href="https://www.amazonascustomerservice.com/subscribe/index2.php">here</a> or digital version <a href="https://www.amazonascustomerservice.com/subscribe/digital.php">here</a>.

^^^ This will likely only have a few Glossary terms in rather than any species links.

Example terms

$species_terms

339 => 'Aulonocara maylandi maylandi',
340 => 'Aulonocara maylandi kandeensis',
341 => 'Aulonocara sp. "walteri"',
342 => 'Aulonocara sp. "stuartgranti maleri"',
343 => 'Aulonocara stuartgranti',
344 => 'Benthochromis tricoti',
345 => 'Boulengerochromis microlepis',
346 => 'Buccochromis lepturus',
347 => 'Buccochromis nototaenia',
348 => 'Betta brownorum',
349 => 'Betta foerschi',
350 => 'Betta coccina',
351 => 'Betta uberis'

As you can see above, the general format for these scientific names is "Genus species", but can often include "sp." or "aff." (for species which aren't officially described) and "Genus species subspecies" formats.

$glossary_terms

1 => 'abdomen',
2 => 'caudal',
3 => 'caudal-fin',
4 => 'caudal-fin peduncle',
5 => 'Gram\'s Method'

If anyone can come up with a filter which meets all these conditions and requirements, I'd like to offer a bounty.

Thanks in advance,

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

duanpoqiu0919 2012-03-24 18:44

关注

I think it's better to use DOMDocument functionality than regexps. Here is a working prototype:

// Each dynamically constructed regexp will contain at most 70 subpatterns
define('GROUPS_PER_REGEXPS', 70);

$speciesTerms = array(
  339 => '(?:Aulonocara|A\.) maylandi maylandi',
  340 => '(?:Aulonocara|A\.) maylandi kandeensis',
  344 => '(?:Benthochromis|B\.) tricoti',
  345 => '(?:Boulengerochromis|B\.) microlepis',
);

function matchTerms($text) {
  // Globals are not good. I left it for the simplicity
  global $speciesTerms;

  $result = array();
  $t = 0;
  $speciesCount = count($speciesTerms);
  reset($speciesTerms);
  while ($t < $speciesCount) {
    // Maps capturing group identifiers to term ids
    $termMapping = array();

    // Dynamically construct regexp
    $groups = '';
    $c = 1;
    while (list($termId, $termPattern) = each($speciesTerms)) {
      if (!empty($groups)) {
        $groups .= '|';
      }
      // Match word boundaries, so we don't capture "B. tricotisomeramblingstring"
      $groups .= '(\b' . $termPattern . '\b)';
      $termMapping[$c++] = $termId;
      if (++$t % GROUPS_PER_REGEXPS == 0) {
        break;
      }
    }
    $regexp = "/$groups/m";
    preg_match_all($regexp, $text, $matches, PREG_OFFSET_CAPTURE);
    for ($i = 1; $i < $c; $i++) {
      foreach ($matches[$i] as $matchData) {
        // matchData[0] holds matched string, e.g. Benthochromis tricoti
        // matchData[1] holds offset, e.g. 15
        if (isset($matchData[0]) && !empty($matchData[0])) {
          $result[] = array(
            'text' => $matchData[0],
            'offset' => $matchData[1],
            'id' => $termMapping[$i],
          );
        }
      }
    }
  }
  // Sort by offset in descending order
  usort($result, function($a, $b) {
    return $a['offset'] > $b['offset'] ? -1 : 1;
  });
  return $result;
}

$doc = DOMDocument::loadHTML($html);

// Stack will be used to avoid recursive functions
$stack = new SplStack;
$stack->push($doc);
while (!$stack->isEmpty()) {
  $node = $stack->pop();
  if ($node->nodeType == XML_TEXT_NODE && $node->parentNode instanceof DOMElement) {
    // $node represents text node
    //  and it's inside a tag (second condition in the statement above)

    // Check that this text is not wrapped in <a> tag
    //  as we don't want to wrap it twice
    if ($node->parentNode->tagName != 'a') {
      $matches = matchTerms($node->wholeText);
      foreach ($matches as $match) {
        // Create new link element in the DOM
        $link = $doc->createElement('a', $match['text']);
        $link->setAttribute('href', 'species/' . $match['id']);
        $link->setAttribute('class', 'link_species');

        // Save the text after the link
        $remainingText = $node->splitText($match['offset'] + strlen($match['text']));
        // Save the text before the link
        $linkText = $node->splitText($match['offset']);

        // Replace $linkText with $link node
        //  i.e. 'something' becomes '<a href="..">something</a>'
        $node->parentNode->replaceChild($link, $linkText);
      }
    }
  }
  if ($node->hasChildNodes()) {
    foreach ($node->childNodes as $childNode) {
      $stack->push($childNode);
    }
  }
}

$body = $doc->getElementsByTagName('body');
echo $doc->saveHTML($body->item(0));

Implementation details

I've only showed how to replace species terms, glossary terms will be same. Links are formed in form "species/$id". Abbreviations are handled correctly. DOMDocument is a very reliable parser, it can deal with broken markup and is fast.

?: in regexp allows not to count this subpattern as a capturing group (documentation on subpatterns). Without proper counting of subpatterns, we can't retrieve the termId. The idea is that we build a big regexp pattern by joining all regexps specified in $speciesTerms array and separating them with a pipe |. Final regexp for the first two species would be (spaces for clarity):

       First capturing group             Alternation       Second capturing group
( (?:Aulonocara|A\.) maylandi maylandi )      |       ( (?:Aulonocara|A\.) maylandi kandeensis )

So, the text "Examples: Aulonocara maylandi maylandi, A. maylandi kandeensis" will give following matches:

$matches[1] = array('Aulonocara maylandi maylandi') // Captured by the first group
$matches[2] = array('A. maylandi kandeensis') // Captured by the second group

We can clearly say that all elements in matches[1] are referring to the species Aulonocara maylandi maylandi or A. maylandi maylandi which has id = 339.

In short: Use (?:) if you're using subpatterns in $speciesTerms.

UPDATE Each dynamically created regexp has a limit on maximal number of subpatterns, which is defined as a const at the top. This allows avoiding PCRE limit on number of subpatterns in regexp.

Important notes:

If you have a lot of terms you should rewrite matchTerms, because regexp has a limit on a number of subpatterns. In this case it's optimal to prebuild array of regexps out of every N terms.
matchTerms generates regexp at every call, obviously it can be done only once
It's possible to use advanced regexps in speciesTerms
strlen => mb_strlen if you're using multibyte encodings
Supplied $html will be wrapped in a <body> tag (unless it's already wrapped)

本回答被题主选为最佳回答 , 对您是否有帮助呢?

查看更多回答(1条)

报告相同问题？

关注问题

可靠有效的自定义搜索和替换功能 - preg或str替换 php
2012-03-24 15:37

回答 2 已采纳 I think it's better to use DOMDocument functionality than regexps. Here is a working prototype: /
php preg_replace，用span标记中的data属性替换所有span标记 php
2018-09-05 15:58

回答 1 已采纳 In your regex .*? is not enough, it matches untill data-snippet-php="(.*?)" of the following <s
PHP - 如何避免替换替换字符串 php
2019-05-11 17:45

回答 2 已采纳 The way I've approached this is to split the original content into segments which relate to the ma
php preg_match_all结合str_replace替换内容中所有img
2020-10-30 06:02

最近做站的时候，采集了大量的数据，但采回来的数据基本上都要经过过滤原站保留的数据，其中IMG就是一个地方。网站上好多这些应用例子似乎没有必要“秀”出来，但站已几天没写日志，那就来一个吧
php str_replace替换关键词，如何控制长词优先 php
2017-02-27 01:07

回答 1 已采纳 http://www.oschina.net/question/2315734_2232482
如何获得找到的值，而不是替换？使用preg_replace php php
2015-03-03 17:27

回答 1 已采纳 preg_match will do what you want: $input = "HTTP/1.1 301 Moved Permanently"; $matches = array();
如何用preg_replace和PHP替换特殊字符？ php
2014-03-26 11:05

回答 2 已采纳 $str=preg_replace("/[^0-9a-zA-Z ]/u", "_", $str_test); Notice 'u' modifier! Explanation: http://
PHP html标签正则替换并可自定义正则规则
2021-01-21 12:44

php function pregstring($str){ $strtemp = trim($str); $search = array( “|’|Uis”, “|[removed]].*?[removed]|Uis”, // 去掉 javascript “|\[字定义\].*\[/字定义\]|Uis”, // 去掉缩略图 “|<[\/\!].*?...
PHP：使用preg_replace替换错误的HTML。 html php
2016-03-29 07:24

回答 2 已采纳 You forgot to add the delimiters on the "$target" regex. Try this: define('IMG_REG', '#<img (
PHP用str_replace（）替换URL段; php
2015-10-13 22:20

回答 5 已采纳 $url = '/foo/bar/url/'; if (false !== $last = strrpos($url, '/')) { if (false !== $penultimat
str_replace函数无法替换多个值[重复] php
2019-05-26 04:52

回答 1 已采纳 <?php include_once("con.php"); $db = new Da(); $con = $db->con(); $lclString = "{{ONE}} {
php 字符串替换函数,php字符串替换函数str-replace速度比preg-replace快
2021-03-23 23:36

asta谢的博客 php字符串替换函数str-replace速度比preg-replace快在选择函数的时候，我们都会优先选择执行速度快的函数，下面是小编整理的php字符串替换函数str_replace与preg_replace的比较，希望对大家有用，更多消息请关注应届...
php preg_replace regex替换两个字符串之间的字符串 php
2015-01-27 11:46

回答 4 已采纳 You need not use look arounds here. It can be written as ("[^";]*);([^"]*") replace with \1:\2
pregmatchall php替换,php preg_match_all结合str_replace替换内容中所有img
2021-05-08 16:04

Denny W的博客采集回来的图片img标签中，有好多javascript脚本和无用的信息，必需过替换自己想要的，比如alt。先看看要过滤的内容，我随便复制出来：复制代码代码如下:sdfsdfsdf500){this.resized=true;this.style.width=500;}”&...
php正则替换变量指定字符的方法
2021-01-19 20:57

本文实例讲述了php正则替换变量指定字符的方法。分享给大家供大家参考。具体如下：这里介绍三种常用方法. 方法一： <?php $str = preg_quote('(银子)'); $txt = '我的呢称(银子)'; echo preg_replace(/($str)/,...
php字符串的替换，分割和连接方法
2020-10-22 07:12

主要介绍了php字符串的替换，分割和连接方法,分析了preg_replace、str_replace、preg_split、explode及implode等函数的功能与使用方法,需要的朋友可以参考下
php字符串preg替换,PHP字符串正则替换函数preg_replace_PHP教程
2021-04-24 16:21

微凉qazz的博客语法: mixed preg_replace(mixed pattern, mixed replacement, mixed subject);返回值: 混合类型资料函数种类: 资料处理内容说明: 本函数以 pattern 的规则来解析比对字符串 subject，欲取而代之的字符串为参数 ...
php正则表达式替换,PHP正则表达式替换函数preg_replace
2021-03-24 09:22

覃秉坤的博客 preg_replace正则表达式替换函数对于我来说，现在非常喜欢它。不仅能准确、快速的处理字符串，还能让我从新审视正则表达式的重要性。正则表达式语言对于我来说，就像人类听不懂鸟语一样，想懂它却那么难以交流，看来...
php正则替换a,php正则动态匹配 - 替代str_replace的数组替换
2021-03-23 22:49

Zain Mei的博客 [大笑]，是用的str_replace的数组替换，完后一天，域名变了，悲催了，因为有新旧数据的同时存在，如果延续之前的做法，肯定还要再搞一个数组,主要因为表情一共有100多个，我实在不想那么干(同时对于之前那么干过的...
php 随机替换字符串_PHP: preg_replace - Manual
2021-03-22 20:24

weixin_39799307的博客 Hello there,I would like to share a regex (PHP) sniplet of codeI wrote (2012) for myself it is also being used in theYerico sriptmerge plugin for joomla marked as simple code..To compress javascript ....
没有解决我的问题, 去提问

悬赏问题

¥15 想问一下树莓派接上显示屏后出现如图所示画面，是什么问题导致的
¥100 嵌入式系统基于PIC16F882和热敏电阻的数字温度计
¥15 cmd cl 0x000007b
¥20 BAPI_PR_CHANGE how to add account assignment information for service line
¥500 火焰左右视图、视差（基于双目相机）
¥100 set_link_state
¥15 虚幻5 UE美术毛发渲染
¥15 CVRP 图论物流运输优化
¥15 Tableau online 嵌入ppt失败
¥100 支付宝网页转账系统不识别账号

码龄粉丝数原力等级 --

可靠有效的自定义搜索和替换功能 - preg或str替换

2条回答默认最新

码龄粉丝数原力等级 --

悬赏问题

可靠有效的自定义搜索和替换功能 - preg或str替换

2条回答 默认 最新

悬赏问题

2条回答默认最新