dongtuo4723 2017-03-30 14:18
浏览 38
已采纳

使所有绝对链接相对

I am looking for a regex solution for this problem. It can be a multiple step solution if this makes things easier. Important notice: The test string is just a snippet of a complete HTML DOM and only images should get addressed by this and any other URL should be left alone.

Here's an image:

<img 
src="https://www.example.com/de/wp-content/uploads/sites/1/2017/03/image.jpg"
data-srcset="
 https://www.example.com/de/wp-content/uploads/sites/1/2017/03/img1.jpg 507w,
 https://www.example.com/de/wp-content/uploads/sites/1/2017/03/img2.jpg 780w,
 https://www.example.com/de/wp-content/uploads/sites/74/2017/03/img3.jpg 950w"
data-sizes="
 (min-width: 80em) calc(0.5 * (100vw - (100vw- 57em))),
 (min-width: 48em) calc(0.5 * (100vw - 5em)),
 calc(100vw - 1em)"
alt="image" class="lazyload">

As a oneliner:

<img src="https://www.example.com/de/wp-content/uploads/sites/1/2017/03/image.jpg" data-srcset="https://www.example.com/de/wp-content/uploads/sites/1/2017/03/img1.jpg 507w, https://www.example.com/de/wp-content/uploads/sites/1/2017/03/img2.jpg 780w, https://www.example.com/de/wp-content/uploads/sites/74/2017/03/img3.jpg 950w" data-sizes="(min-width: 80em) calc(0.5 * (100vw - (100vw- 57em))), (min-width: 48em) calc(0.5 * (100vw - 5em)), calc(100vw - 1em)" alt="image" class="lazyload">

The desired result is that need to get rid of protocol, domain, and first directory - that is to say: everything in front of the /wp-content. The language I am doing this in is php.

For the src part I have

 preg_replace("/(<img.*?src=\")(.*?)(\/wp-content.*?\")(.*>)/", '"$1$3$4"', $string);

The answer below is correct. Most HTML documents should be able to load. Do yourself a favor and try to be as valid as possible, this is a good thing anyways. If you don't produce the HTML in question yourself, try to process it before you consume it.

For the data-srcset problem just parse that argument separately.

Compare your DOM before and after completely. The @dom->saveHTML() method makes closed tags which do not need to be closed, closed. Like <meta arg="yada"/> turns to <meta arg="yada"> (closing backslash missing). Also see Are (non-void) self-closing tags valid in HTML5?

  • 写回答

1条回答 默认 最新

  • dougu3591 2017-03-30 14:36
    关注

    Don't. Use a parser to analyze the DOM and apply the regex on the DOM elements/attributes directly.

    <?php
    
    $dom = new DOMDocument();
    $dom->loadHTML($html, LIBXML_HTML_NOIMPLIED);
    
    $xpath = new DOMXPath($dom);
    $images = $xpath->query("//img[contains(@src, 'wp-content')]");
    
    $regex = '~^.+?(?=/wp-content/)~';
    foreach($images as $img) {
        $img->setAttribute('src', 
            preg_replace($regex, 'https://anotherdomain.com', $img->getAttribute('src'))
        );
    }
    
    echo $dom->saveHTML();
    

    It has been answered a dozen times why it is not a good idea to parse HTML with regular expressions, one of the most favourite answers being this: RegEx match open tags except XHTML self-contained tags.


    However, if your HTML is not valid, you could use the following regex (in verbose mode):
    (?:\G(?!\A)|<img)
    (?s:.+?\bsrc=['"])\K
    https?://.+?(?=/wp-content/)
    

    See it working on regex101.com.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 运筹学中在线排序的时间在线排序的在线LPT算法
  • ¥30 求一段fortran代码用IVF编译运行的结果
  • ¥15 深度学习根据CNN网络模型,搭建BP模型并训练MNIST数据集
  • ¥15 lammps拉伸应力应变曲线分析
  • ¥15 C++ 头文件/宏冲突问题解决
  • ¥15 用comsol模拟大气湍流通过底部加热(温度不同)的腔体
  • ¥50 安卓adb backup备份子用户应用数据失败
  • ¥20 有人能用聚类分析帮我分析一下文本内容嘛
  • ¥15 请问Lammps做复合材料拉伸模拟,应力应变曲线问题
  • ¥30 python代码,帮调试,帮帮忙吧