duanci8209
2018-08-17 20:23
浏览 199
已采纳

PHP str_replace使用通配符刮取内容?

I'm looking for a solution to strip some HTML from a scraped HTML page. The page has some repetitive data I would like to delete so I tried with preg_replace() to delete the variable data.

Data I want to strip:

Producent:<td class="datatable__body__item" data-title="Producent">Example
Groep:<td class="datatable__body__item" data-title="Produkt groep">Example1
Type:<td class="datatable__body__item" data-title="Produkt type">Example2
.... 
...

Must be like this afterwards:

Producent:Example
Groep:Example1
Type:Example2

So a big piece is the same except the word within the data-title piece. How could I delete this piece of data?

I tried a few things like this one:

$pattern = '/<td class=\"datatable__body__item\"(.*?)>/';
$tech_specs = str_replace($pattern,"", $tech_specs);

But that didn't work. Is there any solution to this?

图片转代码服务由CSDN问答提供 功能建议

我正在寻找一种从已删除的HTML页面中删除一些HTML的解决方案。 该页面有一些我想要删除的重复数据,所以我尝试使用preg_replace()删除变量数据。

我要删除的数据: < pre> Producent:&lt; td class =“datatable__body__item”data-title =“Producent”&gt;示例 Groep:&lt; td class =“datatable__body__item”data-title =“Produkt groep”&gt; Example1 Type: &lt; td class =“datatable__body__item”data-title =“Produkt type”&gt; Example2 .... ...

必须像 之后:

  Producent:Example 
Groep:Example1 
Type:Example2 
   
 
 

所以一大片是 除了数据标题片段中的单词之外。 我怎么能删除这段数据呢?

我尝试过这样的一些事情:

  $ pattern ='/&lt;  td class = \“datatable__body__item \”(。*?)&gt; /'; 
 $ tech_specs = str_replace($ pattern,“”,$ tech_specs); 
   
 
 < 但是那没用。 有没有解决方案? 
 
  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 邀请回答

3条回答 默认 最新

  • duanmo7075 2018-08-27 09:57
    已采纳

    Well maybe my question wasn't that good written. I had a table which I needed to scrape from a website. I needed the info in the table, but had to cleanup some parts as mentioned. The solution I finally made was this one and it works. It still has a little work to do with manual replacements but that is because of the stupid " they use for inch. ;-)

    Solution:

       \\ find the table in the sourcecode
       foreach($techdata->find('table') as $table){
    
        \\ filter out the rows
        foreach($table->find('tr') as $row){
    
        \\ take the innertext using simplehtmldom
        $tech_specs = $row->innertext;
    
        \\ strip some 'garbage'
        $tech_specs = str_replace("  \t\t\t\t\t\t\t\t\t\t\t<td class=\"datatable__body__item\">","", $tech_specs);
    
        \\ find the first word of the string so I can use it    
        $spec1 = explode('</td>', $tech_specs)[0];
    
        \\ use the found string to strip down the rest of the table
        $tech_specs = str_replace("<td class=\"datatable__body__item\" data-title=\"" . $spec1 . "\">",":", $tech_specs);
    
        \\ manual correction because of the " used
        $tech_specs = str_replace("<td class=\"datatable__body__item\" data-title=\"tbv Montage benodigde 19\">",":", $tech_specs);
    
        \\ manual correction because of the " used
        $tech_specs = str_replace("<td class=\"datatable__body__item\" data-title=\"19\">",":", $tech_specs);
    
        \\ strip some 'garbage'
        $tech_specs = str_replace("\t\t\t\t\t\t\t\t\t\t","
    ", $tech_specs);
        $tech_specs = str_replace("</td>","", $tech_specs);
        $tech_specs = str_replace("  ","", $tech_specs);
    
        \\ put the clean row in an array ready for usage
        $specs[] = $tech_specs;
        }
      }
    
    点赞 评论
  • douyi2664 2018-08-17 20:35

    Assuming that the string looked like this:

    $string = 'Producent:<td class="datatable__body__item" data-title="Producent">Example';
    

    You could get the beginning and the end of the string with this:

    preg_match('/^(\w+:).*\>(\w+)/', $string, $matches);
    
    echo implode([$matches[1], $matches[2]]);
    

    Which, in this case, will throw Producent:Example. So, then you could add this output to another variable/array you intend to use. OR, since you mentioned replacing:

    $string = preg_replace('/^(\w+:).*\>(\w+)/', '$1$2', $string);
    

    But then again, checking as it would probably come in a variable number of lines:

    $string = 'Producent:<td class="datatable__body__item" data-title="Producent">Example
    Groep:<td class="datatable__body__item" data-title="Produkt groep">Example1
    Type:<td class="datatable__body__item" data-title="Produkt type">Example2';
    
    $stringRows = explode(PHP_EOL, $string);
    
    $pattern = '/^(\w+:).*\>(\w+)/';
    $replacement = '$1$2';
    foreach ($stringRows as &$stringRow) {
        $stringRow = preg_replace($pattern, $replacement, $stringRow);
    }
    
    $string = implode(PHP_EOL, $stringRows);
    

    Which will then output the string like you expect.

    Explaining my regex: the first group catches the first word until the two dots :, then another group to catch the last word. I had previously specified anchors for both ends, but when breaking each line this wouldn't work as expected, so I kept only the beginning.

    ^(\w+:) => the word in the beginning of the string until two dots appear
    .*\>    => everything else until smaller symbol appears (escaped by slash)
    (\w+)   => the word after the smaller than symbol 
    
    点赞 评论
  • douyuqing_12345 2018-08-27 23:25

    Just use a wildcard:

    $newstr = preg_replace('/<td class="datatable__body__item" data-title=".*?">/', '', $str);
    

    .*? means match anything but don't be greedy

    点赞 评论

相关推荐 更多相似问题