duanci8209 2018-08-17 20:23
浏览 304
已采纳

PHP str_replace使用通配符刮取内容?

I'm looking for a solution to strip some HTML from a scraped HTML page. The page has some repetitive data I would like to delete so I tried with preg_replace() to delete the variable data.

Data I want to strip:

Producent:<td class="datatable__body__item" data-title="Producent">Example
Groep:<td class="datatable__body__item" data-title="Produkt groep">Example1
Type:<td class="datatable__body__item" data-title="Produkt type">Example2
.... 
...

Must be like this afterwards:

Producent:Example
Groep:Example1
Type:Example2

So a big piece is the same except the word within the data-title piece. How could I delete this piece of data?

I tried a few things like this one:

$pattern = '/<td class=\"datatable__body__item\"(.*?)>/';
$tech_specs = str_replace($pattern,"", $tech_specs);

But that didn't work. Is there any solution to this?

  • 写回答

3条回答 默认 最新

  • duanmo7075 2018-08-27 09:57
    关注

    Well maybe my question wasn't that good written. I had a table which I needed to scrape from a website. I needed the info in the table, but had to cleanup some parts as mentioned. The solution I finally made was this one and it works. It still has a little work to do with manual replacements but that is because of the stupid " they use for inch. ;-)

    Solution:

       \\ find the table in the sourcecode
       foreach($techdata->find('table') as $table){
    
        \\ filter out the rows
        foreach($table->find('tr') as $row){
    
        \\ take the innertext using simplehtmldom
        $tech_specs = $row->innertext;
    
        \\ strip some 'garbage'
        $tech_specs = str_replace("  \t\t\t\t\t\t\t\t\t\t\t<td class=\"datatable__body__item\">","", $tech_specs);
    
        \\ find the first word of the string so I can use it    
        $spec1 = explode('</td>', $tech_specs)[0];
    
        \\ use the found string to strip down the rest of the table
        $tech_specs = str_replace("<td class=\"datatable__body__item\" data-title=\"" . $spec1 . "\">",":", $tech_specs);
    
        \\ manual correction because of the " used
        $tech_specs = str_replace("<td class=\"datatable__body__item\" data-title=\"tbv Montage benodigde 19\">",":", $tech_specs);
    
        \\ manual correction because of the " used
        $tech_specs = str_replace("<td class=\"datatable__body__item\" data-title=\"19\">",":", $tech_specs);
    
        \\ strip some 'garbage'
        $tech_specs = str_replace("\t\t\t\t\t\t\t\t\t\t","
    ", $tech_specs);
        $tech_specs = str_replace("</td>","", $tech_specs);
        $tech_specs = str_replace("  ","", $tech_specs);
    
        \\ put the clean row in an array ready for usage
        $specs[] = $tech_specs;
        }
      }
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(2条)

报告相同问题?

悬赏问题

  • ¥30 Hyper-v虚拟机相关问题,求解答。
  • ¥15 TSM320F2808PZA芯片 Bootloader
  • ¥45 谷歌浏览器出现开发者工具无法显示已创建的,但您可以调试已部署的代码。 状态代码 404, net::ERR HTTP RESPONSE CODE FAILURE
  • ¥15 如何解决蓝牙通话音频突发失真问题
  • ¥15 安装opengauss数据库报错
  • ¥15 【急】在线问答CNC雕刻机的电子电路与编程
  • ¥60 在mc68335芯片上移植ucos ii 的成功工程文件
  • ¥15 笔记本外接显示器正常,但是笔记本屏幕黑屏
  • ¥15 Python pandas
  • ¥15 蓝牙硬件,可以用哪几种方法控制手机点击和滑动