duanci8209 2018-08-17 20:23
浏览 304
已采纳

PHP str_replace使用通配符刮取内容?

I'm looking for a solution to strip some HTML from a scraped HTML page. The page has some repetitive data I would like to delete so I tried with preg_replace() to delete the variable data.

Data I want to strip:

Producent:<td class="datatable__body__item" data-title="Producent">Example
Groep:<td class="datatable__body__item" data-title="Produkt groep">Example1
Type:<td class="datatable__body__item" data-title="Produkt type">Example2
.... 
...

Must be like this afterwards:

Producent:Example
Groep:Example1
Type:Example2

So a big piece is the same except the word within the data-title piece. How could I delete this piece of data?

I tried a few things like this one:

$pattern = '/<td class=\"datatable__body__item\"(.*?)>/';
$tech_specs = str_replace($pattern,"", $tech_specs);

But that didn't work. Is there any solution to this?

  • 写回答

3条回答 默认 最新

  • duanmo7075 2018-08-27 09:57
    关注

    Well maybe my question wasn't that good written. I had a table which I needed to scrape from a website. I needed the info in the table, but had to cleanup some parts as mentioned. The solution I finally made was this one and it works. It still has a little work to do with manual replacements but that is because of the stupid " they use for inch. ;-)

    Solution:

       \\ find the table in the sourcecode
       foreach($techdata->find('table') as $table){
    
        \\ filter out the rows
        foreach($table->find('tr') as $row){
    
        \\ take the innertext using simplehtmldom
        $tech_specs = $row->innertext;
    
        \\ strip some 'garbage'
        $tech_specs = str_replace("  \t\t\t\t\t\t\t\t\t\t\t<td class=\"datatable__body__item\">","", $tech_specs);
    
        \\ find the first word of the string so I can use it    
        $spec1 = explode('</td>', $tech_specs)[0];
    
        \\ use the found string to strip down the rest of the table
        $tech_specs = str_replace("<td class=\"datatable__body__item\" data-title=\"" . $spec1 . "\">",":", $tech_specs);
    
        \\ manual correction because of the " used
        $tech_specs = str_replace("<td class=\"datatable__body__item\" data-title=\"tbv Montage benodigde 19\">",":", $tech_specs);
    
        \\ manual correction because of the " used
        $tech_specs = str_replace("<td class=\"datatable__body__item\" data-title=\"19\">",":", $tech_specs);
    
        \\ strip some 'garbage'
        $tech_specs = str_replace("\t\t\t\t\t\t\t\t\t\t","
    ", $tech_specs);
        $tech_specs = str_replace("</td>","", $tech_specs);
        $tech_specs = str_replace("  ","", $tech_specs);
    
        \\ put the clean row in an array ready for usage
        $specs[] = $tech_specs;
        }
      }
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(2条)

报告相同问题?

悬赏问题

  • ¥15 请教一下各位,为什么我这个没有实现模拟点击
  • ¥15 执行 virtuoso 命令后,界面没有,cadence 启动不起来
  • ¥50 comfyui下连接animatediff节点生成视频质量非常差的原因
  • ¥20 有关区间dp的问题求解
  • ¥15 多电路系统共用电源的串扰问题
  • ¥15 slam rangenet++配置
  • ¥15 有没有研究水声通信方面的帮我改俩matlab代码
  • ¥15 ubuntu子系统密码忘记
  • ¥15 保护模式-系统加载-段寄存器
  • ¥15 电脑桌面设定一个区域禁止鼠标操作