dtnat7146 2016-03-03 07:49
浏览 43

使用RegEx在HTML标记之间查找内容

I want to extract content of a page which has the attribute name itemprop. Suppose I have page which has different HTML tags that have the attribute named itemprop so I want text in between those tags,

For a heading:

<h1 itemprop="name" class="h2">Whirlpool Direct Drive Washer Motor Coupling</h1>

Table data from td tag:

<td itemprop="productID">AP3963893</td>

Here the itemprop attribute is common. So I need data in between these tags like Whirlpool Direct Drive Washer Motor Coupling and AP3963893 using regexp .

Below is my code (which is currently not working)

preg_match_all(
    '/<div class=\"pdct\-inf\">(.*?)<\/div>/s',
    $producturl,
    $posts    
);

My code:

<?php
    define('CSV_PATH','csvfiles/');
    $csv_file = CSV_PATH . "producturl.csv"; // Name of your producturl file
    $csvfile = fopen($csv_file, 'r');
    $csv_fileoutput = CSV_PATH . "productscraping.csv"; // Name of your product page data file
    $csvfileoutput = fopen($csv_fileoutput, 'a');

    $websitename = "http://www.appliancepartspros.com";

    while($data = fgetcsv($csvfile)) 
    {
        $producturl = $websitename . trim($data[1]);

        preg_match_all(
            '/<.*itemprop=\".*\".*>(.*?)<\/.*>/s',
            $producturl,
            $posts    
        );
        print_r($posts);
    }
  • 写回答

2条回答 默认 最新

  • drex88669 2016-03-03 07:51
    关注

    Firstly, never ever use RegEx to parse HTML. Secondly, you can achieve this using jQuery quite simply by using the attribute selector:

    var nameItemprop = $('[itemprop="name"]').text(); // = 'Whirlpool Direct Drive Washer Motor Coupling'
    var productIdItemprop = $('[itemprop="productID"]').text(); // = 'AP3963893'
    

    Note however, that it is invalid HTML to create your own non-standard attributes. You should ideally be using data-* attributes to contain data associated with those elements:

    <h1 data-itemprop="name" class="h2">Whirlpool Direct Drive Washer Motor Coupling</h1>
    <td data-itemprop="productID">AP3963893</td>
    
    var nameItemprop = $('[data-itemprop="name"]').text();
    var productIdItemprop = $('[data-itemprop="productID"]').text();
    

    Finally, should there ever be multiple elements with the same itemprop attribute then you would need to loop through them to get the value from each individual element.

    评论

报告相同问题?

悬赏问题

  • ¥15 装 pytorch 的时候出了好多问题,遇到这种情况怎么处理?
  • ¥20 IOS游览器某宝手机网页版自动立即购买JavaScript脚本
  • ¥15 手机接入宽带网线,如何释放宽带全部速度
  • ¥30 关于#r语言#的问题:如何对R语言中mfgarch包中构建的garch-midas模型进行样本内长期波动率预测和样本外长期波动率预测
  • ¥15 ETLCloud 处理json多层级问题
  • ¥15 matlab中使用gurobi时报错
  • ¥15 这个主板怎么能扩出一两个sata口
  • ¥15 不是,这到底错哪儿了😭
  • ¥15 2020长安杯与连接网探
  • ¥15 关于#matlab#的问题:在模糊控制器中选出线路信息,在simulink中根据线路信息生成速度时间目标曲线(初速度为20m/s,15秒后减为0的速度时间图像)我想问线路信息是什么