dti3914 2013-04-20 18:08
浏览 49

从html文档中回显<a>具有class =“pret”的内容

I have the html document in a php $content. I can echo it, but I just need all the <a...> tags with class="pret" and after I get them I would need the non words (like a code i.e. d3852) from href attribute of <a> and the number (i.e. 2352.2345) from between <a> and </a>.

I have tried more examples from the www but I either get empty arrays or php errors.

A regex example that gives me an empty array (the <a> tag is in a table)

$pattern = "#<table\s.*?>.*?<a\s.*?class=[\"']pret[\"'].*?>(.*?)</a>.*?</table>#i";
preg_match_all($pattern, $content, $results);
print_r($results[1]);

Another example that gives just an error

$a=$content->getElementsByTagName(a);

Reason for various errors: unvalid html, non utf 8 chars.

Next I did this on another website, matched the contents in a single SQL table, and the result is a copied website with updated data from my country. No longer will I search the www for matching single results.

  • 写回答

2条回答 默认 最新

  • dongxing5525 2013-04-20 18:42
    关注

    Let's hope you're trying to parse valid (at least valid enough) HTML document, you should use DOM for this:

    // Simple example from php manual from comments
    $xml = new DOMDocument(); 
    $xml->loadHTMLFile($url); 
    $links = array(); 
    
    foreach($xml->getElementsByTagName('a') as $link) { 
        $links[] = array('url' => $link->getAttribute('href'),
                         'text' => $link->nodeValue); 
    } 
    

    Note using loadHTML not load (it's just more robust against errors). You also may set DOMDocument::recover (as suggested in comment by hakre) so parser will try to recover from errors.

    Or you could use xPath (here's explanation of syntax):

    $xpath = new DOMXpath($doc);
    $elements = $xpath->query("//a[@class='pret']");
    
    if (!is_null($elements)) {
        foreach ($elements as $element) {
            $links[] = array('url' => $link->getAttribute('href'),
                             'text' => $link->nodeValue); 
        }
    }
    

    And for case of invalid HTML you may use regexp like this:

    $a1 = '\s*[^\'"=<>]+\s*=\s*"[^"]*"'; # Attribute with " - space tolerant
    $a2 = "\s*[^'\"=<>]+\s*=\s*'[^']*'"; # Attribute with ' - space tolerant
    $a3 = '\s*[^\'"=<>]+\s*=\s*[\w\d]*' # Unescaped values - space tolerant
    # [^'"=<>]* # Junk - I'm not inserting this to regexp but you may have to
    
    $a = "(?:$a1|$a2|$a2)*"; # Any number of arguments
    $class = 'class=([\'"])pret\\1'; # Using ?: carefully is crucial for \\1 to work
                                     # otherwise you can use ["']
    $reg = "<a{$a}\s*{$class}{$a}\s*>(.*?)</a";
    

    And then just preg_match_all.All regexp are written from the top of my head - you may have to debug them.

    评论

报告相同问题?

悬赏问题

  • ¥20 BAPI_PR_CHANGE how to add account assignment information for service line
  • ¥500 火焰左右视图、视差(基于双目相机)
  • ¥100 set_link_state
  • ¥15 虚幻5 UE美术毛发渲染
  • ¥15 CVRP 图论 物流运输优化
  • ¥15 Tableau online 嵌入ppt失败
  • ¥100 支付宝网页转账系统不识别账号
  • ¥15 基于单片机的靶位控制系统
  • ¥15 真我手机蓝牙传输进度消息被关闭了,怎么打开?(关键词-消息通知)
  • ¥15 装 pytorch 的时候出了好多问题,遇到这种情况怎么处理?