dongxuxian1123 2016-10-14 06:37
浏览 85

用PHP抓取HTML列表结构

I want to scrape a html list structure, so I can save parent and child separately.

Here's the view source of html

<ul class="categories_list">
    <li><a href="/sports-nutrition">Sports Nutrition</a>
        <ul class="categories_list">
            <li><a href="/protein">Protein</a>
                <ul class="categories_list">
                    <li><a href="/protein-powder">Protein Powder</a>
                        <ul class="categories_list">
                            <li><a href="/whey-protein">Whey Protein</a>
                                <ul class="categories_list">
                                    <li><a href="/whey-protein-isolate">Whey Protein Isolate</a></li>
                                </ul>
                            </li>
                        </ul>
                    </li>
                </ul>
            </li>
        </ul>
        <ul class="categories_list">
            <li><a href="/pre-workout-supplements">Pre Workout Supplements</a></li>
        </ul>
        <ul class="categories_list">
            <li><a href="/creatine">Creatine</a>
                <ul class="categories_list">
                    <li><a href="/creatine-monohydrate">Creatine Monohydrate</a></li>
                </ul>
            </li>
        </ul>
        <ul class="categories_list">
            <li><a href="/amino-acids">Amino Acids</a>
                <ul class="categories_list">
                    <li><a href="/essential-amino-acids">Essential Amino Acids</a>
                        <ul class="categories_list">
                            <li><a href="/bcaa">BCAA</a></li>
                        </ul>
                    </li>
                </ul>
            </li>
        </ul>
        <ul class="categories_list">
            <li><a href="/joint-supplements">Joint Supplements</a>
                <ul class="categories_list">
                    <li><a href="/curcumin">Curcumin</a>
                        <ul class="categories_list">
                            <li><a href="/curcumin-phytosome">Curcumin Phytosome</a></li>
                        </ul>
                    </li>
                </ul>
            </li>
        </ul>
        <ul class="categories_list">
            <li><a href="/energy-endurance">Energy &amp; Endurance</a>
                <ul class="categories_list">
                    <li><a href="/stimulants">Stimulants</a></li>
                </ul>
            </li>
        </ul>
    </li>
</ul>

I am using simple HTML DOM for scraping. I am able to get all categories, but I cannot get them in proper the hierarchy. I also tried the children approach, but that didn't work.

So I am looking for some help in my existing to make it working. Here's my existing code:

$html= file_get_html($url);

foreach ($html->find('ul.categories_list li') as $link) {
    echo $link->plaintext.'<br>';
}
  • 写回答

1条回答 默认 最新

  • donglian4770 2016-10-14 10:33
    关注

    There is this script which tried to get all elements. This needs to be improved upon:

    <?php 
    
    require_once("simple_html_dom.php");
    $dom = file_get_html("source.php");
    
    getCategory($dom);
    print_r($categoryList);
    
    function getCategory(simple_html_dom $dom){
        global $categoryList;
    
        foreach($dom->find('ul.categories_list li') as $ul){
            //extract the a tag if found
            $categoryName = $ul->find('a',0)->href;
            $categoryLabel = $ul->find('a',0)->innertext;
    
            $categoryList[] = array(
                                                "categoryName"  =>  $categoryName,
                                                "categoryLabel" =>  $categoryLabel,
                                                );
    
            //remove a node
            $ul->find('a',0)->outertext = '';
    
            $string = $ul->innertext;
            if(trim($string) == ''){
                continue;
            }else{
                // die($string);
                $dom2 = str_get_html($string);
                getCategory($dom2);
            }
        }       
    }
    

    It basically does recursion filling the $categoryList on each call.

    评论

报告相同问题?

悬赏问题

  • ¥15 使用ue5插件narrative时如何切换关卡也保存叙事任务记录
  • ¥20 软件测试决策法疑问求解答
  • ¥15 win11 23H2删除推荐的项目,支持注册表等
  • ¥15 matlab 用yalmip搭建模型,cplex求解,线性化处理的方法
  • ¥15 qt6.6.3 基于百度云的语音识别 不会改
  • ¥15 关于#目标检测#的问题:大概就是类似后台自动检测某下架商品的库存,在他监测到该商品上架并且可以购买的瞬间点击立即购买下单
  • ¥15 神经网络怎么把隐含层变量融合到损失函数中?
  • ¥15 lingo18勾选global solver求解使用的算法
  • ¥15 全部备份安卓app数据包括密码,可以复制到另一手机上运行
  • ¥20 测距传感器数据手册i2c