doudieheng5322 2016-06-02 21:36
浏览 14

解析HTML的有效方法

I am working with cURL and once executed I end up with a variable $data which contains the whole pages html content.

I have a little bit of the html content content within that variable to demonstrate.

<table class="TabTopGroup" width="100%" height="100%" cellspacing="0" cellpadding="0" border="0">
    <tr>
        <td align="Left" class="HtmlGridCell" colspan="5"> </td>
        <td align="Left" class="HtmlGridCell" colspan="2"><span class="progress" title="0%"><span class="indicator" style="width: 0%"> </span></span></td>
    </tr>
    <tr valign="top">
        <td align="Left" class="HtmlGridCell no-bottom-border"><span class="priority-1"> </span></td>
        <td align="Left" class="HtmlGridCell no-bottom-border"><a href="jobview.aspx?id=12514845" class="link">J000005</a></td>
        <td align="Left" class="HtmlGridCell no-bottom-border">Student</td>
        <td align="Left" class="HtmlGridCell no-bottom-border">test job</td>
        <td align="Left" class="HtmlGridCell no-bottom-border"><span id='jobstate_12514845'>Planned</span><span class='inline-dropdown' onclick='return jqe.c();' onmouseover='jqe.s(12514845, this, event);'> </span></td>
        <td align="Left" class="HtmlGridCell no-bottom-border">02-Jun</td>
        <td align="Left" class="HtmlGridCell no-bottom-border">02-Jun</td>
    </tr>
    <tr>
        <td align="Left" class="HtmlGridCell" colspan="5"> </td>
        <td align="Left" class="HtmlGridCell" colspan="2"><span class="progress" title="0%"><span class="indicator" style="width: 0%"> </span></span></td>
    </tr>
    <tr valign="top">
        <td align="Left" class="HtmlGridCell no-bottom-border"><span class="priority-1"> </span></td>
        <td align="Left" class="HtmlGridCell no-bottom-border"><a href="jobview.aspx?id=12514850" class="link">J000006</a></td>
        <td align="Left" class="HtmlGridCell no-bottom-border">Student</td>
        <td align="Left" class="HtmlGridCell no-bottom-border">test job</td>
        <td align="Left" class="HtmlGridCell no-bottom-border"><span id='jobstate_12514850'>Planned</span><span class='inline-dropdown' onclick='return jqe.c();' onmouseover='jqe.s(12514850, this, event);'> </span></td>
        <td align="Left" class="HtmlGridCell no-bottom-border">02-Jun</td>
        <td align="Left" class="HtmlGridCell no-bottom-border">02-Jun</td>
    </tr>
</table>

Now on the other side of things, I have an array which contains the following type of data

$jobs = 
    array( 
        array( 
            jID => "J000005", 
            Name => "Something"
        ),
        array( 
            jID => "J000006", 
            Name => "Something"
        ),
        array(
            jID => "J16453", 
            Name => "Something"
        )
    );

Now what I am trying to do is search for occurrences of the jID within the html string. If a jID is found, I need to obtain the id parameter from its parents anchor and then add them to an array. So if I cross check the above array with the HTML, I should end up with something like this.

$outcome = 
    array( 
        array( 
            jID => "J000005", 
            aID => "12514845"
        ),
        array( 
            jID => "J000006", 
            aID => "12514850"
        )
    );

The example I have shown above is a small dataset. The html string has a lot more data, and my initial array will contain about 50 jID's.

Really I am after advice as to the best way to handle this. I was initially thinking of using DomDocument but I dont think this is the best way. Another option would be to use preg_match_all somehow but I am not too sure how efficient this would be.

Another problem I am faced with is that the html might contain more than one occurence of the jID. I am not bothered how many occurences of J000005 there are for instance, all I want is it's associative id which is contained as a parameter within its parent anchor.

So any advice on how this can be achieved appreciated. I would be interested to understand what the most efficient way is because I read the preg_match_all is faster than doing it via DomDocument.

  • 写回答

0条回答 默认 最新

    报告相同问题?

    悬赏问题

    • ¥20 西门子S7-Graph,S7-300,梯形图
    • ¥50 用易语言http 访问不了网页
    • ¥50 safari浏览器fetch提交数据后数据丢失问题
    • ¥15 matlab不知道怎么改,求解答!!
    • ¥15 永磁直线电机的电流环pi调不出来
    • ¥15 用stata实现聚类的代码
    • ¥15 请问paddlehub能支持移动端开发吗?在Android studio上该如何部署?
    • ¥20 docker里部署springboot项目,访问不到扬声器
    • ¥15 netty整合springboot之后自动重连失效
    • ¥15 悬赏!微信开发者工具报错,求帮改