douzhimao8656 2013-07-23 00:48
浏览 29
已采纳

解析消防调度网站提要使用[关闭]中包含的离散元素

I would like to be able to parse the following website and separate each dispatch page into discrete elements, such as time, data, address, and each individual unit dispatched to a call, etc.

http://lebanonema.org/pager/html/monitor.html

I would like to be able to use the discrete elements of a page and display them on a different website and such.

For example I would like to turn

this:

20:15:09 22-07-13 POCSAG-1 West Cornwall Township SPANGLER RD HORSESHOE PIKE MV - Accident w/Injuries **NON EMERGENCY RESPONSE* TK5 Fire-Box 37-03 EMS-Box 190-7 Station 05**

<tr>
<td class="COL2">20:15:09</td>
<td class="COL3">22-07-13</td>
<td class="COL4">POCSAG-1</td>
<td class="COL7">
West Cornwall Township SPANGLER RD HORSESHOE PIKE MV - Accident w/Injuries **NON EMERGENCY RESPONSE*** TK5 Fire-Box 37-03 EMS-Box 190-7
<span class="M">Station 05</span>
</td>
</tr>

into individual elements that I could somehow use on another website, such as the following:

time:20:15:09
date:22-07-13
pageid:POCSAG-1
address:West Cornwall Township SPANGLER RD HORSESHOE PIKE
incident:MV - Accident w/Injuries
additional_details:**NON EMERGENCY RESPONSE***
responding_unit_1:TK5
responding_unit_2:
responting_unit_3:
etc...
fire_box:37-03 
ems_box:190-7
station:7

I have moderate experience in HTML, CSS, and Java. I am open to learning much more. If someone can provide me with a snippet of code doing what I am asking, I should be able to learn enough from that in order to learn to complete what I am asking.

Please keep in mind that the page is constantly updated with pages, and that whatever method is used to do what I am asking, would need to accommodate such an environment.

  • 写回答

1条回答 默认 最新

  • douxue4395 2013-07-23 12:02
    关注

    You are actually asking two questions here. One is how to parse HTML (you find that outlined in How do you parse and process HTML/XML in PHP? and as this has been answered extensively, I skip that part). The other is how to parse a string.

    Parsing a string totally depends on the format the string has. This is normally done with PHP's string functions and also with PHP's regular expression functions. Consult the PHP manual for more information about these.

    Next to the functions used as I have already outlined, you need as well the format specification of the string. So far, your question only contains examples of the strings, however, the specification is missing which part is what and what the decision criteria is.

    You need to specify first, and I would do that before writing the first line of code. In the end, you can then write it in any programming language you like. So it's not that important if PHP or Java, it's much more important you have properly specified how it works. You then encode that processing into code.


    Some rough example code (excerpt), to demonstrate how it could be done in PHP:

    $url = 'http://lebanonema.org/pager/html/monitor.html';
    
    $buffer = file_get_contents($url);
    
    $buffer = utf8_encode($buffer);
    
    $config = [
        'doctype'    => 'omit',
        'output-xml' => 1,
    ];
    
    $buffer = tidy_repair_string($buffer, $config, 'utf8');
    
    $xml = simplexml_load_string($buffer);
    
    $nodes = new DecoratingIterator(
        new SimpleXMLXPathIterator($xml, '//tr[count(td) > 1]'),
        'NodeParser'
    );
    
    foreach ($nodes as $index => $node) {
        echo $index, ': ', json_encode($node, JSON_PRETTY_PRINT), "
    ";
    }
    

    Exemplary output:

    0: {
        "date": "23-07-13",
        "time": "07:56:28",
        "pageid": "POCSAG-1",
        "text": "Jackson Township W LINCOLN AVE N LOCUST ST MV -
    Accident w\/Injuries FG-3 E30 R31 Fire-Box 30-01 EMS-Box 140-2",
        "station": "Station 31"
    }
    1: {
        "date": "23-07-13",
        "time": "07:56:26",
        "pageid": "POCSAG-1",
        "text": "Jackson Township W LINCOLN AVE N LOCUST ST MV -
    Accident w\/Injuries FG-3 E30 R31 Fire-Box 30-01 EMS-Box 140-2",
        "station": "Station 30"
    }
    2: {
        "date": "23-07-13",
        "time": "07:56:25",
        "pageid": "POCSAG-1",
        "text": "Jackson Township W LINCOLN AVE N LOCUST ST MV -
    Accident w\/Injuries FG-3 E30 R31 Fire-Box 30-01 EMS-Box 140-2",
        "station": "Sta 31 Siren"
    }
    
    ...
    
    497: {
        "date": "22-07-13",
        "time": "12:21:27",
        "pageid": "POCSAG-1",
        "text": "South Lebanon Township 1700 S LINCOLN AVE VA
    Medical CenterAFA - Auto Fire Alarm FG-4 E25 E26 W36 R25 TK26 TK36
    AmbCo190 Fire-Box 25-08 EMS-Box 190-4",
        "station": "Station 26"
    }
    498: {
        "date": "22-07-13",
        "time": "12:21:20",
        "pageid": "POCSAG-1",
        "text": "South Lebanon Township 1700 S LINCOLN AVE VA
    Medical CenterAFA - Auto Fire Alarm FG-4 E25 E26 W36 R25 TK26 TK36
    AmbCo190 Fire-Box 25-08 EMS-Box 190-4",
        "station": "Station 25"
    }
    499: {
        "date": "22-07-13",
        "time": "12:18:19",
        "pageid": "POCSAG-1",
        "text": "Company 34 Correction..No Training TOMORROW
    night..Training Will Be Held Thursday At 1830",
        "station": "Station 34"
    }
    

    This example also shows, that you need to deal with more than just the parsing, this is for example cleaning up invalid HTML (in PHP Tidy can be used for this) and dealing with charset encodings.

    The NodeParser object is just overloading a concrete <TR> element given back by the xpath() operation - this is basic SimpleXML parsing and has been outlined previously. As a bonus this object implements the JsonSerializable interface so that it can be easily converted / displayed.

    Using a parser-object allows you to change and tweak the parsing over time. E.g. as this example code shows, the text so far is not been parsed further on (as the specification is missing).

    I hope this is helpful and showing how it could be done at least.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 求daily translation(DT)偏差订正方法的代码
  • ¥15 js调用html页面需要隐藏某个按钮
  • ¥15 ads仿真结果在圆图上是怎么读数的
  • ¥20 Cotex M3的调试和程序执行方式是什么样的?
  • ¥20 java项目连接sqlserver时报ssl相关错误
  • ¥15 一道python难题3
  • ¥15 牛顿斯科特系数表表示
  • ¥15 arduino 步进电机
  • ¥20 程序进入HardFault_Handler
  • ¥15 关于#python#的问题:自动化测试