dsfs21312 2012-04-06 15:32
浏览 52
已采纳

将今天的“历史上的这一天”写成PHP中的数组

I'm trying to get the four or five things that happened on this day in history, and add a plaintext representation of that into an array in PHP.

So far, I'm using this code:

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://en.wikipedia.org/w/api.php?action=featuredfeed&feed=onthisday&feedformat=rss');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_TIMEOUT, '3');
curl_setopt($ch, CURLOPT_USERAGENT, 'My random user agent'); // Needed for Wikipedia to prevent IP blocking
$contents = trim(curl_exec($ch));
curl_close($ch);

$xml = simplexml_load_string($contents);
$json = json_encode($xml);
$array = json_decode($json, true);


$noOfDays = count($array['channel']['item']);
$r = $noOfDays - 1;
$input = $array['channel']['item'][$r]['description'];

I know this is not very dyamic and efficient, but one person is going to be calling this page once a day, so it's not terribly important.

At this point, $input contains a block of HTML, which looks something like this:

<p><b><a href="/wiki/April_6" title="April 6">April 6</a></b>: <b><a href="/wiki/Good_Friday" title="Good Friday">Good Friday</a></b> (Western Christianity, 2012); <b><a href="/wiki/Fast_of_the_Firstborn" title="Fast of the Firstborn">Fast of the Firstborn</a></b> begins at dawn and <b><a href="/wiki/Passover" title="Passover">Passover</a></b> begins at sunset (Judaism, 2012)
</p>
<div style="float:right;margin-left:0.5em">
<p><a href="/wiki/File:Sir_Arthur_Wellesley,_1st_Duke_of_Wellington.png" class="image" title="Arthur Wellesley, the Earl of Wellington"><img alt="Arthur Wellesley, the Earl of Wellington" src="//upload.wikimedia.org/wikipedia/commons/thumb/8/83/Sir_Arthur_Wellesley%2C_1st_Duke_of_Wellington.png/78px-Sir_Arthur_Wellesley%2C_1st_Duke_of_Wellington.png" width="78" height="100" /></a>
</p>
</div>
<li style="-moz-float-edge: content-box">
<a href="/wiki/1250" title="1250">1250</a> – <a href="/wiki/Seventh_Crusade" title="Seventh Crusade">Seventh Crusade</a>: Egyptian <a href="/wiki/Ayyubid" title="Ayyubid" class="mw-redirect">Ayyubids</a> <b><a href="/wiki/Battle_of_Fariskur" title="Battle of Fariskur">annihilated the crusader army</a></b> and captured King <a href="/wiki/Louis_IX_of_France" title="Louis IX of France">Louis&#160;IX of France</a> as a hostage.
<li style="-moz-float-edge: content-box">
<a href="/wiki/1320" title="1320">1320</a> – The <b><a href="/wiki/Declaration_of_Arbroath" title="Declaration of Arbroath">Declaration of Arbroath</a></b>, a declaration of <a href="/wiki/Scottish_independence" title="Scottish independence">Scottish independence</a>, was adopted.
<li style="-moz-float-edge: content-box">
<a href="/wiki/1812" title="1812">1812</a> – <a href="/wiki/Peninsular_War" title="Peninsular War">Peninsular War</a>: After a <b><a href="/wiki/Siege_of_Badajoz_(1812)" title="Siege of Badajoz (1812)">three-week siege</a></b>, the <a href="/wiki/Anglo-Portuguese_Army" title="Anglo-Portuguese Army">Anglo-Portuguese Army</a>, under the <a href="/wiki/Arthur_Wellesley,_1st_Duke_of_Wellington" title="Arthur Wellesley, 1st Duke of Wellington">Earl of Wellington</a> <i>(pictured)</i>, captured <a href="/wiki/Badajoz" title="Badajoz">Badajoz</a>, Spain and forced the surrender of the French garrison.
<li style="-moz-float-edge: content-box">
<a href="/wiki/1947" title="1947">1947</a> – The <a href="/wiki/1st_Tony_Awards" title="1st Tony Awards">first</a> <b><a href="/wiki/Tony_Award" title="Tony Award">Tony Awards</a></b>, recognizing achievement in live American <a href="/wiki/Theatre" title="Theatre">theatre</a>, were handed out at the <a href="/wiki/Waldorf-Astoria_Hotel" title="Waldorf-Astoria Hotel">Waldorf-Astoria Hotel</a> in <a href="/wiki/New_York_City" title="New York City">New York City</a>.
<li style="-moz-float-edge: content-box">
<a href="/wiki/2008" title="2008">2008</a> – Egyptian workers staged <b><a href="/wiki/2008_Egyptian_general_strike" title="2008 Egyptian general strike">an illegal general strike</a></b>, two days before <a href="/wiki/Egyptian_municipal_elections,_2008" title="Egyptian municipal elections, 2008">key municipal elections</a>.
</li>
</ul>
<p>More anniversaries: <span class="nowrap"><a href="/wiki/April_5" title="April 5">April 5</a> &#8211;</span> <span class="nowrap"><b><a href="/wiki/April_6" title="April 6">April 6</a></b> &#8211;</span> <span class="nowrap"><a href="/wiki/April_7" title="April 7">April 7</a></span>
</p>
<div style="text-align: right;" class="noprint"><span class="nowrap"><b><a href="/wiki/Wikipedia:Selected_anniversaries/April" title="Wikipedia:Selected anniversaries/April">Archive</a></b> &#8211;</span> <span class="nowrap"><b><a href="https://lists.wikimedia.org/mailman/listinfo/daily-article-l" class="extiw" title="mail:daily-article-l">By email</a></b> &#8211;</span> <span class="nowrap"><b><a href="/wiki/List_of_historical_anniversaries" title="List of historical anniversaries">List of historical anniversaries</a></b></span></div>
<div style="text-align: right;"><small>It is now <span class="nowrap">April 6, 2012</span> (<a href="/wiki/Coordinated_Universal_Time" title="Coordinated Universal Time">UTC</a>) &#8211; <span class="plainlinks" id="purgelink"><span class="nowrap"><a class="external text" href="//en.wikipedia.org/w/index.php?title=MediaWiki:Ffeed-onthisday-transcludeme&amp;action=purge">Refresh this page</a></span></span></small></div>

The only thing that I'm interested in are the bits between each <li style="-moz-float-edge: content-box">

I've got no idea why they didn't close these <li> tags properly, but there you go.

So the essence of what I want to is take the actual information, strip away the links and add each one into an array, which should look something like this:

Array (
    [0] => 1250 – Seventh Crusade: Egyptian Ayyubids annihilated the crusader army and captured King Louis&#160;IX of France as a hostage.
    [1] => Next one...
    [2] => And another...
)

There's also a slight problem regarding the &#160; at the end of this line. How would I translate that into plaintext? I have a feeling HTML parsing may be the answer.

I've already tried regex and HTML parsing, but as the tags don't close I've had some difficulty doing this.

Any suggestions?

  • 写回答

1条回答 默认 最新

  • dongyakui8675 2012-04-06 16:08
    关注

    As @zzzzBov points out, closing tags are optional in HTML (but not XHTML). Unfortunately this is one of several facts that makes it incompatible with XML (and XML parsers). For your task I would recommend parsing the DOM using a library like phpQuery or PHP Simple HTML DOM Parser.

    In phpQuery your code would look something like this:

    $doc   = phpQuery::newDocumentHTML( $input );
    $items = $doc->find('li');
    
    foreach($items as $item) {
      echo pq($item)->text();
    }
    
    // Or... (PHP 5.3+)
    
    $items = array_map( function( $item ) {
      return pq( $item )->text();
    }, $doc->find('li') );
    

    As for &#160;, try html_entity_decode().

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥30 这是哪个作者做的宝宝起名网站
  • ¥60 版本过低apk如何修改可以兼容新的安卓系统
  • ¥25 由IPR导致的DRIVER_POWER_STATE_FAILURE蓝屏
  • ¥50 有数据,怎么建立模型求影响全要素生产率的因素
  • ¥50 有数据,怎么用matlab求全要素生产率
  • ¥15 TI的insta-spin例程
  • ¥15 完成下列问题完成下列问题
  • ¥15 C#算法问题, 不知道怎么处理这个数据的转换
  • ¥15 YoloV5 第三方库的版本对照问题
  • ¥15 请完成下列相关问题!