dtf24224 2016-11-27 15:33
浏览 36
已采纳

too long

I'm working on a tool for wikipedia. I'm trying to retrieve the page https://de.wikipedia.org/wiki/Spezial:Linkliste/Hans_Jansen_(Arabist) with file_get_contents. Then I extract all list items by locating the list and exploding it at .

Afterwards I want to retrieve the article texts named after the list items. For that I do

 file_get_contents(https://de.wikipedia.org/w/index.php?action=raw&title=".urlencode($article));

Everything goes well until the article named Ka'b ibn As'ad which leads to retrieval of

https://de.wikipedia.org/w/index.php?action=raw&title=Ka

When I copy the article name as plain text, everything goes well:

 $article = "Ka'b ibn As'ad";
 $page = "https://".$server."/w/index.php?action=raw&title=".urlencode($article);

Comparing the output of urlencode for $article typed manually and retrieved from website shows the difference:

  manually; Ka%27b+ibn+As%27ad
  website:  Ka%26%23039%3Bb%20ibn%20As%26%23039%3Bad

Comparing the output with htmlspecialchars() is even more impressive:

  manually; Ka'b ibn As'ad
  website:  Ka'b ibn As'ad

How do I get rid of those ' special characters? Apparently htmlspecialchars_decode() does not work.

  • 写回答

1条回答 默认 最新

  • dongwen3093 2016-11-27 15:33
    关注

    htmlspecialchars_decode() only converts html entities that have a name, not those with a number. You need to use html-entity-decode() for this!

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 MATLAB怎么通过柱坐标变换画开口是圆形的旋转抛物面?
  • ¥15 寻一个支付宝扫码远程授权登录的软件助手app
  • ¥15 解riccati方程组
  • ¥15 display:none;样式在嵌套结构中的已设置了display样式的元素上不起作用?
  • ¥30 用arduino开发esp32控制ps2手柄一直报错
  • ¥15 使用rabbitMQ 消息队列作为url源进行多线程爬取时,总有几个url没有处理的问题。
  • ¥15 Ubuntu在安装序列比对软件STAR时出现报错如何解决
  • ¥50 树莓派安卓APK系统签名
  • ¥65 汇编语言除法溢出问题
  • ¥15 Visual Studio问题