duanmeng1858 2017-09-16 18:47
浏览 151
已采纳

使用PHP从url.jsonp获取文本

I'm trying to get the plain text from this webpage: https://html2-f.scribdassets.com/55ssxtbbb45pk2eg/pages/319-42c28ee981.jsonp which upon inspection is a callback function that inserts HTML. I'm trying to scrape the page and reformat the text to be comprehensive and actually display the HTML instead of it being plain text.

PHP:

echo file_get_contents("https://html2-f.scribdassets.com/55ssxtbbb45pk2eg/pages/319-42c28ee981.jsonp");

The returning text is a complete mess

����X321-5db7e88872.jsonp�Y]n�6���E�ıH�;��E�@���b�PM��%�f#K�H��}�;�z���:�eG"e��:@�E����j��XޖdJ���$�&$~����>a�8#��p�ӥy��X��8�r��(#kZ���85�j�A�%��������Ȇ�...

Whereas it should look like this:

"<div class=\"newpage\" id=\"page319\" style=\"width: 902px; height:1167px\">
<div class=text_layer style=\"z-index:2\"><div class=ie_fix>
&nbsp;
<div class=\"ff81\" style=\"font-size:114px\">
<span class=a style=\"left:331px;top:75px;color:#ffffff\">1<span class=w9></span>3</span></div>...

Although I could manually copy/paste the text from the webpage into a text editor for future usage, I would like to eliminate this step as I'll need to do this for 320 pages.

Is there some work around for .jsonp urls? Or is the data encrypted by the server? (I just don't know)

</div>
  • 写回答

1条回答 默认 最新

  • douxing5598 2017-09-16 19:08
    关注

    The response is gzip'd. You can see it in the response headers:

    Content-Encoding: gzip
    

    So, you need to unzip it. You can do this either changing your whole approach and using cURL, or using the stream wrapper compress.zlib://. Just prepend that to the URL:

    echo file_get_contents("compress.zlib://https://html2-f.scribdassets.com/55ssxtbbb45pk2eg/pages/319-42c28ee981.jsonp");
    

    That will get you the correct response. Notice that this is still a JSONP response, so it's in form of a callback. You need to decide what to do with it.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥20 机器学习能否像多层线性模型一样处理嵌套数据
  • ¥20 西门子S7-Graph,S7-300,梯形图
  • ¥50 用易语言http 访问不了网页
  • ¥50 safari浏览器fetch提交数据后数据丢失问题
  • ¥15 matlab不知道怎么改,求解答!!
  • ¥15 永磁直线电机的电流环pi调不出来
  • ¥15 用stata实现聚类的代码
  • ¥15 请问paddlehub能支持移动端开发吗?在Android studio上该如何部署?
  • ¥20 docker里部署springboot项目,访问不到扬声器
  • ¥15 netty整合springboot之后自动重连失效