2017-09-16 18:47



I'm trying to get the plain text from this webpage: https://html2-f.scribdassets.com/55ssxtbbb45pk2eg/pages/319-42c28ee981.jsonp which upon inspection is a callback function that inserts HTML. I'm trying to scrape the page and reformat the text to be comprehensive and actually display the HTML instead of it being plain text.


echo file_get_contents("https://html2-f.scribdassets.com/55ssxtbbb45pk2eg/pages/319-42c28ee981.jsonp");

The returning text is a complete mess


Whereas it should look like this:

"<div class=\"newpage\" id=\"page319\" style=\"width: 902px; height:1167px\">
<div class=text_layer style=\"z-index:2\"><div class=ie_fix>
<div class=\"ff81\" style=\"font-size:114px\">
<span class=a style=\"left:331px;top:75px;color:#ffffff\">1<span class=w9></span>3</span></div>...

Although I could manually copy/paste the text from the webpage into a text editor for future usage, I would like to eliminate this step as I'll need to do this for 320 pages.

Is there some work around for .jsonp urls? Or is the data encrypted by the server? (I just don't know)

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答


  • douxing5598 douxing5598 4年前

    The response is gzip'd. You can see it in the response headers:

    Content-Encoding: gzip

    So, you need to unzip it. You can do this either changing your whole approach and using cURL, or using the stream wrapper compress.zlib://. Just prepend that to the URL:

    echo file_get_contents("compress.zlib://https://html2-f.scribdassets.com/55ssxtbbb45pk2eg/pages/319-42c28ee981.jsonp");

    That will get you the correct response. Notice that this is still a JSONP response, so it's in form of a callback. You need to decide what to do with it.

    点赞 评论 复制链接分享