drba1172 2015-02-05 06:56
浏览 74

PHP:file_get_contents和cURL因为“神秘”换行符而给出404

I am using PhpQuery to get all links with an specific class and after that I want to get the html source of each link to mess a little bit.

But the first thing is that I am getting the links from a site I am doing a research about. I do not have any administrator rights of the site I am using.

So, in order to do all I my research in my localhost environment, I accomplished to change all links from something like this:

<a class="linkHtml" href="search?q=HUGE_QUERY_HERE">link</a>

to:

<a class="linkHtml" href="http://www.domain.com/search?q=HUGE_QUERY_HERE">link</a>

using this PHP code here:

foreach (pq('.linkHtml') as $link) {                
    $id = pq($link)->parent()->prev()->text();              
    $search = 'search?q=';
    $replace = 'http://www.domain.com/busca/search?q=';
    $subject = pq($link)->attr('href'); 
    $pageUrl =str_replace($search,$replace,$subject);               

    pq($link)->attr('href',$pageUrl);                   

   /* more code here */
}

The problem is that somehow the first ? is breaking the string. I cant even reproduce that same error in text, I'll have to upload pictures of it.

Considering the code above, if I do a var_dump($pageUrl) and try to connect it will result on this:

You can see that it looks like it has a line break after the search?, even tough it clearly doesn't have. I already tried to remove all line breaks based on this answers and others and got no luck. And it tries to connect as if the url ended on the question mark.

If I change the code to:

$pageUrl =str_replace("search?q=", "searchq=", $pageUrl);       
var_dump($pageUrl);     

will result in this:

As you can see, it will try to connect to correctly, but obviously searchq= is wrong.

What am I missing? Where that line break come from? I said I can not reproduce it in text because if I copy it, it will look normal, as there is nothing there and the site will work normally.

EDIT: Also tried this with no luck.

$pageUrl = urlencode($pageUrl);
$pageUrl = str_replace("%2f","/",$pageUrl); 
$pageUrl = str_replace("%3A",":",$pageUrl); 
$content = file_get_contents($pageUrl);

After changing the strings to single quotes, a var_dump of each of them result on this:

enter image description here

Apologies for using all those images, but I don't know a better way to reproduce the exact same problem. Also sorry for always hiding the site domain making the images dirty, but I have to.

  • 写回答

1条回答 默认 最新

  • dpdt79577 2015-02-05 14:48
    关注

    I assume you have some breacking lines chars which breakes your url, try to do this:

    $findStringsArr = array("\0","
    ","","search?q=");
    $replaceStringsWithArr = array("","","","http://www.domain.com/busca/search?q=");
    $subject = pq($link)->attr('href'); 
    $pageUrl = str_replace($findStringsArr,$replaceStringsWithArr,$subject);
    
    评论

报告相同问题?

悬赏问题

  • ¥15 一直显示正在等待HID—ISP
  • ¥15 Python turtle 画图
  • ¥15 关于大棚监测的pcb板设计
  • ¥15 stm32开发clion时遇到的编译问题
  • ¥15 lna设计 源简并电感型共源放大器
  • ¥15 如何用Labview在myRIO上做LCD显示?(语言-开发语言)
  • ¥15 Vue3地图和异步函数使用
  • ¥15 C++ yoloV5改写遇到的问题
  • ¥20 win11修改中文用户名路径
  • ¥15 win2012磁盘空间不足,c盘正常,d盘无法写入