doukeyong3746487 2013-01-19 21:30
浏览 63

使用cURL检索网站并绕过相同的原始限制,插入javascript

I need to load several websites in iframes whilst also adding a google translate plugin into each page so they can be translated. Here's my code for the insertion part:

<iframe onload="googleJS1(); googleJS2(); googleJS3();" class=iframe2 src=http://localhost:8888/mysitep></iframe>

<script>
    function googleJS1() {
        var iframe = document.getElementsByTagName('iframe')[0];
        var doc = iframe.contentWindow.document;
        var newScript = doc.createElement('div');
        newScript.setAttribute("id", "google_translate_element");
        var bodyClass = doc.getElementsByTagName('body')[0];
        bodyClass.insertBefore(newScript, bodyClass.childNodes[0]);
    }

    function googleJS2() {
        var iframe = document.getElementsByTagName('iframe')[0];
        var doc = iframe.contentWindow.document;
        var newScript = doc.createElement('script');
        newScript.setAttribute("src", "http://translate.google.com/translate_a/element.js?    cb=googleTranslateElementInit");
        var bodyClass = doc.getElementsByTagName('head')[0];
        bodyClass.insertBefore(newScript, bodyClass.childNodes[1]);
    }

    function googleJS3() {
        var iframe = document.getElementsByTagName('iframe')[0];
        var doc = iframe.contentWindow.document;
        var newScript = doc.createElement('script');
        newScript.setAttribute("src", "http://localhost:8888/mysite/google.js");
        var bodyClass = doc.getElementsByTagName('head')[0];
        bodyClass.insertBefore(newScript, bodyClass.childNodes[2]);
    }
}
</script>

This works as long as the iframe target URL is on the same server. I read to bypass the same origin constraint I should set up a proxy server and pass all URL requests via the proxy. So I read up on cURL and tried this as a test:

<?php

function get_data($url) {
    $ch = curl_init();
    $timeout = 5;
    curl_setopt($ch,CURLOPT_USERAGENT, $userAgent);
        curl_setopt($ch,CURLOPT_URL,$url);
    curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
    curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
    curl_setopt($ch,CURLOPT_FOLLOWLOCATION,true);
    $data = curl_exec($ch);
    curl_close($ch);
    return $data;
}

$test = get_data("http://www.selfridges.com");
echo $test;

?>

The basic HTML elements are loaded yet no CSS and images are loaded. Also the links still point to the original URL. I need some suggestions on how I can also pull the CSS, images and js off the target URL into a proxy and load the pages from there, making it look like it came from the same domain and ports and by passing the same origin policy. I also need the links to work in this fashion.

e.g:

main page - http://localhost:8888/proxy.php 

links     - http://localhost:8888/proxy.php/products/2012/shoes

Any other methods or alternatives are also welcome.

Thanks

  • 写回答

1条回答 默认 最新

  • dsq2015 2013-01-19 21:47
    关注

    Assuming all the links & images in your target documents are relative, you could inject a base tag into the head. This would effectively make the links absolute, so the links & images would still refer to the target domain (not yours).

    http://reference.sitepoint.com/html/base

    Not sure how this would work with css images though.

    A solution that will work consistently for any target site is going to be tough - you'll need to parse out links not only in the html, but in any css references. Some sites might use AJAX to populate the pages, which will cause same origin policy issues on the target site too.

    评论

报告相同问题?

悬赏问题

  • ¥15 用hfss做微带贴片阵列天线的时候分析设置有问题
  • ¥15 基于52单片机的酒精浓度检测系统加继电器和sim800
  • ¥50 我撰写的python爬虫爬不了 要爬的网址有反爬机制
  • ¥15 Centos / PETSc / PETGEM
  • ¥15 centos7.9 IPv6端口telnet和端口监控问题
  • ¥120 计算机网络的新校区组网设计
  • ¥20 完全没有学习过GAN,看了CSDN的一篇文章,里面有代码但是完全不知道如何操作
  • ¥15 使用ue5插件narrative时如何切换关卡也保存叙事任务记录
  • ¥20 海浪数据 南海地区海况数据,波浪数据
  • ¥20 软件测试决策法疑问求解答