dongzhuo5425 2019-02-28 13:30
浏览 62

用PHP刮掉Instagram - 插件Wordpress

I'm trying to retreive the user's picture from Instagram without the access token. To do that, I build a scraper with PHP. It works quite well, but sometimes and only with some accounts, it doesn't work.

Here the function to scrape Instagram:

function get_instagram_feed( $number, $username ) {
        error_reporting(0);

        require 'simple-cache.php';
        $cacheFolder = 'instagram-cache';

        $user = strtolower( $username );
        if (!file_exists($cacheFolder)) {
            mkdir($cacheFolder, 0777, true);
        }
        $cache = new Gilbitron\Util\SimpleCache();
        $cache->cache_path = $cacheFolder . '/';
        $cache->cache_time = 3600;
        $scraped_website = $cache->get_data("user-$user", "https://www.instagram.com/$user/");
        $document = new DOMDocument();
        libxml_use_internal_errors(true);
        $document->loadHTML($scraped_website);
        libxml_use_internal_errors(false);
        $selector = new DOMXPath($document);
        $anchors = $selector->query('/html/body//script');

        $images = array();
        $insta_feed = array();

        foreach($anchors as $a) {
            $text = $a->nodeValue;
            preg_match('/window._sharedData = \{(.*?)\};/', $text, $matches);
            $json = json_decode('{' . $matches[1] . '}', true);
            $images[] = $json['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['edges'];
        }

        for ( $i = 0; $i < count($images); $i++ ) {
            $insta_feed[] = array(
                'thumbnail' => $images[0][$i]['node']['thumbnail_resources'][0]['src'],
                'small' => $images[0][$i]['node']['thumbnail_resources'][1]['src'],
                'medium' => $images[0][$i]['node']['thumbnail_resources'][2]['src'],
                'large' => $images[0][$i]['node']['thumbnail_resources'][4]['src'],
                'original' => $images[0][$i]['node']['display_url'],
                'link'  => trailingslashit( '//instagram.com/p/' . $images[0][$i]['node']['shortcode'] ),
                'caption' => $images[0][$i]['node']['edge_media_to_caption']['edges'][0]['node']['text']
            );
        }

        if ( !empty( $insta_feed ) ) {
            return ( $number ) ? array_slice( $insta_feed, 0, $number ) : $insta_feed;
        }
    }

When it works, in the foreach I have $a is a DOMElement Object that I can navigate to get the images URLs.

When it doesn't work, my $a looks like this:

DOMElement Object ( [tagName] => script [schemaTypeInfo] => [nodeName] => script [nodeValue] => (function(){ function normalizeError(err) { var errorInfo = err.error || {}; var getConfigProp = function(propName, defaultValueIfNotTruthy) { var propValue = window._sharedData && window._sharedData[propName]; return propValue ? propValue : defaultValueIfNotTruthy; }; return { line: err.line || errorInfo.message || 0, column: err.column || 0, name: 'InitError', message: err.message || errorInfo.message || '', script: errorInfo.script || '', stack: errorInfo.stackTrace || errorInfo.stack || '', timestamp: Date.now(), ref: window.location.href, deployment_stage: getConfigProp('deployment_stage', ''), is_canary: getConfigProp('is_canary', false), rollout_hash: getConfigProp('rollout_hash', ''), is_prerelease: window.__PRERELEASE__ || false, bundle_variant: getConfigProp('bundle_variant', null), request_url: err.url || window.location.href, response_status_code: errorInfo.statusCode || 0 } } window.addEventListener('load', function(){ if (window.__bufferedErrors && window.__bufferedErrors.length) { if (window.caches && window.caches.keys && window.caches.delete) { window.caches.keys().then(function(keys) { keys.forEach(function(key) { window.caches.delete(key) }) }) } window.__bufferedErrors.map(function(error) { return normalizeError(error) }).forEach(function(normalizedError) { var request = new XMLHttpRequest(); request.open('POST', '/client_error/', true); request.setRequestHeader('Content-Type', 'application/json; charset=utf-8'); request.send(JSON.stringify(normalizedError)); }) } }) }()); [nodeType] => 1 [parentNode] => (object value omitted) [childNodes] => (object value omitted) [firstChild] => (object value omitted) [lastChild] => (object value omitted) [previousSibling] => (object value omitted) [nextSibling] => [attributes] => (object value omitted) [ownerDocument] => (object value omitted) [namespaceURI] => [prefix] => [localName] => script [baseURI] => [textContent] => (function(){ function normalizeError(err) { var errorInfo = err.error || {}; var getConfigProp = function(propName, defaultValueIfNotTruthy) { var propValue = window._sharedData && window._sharedData[propName]; return propValue ? propValue : defaultValueIfNotTruthy; }; return { line: err.line || errorInfo.message || 0, column: err.column || 0, name: 'InitError', message: err.message || errorInfo.message || '', script: errorInfo.script || '', stack: errorInfo.stackTrace || errorInfo.stack || '', timestamp: Date.now(), ref: window.location.href, deployment_stage: getConfigProp('deployment_stage', ''), is_canary: getConfigProp('is_canary', false), rollout_hash: getConfigProp('rollout_hash', ''), is_prerelease: window.__PRERELEASE__ || false, bundle_variant: getConfigProp('bundle_variant', null), request_url: err.url || window.location.href, response_status_code: errorInfo.statusCode || 0 } } window.addEventListener('load', function(){ if (window.__bufferedErrors && window.__bufferedErrors.length) { if (window.caches && window.caches.keys && window.caches.delete) { window.caches.keys().then(function(keys) { keys.forEach(function(key) { window.caches.delete(key) }) }) } window.__bufferedErrors.map(function(error) { return normalizeError(error) }).forEach(function(normalizedError) { var request = new XMLHttpRequest(); request.open('POST', '/client_error/', true); request.setRequestHeader('Content-Type', 'application/json; charset=utf-8'); request.send(JSON.stringify(normalizedError)); }) } }) }()); )

I really can't understand. Form localhost it works always like a charm. On a live website and with some Accounts, I get this DOMElement Object without any interesting data. It happen mostly with the "verificated" accounts.

Can somebody help me with this little challange? thx

  • 写回答

0条回答 默认 最新

    报告相同问题?

    悬赏问题

    • ¥15 BP神经网络控制倒立摆
    • ¥20 要这个数学建模编程的代码 并且能完整允许出来结果 完整的过程和数据的结果
    • ¥15 html5+css和javascript有人可以帮吗?图片要怎么插入代码里面啊
    • ¥30 Unity接入微信SDK 无法开启摄像头
    • ¥20 有偿 写代码 要用特定的软件anaconda 里的jvpyter 用python3写
    • ¥20 cad图纸,chx-3六轴码垛机器人
    • ¥15 移动摄像头专网需要解vlan
    • ¥20 access多表提取相同字段数据并合并
    • ¥20 基于MSP430f5529的MPU6050驱动,求出欧拉角
    • ¥20 Java-Oj-桌布的计算