duanchu7271 2017-09-11 14:35
浏览 23
已采纳

从网页上刮取源代码<script>标记

I'm looking for a way to scrape some source code. The information I need is inside a tag similar to this.

<script>
.......
var playerIdMap = {};
playerIdMap['4'] = '614';
playerIdMap['5'] = '84';
playerIdMap['6'] = '65';
playerIdMap['7'] = '701';
getPlayerIdMap = function() { return playerIdMap; };   // global
}
enclosePlayerMap();
</script>

I am trying to grab the contents of the playerIdMap numbers eg: 4 and 614, or the whole line for that matter..

  • 写回答

1条回答 默认 最新

  • doutai1509 2017-09-11 16:25
    关注

    Edit-2

    Complete PHP code inspired from code at How to get data from API - php - curl

    <?php
    /**
     * Handles making a cURL request
     *
     * @param string $url         URL to call out to for information.
     * @param bool   $callDetails Optional condition to allow for extended
     *   information return including error and getinfo details.
     *
     * @return array $returnGroup cURL response and optional details.
     */
    function makeRequest($url, $callDetails = false)
    {
      // Set handle
      $ch = curl_init($url);
    
      // Set options
      curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
      curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    
      // Execute curl handle add results to data return array.
      $result = curl_exec($ch);
      $returnGroup = ['curlResult' => $result,];
    
      // If details of curl execution are asked for add them to return group.
      if ($callDetails) {
        $returnGroup['info'] = curl_getinfo($ch);
        $returnGroup['errno'] = curl_errno($ch);
        $returnGroup['error'] = curl_error($ch);
      }
    
      // Close cURL and return response.
      curl_close($ch);
      return $returnGroup;
    }
    
    $url = "http://www.bullshooterlive.com/my-stats/999/";
    $response = makeRequest($url, true);
    
    $re = '/playerIdMap\[\'(?P<id>\d+)\']\s+=\s+\'(?P<value>\d+)\'/';
    
    preg_match_all($re, $response['curlResult'], $matches, PREG_SET_ORDER, 0);
    
    // Print the entire match result
    var_dump($matches);
    
    //var_dump($response);
    

    Edit-1

    Sorry didn't realize you asked PHP question. Don't know why I assumed scrapy here. Anyways below php code should help

    $re = '/playerIdMap\[\'(?P<id>\d+)\']\s+=\s+\'(?P<value>\d+)\'/';
    $str = '<script>
    .......
    var playerIdMap = {};
    playerIdMap[\'4\'] = \'614\';
    playerIdMap[\'5\'] = \'84\';
    playerIdMap[\'6\'] = \'65\';
    playerIdMap[\'7\'] = \'701\';
    getPlayerIdMap = function() { return playerIdMap; };   // global
    }
    enclosePlayerMap();
    </script>';
    
    preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
    
    // Print the entire match result
    var_dump($matches);
    

    Previous answer

    You can use something like below

    >>> data = """
    ... <script>
    ... .......
    ... var playerIdMap = {};
    ... playerIdMap['4'] = '614';
    ... playerIdMap['5'] = '84';
    ... playerIdMap['6'] = '65';
    ... playerIdMap['7'] = '701';
    ... getPlayerIdMap = function() { return playerIdMap; };   // global
    ... }
    ... enclosePlayerMap();
    ... </script>
    ... """
    >>> import re
    >>>
    >>> regex = r"playerIdMap\['(?P<id>\d+)']\s+=\s+'(?P<value>\d+)'"
    >>> re.findall(regex, data)
    [('4', '614'), ('5', '84'), ('6', '65'), ('7', '701')]
    

    You need to get to the script tag using below

    data = response.xpath("//script[contains(text(),'getPlayerIdMap')]").extract_first() 
    
    import re
    regex = r"playerIdMap\['(?P<id>\d+)']\s+=\s+'(?P<value>\d+)'"
    print(re.findall(regex, data))
    [('4', '614'), ('5', '84'), ('6', '65'), ('7', '701')]
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥100 set_link_state
  • ¥15 虚幻5 UE美术毛发渲染
  • ¥15 CVRP 图论 物流运输优化
  • ¥15 Tableau online 嵌入ppt失败
  • ¥100 支付宝网页转账系统不识别账号
  • ¥15 基于单片机的靶位控制系统
  • ¥15 真我手机蓝牙传输进度消息被关闭了,怎么打开?(关键词-消息通知)
  • ¥15 装 pytorch 的时候出了好多问题,遇到这种情况怎么处理?
  • ¥20 IOS游览器某宝手机网页版自动立即购买JavaScript脚本
  • ¥15 手机接入宽带网线,如何释放宽带全部速度