dongye1942 2016-12-05 10:15
浏览 73
已采纳

在抓取《纽约时报》全文时,如何规避机器人保护?

I am trying to scrape full book reviews from the New York Times in order to perform sentiment analysis on them. I am aware of the NY Times API and am using it to get book review URLs, but I need to devise a scraper to get the full article text, as the API only gives a snippet. I believe that nytimes.com has bot protection to prevent bots from scraping the website but I know there are ways to circumvent it.

I found this python scraper that works and can pull full text from nytimes.com, but I would prefer to implement my solution in Go. Should I just port this to Go or is this solution unnecessarily complex? I have already played around with changing the User-Agent header but everything that I do in Go ends in an infinite redirect loop error.

Code:

package main

import (
    //"fmt"
    "io/ioutil"
    "log"
    "math/rand"
    "net/http"
    "time"
    //"net/url"
)

func main() {

    rand.Seed(time.Now().Unix())

    userAgents := [5]string{
        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36",
        "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0",
        "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:41.0) Gecko/20100101 Firefox/41.0",
    }

    url := "http://www.nytimes.com/2015/10/25/books/review/the-tsar-of-love-and-techno-by-anthony-marra.html"

    client := &http.Client{}

    req, err := http.NewRequest("GET", url, nil)
    if err != nil {
        log.Fatalln(err)
    }

    req.Header.Set("User-Agent", userAgents[rand.Intn(len(userAgents))])

    resp, err := client.Do(req)
    if err != nil {
        log.Fatalln(err)
    }

    defer resp.Body.Close()
    body, err := ioutil.ReadAll(resp.Body)
    if err != nil {
        log.Fatalln(err)
    }

    log.Println(string(body))
}

Results in:

2016/12/05 21:57:53 Get http://www.nytimes.com/2015/10/25/books/review/the-tsar-of-love-and-techno-by-anthony-marra.html?_r=4: stopped after 10 redirects
exit status 1

Any help is appreciated! Thank you!

展开全部

  • 写回答

1条回答 默认 最新

  • duanmei2805 2016-12-06 10:03
    关注

    You just have to add cookies to your client:

    var cookieJar, _ = cookiejar.New(nil)
    var client = &http.Client{Jar: cookieJar}
    
    resp, err := client.Do(req)
    if err != nil {
        log.Fatalln(err)
    }
    // now response contains all you need and 
    // you can show it on the console or save to file
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
编辑
预览

报告相同问题?

悬赏问题

  • ¥15 没输出运行不了什么问题
  • ¥20 输入import torch显示Intel MKL FATAL ERROR,系统驱动1%,: Cannot load mkl_intel_thread.dll.
  • ¥15 点云密度大则包围盒小
  • ¥15 nginx使用nfs进行服务器的数据共享
  • ¥15 C#i编程中so-ir-192编码的字符集转码UTF8问题
  • ¥15 51嵌入式入门按键小项目
  • ¥30 海外项目,如何降低Google Map接口费用?
  • ¥15 fluentmeshing
  • ¥15 手机/平板的浏览器里如何实现类似荧光笔的效果
  • ¥15 盘古气象大模型调用(python)
手机看
程序员都在用的中文IT技术交流社区

程序员都在用的中文IT技术交流社区

专业的中文 IT 技术社区,与千万技术人共成长

专业的中文 IT 技术社区,与千万技术人共成长

关注【CSDN】视频号,行业资讯、技术分享精彩不断,直播好礼送不停!

关注【CSDN】视频号,行业资讯、技术分享精彩不断,直播好礼送不停!

客服 返回
顶部