doumiebiao6827 2019-02-14 18:16
浏览 29

柯利找不到任何链接

I've done a few programs like this before in basically the same fashion (just different domains), however this time, colly isn't finding a single link and just quits after visiting the first page. Can anyone see what's wrong? *NOTE: there are parts of the program I have omitted for clarity about the topic at hand.

*EDIT: I have found the problem but not a solution. Running curl https://trendmicro.com/vinfo/us/security/research-and-analysis/threat-reports returns a 301 permanently moved error in the terminal, but connecting to the same link in the browser gets the page I want. Why is THIS happening and how do I fix it?

*EDIT2: I have found that making the command curl -L makes curl follow redirects - which then spits out the webpage I need. However, how do I translate that to colly? Because colly is still picking up the 301 error.

import (
    "fmt"
    "strings"
    "github.com/gocolly/colly"
)

func main() {
    /* only navigate to links within these paths */
    tld1 := "/vinfo/us/security/research-and-analysis/threat-reports"

    c := colly.NewCollector(
        colly.AllowedDomains("trendmicro.com", "documents.trendmicro.com"),
    )

    c.OnHTML("a[href]", func(e *colly.HTMLElement) {
        link := e.Attr("href")
        fmt.Printf("Link found: %q -> %s
", e.Text, link)
        if strings.Contains(link, tld1) {
            c.Visit(e.Request.AbsoluteURL(link))
        }
    })

    c.OnRequest(func(r * colly.Request) {
        fmt.Println("Visiting", r.URL.String())
    })

    c.Visit("https://trendmicro.com/vinfo/us/security/research-and-analysis/threat-reports")
}
  • 写回答

1条回答 默认 最新

  • douxiawei9318 2019-02-14 18:39
    关注

    I have found the solution. I plugged my link https://trendmicro.com/vinfo/us/security/research-and-analysis/threat-reports into https://wheregoes.com/retracer.php to find where the 301 redirects to, only to find out it prepends a www. to the beginning of the link. Adding the www. to the beginning of the initial c.Visit string and to the c.AllowedDomains sections worked like a charm

    评论

报告相同问题?

悬赏问题

  • ¥15 使用EMD去噪处理RML2016数据集时候的原理
  • ¥15 神经网络预测均方误差很小 但是图像上看着差别太大
  • ¥15 Oracle中如何从clob类型截取特定字符串后面的字符
  • ¥15 想通过pywinauto自动电机应用程序按钮,但是找不到应用程序按钮信息
  • ¥15 如何在炒股软件中,爬到我想看的日k线
  • ¥15 seatunnel 怎么配置Elasticsearch
  • ¥15 PSCAD安装问题 ERROR: Visual Studio 2013, 2015, 2017 or 2019 is not found in the system.
  • ¥15 (标签-MATLAB|关键词-多址)
  • ¥15 关于#MATLAB#的问题,如何解决?(相关搜索:信噪比,系统容量)
  • ¥500 52810做蓝牙接受端