doudouji2016 2015-06-01 07:53
浏览 103
已采纳

golang网页抓取工具,忽略表格的特定单元格

I'm working on a small web scraper to just get a feel of golang. It currently is grabbing info off of a wiki from a table and then grabbing info specifically from cells. I don't currently have the code on me (not currently at home) but it looks fairly similar to this:

    func main() {
        doc, err := goquery.NewDocument("http://monsterhunter.wikia.com/wiki/MH4:_Item_List")
        if err != nil {
                log.Fatal(err)
        }

        doc.Find("tbody").Each(func(i int, s *goquery.Selection) {
                title := s.Find("td").Text()
                fmt.Printf(title)
        })
}

The issue is that on this website the first cell is an image, so it prints the image source which I don't want. How can I ignore the first cell in each row of the large table?

  • 写回答

1条回答 默认 最新

  • duandu2980 2015-06-01 09:21
    关注

    Let's clear some things. A Selection is a collection of nodes matching some criteria.

    doc.Find() is Selection.Find() which returns a new Selection containing the elements matching the criteria. And Selection.Each() iterations over each of the elements of the collection and calls the function value passed to it.

    So in your case Find("tbody") will find all tbody elements, Each() will iterate over all tbody elements and call your anonymous function.

    Inside your anonymous function s is a Selection of one tbody element. You call s.Find("td") which will return a new Selection which will contain all the td elements of the current table. So when you call Text() on this, it will be the combined text contents of each td elements including their descendants. This is not what you want.

    What you should do is call another Each() on the Selection returned by s.Find("td"). And check if the Selection passed to the 2nd anonymous function has an img child.

    Example code:

    doc.Find("tbody").Each(func(i int, s *goquery.Selection) {
        // s here is a tbody element
        s.Find("td").Each(func(j int, s2 *goquery.Selection) {
            // s2 here is a td element
            if s3 := s2.Find("img"); s3 != nil && s3.Length() > 0 {
                return // This TD has at least one img child, skip it
            }
            fmt.Printf(s2.Text())
        })
    })
    

    Alternatively you could search tr elements and skip the first td child of each row by checking if the index passed to the 3rd anonymous function is 0 (first child), something like this:

    doc.Find("tbody").Each(func(i int, s *goquery.Selection) {
        // s here is a tbody element
        s.Find("tr").Each(func(j int, s2 *goquery.Selection) {
            // s2 here is a tr element
            s2.Find("td").Each(func(k int, s3 *goquery.Selection) {
                // s3 here is a td element
                if k == 0 {
                    return // This is the first TD in the row
                }
                fmt.Printf(s3.Text())
            })
        })
    })
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 求解O-S方程的特征值问题给出边界层布拉休斯平行流的中性曲线
  • ¥15 谁有desed数据集呀
  • ¥20 手写数字识别运行c仿真时,程序报错错误代码sim211-100
  • ¥15 关于#hadoop#的问题
  • ¥15 (标签-Python|关键词-socket)
  • ¥15 keil里为什么main.c定义的函数在it.c调用不了
  • ¥50 切换TabTip键盘的输入法
  • ¥15 可否在不同线程中调用封装数据库操作的类
  • ¥15 微带串馈天线阵列每个阵元宽度计算
  • ¥15 keil的map文件中Image component sizes各项意思