go中的对象被替换

I'm learning go by writing a web spider. I'm trying to get a list of all the business categories from allpages.com.

Below is my entire program. Unfortunately I can't isolate the issue so I've pasted it all.

If you run this program, you'll see that first of all it correctly downloads the first page, and adds all the extracted categories to the list of categories.

However, when it then downloads subsequent pages, it seems to mess up the reference to the parent category. E.g. it incorrectly calculates the URL http://www.allpages.com/travel-tourism/political-ideological-organizations/, when in fact political-ideological-organizations/ is not a subcategory of travel-tourism/. Digging through the logs it seems to overwrite the data in the parent object. The error is more pronounced the more workers there are.

This was working a bit better before I started passing data by reference to the goroutine, but I had essentially the same issue.

I've got several questions:

How can I debug this without resorting to picking through log lines?

What's wrong/why isn't it working and how can it be fixed?

package main

import (
        "fmt"
        "github.com/PuerkitoBio/goquery"
        "log"
        "strconv"
        "strings"
        "regexp"
)

const domain = "http://www.allpages.com/"
const categoryPage = "category.html"

type Category struct {
        url string
        level uint
        name string
        entries int
        parent *Category
}

type DownloadResult struct {
        doc *goquery.Document
        category *Category
}

const WORKERS = 2
const SEPARATOR = "§§§"

func main() {

        allCategories := make([]Category, 0)

        downloadChannel := make(chan *Category)
        resultsChannel := make(chan *DownloadResult, 100)

        for w := 1; w <= WORKERS; w++ {
                go worker(downloadChannel, resultsChannel)
        }

        numRequests := 1
        downloadChannel <- &Category{ domain + categoryPage, 0, "root", 0, nil }

        for result := range resultsChannel {
                var extractor func(doc *goquery.Document) []string

                if result.category.level == 0 {
                        extractor = topLevelExtractor
                } else if result.category.level == 1 {
                        extractor = secondLevelExtractor
                } else {
                        extractor = thirdLevelExtractor
                }

                categories := extractCategories(result.doc, result.category, extractor)
                allCategories = append(allCategories, *categories...)

                //fmt.Printf("Appending categories: %v", *categories)

                fmt.Printf("total categories = %d, total requests = %d
", len(allCategories), numRequests)

                for _, category := range *categories {
                        numRequests += 1
                        downloadChannel <- &category
                }

                // close the channels when there are no more jobs
                if len(allCategories) > numRequests {
                        close(downloadChannel)
                        close(resultsChannel)
                }
        }

        fmt.Println("Done")
}

func worker(downloadChannel <-chan *Category, results chan<- *DownloadResult) {
        for target := range downloadChannel {
                fmt.Printf("Downloading %v (addr %p) ...", target, &target)

                doc, err := goquery.NewDocument(target.url)
                if err != nil {
                        log.Fatal(err)
                        panic(err)
                }

                fmt.Print("done 
")

                results <- &DownloadResult{doc, target}
        }
}

func extractCategories(doc *goquery.Document, parent *Category, extractor func(doc *goquery.Document) []string) *[]Category {

        numberRegex, _ := regexp.Compile("[0-9,]+")

        log.Printf("Extracting subcategories for page %s
", parent)

        subCategories := extractor(doc)

        categories := make([]Category, 0)

        for _, subCategory := range subCategories {
                log.Printf("Got subcategory=%s from parent=%s", subCategory, parent)
                extracted := strings.Split(subCategory, SEPARATOR)

                numberWithComma := numberRegex.FindString(extracted[2])
                number := strings.Replace(numberWithComma, ",", "", -1)

                numRecords, err := strconv.Atoi(number)
                if err != nil {
                        log.Fatal(err)
                        panic(err)
                }

                var category Category

                level := parent.level + 1

                if parent.level == 0 {
                        category = Category{ domain + extracted[1], level, extracted[0], numRecords, parent }
                } else {
                        log.Printf("category URL=%s, parent=%s, parent=%v", extracted[1], parent.url, parent)
                        category = Category{ parent.url + extracted[1], level, extracted[0], numRecords, parent }
                }

                log.Printf("Appending category=%v (pointer=%p)", category, &category)

                categories = append(categories, category)
        }

        return &categories
}

func topLevelExtractor(doc *goquery.Document) []string {
        return doc.Find(".cat-listings-td .c-1s-2m-1-td1").Map(func(i int, s *goquery.Selection) string {
                title := s.Find("a").Text()
                url := s.Find("a").Map(func(x int, a *goquery.Selection) string {
                        v, _ := a.Attr("href")
                        return v
                })
                records := s.Clone().Children().Remove().End().Text()

                //log.Printf("Item %d: %s, %s - %s
", i, title, records, url)

                res := []string{title, url[0], records}
                return strings.Join(res, SEPARATOR)
        })
}

func secondLevelExtractor(doc *goquery.Document) []string {
        return doc.Find(".c-2m-3c-1-table .c-2m-3c-1-td1").Map(func(i int, s *goquery.Selection) string {
                title := s.Find("a").Text()
                url := s.Find("a").Map(func(x int, a *goquery.Selection) string {
                        v, _ := a.Attr("href")
                        return v
                })
                records := s.Clone().Children().Remove().End().Text()

                //log.Printf("Item %d: %s, %s - %s
", i, title, records, url)

                res := []string{title, url[0], records}
                return strings.Join(res, SEPARATOR)
        })
}

func thirdLevelExtractor(doc *goquery.Document) []string {
        return doc.Find(".c-2m-3c-1-table .c-2m-3c-1-td1").Map(func(i int, s *goquery.Selection) string {
                title := s.Find("a").Text()
                url := s.Find("a").Map(func(x int, a *goquery.Selection) string {
                        v, _ := a.Attr("href")
                        return v
                })
                records := s.Clone().Children().Remove().End().Text()

                //log.Printf("Item %d: %s, %s - %s
", i, title, records, url)

                res := []string{title, url[0], records}
                return strings.Join(res, SEPARATOR)
        })
}

Update Fixed - see comment below.

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
douyasihefu6214 2017-02-08 12:49
关注
Looping over:

for _, category := range *categories { numRequests += 1 downloadChannel <- &category }

meant I was sending a reference to the temporary variable category to the channel, instead of the actual memory address of that value.

I've fixed this by using a different loop:

for i := 0; i < len(*categories); i++ { fmt.Printf("Queuing category: %v (%p)", categoriesValues[i], categoriesValues[i]) downloadChannel <- &categoriesValues[i] }
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

在Golang中替换数组结构中的数据
2019-03-03 02:14

回答 2 已采纳 Your logic assumes that the length of a is always the proper count to conditionally iterate throug
对象Golang中的对象中的对象 mongodb
2017-02-28 15:51

回答 1 已采纳 Your described structure, as I understand it, can be represented in Go as follows: type Hole stru
在golang中替换字符串中的字符
2019-08-01 12:50

回答 2 已采纳 Strings in Go are immutable, you can't change their content. To change the value of a string varia
SQL Server中利用正则表达式替换字符串的方法
2020-12-16 16:45

建立正则替换函数，利用了OLE对象，以下是函数代码： --如果存在则删除原有函数 IF OBJECT_ID(N'dbo.RegexReplace') IS NOT NULL DROP FUNCTION dbo.RegexReplace GO --开始创建正则替换函数 CREATE FUNCTION dbo...
在Golang中快速复制对象的更快方法 json
2017-10-17 12:31

回答 1 已采纳 JSON vs gob difference The encoding/gob package needs to transmit type definitions: The imple
Golang SQL查询变量替换 database oracle sql
2018-09-14 23:45

回答 2 已采纳 Parameter Placeholder Syntax (reference: http://go-database-sql.org/prepared.html ) The syntax
golang时间对象中有多少字节
2019-07-10 03:58

回答 2 已采纳 The answer to this question is not as straight forward as it might seem. It depends a lot on how
替换yaml模板中的变量
2022-11-15 16:53

朋泽的博客根据某种文件格式的template，通过替换其中的变量生成结果文件
在Go中初始化空对象
2018-11-07 21:36

回答 1 已采纳 You can declare a typed variable before you have a value for it. var cmd *exec.Cmd // or cmd :
在golang中循环遍历数组对象和分组的最佳方法
2018-07-26 16:00

回答 2 已采纳 You can use a map to store a slice of books per collection ID. type Book struct { Title
如何在Go中删除struct对象？
2017-02-06 11:34

回答 2 已采纳 Go is a garbage collected language. You are not supposed to, and you cannot delete objects from me
Go语言如何实现面向对象？
2021-11-18 21:26

ClimberCoding的博客 Go 由于缺乏类型层次，Go 中的 "对象 "比 C++ 或 Java 等语言更轻巧。 Go 实现面向对象编程封装面向对象中的 “封装” 指的是可以隐藏对象的内部属性和实现细节，仅对外提供公开接口调用，这样子用户就不需要关注...
如何在Go中替换字符串中的最后两个'/'字符
2017-04-20 07:55

回答 1 已采纳 First, from the documentation: Replace returns a copy of the string s with the first n non-ove
Golang 如何实现面向对象方法
2022-05-19 11:21

@航空母舰的博客在了解 Go 语言是不是面向对象（简称：OOP）之前，我们必须先知道 OOP 是啥，得先给他 “下定义”。根据 Wikipedia 的定义，我们梳理出 OOP 的几个基本认知：面向对象编程（OOP）是一种基于 "对象" 概念的...
一键替换工程文件和场景中的UI对象字体
2023-09-03 18:06

阳光下的大嘴猴王的博客一键替换工程和场景中的Text的字体
没有解决我的问题, 去提问

悬赏问题

¥15 smptlib使用465端口发送邮件失败
¥200 总是报错，能帮助用python实现程序实现高斯正反算吗？有偿
¥15 对于squad数据集的基于bert模型的微调
¥15 为什么我运行这个网络会出现以下报错？CRNN神经网络
¥20 steam下载游戏占用内存
¥15 CST保存项目时失败
¥15 树莓派5怎么用camera module 3啊
¥20 java在应用程序里获取不到扬声器设备
¥15 echarts动画效果的问题，请帮我添加一个动画。不要机器人回答。
¥15 Attention is all you need 的代码运行

go中的对象被替换

1条回答 默认 最新

悬赏问题

1条回答默认最新