多个goroutine访问/修改列表/地图

I am trying to implement a multithreaded crawler using a go lang as a sample task to learn the language.

It supposed to scan pages, follow links and save them do DB.

To avoid duplicates I'm trying to use map where I save all the URLs I've already saved.

The synchronous version works fine, but I have troubles when I'm trying to use goroutines.

I'm trying to use mutex as a sync object for map, and channel as a way to coordinate goroutines. But obviously I don't have clear understanding of them.

The problem is that I have many duplicate entries, so my map store/check does not work properly.

Here is my code:

package main

import (
    "fmt"
    "net/http"
    "golang.org/x/net/html"
    "strings"
    "database/sql"
    _ "github.com/ziutek/mymysql/godrv"
    "io/ioutil"
    "runtime/debug"
    "sync"
)

const maxDepth = 2;

var workers = make(chan bool)

type Pages struct {
    mu sync.Mutex
    pagesMap map[string]bool
}

func main() {
    var pagesMutex Pages
    fmt.Println("Start")
    const database = "gotest"
    const user = "root"
    const password = "123"

    //open connection to DB
    con, err := sql.Open("mymysql", database + "/" + user + "/" + password)
    if err != nil { /* error handling */
        fmt.Printf("%s", err)
        debug.PrintStack()
    }

    fmt.Println("call 1st save site")
    pagesMutex.pagesMap = make(map[string]bool)
    go pagesMutex.saveSite(con, "http://golang.org/", 0)

    fmt.Println("saving true to channel")
    workers <- true

    fmt.Println("finishing in main")
    defer con.Close()
}


func (p *Pages) saveSite(con *sql.DB, url string, depth int) {
    fmt.Println("Save ", url, depth)
    fmt.Println("trying to lock")
    p.mu.Lock()
    fmt.Println("locked on mutex")
    pageDownloaded := p.pagesMap[url] == true
    if pageDownloaded {
        p.mu.Unlock()
        return
    } else {
        p.pagesMap[url] = true
    }
    p.mu.Unlock()

    response, err := http.Get(url)
    if err != nil {
        fmt.Printf("%s", err)
        debug.PrintStack()
    } else {
        defer response.Body.Close()

        contents, err := ioutil.ReadAll(response.Body)
        if err != nil {
            if err != nil {
                fmt.Printf("%s", err)
                debug.PrintStack()
            }
        }

        _, err = con.Exec("insert into pages (url) values (?)", string(url))
        if err != nil {
            fmt.Printf("%s", err)
            debug.PrintStack()
        }
        z := html.NewTokenizer(strings.NewReader((string(contents))))

        for {
            tokenType := z.Next()
            if tokenType == html.ErrorToken {
                return
            }

            token := z.Token()
            switch tokenType {
            case html.StartTagToken: // <tag>

                tagName := token.Data
                if strings.Compare(string(tagName), "a") == 0 {
                    for _, attr := range token.Attr {
                        if strings.Compare(attr.Key, "href") == 0 {
                            if depth < maxDepth  {
                                urlNew := attr.Val
                                if !strings.HasPrefix(urlNew, "http")  {
                                    if strings.HasPrefix(urlNew, "/")  {
                                        urlNew = urlNew[1:]
                                    }
                                    urlNew = url + urlNew
                                }
                                //urlNew = path.Clean(urlNew)
                                go  p.saveSite(con, urlNew, depth + 1)

                            }
                        }
                    }

                }
            case html.TextToken: // text between start and end tag
            case html.EndTagToken: // </tag>
            case html.SelfClosingTagToken: // <tag/>

            }

        }

    }
    val := <-workers
    fmt.Println("finished Save Site", val)
}

Could someone explain to me how to do this properly, please?

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

dspym82000 2016-02-19 18:00

关注

Well you have two chooses, for a little and simple implementation, I would recommend to separate the operations on the map into a separate structure.

// Index is a shared page index
type Index struct {
    access sync.Mutex
    pages map[string]bool
}

// Mark reports that a site have been visited
func (i Index) Mark(name string) {
    i.access.Lock()
    i.pages[name] = true
    i.access.Unlock()
}

// Visited returns true if a site have been visited
func (i Index) Visited(name string) bool {
    i.access.Lock()
    defer i.access.Unlock()

    return i.pages[name]
}

Then, add another structure like this:

// Crawler is a web spider :D
type Crawler struct {
    index Index
    /* ... other important stuff like visited sites ... */
}

// Crawl looks for content
func (c *Crawler) Crawl(site string) {
    // Implement your logic here 
    // For example: 
    if !c.index.Visited(site) {
        c.index.Mark(site) // When marked
    }
}

That way you keep things nice and clear, probably a little more code, but definitely more readable. You need to instance crawler like this:

sameIndex := Index{pages: make(map[string]bool)}
asManyAsYouWant := Crawler{sameIndex, 0} // They will share sameIndex

If you want to go further with a high level solution, then I would recommend Producer/Consumer architecture.

本回答被题主选为最佳回答 , 对您是否有帮助呢?

报告相同问题？

关注问题

如何等待其他多个Goroutine的单个Goroutine响应？
2019-02-24 16:50

回答 1 已采纳 What I would to to solve your task is I would use a goroutine pool for this. There would be a prod
具有多个通道的多个goroutine的死锁
2018-11-05 05:11

回答 1 已采纳 We can iterate through values sent over a channel. To break such iteration channel needs to be clo
使用通道同步多个goroutine
2018-05-13 06:29

回答 2 已采纳 Use sync.WaitGroup to wait for goroutines to complete. Close channels to cause loops reading on c
*为什么Rust Async/await 设计成这样？
2024-04-30 16:03

xuejianxinokok的博客 Rust 中的 Async/await 语法最初...再次引用上的评论，讨论最近关于该主题的博客文章：我真的无法理解为什么有人会看到 Rust 的异步混乱，并认为对于一种已经以编写起来非常复杂而闻名的语言来说，这是一个很好的设计。
Goroutine I / O调度
2015-08-05 11:09

回答 2 已采纳 You can have a look in file https://github.com/golang/go/blob/master/src/net/fd_unix.go (Write fun
如果一次执行中发生错误，则关闭多个goroutine
2017-08-04 07:37

回答 1 已采纳 You may use the context package which was created for things like this ("carries deadlines, cancel
Golang：在多个goroutine中发送关闭通道错误
2018-08-08 07:32

回答 2 已采纳 When you work with channels in Go always the sender should close the channel. Because that signals
Go Web 爬虫快速启动指南（二）
2024-07-11 10:48

绝不原创的飞龙的博客在我们之前使用互斥锁的示例中，多个线程试图通过创建释放锁来访问包含每个网站状态的地图。我们可以通过将刮削器线程作为单独的 goroutines 启动，并通过通道将它们的状态传递回主 goroutine 来使用通道作为更安全...
这个goroutine是否永远泄漏/阻塞？
2018-05-20 20:10

回答 1 已采纳 To close this question out...the simple answer is yes.
Golang多个goroutine通过引用共享相同的变量
2017-02-05 01:15

回答 1 已采纳 Are you sure you need goroutines to perform simple validations? Anyway the code you have written u
如何从多个goroutine共享的单个通道读取
2018-10-14 14:34

回答 2 已采纳 The code will return with an error if you not create buffer channels reason being the channel is c
Go框架，库和软件的精选列表
2019-05-05 15:55

思月行云的博客 2018最新精选的Go框架，库和软件的精选列表一 https://awesome-go.com/ 2018最新精选的Go框架，库和软件的精选列表二 https://awesome-go.com/ 2018最新精选的Go框架，库和软件的精选列表三 ...
http://c.biancheng.net/
2023-07-11 08:40

mingo_敏的博客 Go语言变量的声明（使用var关键字） http://c.biancheng.net/view/10.html 标题是：Go语言变量的初始化 http://c.biancheng.net/view/11.html 标题是：Go语言多个变量同时赋值 http://c.biancheng.net/view/12.html ...
Go 语言入门三部曲（一）：能看懂 Go 语言
2022-06-14 21:59

看，未来的博客一个标识符实际上就是一个或是多个字母(A~Z和a~z)数字(0~9)、下划线_组成的序列，但是第一个字符必须是字母或下划线而不能是数字。关键字和预定义标识符这不用去记，按我的方法来起名字（凡是起名必带 _）是不...
2018最新精选的Go框架，库和软件的精选列表三
2019-01-04 11:05

秋天的春的博客 - 路由和代理Selenium Wedriver请求多个Selenium集线器的轻量级服务器。 selenoid - 在容器中启动浏览器的备用Selenium中心服务器。文字处理用于解析和操作文本的库。具体格式 align - 一个...
没有解决我的问题, 去提问

悬赏问题

¥15 adb push异常 adb: error: 1409-byte write failed: Invalid argument
¥15 android报错 brut.common.BrutException: could not exec (exit code = 1)
¥15 nginx反向代理获取ip，java获取真实ip
¥15 eda：门禁系统设计
¥50 如何使用js去调用vscode-js-debugger的方法去调试网页
¥15 376.1电表主站通信协议下发指令全被否认问题
¥15 物体双站RCS和其组成阵列后的双站RCS关系验证
¥15 复杂网络，变滞后传递熵，FDA
¥20 csv格式数据集预处理及模型选择
¥15 部分网页页面无法显示！

码龄粉丝数原力等级 --

多个goroutine访问/修改列表/地图

1条回答默认最新

码龄粉丝数原力等级 --

悬赏问题

多个goroutine访问/修改列表/地图

1条回答 默认 最新

悬赏问题

1条回答默认最新