dor2p0520 2016-02-16 21:55
浏览 35
已采纳

多个goroutine访问/修改列表/地图

I am trying to implement a multithreaded crawler using a go lang as a sample task to learn the language.

It supposed to scan pages, follow links and save them do DB.

To avoid duplicates I'm trying to use map where I save all the URLs I've already saved.

The synchronous version works fine, but I have troubles when I'm trying to use goroutines.

I'm trying to use mutex as a sync object for map, and channel as a way to coordinate goroutines. But obviously I don't have clear understanding of them.

The problem is that I have many duplicate entries, so my map store/check does not work properly.

Here is my code:

package main

import (
    "fmt"
    "net/http"
    "golang.org/x/net/html"
    "strings"
    "database/sql"
    _ "github.com/ziutek/mymysql/godrv"
    "io/ioutil"
    "runtime/debug"
    "sync"
)

const maxDepth = 2;

var workers = make(chan bool)

type Pages struct {
    mu sync.Mutex
    pagesMap map[string]bool
}

func main() {
    var pagesMutex Pages
    fmt.Println("Start")
    const database = "gotest"
    const user = "root"
    const password = "123"

    //open connection to DB
    con, err := sql.Open("mymysql", database + "/" + user + "/" + password)
    if err != nil { /* error handling */
        fmt.Printf("%s", err)
        debug.PrintStack()
    }

    fmt.Println("call 1st save site")
    pagesMutex.pagesMap = make(map[string]bool)
    go pagesMutex.saveSite(con, "http://golang.org/", 0)

    fmt.Println("saving true to channel")
    workers <- true

    fmt.Println("finishing in main")
    defer con.Close()
}


func (p *Pages) saveSite(con *sql.DB, url string, depth int) {
    fmt.Println("Save ", url, depth)
    fmt.Println("trying to lock")
    p.mu.Lock()
    fmt.Println("locked on mutex")
    pageDownloaded := p.pagesMap[url] == true
    if pageDownloaded {
        p.mu.Unlock()
        return
    } else {
        p.pagesMap[url] = true
    }
    p.mu.Unlock()

    response, err := http.Get(url)
    if err != nil {
        fmt.Printf("%s", err)
        debug.PrintStack()
    } else {
        defer response.Body.Close()

        contents, err := ioutil.ReadAll(response.Body)
        if err != nil {
            if err != nil {
                fmt.Printf("%s", err)
                debug.PrintStack()
            }
        }

        _, err = con.Exec("insert into pages (url) values (?)", string(url))
        if err != nil {
            fmt.Printf("%s", err)
            debug.PrintStack()
        }
        z := html.NewTokenizer(strings.NewReader((string(contents))))

        for {
            tokenType := z.Next()
            if tokenType == html.ErrorToken {
                return
            }

            token := z.Token()
            switch tokenType {
            case html.StartTagToken: // <tag>

                tagName := token.Data
                if strings.Compare(string(tagName), "a") == 0 {
                    for _, attr := range token.Attr {
                        if strings.Compare(attr.Key, "href") == 0 {
                            if depth < maxDepth  {
                                urlNew := attr.Val
                                if !strings.HasPrefix(urlNew, "http")  {
                                    if strings.HasPrefix(urlNew, "/")  {
                                        urlNew = urlNew[1:]
                                    }
                                    urlNew = url + urlNew
                                }
                                //urlNew = path.Clean(urlNew)
                                go  p.saveSite(con, urlNew, depth + 1)

                            }
                        }
                    }

                }
            case html.TextToken: // text between start and end tag
            case html.EndTagToken: // </tag>
            case html.SelfClosingTagToken: // <tag/>

            }

        }

    }
    val := <-workers
    fmt.Println("finished Save Site", val)
}

Could someone explain to me how to do this properly, please?

  • 写回答

1条回答 默认 最新

  • dspym82000 2016-02-19 18:00
    关注

    Well you have two chooses, for a little and simple implementation, I would recommend to separate the operations on the map into a separate structure.

    // Index is a shared page index
    type Index struct {
        access sync.Mutex
        pages map[string]bool
    }
    
    // Mark reports that a site have been visited
    func (i Index) Mark(name string) {
        i.access.Lock()
        i.pages[name] = true
        i.access.Unlock()
    }
    
    // Visited returns true if a site have been visited
    func (i Index) Visited(name string) bool {
        i.access.Lock()
        defer i.access.Unlock()
    
        return i.pages[name]
    }
    

    Then, add another structure like this:

    // Crawler is a web spider :D
    type Crawler struct {
        index Index
        /* ... other important stuff like visited sites ... */
    }
    
    // Crawl looks for content
    func (c *Crawler) Crawl(site string) {
        // Implement your logic here 
        // For example: 
        if !c.index.Visited(site) {
            c.index.Mark(site) // When marked
        }
    }
    

    That way you keep things nice and clear, probably a little more code, but definitely more readable. You need to instance crawler like this:

    sameIndex := Index{pages: make(map[string]bool)}
    asManyAsYouWant := Crawler{sameIndex, 0} // They will share sameIndex
    

    If you want to go further with a high level solution, then I would recommend Producer/Consumer architecture.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 adb push异常 adb: error: 1409-byte write failed: Invalid argument
  • ¥15 android报错 brut.common.BrutException: could not exec (exit code = 1)
  • ¥15 nginx反向代理获取ip,java获取真实ip
  • ¥15 eda:门禁系统设计
  • ¥50 如何使用js去调用vscode-js-debugger的方法去调试网页
  • ¥15 376.1电表主站通信协议下发指令全被否认问题
  • ¥15 物体双站RCS和其组成阵列后的双站RCS关系验证
  • ¥15 复杂网络,变滞后传递熵,FDA
  • ¥20 csv格式数据集预处理及模型选择
  • ¥15 部分网页页面无法显示!