dongweishi2028 2016-01-05 01:28
浏览 27

通过Goroutines运行多个GTK WebKitWebViews

I'm using Go with the gotk3 and webkit2 libraries to try and build a web crawler that can parse JavaScript in the context of a WebKitWebView.

Thinking of performance, I'm trying to figure out what would be the best way to have it crawl concurrently (if not in parallel, with multiple processors), using all available resources.

GTK and everything with threads and goroutines are pretty new to me. Reading from the gotk3 goroutines example, it states:

Native GTK is not thread safe, and thus, gotk3's GTK bindings may not be used from other goroutines. Instead, glib.IdleAdd() must be used to add a function to run in the GTK main loop when it is in an idle state.

Go will panic and show a stack trace when I try to run a function, which creates a new WebView, in a goroutine. I'm not exactly sure why this happens, but I think it has something to do with this comment. An example is shown below.

Current Code

Here's my current code, which has been adapted from the webkit2 example:

package main

import (
    "fmt"
    "github.com/gotk3/gotk3/glib"
    "github.com/gotk3/gotk3/gtk"
    "github.com/sourcegraph/go-webkit2/webkit2"
    "github.com/sqs/gojs"
)

func crawlPage(url string) {
    web := webkit2.NewWebView()

    web.Connect("load-changed", func(_ *glib.Object, i int) {
        loadEvent := webkit2.LoadEvent(i)

        switch loadEvent {
        case webkit2.LoadFinished:
            fmt.Printf("Load finished for: %v
", url)

            web.RunJavaScript("window.location.hostname", func(val *gojs.Value, err error) {
                if err != nil {
                    fmt.Println("JavaScript error.")
                } else {
                    fmt.Printf("Hostname (from JavaScript): %q
", val)
                }

                //gtk.MainQuit()
            })
        }
    })

    glib.IdleAdd(func() bool {
        web.LoadURI(url)
        return false
    })
}

func main() {
    gtk.Init(nil)

    crawlPage("https://www.google.com")
    crawlPage("https://www.yahoo.com")
    crawlPage("https://github.com")
    crawlPage("http://deelay.me/2000/http://deelay.me/img/1000ms.gif")

    gtk.Main()
}

It seems that creating a new WebView for each URL allows them to load concurrently. Having glib.IdleAdd() running in a goroutine, as per the gotk3 example, doesn't seem to have any effect (although I'm only doing a visual benchmark):

go glib.IdleAdd(func() bool { // Works
    web.LoadURI(url)
    return false
})

However, trying to create a goroutine for each crawlPage() call ends in a panic:

go crawlPage("https://www.google.com") // Panics and shows stack trace

I can run web.RunJavaScript() in a goroutine without issue:

        switch loadEvent {
        case webkit2.LoadFinished:
            fmt.Printf("Load finished for: %v
", url)

            go web.RunJavaScript("window.location.hostname", func(val *gojs.Value, err error) { // Works
                if err != nil {
                    fmt.Println("JavaScript error.")
                } else {
                    fmt.Printf("Hostname (from JavaScript): %q
", val)
                }

                //gtk.MainQuit()
            })
        }

Best Method?

The current methods I can think of are:

  1. Spawn new WebViews to crawl each page, as shown in the current code. Track how many WebViews are opened and either continually delete and create new ones, or reuse a set number created initially, to where all available resources on the machine are used. Would this be limited in terms of processor cores being used?
  2. Basic idea of #1, but running the binary multiple times (instead of one gocrawler process running on the machine, have four) to utilize all cores/resources.
  3. Run the GUI (gtk3) portion of the app in its own goroutine. I could then pass data to other goroutines which do their own heavy processing, such as searching through content.

What would actually be the best way to run this code concurrently, if possible, and max out performance?

Update

Method 1 and 2 are probably out of the picture, as I ran a test by spawning ~100 WebViews and they seem to load synchronously.

  • 写回答

0条回答 默认 最新

    报告相同问题?

    悬赏问题

    • ¥15 R语言Rstudio突然无法启动
    • ¥15 关于#matlab#的问题:提取2个图像的变量作为另外一个图像像元的移动量,计算新的位置创建新的图像并提取第二个图像的变量到新的图像
    • ¥15 改算法,照着压缩包里边,参考其他代码封装的格式 写到main函数里
    • ¥15 用windows做服务的同志有吗
    • ¥60 求一个简单的网页(标签-安全|关键词-上传)
    • ¥35 lstm时间序列共享单车预测,loss值优化,参数优化算法
    • ¥15 Python中的request,如何使用ssr节点,通过代理requests网页。本人在泰国,需要用大陆ip才能玩网页游戏,合法合规。
    • ¥100 为什么这个恒流源电路不能恒流?
    • ¥15 有偿求跨组件数据流路径图
    • ¥15 写一个方法checkPerson,入参实体类Person,出参布尔值