I'm using Go with the gotk3 and webkit2 libraries to try and build a web crawler that can parse JavaScript in the context of a WebKitWebView.
Thinking of performance, I'm trying to figure out what would be the best way to have it crawl concurrently (if not in parallel, with multiple processors), using all available resources.
GTK and everything with threads and goroutines are pretty new to me. Reading from the gotk3 goroutines example, it states:
Native GTK is not thread safe, and thus, gotk3's GTK bindings may not be used from other goroutines. Instead, glib.IdleAdd() must be used to add a function to run in the GTK main loop when it is in an idle state.
Go will panic and show a stack trace when I try to run a function, which creates a new WebView, in a goroutine. I'm not exactly sure why this happens, but I think it has something to do with this comment. An example is shown below.
Current Code
Here's my current code, which has been adapted from the webkit2 example:
package main
import (
"fmt"
"github.com/gotk3/gotk3/glib"
"github.com/gotk3/gotk3/gtk"
"github.com/sourcegraph/go-webkit2/webkit2"
"github.com/sqs/gojs"
)
func crawlPage(url string) {
web := webkit2.NewWebView()
web.Connect("load-changed", func(_ *glib.Object, i int) {
loadEvent := webkit2.LoadEvent(i)
switch loadEvent {
case webkit2.LoadFinished:
fmt.Printf("Load finished for: %v
", url)
web.RunJavaScript("window.location.hostname", func(val *gojs.Value, err error) {
if err != nil {
fmt.Println("JavaScript error.")
} else {
fmt.Printf("Hostname (from JavaScript): %q
", val)
}
//gtk.MainQuit()
})
}
})
glib.IdleAdd(func() bool {
web.LoadURI(url)
return false
})
}
func main() {
gtk.Init(nil)
crawlPage("https://www.google.com")
crawlPage("https://www.yahoo.com")
crawlPage("https://github.com")
crawlPage("http://deelay.me/2000/http://deelay.me/img/1000ms.gif")
gtk.Main()
}
It seems that creating a new WebView for each URL allows them to load concurrently. Having glib.IdleAdd()
running in a goroutine, as per the gotk3 example, doesn't seem to have any effect (although I'm only doing a visual benchmark):
go glib.IdleAdd(func() bool { // Works
web.LoadURI(url)
return false
})
However, trying to create a goroutine for each crawlPage()
call ends in a panic:
go crawlPage("https://www.google.com") // Panics and shows stack trace
I can run web.RunJavaScript()
in a goroutine without issue:
switch loadEvent {
case webkit2.LoadFinished:
fmt.Printf("Load finished for: %v
", url)
go web.RunJavaScript("window.location.hostname", func(val *gojs.Value, err error) { // Works
if err != nil {
fmt.Println("JavaScript error.")
} else {
fmt.Printf("Hostname (from JavaScript): %q
", val)
}
//gtk.MainQuit()
})
}
Best Method?
The current methods I can think of are:
- Spawn new WebViews to crawl each page, as shown in the current code. Track how many WebViews are opened and either continually delete and create new ones, or reuse a set number created initially, to where all available resources on the machine are used. Would this be limited in terms of processor cores being used?
- Basic idea of #1, but running the binary multiple times (instead of one gocrawler process running on the machine, have four) to utilize all cores/resources.
- Run the GUI (gtk3) portion of the app in its own goroutine. I could then pass data to other goroutines which do their own heavy processing, such as searching through content.
What would actually be the best way to run this code concurrently, if possible, and max out performance?
Update
Method 1 and 2 are probably out of the picture, as I ran a test by spawning ~100 WebViews and they seem to load synchronously.