douyin2435 2017-09-02 06:42
浏览 51
已采纳

如何在Go中将结果输出到并发Web刮板的CSV中?

I'm new to Go and am trying to take advantage of the concurrency in Go to build a basic scraper to pull extract title, meta description, and meta keywords from URLs.

I am able to print out the results to terminal with the concurrency but can't figure out how to write output to CSV. I've tried many a variations that I could think of with limited knowledge of Go and many end up breaking the concurrency - so losing my mind a bit.

My code and URL input file is below - Thanks in advance for any tips!

// file name: metascraper.go
package main

import (
    // import standard libraries
    "encoding/csv"
    "fmt"
    "io"
    "log"
    "os"
    "time"
    // import third party libraries
    "github.com/PuerkitoBio/goquery"
)

func csvParsing() {
    file, err := os.Open("data/sample.csv")
    checkError("Cannot open file ", err)

    if err != nil {
        // err is printable
        // elements passed are separated by space automatically
        fmt.Println("Error:", err)
        return
    }

    // automatically call Close() at the end of current method
    defer file.Close()
    //
    reader := csv.NewReader(file)
    // options are available at:
    // http://golang.org/src/pkg/encoding/csv/reader.go?s=3213:3671#L94
    reader.Comma = ';'
    lineCount := 0

    fileWrite, err := os.Create("data/result.csv")
    checkError("Cannot create file", err)
    defer fileWrite.Close()

    writer := csv.NewWriter(fileWrite)
    defer writer.Flush()

    for {
        // read just one record
        record, err := reader.Read()
        // end-of-file is fitted into err
        if err == io.EOF {
            break
        } else if err != nil {
            fmt.Println("Error:", err)
            return
        }

        go func(url string) {
            // fmt.Println(msg)
            doc, err := goquery.NewDocument(url)
            if err != nil {
                checkError("No URL", err)
            }

            metaDescription := make(chan string, 1)
            pageTitle := make(chan string, 1)

            go func() {
                // time.Sleep(time.Second * 2)
                // use CSS selector found with the browser inspector
                // for each, use index and item
                pageTitle <- doc.Find("title").Contents().Text()

                doc.Find("meta").Each(func(index int, item *goquery.Selection) {
                    if item.AttrOr("name", "") == "description" {
                        metaDescription <- item.AttrOr("content", "")
                    }
                })
            }()
            select {
            case res := <-metaDescription:
                resTitle := <-pageTitle
                fmt.Println(res)
                fmt.Println(resTitle)

                // Have been trying to output to CSV here but it's not working

                // writer.Write([]string{url, resTitle, res})
                // err := writer.WriteString(`res`)
                // checkError("Cannot write to file", err)

            case <-time.After(time.Second * 2):
                fmt.Println("timeout 2")
            }

        }(record[0])

        fmt.Println()

        lineCount++
    }
}

func main() {

    csvParsing()

    //Code is to make sure there is a pause before program finishes so we can see output
    var input string
    fmt.Scanln(&input)
}

func checkError(message string, err error) {
    if err != nil {
        log.Fatal(message, err)
    }
}

The data/sample.csv input file with URLs:

    http://jonathanmh.com
    http://keshavmalani.com
    http://google.com
    http://bing.com
    http://facebook.com
  • 写回答

1条回答 默认 最新

  • dongqin1861 2017-09-02 07:17
    关注

    In the code you supplied, you had commented the following code:

    // Have been trying to output to CSV here but it's not working
    err = writer.Write([]string{url, resTitle, res})
    checkError("Cannot write to file", err)
    

    This code is correct, except you have one issue. Earlier in the function, you have the following code:

    fileWrite, err := os.Create("data/result.csv")
    checkError("Cannot create file", err)
    defer fileWrite.Close()
    

    This code causes the fileWriter to close once your csvParsing() func exits. Because you've closed fileWriter with the defer, you are unable to write to it in your concurrent function.

    Solution: You'll need to use defer fileWrite.Close() inside your concurrent func or something similar so you do not close the fileWriter before you have written to it.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥50 如何用脚本实现输入法的热键设置
  • ¥20 我想使用一些网络协议或者部分协议也行,主要想实现类似于traceroute的一定步长内的路由拓扑功能
  • ¥30 深度学习,前后端连接
  • ¥15 孟德尔随机化结果不一致
  • ¥15 apm2.8飞控罗盘bad health,加速度计校准失败
  • ¥15 求解O-S方程的特征值问题给出边界层布拉休斯平行流的中性曲线
  • ¥15 谁有desed数据集呀
  • ¥20 手写数字识别运行c仿真时,程序报错错误代码sim211-100
  • ¥15 关于#hadoop#的问题
  • ¥15 (标签-Python|关键词-socket)