处理大型csv文件并限制goroutines

I'm trying to find the best efficient way to read a csv file (~1M row). Each row contain a HTTP link to an image which I need to download.

This is my current code using worker pools:

func worker(queue chan []string, worknumber int, done, ks chan bool) {
    for true {
        select {
        case url := <-queue:
            fmt.Println("doing work!", url, "worknumber", worknumber)
            processData(url) // HTTP download
            done <- true
        case <-ks:
            fmt.Println("worker halted, number", worknumber)
            return
        }
    }
}

func main() {
    start := time.Now()
    flag.Parse()
    fmt.Print(strings.Join(flag.Args(), "
"))
    if *filename == "REQUIRED" {
        return
    }

    csvfile, err := os.Open(*filename)
    if err != nil {
        fmt.Println(err)
        return
    }
    count, _ := lineCounter(csvfile)
    fmt.Printf("Total count: %d
", count)
    csvfile.Seek(0, 0)

    defer csvfile.Close()

    //bar := pb.StartNew(count)
    bar := progressbar.NewOptions(count)
    bar.RenderBlank()

    reader := csv.NewReader(csvfile)

    //channel for terminating the workers
    killsignal := make(chan bool)

    //queue of jobs
    q := make(chan []string)
    // done channel takes the result of the job
    done := make(chan bool)

    numberOfWorkers := *numChannels
    for i := 0; i < numberOfWorkers; i++ {
        go worker(q, i, done, killsignal)
    }

    i := 0
    for {
        record, err := reader.Read()
        if err == io.EOF {
            break
        } else if err != nil {
            fmt.Println(err)
            return
        }
        i++

        go func(r []string, i int) {
            q <- r
            bar.Add(1)
        }(record, i)
    }

    // a deadlock occurs if c >= numberOfJobs
    for c := 0; c < count; c++ {
        <-done
    }

    fmt.Println("finished")

    // cleaning workers
    close(killsignal)
    time.Sleep(2 * time.Second)

    fmt.Printf("
%2fs", time.Since(start).Seconds())
}

My issue here is that it opens a lot of goroutines, use all the memory and crash.

What would be the best way to limit it?

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

3条回答默认最新

duanlun2827 2019-05-27 14:51

关注

I striped out the progress bar as i did not want to bother about it, but overall this is closer to what you are looking for.

It does not genuinely handle errors, they simply fail in a fatal state.

I have added context and cancellation support.

You might want to check for https://godoc.org/golang.org/x/sync/errgroup#Group.Go

As a general recommentation, you need to learn the golang patterns and their usage.

It is obvious you have not worked that enough, or that you are in process of learning.

Its not the fastest program at all, but it does the job.

This is only a draft to get you back on a better direction.

package main

import (
    "context"
    "encoding/csv"
    "flag"
    "fmt"
    "io"
    "log"
    "os"
    "os/signal"
    "sync"
    "time"
)

func worker(ctx context.Context, dst chan string, src chan []string) {
    for {
        select {
        case url, ok := <-src: // you must check for readable state of the channel.
            if !ok {
                return
            }
            dst <- fmt.Sprintf("out of %v", url) // do somethingg useful.
        case <-ctx.Done(): // if the context is cancelled, quit.
            return
        }
    }
}

func main() {

    // create a context
    ctx, cancel := context.WithCancel(context.Background())
    defer cancel()
    // that cancels at ctrl+C
    go onSignal(os.Interrupt, cancel)

    // parse command line arguments
    var filename string
    var numberOfWorkers int
    flag.StringVar(&filename, "filename", "", "src file")
    flag.IntVar(&numberOfWorkers, "c", 2, "concurrent workers")
    flag.Parse()

    // check arguments
    if filename == "" {
        log.Fatal("filename required")
    }

    start := time.Now()

    csvfile, err := os.Open(filename)
    if err != nil {
        log.Fatal(err)
    }
    defer csvfile.Close()

    reader := csv.NewReader(csvfile)

    // create the pair of input/output channels for the controller=>workers com.
    src := make(chan []string)
    out := make(chan string)

    // use a waitgroup to manage synchronization
    var wg sync.WaitGroup

    // declare the workers
    for i := 0; i < numberOfWorkers; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            worker(ctx, out, src)
        }()
    }

    // read the csv and write it to src
    go func() {
        for {
            record, err := reader.Read()
            if err == io.EOF {
                break
            } else if err != nil {
                log.Fatal(err)
            }
            src <- record // you might select on ctx.Done().
        }
        close(src) // close src to signal workers that no more job are incoming.
    }()

    // wait for worker group to finish and close out
    go func() {
        wg.Wait() // wait for writers to quit.
        close(out) // when you close(out) it breaks the below loop.
    }()

    // drain the output
    for res := range out {
        fmt.Println(res)
    }

    fmt.Printf("
%2fs", time.Since(start).Seconds())
}

func onSignal(s os.Signal, h func()) {
    c := make(chan os.Signal, 1)
    signal.Notify(c, s)
    <-c
    h()
}

本回答被题主选为最佳回答 , 对您是否有帮助呢?

查看更多回答(2条)

报告相同问题？

关注问题

处理大型csv文件并限制goroutines
2019-05-27 11:46

回答 3 已采纳 I striped out the progress bar as i did not want to bother about it, but overall this is closer to
如何将大型csv文件拆分为多个csv文件 php
2018-08-21 14:09

回答 2 已采纳 The script you show is reading the WHOLE .csv file into an in memory array. Its not surprising it
使用python 实现对CSV文件数据的处理 python 大数据
2022-03-18 16:05

回答 2 已采纳 import pandas as pd import re df = pd.DataFrame({'Category':['C,D','A,B,C','A,D','C','A,D','A,B,C','
Golang保姆级知识点讲解
2023-09-16 17:10

Louis yeap的博客测试文件：通常，一个包还可以包含与测试相关的文件，这些文件以_test.go为后缀，并包含用于测试包中代码的测试函数。避免循环依赖：当一个包包含多个文件时，要注意避免循环依赖问题。循环依赖可能会导致编译错误...
Python处理CSV文件 python 有问必答
2022-12-18 23:04

回答 3 已采纳 csv格式不对吧，怎么有双引号遍历文件、快速列表访问方式读取grade.csv这2个没明白什么意思。下面是read，readline和readlines的实现，注意csv格式内容和代码中一样 '''
python处理csv文件的编码格式问题 python
2020-01-13 15:23

回答 3 已采纳 ``` with open('3020100_2019_qb.csv', 'r', encoding='gbk', errors='ignore') as f: csv
读取CSV文件的限制 php
2018-06-05 09:40

回答 2 已采纳 As RiggsFolly pointed out; fgetscsv() is your best tool for this. $fileDir = 'stock.csv'; //open
腾讯面试题
2024-02-01 11:41

我但行好事莫问前程的博客 TCP（传输控制协议）的可靠性体现在多个方面，其中一...例如，在Linux内核中，这样的函数就是用来处理这类校验和计算任务的。TCP粘包（TCP Packet Coalescing）是指在TCP协议传输数据时，由于TCP协议本身的特性，在接收
Qt循环处理多个CSV文件 c++ c语言 qt
2022-07-24 12:25

回答 2 已采纳要通过循环去读文件，文件名就应该是有规律的，例如file1.csv file2.csv. 获取文件夹下所有的csv文件名，然后遍历QStringList 去打开文件操作 QStringList MSy
python处理csv文件 python 有问必答
2021-08-25 14:20

回答 1 已采纳你是对同一个文件执行了两次打开，这个是占用资源的，只能先读取文件，将数据存储在列表，然后再打开文件重新写入，就是两个步骤分开执行
读取csv文件并将字符转化为float python
2023-01-31 17:24

回答 2 已采纳 green没有定义呀，green是函数的参数名称。这里应该是filename
golang大厂面试1
2023-06-11 21:42

theo.wu的博客 golang大厂面试 Golang字节面试经验分享第一面面试官首先...由于并发访问`total`变量存在竞争条件，而且`fmt.Printf`中的代码没有等待goroutines完成计算就立即执行，所以`total`的值可能不是我们期望的。这是因为`...
csv文件无法被识别，不知道该如何处理 jupyter python 开发语言
2022-03-29 22:33

回答 2 已采纳你确定文件名打对了么亦或是文件放对位置了没（同一个目录下），我这正常运行，
golang大厂面试2
2023-07-04 14:42

theo.wu的博客处理日志的时候如果发现突然量变大，该如何扩容让以前堆积的日志可以消耗掉？命令的时间差如何保证，命令混乱了怎么办（时间戳记录在服务器可以吗？题拍拍主要做拍搜服务的，题拍拍主要做增长，后面会做微服务架构k8...
【吐血整理】超全golang面试题合集+golang学习指南+golang知识图谱+成长路线一份涵盖大部分golang程序员所需要掌握的核心知识。
2021-01-11 12:37

小白debug的博客转到其他语言 Goroutines:用于管理和使用Goroutines的工具图形界面:用于构建GUI应用程序的库图片:用于处理图像的库物联网:物联网设备编程库 JSON格式:用于处理JSON的库机器学习:常用机器学习库微软办公软件 ...
没有解决我的问题, 去提问

悬赏问题

¥20 有关区间dp的问题求解
¥15 多电路系统共用电源的串扰问题
¥15 slam rangenet++配置
¥15 有没有研究水声通信方面的帮我改俩matlab代码
¥15 对于相关问题的求解与代码
¥15 ubuntu子系统密码忘记
¥15 信号傅里叶变换在matlab上遇到的小问题请求帮助
¥15 保护模式-系统加载-段寄存器
¥15 电脑桌面设定一个区域禁止鼠标操作
¥15 求NPF226060磁芯的详细资料

码龄粉丝数原力等级 --

处理大型csv文件并限制goroutines

3条回答默认最新

码龄粉丝数原力等级 --

悬赏问题

处理大型csv文件并限制goroutines

3条回答 默认 最新

悬赏问题

3条回答默认最新