dongmei1828 2017-05-30 07:13
浏览 76
已采纳

并发文件系统扫描

I want to obtain file information (file name & size in bytes) for the files in a directory. But there are a lot of sub-directory (~ 1000) and files (~40 000).

Actually my solution is to use filepath.Walk() to obtain file information for each file. But this is quite long.

func visit(path string, f os.FileInfo, err error) error {
    if f.Mode().IsRegular() {
        fmt.Printf("Visited: %s File name: %s Size: %d bytes
", path, f.Name(), f.Size())
    }
    return nil
}
func main() {
    flag.Parse()
    root := "C:/Users/HERNOUX-06523/go/src/boilerpipe" //flag.Arg(0)
    filepath.Walk(root, visit)
}

Is it possible to do parallel/concurrent processing using filepath.Walk()?

  • 写回答

1条回答 默认 最新

  • dqjmq28248 2017-05-30 07:40
    关注

    You may do concurrent processing by modifying your visit() function to not go into subfolders, but launch a new goroutine for each subfolder.

    In order to do that, return the special filepath.SkipDir error from your visit() function if the entry is a directory. Don't forget to check if the path inside visit() is the subfolder the goroutine is ought to process, because that is also passed to visit(), and without this check you would launch goroutines endlessly for the initial folder.

    Also you will need some kind of "counter" of how many goroutines are still working in the background, for that you may use sync.WaitGroup.

    Here's a simple implementation of this:

    var wg sync.WaitGroup
    
    func walkDir(dir string) {
        defer wg.Done()
    
        visit := func(path string, f os.FileInfo, err error) error {
            if f.IsDir() && path != dir {
                wg.Add(1)
                go walkDir(path)
                return filepath.SkipDir
            }
            if f.Mode().IsRegular() {
                fmt.Printf("Visited: %s File name: %s Size: %d bytes
    ",
                    path, f.Name(), f.Size())
            }
            return nil
        }
    
        filepath.Walk(dir, visit)
    }
    
    func main() {
        flag.Parse()
        root := "folder/to/walk" //flag.Arg(0)
    
        wg.Add(1)
        walkDir(root)
        wg.Wait()
    }
    

    Some notes:

    Depending on the "distribution" of files among subfolders, this may not fully utilize your CPU / storage, as if for example 99% of all the files are in one subfolder, that goroutine will still take the majority of time.

    Also note that fmt.Printf() calls are serialized, so that will also slow down the process. I assume this was just an example, and in reality you will do some kind of processing / statistics in-memory. Don't forget to also protect concurrent access to variables accessed from your visit() function.

    Don't worry about the high number of subfolders. It is normal and the Go runtime is capable of handling even hundreds of thousands of goroutines.

    Also note that most likely the performance bottleneck will be your storage / hard disk speed, so you may not gain the performance you wish. After a certain point (your hard disk limit), you won't be able to improve performance.

    Also launching a new goroutine for each subfolder may not be optimal, it may be that you get better performance by limiting the number of goroutines walking your folders. For that, check out and use a worker pool:

    Is this an idiomatic worker thread pool in Go?

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥100 set_link_state
  • ¥15 虚幻5 UE美术毛发渲染
  • ¥15 CVRP 图论 物流运输优化
  • ¥15 Tableau online 嵌入ppt失败
  • ¥100 支付宝网页转账系统不识别账号
  • ¥15 基于单片机的靶位控制系统
  • ¥15 真我手机蓝牙传输进度消息被关闭了,怎么打开?(关键词-消息通知)
  • ¥15 装 pytorch 的时候出了好多问题,遇到这种情况怎么处理?
  • ¥20 IOS游览器某宝手机网页版自动立即购买JavaScript脚本
  • ¥15 手机接入宽带网线,如何释放宽带全部速度