并发文件系统扫描

I want to obtain file information (file name & size in bytes) for the files in a directory. But there are a lot of sub-directory (~ 1000) and files (~40 000).

Actually my solution is to use filepath.Walk() to obtain file information for each file. But this is quite long.

func visit(path string, f os.FileInfo, err error) error {
    if f.Mode().IsRegular() {
        fmt.Printf("Visited: %s File name: %s Size: %d bytes
", path, f.Name(), f.Size())
    }
    return nil
}
func main() {
    flag.Parse()
    root := "C:/Users/HERNOUX-06523/go/src/boilerpipe" //flag.Arg(0)
    filepath.Walk(root, visit)
}

Is it possible to do parallel/concurrent processing using filepath.Walk()?

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
dqjmq28248 2017-05-30 07:40
关注
You may do concurrent processing by modifying your visit() function to not go into subfolders, but launch a new goroutine for each subfolder.

In order to do that, return the special filepath.SkipDir error from your visit() function if the entry is a directory. Don't forget to check if the path inside visit() is the subfolder the goroutine is ought to process, because that is also passed to visit(), and without this check you would launch goroutines endlessly for the initial folder.

Also you will need some kind of "counter" of how many goroutines are still working in the background, for that you may use sync.WaitGroup.

Here's a simple implementation of this:

var wg sync.WaitGroup func walkDir(dir string) { defer wg.Done() visit := func(path string, f os.FileInfo, err error) error { if f.IsDir() && path != dir { wg.Add(1) go walkDir(path) return filepath.SkipDir } if f.Mode().IsRegular() { fmt.Printf("Visited: %s File name: %s Size: %d bytes ", path, f.Name(), f.Size()) } return nil } filepath.Walk(dir, visit) } func main() { flag.Parse() root := "folder/to/walk" //flag.Arg(0) wg.Add(1) walkDir(root) wg.Wait() }

Some notes:

Depending on the "distribution" of files among subfolders, this may not fully utilize your CPU / storage, as if for example 99% of all the files are in one subfolder, that goroutine will still take the majority of time.

Also note that fmt.Printf() calls are serialized, so that will also slow down the process. I assume this was just an example, and in reality you will do some kind of processing / statistics in-memory. Don't forget to also protect concurrent access to variables accessed from your visit() function.

Don't worry about the high number of subfolders. It is normal and the Go runtime is capable of handling even hundreds of thousands of goroutines.

Also note that most likely the performance bottleneck will be your storage / hard disk speed, so you may not gain the performance you wish. After a certain point (your hard disk limit), you won't be able to improve performance.

Also launching a new goroutine for each subfolder may not be optimal, it may be that you get better performance by limiting the number of goroutines walking your folders. For that, check out and use a worker pool:

Is this an idiomatic worker thread pool in Go?
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

并发文件系统扫描
2017-05-30 07:13

回答 1 已采纳 You may do concurrent processing by modifying your visit() function to not go into subfolders, but
操作系统两个进程的并发问题
2017-11-24 09:03

回答 2 已采纳三种可能嘛：0,1,2 第一种：未发生访问冲突，即在P1对X完成加1操作后P2对才开始对X进行减1操作，或者在p2对X完成减1操作后P1才开始对X进行加一操作，这样两种情况下X
quartz定时任务并发问题
2017-12-15 02:56

回答 2 已采纳 quartz框架中防止任务并行可以有两种方案： 1、如果是通过MethodInvokingJobDetailFactoryBean在运行中动态生成的Job，配置的xml文件有个concurrent属
端口扫描高并发测试程序
2021-09-29 13:16

异步模式扫描全网开启了80端口的主机。结果保存在IpMap.txt文件中。 ResetIP.txt文件保存了被GFW拦截或服务器有特殊安全...配置为10万并发，实测并发数16378，更多并发需要修改注册表，让操作系统允许创建更多socket.
多用户高并发插入数据怎么解决并发问题 mysql
2018-04-19 04:02

回答 4 已采纳这个是乐观锁，可以使用CAS原理，取出值后得到新值，然后插入的时候比较原值，如set count = 12 where count = 8,8是旧值，但这样容易出现ABA问题，所以需要配合你的版本ve
多线程共享资源并发访问控制实验
2017-06-26 15:27

回答 1 已采纳第一个问题比较容易：在未加锁的情况，一般来说，可以通过增加并发访问的线程数来提高错误发生的几率。第二个问题：在底层实现来看，加锁之后一般不会出现计算结果错误，多出现死锁情况。
java高并发分布式系统使用了哪些中间件？ java java-ee
2019-07-25 08:43

回答 3 已采纳消息中间件：kafka rocketMQ RabbitMq等 kv中间件：redis （缓存）memcache （缓存） es(可作为搜索引擎) 一致性：zookeeper(zab)，etcd(
[Wibear资源] 《亿级流量网站架构核心技术跟开涛学搭建高可用高并发系统》part2 PDF格式高清完整中文版压缩包-低分下载
2018-01-29 11:25

因为文件太大，分 part1和part2两个压缩，请分别下载这两个文件到同一文件目录底下，进行解压即可
并发写入文件
2015-05-01 02:59

回答 2 已采纳 There are many ways to control concurrent access. The easiest is to use a Mutex: var mu sync.Mute
Javaweb项目多少用户就要考虑并发了 java
2017-07-20 09:57

回答 2 已采纳不知道你想问什么。一个可能是你想问，什么情况下应该考虑通过横向扩展（增加服务器的数目）来提高性能，一般来说，在你访问负荷最高的时候，如果系统负载超过80%，就必须考虑了。另一个你可能问，作为程
一个关于数据库级别的并发问题数据库负载均衡
2017-01-20 15:22

回答 3 已采纳这边没人，具体大家看下面这个吧，这个讨论出了一些解决方案。 http://bbs.csdn.net/topics/392079561
高并发系统设计——“三高”解决方案
2022-02-01 11:48

庄小焱的博客在前两节课中，我带你了解了高并发系统设计的含义，意义以及分层设计原则，接下来，我想带你整体了解一下高并发系统设计的目标，然后在此基础上，进入我们今天的话题：如何提升系统的性能？一、高并发系统设计的三...
java 并发处理耗时操作的问题 java
2016-04-03 04:28

回答 1 已采纳首先，你的代码没有问题，肯定是启动了3个线程来执行任务的。其次，分析下直接用main线程顺序执行三个操作，为什么比同时启动三个线程执行速度快呢？我认为这又可能跟操作系统处理IO的方式有关系，多线程
java并发读取相同的文件,并发读取文件(java优先)[已关闭]
2021-04-17 10:31

宿管淑女坊的博客这里最重要的问题是你的情况是什么瓶颈。如果瓶颈是您的磁盘IO，那么在...您可以安全地创建多个InputStreams或Readers来并行读取文件的不同部分(只要不超过操作系统对打开文件数量的限制)。您可以将工作分成任务并...
开源主流分布式文件系统简单介绍
2021-02-01 11:16

wmnmtm的博客一、分布式文件系统简介 1.特点 2.主要指标及分类对比 3.AFS与NFS 二、开源分布式文件系统 1.GFS （1）GFS与NFS，AFS的区别（2）BigTable （3）Chubby （4）特点1 2.HDFS （1...
没有解决我的问题, 去提问

悬赏问题

¥100 set_link_state
¥15 虚幻5 UE美术毛发渲染
¥15 CVRP 图论物流运输优化
¥15 Tableau online 嵌入ppt失败
¥100 支付宝网页转账系统不识别账号
¥15 基于单片机的靶位控制系统
¥15 真我手机蓝牙传输进度消息被关闭了，怎么打开？(关键词-消息通知)
¥15 装 pytorch 的时候出了好多问题，遇到这种情况怎么处理？
¥20 IOS游览器某宝手机网页版自动立即购买JavaScript脚本
¥15 手机接入宽带网线，如何释放宽带全部速度

并发文件系统扫描

1条回答 默认 最新

悬赏问题

1条回答默认最新