dongpang4470 2017-04-06 16:53
浏览 48
已采纳

在Go中同时解析二进制文件中的记录

I have a binary file that I want to parse. The file is broken up into records that are 1024 bytes each. The high level steps needed are:

  1. Read 1024 bytes at a time from the file.
  2. Parse each 1024-byte "record" (chunk) and place the parsed data into a map or struct.
  3. Return the parsed data to the user and any error(s).

I'm not looking for code, just design/approach help.

Due to I/O constraints, I don't think it makes sense to attempt concurrent reads from the file. However, I see no reason why the 1024-byte records can't be parsed using goroutines so that multiple 1024-byte records are being parsed concurrently. I'm new to Go, so I wanted to see if this makes sense or if there is a better (faster) way:

  1. A main function opens the file and reads 1024 bytes at a time into byte arrays (records).
  2. The records are passed to a function that parses the data into a map or struct. The parser function would be called as a goroutine on each record.
  3. The parsed maps/structs are appended to a slice via a channel. I would preallocate the underlying array managed by the slice as the file size (in bytes) divided by 1024 as this should be the exact number of elements (assuming no errors).

I'd have to make sure I don't run out of memory as well, as the file can be anywhere from a few hundred MB up to 256 TB (rare, but possible). Does this make sense or am I thinking about this problem incorrectly? Will this be slower than simply parsing the file in a linear fashion as I read it 1024 bytes at a time, or will parsing these records concurrently as byte arrays perform better? Or am I thinking about the problem all wrong?

I'm not looking for code, just design/approach help.

Cross-posted on Software Engineering

  • 写回答

1条回答 默认 最新

  • douquan9826 2017-04-06 17:43
    关注

    This is an instance of the producer-consumer problem, where the producer is your main function that generates 1024-byte records and the consumers should process these records and send them to a channel so they are added to the final slice. There are a few questions tagged producer-consumer and Go, they should get you started. As for what is fastest in your case, it depends on so many things that it is really not possible to answer. The best solution may be anywhere from a completely sequential implementation to a cluster of servers in which the records are moved around by RabbitMQ or something similar.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 oracle集群安装出bug
  • ¥15 关于#python#的问题:自动化测试
  • ¥20 问题请教!vue项目关于Nginx配置nonce安全策略的问题
  • ¥15 教务系统账号被盗号如何追溯设备
  • ¥20 delta降尺度方法,未来数据怎么降尺度
  • ¥15 c# 使用NPOI快速将datatable数据导入excel中指定sheet,要求快速高效
  • ¥15 再不同版本的系统上,TCP传输速度不一致
  • ¥15 高德地图点聚合中Marker的位置无法实时更新
  • ¥15 DIFY API Endpoint 问题。
  • ¥20 sub地址DHCP问题