doupingdiao3546 2017-07-06 11:20
浏览 97
已采纳

io.Reader和涉及CSV文件的换行问题

I have an application which deals with CSV's being delivered via RabbitMQ from many different upstream applications - typically 5000-15,000 rows per file. Most of the time it works great. However a couple of these upstream applications are old (12-15 years) and the people who wrote them are long gone.

I'm unable to read CSV files from these older aplications due to the line breaks. I'm finding this a bit weird as the line breaks see to map to UTF-8 Carriage Returns (http://www.fileformat.info/info/unicode/char/000d/index.htm). Typically the app reads in only the headers from those older files and nothing else.

If I open one of these files in a text editor and save as utf-8 encoding overwriting the exiting file then it works with no issues at all.

Things I've tried I expected to work:

-Using a Reader:

    ba := make([]byte, 262144000)
    if _, err := file.Read(ba); err != nil {
        return nil, err
    }
    ba = bytes.Trim(ba, "\x00")
    bb := bytes.NewBuffer(ba)
    reader := csv.NewReader(bb)
    records, err := reader.ReadAll()
    if err != nil {
        return nil, err
    }

-Using the Scanner to read line by line (get a bufio.Scanner: token too long)

    scanner := bufio.NewScanner(file)
    var bb bytes.Buffer
    for scanner.Scan() {
        bb.WriteString(fmt.Sprintf("%s
", scanner.Text()))
    }

    // check for errors
    if err = scanner.Err(); err != nil {
        return nil, err
    }


reader := csv.NewReader(&bb)
records, err := reader.ReadAll()
if err != nil {
    return nil, err
}

Things I tried I expected not to work (and didn't):

  • Writing file contents to a new file (.txt) and reading the file back in (including running dos2unix against the created txt file)
  • Reading file into a standard string (hoping Go's UTF-8 encoding would magically kick in which of course it doesn't)
  • Reading file to Rune slice, then transforming to a string via byte slice

I'm aware of the https://godoc.org/golang.org/x/text/transform package but not too sure of a viable approach - it looks like the src encoding needs to be known to transform.

Am I stupidly overlooking something? Are there any suggestions how to transform these files into UTF-8 or update the line endings without knowing the file encoding whilst keeping the application working for all the other valid CSV files being delivered? Are there any options that don't involve me going byte to byte and doing a bytes.Replace I've not considered? I'm hoping there's something really obvious I've overlooked.

Apologies - I can't share the CSV files for obvious reasons.

  • 写回答

2条回答 默认 最新

  • duanbei7005 2018-02-27 00:27
    关注

    For anyone who's stumbled on this and wants an answer that doesn't involve strings.Replace, here's a method that wraps an io.Reader to replace solo carriage returns. It could probably be more efficient, but works better with huge files than a strings.Replace-based solution.

    https://gist.github.com/b5/78edaae9e6a4248ea06b45d089c277d6

    // ReplaceSoloCarriageReturns wraps an io.Reader, on every call of Read it
    // for instances of lonely  replacing them with 
     before returning to the end customer
    // lots of files in the wild will come without "proper" line breaks, which irritates go's
    // standard csv package. This'll fix by wrapping the reader passed to csv.NewReader:
    //    rdr, err := csv.NewReader(ReplaceSoloCarriageReturns(r))
    //
    func ReplaceSoloCarriageReturns(data io.Reader) io.Reader {
        return crlfReplaceReader{
            rdr: bufio.NewReader(data),
        }
    }
    
    // crlfReplaceReader wraps a reader
    type crlfReplaceReader struct {
        rdr *bufio.Reader
    }
    
    // Read implements io.Reader for crlfReplaceReader
    func (c crlfReplaceReader) Read(p []byte) (n int, err error) {
        if len(p) == 0 {
            return
        }
    
        for {
            if n == len(p) {
                return
            }
    
            p[n], err = c.rdr.ReadByte()
            if err != nil {
                return
            }
    
            // any time we encounter  & still have space, check to see if 
     follows
            // if next char is not 
    , add it in manually
            if p[n] == '' && n < len(p) {
                if pk, err := c.rdr.Peek(1); (err == nil && pk[0] != '
    ') || (err != nil && err.Error() == io.EOF.Error()) {
                    n++
                    p[n] = '
    '
                }
            }
    
            n++
        }
        return
    }
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 关于react-hook组件用函数控制是否渲染的及时性问题。
  • ¥50 Linux下的软件,要做模块化拆分。进程间通信是否有开源框架可以借用?
  • ¥100 修改原有的MYSQL存储代码,在最右边添加多列数据
  • ¥20 Open Interpreter 使用时报错: still has pending operation at deallocation, the process may crash
  • ¥15 qt中链接动态链接库,调用其中的函数,该函数的参数需要传入回调函数,自己创建的回调函数无法作为参数传递进去
  • ¥15 如何把api接口返回的json数据自动计算分页自动执行并写入mysql数据库。
  • ¥15 matlab svm二分类代码问题
  • ¥40 求一款能支持ios15以上的屏蔽越狱插件。比较好用的
  • ¥15 C++ QT对比内存字符(形式不定)
  • ¥30 C++第三方库libiconv 远程安装协助