io.Reader和涉及CSV文件的换行问题

I have an application which deals with CSV's being delivered via RabbitMQ from many different upstream applications - typically 5000-15,000 rows per file. Most of the time it works great. However a couple of these upstream applications are old (12-15 years) and the people who wrote them are long gone.

I'm unable to read CSV files from these older aplications due to the line breaks. I'm finding this a bit weird as the line breaks see to map to UTF-8 Carriage Returns (http://www.fileformat.info/info/unicode/char/000d/index.htm). Typically the app reads in only the headers from those older files and nothing else.

If I open one of these files in a text editor and save as utf-8 encoding overwriting the exiting file then it works with no issues at all.

Things I've tried I expected to work:

-Using a Reader:

    ba := make([]byte, 262144000)
    if _, err := file.Read(ba); err != nil {
        return nil, err
    }
    ba = bytes.Trim(ba, "\x00")
    bb := bytes.NewBuffer(ba)
    reader := csv.NewReader(bb)
    records, err := reader.ReadAll()
    if err != nil {
        return nil, err
    }

-Using the Scanner to read line by line (get a bufio.Scanner: token too long)

    scanner := bufio.NewScanner(file)
    var bb bytes.Buffer
    for scanner.Scan() {
        bb.WriteString(fmt.Sprintf("%s
", scanner.Text()))
    }

    // check for errors
    if err = scanner.Err(); err != nil {
        return nil, err
    }


reader := csv.NewReader(&bb)
records, err := reader.ReadAll()
if err != nil {
    return nil, err
}

Things I tried I expected not to work (and didn't):

Writing file contents to a new file (.txt) and reading the file back in (including running dos2unix against the created txt file)
Reading file into a standard string (hoping Go's UTF-8 encoding would magically kick in which of course it doesn't)
Reading file to Rune slice, then transforming to a string via byte slice

I'm aware of the https://godoc.org/golang.org/x/text/transform package but not too sure of a viable approach - it looks like the src encoding needs to be known to transform.

Am I stupidly overlooking something? Are there any suggestions how to transform these files into UTF-8 or update the line endings without knowing the file encoding whilst keeping the application working for all the other valid CSV files being delivered? Are there any options that don't involve me going byte to byte and doing a bytes.Replace I've not considered? I'm hoping there's something really obvious I've overlooked.

Apologies - I can't share the CSV files for obvious reasons.

展开全部

写回答
好问题 0 提建议
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

duanbei7005 2018-02-26 16:27

关注

For anyone who's stumbled on this and wants an answer that doesn't involve strings.Replace, here's a method that wraps an io.Reader to replace solo carriage returns. It could probably be more efficient, but works better with huge files than a strings.Replace-based solution.

https://gist.github.com/b5/78edaae9e6a4248ea06b45d089c277d6

// ReplaceSoloCarriageReturns wraps an io.Reader, on every call of Read it
// for instances of lonely  replacing them with 
 before returning to the end customer
// lots of files in the wild will come without "proper" line breaks, which irritates go's
// standard csv package. This'll fix by wrapping the reader passed to csv.NewReader:
//    rdr, err := csv.NewReader(ReplaceSoloCarriageReturns(r))
//
func ReplaceSoloCarriageReturns(data io.Reader) io.Reader {
    return crlfReplaceReader{
        rdr: bufio.NewReader(data),
    }
}

// crlfReplaceReader wraps a reader
type crlfReplaceReader struct {
    rdr *bufio.Reader
}

// Read implements io.Reader for crlfReplaceReader
func (c crlfReplaceReader) Read(p []byte) (n int, err error) {
    if len(p) == 0 {
        return
    }

    for {
        if n == len(p) {
            return
        }

        p[n], err = c.rdr.ReadByte()
        if err != nil {
            return
        }

        // any time we encounter  & still have space, check to see if 
 follows
        // if next char is not 
, add it in manually
        if p[n] == '' && n < len(p) {
            if pk, err := c.rdr.Peek(1); (err == nil && pk[0] != '
') || (err != nil && err.Error() == io.EOF.Error()) {
                n++
                p[n] = '
'
            }
        }

        n++
    }
    return
}

展开全部

本回答被题主选为最佳回答 , 对您是否有帮助呢?

查看更多回答(1条)

编辑

预览

报告相同问题？

关注问题

csv写入数据换行问题 python 爬虫
2022-06-09 14:36

回答 1 已采纳你想要，换行呢？还是不换行呢？，你在写入文件的时候是追加模式，尽量用pandas，pandas可以随心所欲的操作表，合并表，拆分表
如何将io.Reader转换为io.ReadCloser？ [重复]
2018-08-29 03:49

回答 1 已采纳 If you're certain that your io.Reader doesn't require any actual closing, you can wrap it with an
如何将multipart.File转换为io.Reader
2019-06-03 17:19

回答 1 已采纳 That means that the multipart.File interface includes the io.Reader interface, so any object that
使用net.sourceforge.javacsv操作CSV文件
2022-02-23 07:39

fengyehongWorld的博客 ⏹net.sourceforge.javacsv操作CSV文件
从io.Reader创建io.ReaderAt
2016-10-23 07:51

回答 2 已采纳 Yes, this is possible. As mentioned in my comment above, the implementation is limited in that you
无法将（[] byte类型）用作io.Reader类型
2017-05-19 01:11

回答 2 已采纳 I think you are missing a step in your logic when you think that []byte would be equivalent to Rea
从Golang中的io.Reader到io.Writer读取/复制一定数量的字节，或者如果超过一定字节数限制，则返回错误？
2016-08-19 20:44

回答 2 已采纳 Since io.Reader interface not knows anything about size or length of underlying data, there is onl
java实现csv文件文字换行,读取CSV文件，并写入文本自动换行优秀文本换行
2021-02-23 11:38

WebQueen的博客 I am trying to get the following ... All rows and columns are text wrapped except the header though:import pandas as pdimport pandas.io.formats.styleimport osfrom pandas import ExcelWriterimport n...
如何返回一个空的io.Reader？
2018-03-15 06:53

回答 1 已采纳 In case of a non-nil error return value, usually other parameters are left to the zero value of th
GoLang链接io.Reader
2018-09-06 05:45

回答 2 已采纳 The code below say "process take too long" Why do this code is not working ? In the tran
io.Reader可以接受文件描述符吗？ “ JSON输入意外结束”
2019-08-02 07:17

回答 1 已采纳 The answer to the question is yes. An *os.File can be used as an io.Reader. The problem is that t
读取csv文件的实例源码（C#语言）.rar
2021-11-02 17:34

C#本身并不直接提供读取CSV的内建函数，但.NET Framework提供了一些类库，如`System.IO.StreamReader`和`System.IO.TextReader`，可以通过它们来实现CSV文件的读取。此外，还可以使用`DataTable`类结合`...
Csv文件导入导出帮助类
2024-04-24 03:05

在.NET开发中，CSV（Comma Separated Values）文件是一种常用的格式，用于数据交换和存储。CSV文件以纯文本形式存储表格数据，其中每行代表一个记录，列由逗号分隔。本篇文章将深入探讨如何使用C#来实现CSV文件的...
【SpringBoot】22 Txt、Csv文件的读取和写入
2024-11-13 02:08

Evans-001的博客 CSV（逗号分隔值，Comma-Separated Values，又称字符分隔值），文件以纯文本形式存储表格数据。
Python csv文件读写(csv模块)(转载)
2021-04-13 01:30

xupeng1644的博客 CSV是英文Comma Separate Values（逗号分隔值）的缩写，...下面我将以一个数据处理的例子入手，展现CSV文档的创建和编辑，以及Python是如何对CSV文档读写的。 CSV文档的创建和编辑 1. 良好的Excel交互 (1) Excel.
没有解决我的问题, 去提问

码龄粉丝数原力等级 --

io.Reader和涉及CSV文件的换行问题

2条回答默认最新

码龄粉丝数原力等级 --

io.Reader和涉及CSV文件的换行问题

2条回答 默认 最新

2条回答默认最新