I have an application which deals with CSV's being delivered via RabbitMQ from many different upstream applications - typically 5000-15,000 rows per file. Most of the time it works great. However a couple of these upstream applications are old (12-15 years) and the people who wrote them are long gone.
I'm unable to read CSV files from these older aplications due to the line breaks. I'm finding this a bit weird as the line breaks see to map to UTF-8 Carriage Returns (http://www.fileformat.info/info/unicode/char/000d/index.htm). Typically the app reads in only the headers from those older files and nothing else.
If I open one of these files in a text editor and save as utf-8 encoding overwriting the exiting file then it works with no issues at all.
Things I've tried I expected to work:
-Using a Reader:
ba := make([]byte, 262144000)
if _, err := file.Read(ba); err != nil {
return nil, err
}
ba = bytes.Trim(ba, "\x00")
bb := bytes.NewBuffer(ba)
reader := csv.NewReader(bb)
records, err := reader.ReadAll()
if err != nil {
return nil, err
}
-Using the Scanner to read line by line (get a bufio.Scanner: token too long)
scanner := bufio.NewScanner(file)
var bb bytes.Buffer
for scanner.Scan() {
bb.WriteString(fmt.Sprintf("%s
", scanner.Text()))
}
// check for errors
if err = scanner.Err(); err != nil {
return nil, err
}
reader := csv.NewReader(&bb)
records, err := reader.ReadAll()
if err != nil {
return nil, err
}
Things I tried I expected not to work (and didn't):
- Writing file contents to a new file (.txt) and reading the file back in (including running dos2unix against the created txt file)
- Reading file into a standard string (hoping Go's UTF-8 encoding would magically kick in which of course it doesn't)
- Reading file to Rune slice, then transforming to a string via byte slice
I'm aware of the https://godoc.org/golang.org/x/text/transform package but not too sure of a viable approach - it looks like the src encoding needs to be known to transform.
Am I stupidly overlooking something? Are there any suggestions how to transform these files into UTF-8 or update the line endings without knowing the file encoding whilst keeping the application working for all the other valid CSV files being delivered? Are there any options that don't involve me going byte to byte and doing a bytes.Replace I've not considered? I'm hoping there's something really obvious I've overlooked.
Apologies - I can't share the CSV files for obvious reasons.