golang XML结束解析并显示“无效的UTF-8”错误

I am having an issue unmarshaling XML with unicode characters.

When attempting to parse XML with standard English characters, it parses the entire file and unmarshals correctly without any issues. However, if the the XML file contains a character such as ñ, á, or – (em-dash), it stops parsing the XML and only returns the items in the array that are before the item with that character.

For example, here is XML:

<items>
  <item>
    <ID value="1" name="Item 1" GCName="Item 1" />
  </item>
  <item>
    <ID value="2" name="Item 2" GCName="Item 2" />
  </item>
  <item>
    <ID value="3" name="Item 3" GCName="Item 3 With ñ" />
  </item>
  <item>
    <ID value="4" name="Item 4" GCName="Item 4" />
  </item>
</items>

This is my Go code (rough without any imports):

# main.go

type Response struct {
    Items []Items `xml:"items"`
}

type Items struct {
    Item []Item `xml:"item"`
}

type Item struct {
    ID    ItemID `xml:"ID"`
}

type ItemID struct {
    Value  string `xml:"value,attr"`
    Name   string `xml:"name,attr"`
    GCName string `xml:"GCName,attr"`
}

func main() {
    xmlFile, err := os.Open("C:\path\to\xml\file.xml")
    if err != nil {
        fmt.Println("Error opening file!")
        fmt.Println(err.Error())
    }
    defer xmlFile.Close()

    xmlData, err := io.ReadAll(xmlFile)
    if err != nil {
        fmt.Println("Error reading file!")
        fmt.Println(err.Error())
    }

    var response Response
    err := xml.Unmarshal(xmlData, &response)
    if err != nil {
        fmt.Println("Error unmarshaling XML")
        fmt.Println(err.Error())
    }
    fmt.Println(response)
}

This code will print out only the first two items, as if they were the only two. It will also output:

Error unmarshaling XML
XML syntax error on line 9; Invalid UTF-8

I have also tried using xml.Decoder with a CharsetReader using a different encoding, but this did not yield any different results. FWIW, I am using Windows.

Is there a way I can get around this error? Swap out the "bad" characters for something else? It was my understanding that those characters are valid UTF-8...so what gives??

Thanks in advance!

写回答
好问题 0 提建议
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

duanliaolan6178 2016-10-18 08:49

关注

Reader that filters out invalid UTF-8 characters

package main

    import (
    "bufio"
    "io"
    "unicode"
    "unicode/utf8"
    )

    // ValidUTF8Reader implements a Reader which reads only bytes that constitute valid UTF-8
    type ValidUTF8Reader struct {
        buffer *bufio.Reader
    }

    // Function Read reads bytes in the byte array b. n is the number of bytes read.
    func (rd ValidUTF8Reader) Read(b []byte) (n int, err error) {
        for {
            var r rune
            var size int
            r, size, err = rd.buffer.ReadRune()
            if err != nil {
                return
            }
            if r == unicode.ReplacementChar && size == 1 {
                continue
            } else if n+size < len(b) {
                utf8.EncodeRune(b[n:], r)
                n += size
            } else {
                rd.buffer.UnreadRune()
                break
            }
        }
        return
    }

    // NewValidUTF8Reader constructs a new ValidUTF8Reader that wraps an existing io.Reader
    func NewValidUTF8Reader(rd io.Reader) ValidUTF8Reader {
        return ValidUTF8Reader{bufio.NewReader(rd)}
    }

taken from here

本回答被题主选为最佳回答 , 对您是否有帮助呢?

查看更多回答(1条)

报告相同问题？

关注问题

Golang
2024-04-04 20:13

hualallaa的博客 Go语言是一门新型的静态类型的编程语言。使用Go语言不仅可以访问底层操作系统，还提供...编译器将源代码编译成二进制（或字节码）格式，在编译代码时，编译器检查错误，优化性能并输出可在不同平台上运行的二进制文件。
golang个人整理知识点
2021-12-13 23:23

闲落~的博客个人整理golang全面知识点
深入解析protobuf 1-proto3 使用及编解码原理介绍
2021-12-10 09:36

杨桃不爱程序的博客后面会有高级教程讲如何二次开发proto-gen-go ，protobuf 官方功能并不是很完善的，在日常项目中，常常有自定义需求，更多的是使用官方protoc-gen-go 这个项目fork 后自定义版本，或者是比较优秀的开源 fork 版本。...
Web 后端开发1—协议和通信 & 架构设计（后端示例：Go）
2023-12-11 15:34

风不归Alkaid的博客 Web后端开发：HTTP协议：请求和响应、请求方法（GET、POST、PUT、DELETE）、状态码，RESTful API设计：资源定义和标识、CRUD操作、RESTful小建议，架构设计
Go 1.20 发行说明（翻译）
2023-06-30 20:53

恋喵大鲤鱼的博客文章目录 Go 1.20 简介语言的变化端口 Windows Darwin and iOS FreeBSD/RISC-V 工具 Go command Cgo Cover Vet Runtime Compiler Linker Bootstrap Core library 新的加密包 crypto/ecdh 包装多个错误 HTTP 响应...
golang语言-2-go基本语法
2017-07-20 11:56

凌风探梅的博客有效的标识符必须以字符（可以使用任何 UTF-8 编码的字符或 _ ）开头，然后紧跟着 0 个或多个字符或 Unicode 数字，如：X56、group1、_x23、i、өԑ12。以下是无效的标识符： 1ab（以数字开头） case...
决战Go语言从入门到入土v0.1
2022-02-12 21:52

小小明-代码实体的博客 } 在 Go 程序中我们不能手工显式地调用 init，否则就会收到编译错误，显示init没有被定义，例如： package main import "fmt" func init() { fmt.Println("init invoked") } func main() { init() } 报错信息为：...
AndroidManifest.xml文件解析
2015-11-26 22:22

AsiaLYF的博客一、关于AndroidManifest.xml AndroidManifest.xml 是每个android程序中必须的文件。它位于整个项目的根目录，描述了package中暴露的组件（activities, services, 等等），他们各自的实现类，各种能被处理的数据和...
Android学习笔记之AndroidManifest.xml文件解析
2013-07-12 11:28

RationalGo的博客一、关于AndroidManifest.xml AndroidManifest.xml 是每个android程序中必须的文件。它位于整个项目的根目录，描述了package中暴露的组件（activities, services, 等等），他们各自的实现类，各种能被处理的数据和...
【Go语言入门教程】Go语言基本语法
2022-02-08 19:23

机载软件与适航的博客文章目录Go语言变量的声明（使用var关键字）标准格式批量格式简短格式Go语言变量的初始化回顾C语言变量初始化的标准格式编译器推导类型的格式短变量声明并初始化Go语言多个变量同时赋值Go语言匿名变量（没有名字的...
没有解决我的问题, 去提问

码龄粉丝数原力等级 --

golang XML结束解析并显示“无效的UTF-8”错误

2条回答默认最新

码龄粉丝数原力等级 --

golang XML结束解析并显示“无效的UTF-8”错误

2条回答 默认 最新

2条回答默认最新