duanjian7617 2016-10-17 22:59
浏览 803
已采纳

golang XML结束解析并显示“无效的UTF-8”错误

I am having an issue unmarshaling XML with unicode characters.

When attempting to parse XML with standard English characters, it parses the entire file and unmarshals correctly without any issues. However, if the the XML file contains a character such as ñ, á, or – (em-dash), it stops parsing the XML and only returns the items in the array that are before the item with that character.

For example, here is XML:

<items>
  <item>
    <ID value="1" name="Item 1" GCName="Item 1" />
  </item>
  <item>
    <ID value="2" name="Item 2" GCName="Item 2" />
  </item>
  <item>
    <ID value="3" name="Item 3" GCName="Item 3 With ñ" />
  </item>
  <item>
    <ID value="4" name="Item 4" GCName="Item 4" />
  </item>
</items>

This is my Go code (rough without any imports):

# main.go

type Response struct {
    Items []Items `xml:"items"`
}

type Items struct {
    Item []Item `xml:"item"`
}

type Item struct {
    ID    ItemID `xml:"ID"`
}

type ItemID struct {
    Value  string `xml:"value,attr"`
    Name   string `xml:"name,attr"`
    GCName string `xml:"GCName,attr"`
}

func main() {
    xmlFile, err := os.Open("C:\path\to\xml\file.xml")
    if err != nil {
        fmt.Println("Error opening file!")
        fmt.Println(err.Error())
    }
    defer xmlFile.Close()

    xmlData, err := io.ReadAll(xmlFile)
    if err != nil {
        fmt.Println("Error reading file!")
        fmt.Println(err.Error())
    }

    var response Response
    err := xml.Unmarshal(xmlData, &response)
    if err != nil {
        fmt.Println("Error unmarshaling XML")
        fmt.Println(err.Error())
    }
    fmt.Println(response)
}

This code will print out only the first two items, as if they were the only two. It will also output:

Error unmarshaling XML
XML syntax error on line 9; Invalid UTF-8

I have also tried using xml.Decoder with a CharsetReader using a different encoding, but this did not yield any different results. FWIW, I am using Windows.

Is there a way I can get around this error? Swap out the "bad" characters for something else? It was my understanding that those characters are valid UTF-8...so what gives??

Thanks in advance!

  • 写回答

2条回答 默认 最新

  • duanliaolan6178 2016-10-18 08:49
    关注

    Reader that filters out invalid UTF-8 characters

    package main
    
        import (
        "bufio"
        "io"
        "unicode"
        "unicode/utf8"
        )
    
        // ValidUTF8Reader implements a Reader which reads only bytes that constitute valid UTF-8
        type ValidUTF8Reader struct {
            buffer *bufio.Reader
        }
    
        // Function Read reads bytes in the byte array b. n is the number of bytes read.
        func (rd ValidUTF8Reader) Read(b []byte) (n int, err error) {
            for {
                var r rune
                var size int
                r, size, err = rd.buffer.ReadRune()
                if err != nil {
                    return
                }
                if r == unicode.ReplacementChar && size == 1 {
                    continue
                } else if n+size < len(b) {
                    utf8.EncodeRune(b[n:], r)
                    n += size
                } else {
                    rd.buffer.UnreadRune()
                    break
                }
            }
            return
        }
    
        // NewValidUTF8Reader constructs a new ValidUTF8Reader that wraps an existing io.Reader
        func NewValidUTF8Reader(rd io.Reader) ValidUTF8Reader {
            return ValidUTF8Reader{bufio.NewReader(rd)}
        }
    

    taken from here

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 在若依框架下实现人脸识别
  • ¥15 网络科学导论,网络控制
  • ¥100 安卓tv程序连接SQLSERVER2008问题
  • ¥15 利用Sentinel-2和Landsat8做一个水库的长时序NDVI的对比,为什么Snetinel-2计算的结果最小值特别小,而Lansat8就很平均
  • ¥15 metadata提取的PDF元数据,如何转换为一个Excel
  • ¥15 关于arduino编程toCharArray()函数的使用
  • ¥100 vc++混合CEF采用CLR方式编译报错
  • ¥15 coze 的插件输入飞书多维表格 app_token 后一直显示错误,如何解决?
  • ¥15 vite+vue3+plyr播放本地public文件夹下视频无法加载
  • ¥15 c#逐行读取txt文本,但是每一行里面数据之间空格数量不同