dongtai419309 2013-11-25 14:07
浏览 97
已采纳

从io.Reader读取UTF-8编码的字符串

I am writing an small communication protocol with TCP sockets. I am able to read and write basic data types such as integers but I have no idea of how to read an UTF-8 encoded string from a slice of bytes.

The protocol client is written in Java and the server is Go.

As per I read: GO runes are 32 bit long and UTF-8 chars are 1 to 4 byte long, what makes not possible to simply cast a byte slice to a String.

I'd like to know how can I read and write this UTF-8 stream.

Note I have the byte buffer length on time to read the string.

  • 写回答

1条回答 默认 最新

  • doutuo7815 2013-11-25 22:06
    关注

    Some theory first:

    • A rune in Go represents a Unicode code point — a number assigned to a particular character in Unicode. It's an alias to uint32.
    • UTF-8 is a Unicode encoding — a format of representing Unicode code points for the means of storage and transmission. UTF-8 might use 1 to 4 bytes to encode a single code point.

    How this maps on Go data types:

    • Both []byte and string store a series of bytes (a byte in Go is an alias for uint8).

      The chief difference is that strings are immutable, so while you can

      b := make([]byte, 2)
      b[0] = byte('a')
      b[1] = byte('z')
      

      you can't

      var s string
      s[0] = byte('a')
      

      The latter fact is even underlined by the inability to set the string length explicitly (like in imaginary s := make(string, 10)).

    • While strings in Go contain abstract bytes (you're free to store in them, say, characters encoded using Windows-1252), certain Go statements and type conversions interpret strings as being encoded in UTF-8, in particular:
      • A type conversion between string and []rune parses the string as a sequence of UTF-8-encoded code points and produces a slice of them. The reverse type conversion takes the Unicode code points from the slice of runes and produces an UTF-8-encoded string.
      • A range loop over a string loops through Unicode code points comprising the string, not just bytes.

    Go also supplies the type conversions between string and []byte and back. Now recall that strings are read-only, while slices of bytes are not. This means a construct like

    b := make([]byte, 1000)
    io.ReadFull(r, b)
    s := sting(b)
    

    always copies the data, no matter if you convert a slice to a string or back. This wastes space but is type-safe and enforces the semantics.

    Now back to your task at hand.

    If you work with reasonably small strings and are not under memory pressure, just convert your byte slices filled by io.Read() (or whatever) to strings. Be sure to reuse the slice you're using to read the data to ease the pressure on the garbage collector — that is, do not allocate a new slice for each new read as you're gonna to copy the data put to it by the reading code off to a string.

    Finally, if you absolutely must to not copy the data (say, you're dealing with multi-megabyte strings, and you have tight memory requirements), you may try to play dirty tricks by unsafely working with memory — here is an example of how you might transplant the memory from a byte slice to a string. Note that should you revert to something like this, you must very well understand that it's free to break with any new release of Go, and it's not even guaranteed to work at all.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 计组这些题应该咋做呀
  • ¥60 更换迈创SOL6M4AE卡的时候,驱动要重新装才能使用,怎么解决?
  • ¥15 让node服务器有自动加载文件的功能
  • ¥15 jmeter脚本回放有的是对的有的是错的
  • ¥15 r语言蛋白组学相关问题
  • ¥15 Python时间序列如何拟合疏系数模型
  • ¥15 求学软件的前人们指明方向🥺
  • ¥50 如何增强飞上天的树莓派的热点信号强度,以使得笔记本可以在地面实现远程桌面连接
  • ¥20 双层网络上信息-疾病传播
  • ¥50 paddlepaddle pinn