dongtai419309 2013-11-25 14:07
浏览 97
已采纳

从io.Reader读取UTF-8编码的字符串

I am writing an small communication protocol with TCP sockets. I am able to read and write basic data types such as integers but I have no idea of how to read an UTF-8 encoded string from a slice of bytes.

The protocol client is written in Java and the server is Go.

As per I read: GO runes are 32 bit long and UTF-8 chars are 1 to 4 byte long, what makes not possible to simply cast a byte slice to a String.

I'd like to know how can I read and write this UTF-8 stream.

Note I have the byte buffer length on time to read the string.

  • 写回答

1条回答 默认 最新

  • doutuo7815 2013-11-25 22:06
    关注

    Some theory first:

    • A rune in Go represents a Unicode code point — a number assigned to a particular character in Unicode. It's an alias to uint32.
    • UTF-8 is a Unicode encoding — a format of representing Unicode code points for the means of storage and transmission. UTF-8 might use 1 to 4 bytes to encode a single code point.

    How this maps on Go data types:

    • Both []byte and string store a series of bytes (a byte in Go is an alias for uint8).

      The chief difference is that strings are immutable, so while you can

      b := make([]byte, 2)
      b[0] = byte('a')
      b[1] = byte('z')
      

      you can't

      var s string
      s[0] = byte('a')
      

      The latter fact is even underlined by the inability to set the string length explicitly (like in imaginary s := make(string, 10)).

    • While strings in Go contain abstract bytes (you're free to store in them, say, characters encoded using Windows-1252), certain Go statements and type conversions interpret strings as being encoded in UTF-8, in particular:
      • A type conversion between string and []rune parses the string as a sequence of UTF-8-encoded code points and produces a slice of them. The reverse type conversion takes the Unicode code points from the slice of runes and produces an UTF-8-encoded string.
      • A range loop over a string loops through Unicode code points comprising the string, not just bytes.

    Go also supplies the type conversions between string and []byte and back. Now recall that strings are read-only, while slices of bytes are not. This means a construct like

    b := make([]byte, 1000)
    io.ReadFull(r, b)
    s := sting(b)
    

    always copies the data, no matter if you convert a slice to a string or back. This wastes space but is type-safe and enforces the semantics.

    Now back to your task at hand.

    If you work with reasonably small strings and are not under memory pressure, just convert your byte slices filled by io.Read() (or whatever) to strings. Be sure to reuse the slice you're using to read the data to ease the pressure on the garbage collector — that is, do not allocate a new slice for each new read as you're gonna to copy the data put to it by the reading code off to a string.

    Finally, if you absolutely must to not copy the data (say, you're dealing with multi-megabyte strings, and you have tight memory requirements), you may try to play dirty tricks by unsafely working with memory — here is an example of how you might transplant the memory from a byte slice to a string. Note that should you revert to something like this, you must very well understand that it's free to break with any new release of Go, and it's not even guaranteed to work at all.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 ETLCloud 处理json多层级问题
  • ¥15 matlab中使用gurobi时报错
  • ¥15 这个主板怎么能扩出一两个sata口
  • ¥15 不是,这到底错哪儿了😭
  • ¥15 2020长安杯与连接网探
  • ¥15 关于#matlab#的问题:在模糊控制器中选出线路信息,在simulink中根据线路信息生成速度时间目标曲线(初速度为20m/s,15秒后减为0的速度时间图像)我想问线路信息是什么
  • ¥15 banner广告展示设置多少时间不怎么会消耗用户价值
  • ¥15 可见光定位matlab仿真
  • ¥15 arduino 四自由度机械臂
  • ¥15 wordpress 产品图片 GIF 没法显示