duandie5707 2012-04-12 22:59
浏览 45
已采纳

共享的GAE数据存储区,Go <-> Java,regexp.FindStringIndex索引移位(字节索引与utf-8-char-index)

Short version: This prints 3, which makes sense because in Go strings are basically a slice of bytes, and it takes three bytes to represent this character. How can I get len, and regexp functions to work in terms of characters, not bytes.

package main
import "fmt"
func main() {
    fmt.Println(len("ウ"))//returns 3
    fmt.Println(utf8.RuneCountInString("ウ"))//returns 1
}

Background:

I'm saving text into the GAE datastore using JDO (Java).

Then I'm processing the text using Go, specifically I'm using regexp.FindStringIndex and saving the index to the datastore.

Then back in Java land I send the unmodified text, and index to the GWT client via json.

Somewhere along the way the indexes are 'shifting', so by the time its on the client, they are off.

It seems the issue has to do with character encoding, I'm assuming Java/Go are interpreting the text (indexes) differently utf-8 char/byte?. I see references to Runes in the regexp package.

I think I can either make regexp.FindStringIndex return byte indexes in go, or make GWT client understand the utf-8 indexes.

Any suggestions? I should be using UTF-8 incase I need to internationalize the app in the future, right?

Thanks

EDIT:

Also when I was finding the index using Java on the server things just worked.

On the client (GWT) I'm using text.substring(start,end)

TEST:

package main

import "regexp"
import "fmt"

func main() {
    fmt.Print(regexp.MustCompile(`a`).FindStringIndex("ウィキa")[1])
}

The code outputs 10, not 4.

The plan is to get FindStringIndex to return 4, any ideas?

Update 2: Position Conversion

func main() {
    s:="ab日aba本語ba";
    byteIndex:=regexp.MustCompile(`a`).FindAllStringIndex(s,-1)
    fmt.Println(byteIndex)//[[0 1] [5 6] [7 8] [15 16]]

    offset :=0
    posMap := make([]int,len(s))//maps byte-positions to char-positions
    for pos, char := range s {
        fmt.Printf("character %c starts at byte position %d, has an offset of %d, and a char position of %d.
", char, pos,offset,pos-offset)
        posMap[pos]=offset
        offset += utf8.RuneLen(char)-1
    }
    fmt.Println("posMap =",posMap)
    for pos ,value:= range byteIndex{
        fmt.Printf("pos:%d value:%d subtract %d
",pos,value,posMap[value[0]])
        value[1]-=posMap[value[0]]
        value[0]-=posMap[value[0]]
    }
    fmt.Println(byteIndex)//[[0 1] [3 4] [5 6] [9 10]]

}

* Update 2 *

    lastPos:=-1
    for pos, char := range s {
        offset +=pos-lastPos-1
        fmt.Printf("character %c starts at byte position %d, has an offset of %d, and a char position of %d.
", char, pos,offset,pos-offset)
        posMap[pos]=offset
        lastPos=pos
    }
  • 写回答

1条回答 默认 最新

  • douqian1296 2012-04-13 01:14
    关注

    As you've probably gathered, Go and Java treat strings differently. In Java, a string is a series of codepoints (characters); in Go, a string is a series of bytes. Text manipulation functions in Go understand UTF-8 codepoints when necessary, but since the string is represented as bytes, the indices they return and work with are byte indexes, not character indexes.

    As you observe in the comments, you can use a RuneReader and FindReaderIndex to get indexes in characters rather than bytes. strings.Reader provides an implementation of RuneReader, so you can use strings.NewReader to wrap a string in a RuneReader.

    Another option is to take the substring you want the length of in characters and pass it to utf8.RuneLen, which returns the number of characters in the UTF-8 string. Using a RuneReader is probably more efficient, however.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 用stata实现聚类的代码
  • ¥15 请问paddlehub能支持移动端开发吗?在Android studio上该如何部署?
  • ¥170 如图所示配置eNSP
  • ¥20 docker里部署springboot项目,访问不到扬声器
  • ¥15 netty整合springboot之后自动重连失效
  • ¥15 悬赏!微信开发者工具报错,求帮改
  • ¥20 wireshark抓不到vlan
  • ¥20 关于#stm32#的问题:需要指导自动酸碱滴定仪的原理图程序代码及仿真
  • ¥20 设计一款异域新娘的视频相亲软件需要哪些技术支持
  • ¥15 stata安慰剂检验作图但是真实值不出现在图上