共享的GAE数据存储区，Go <-> Java，regexp.FindStringIndex索引移位（字节索引与utf-8-char-index）

Short version: This prints 3, which makes sense because in Go strings are basically a slice of bytes, and it takes three bytes to represent this character. How can I get len, and regexp functions to work in terms of characters, not bytes.

package main
import "fmt"
func main() {
    fmt.Println(len("ウ"))//returns 3
    fmt.Println(utf8.RuneCountInString("ウ"))//returns 1
}

Background:

I'm saving text into the GAE datastore using JDO (Java).

Then I'm processing the text using Go, specifically I'm using regexp.FindStringIndex and saving the index to the datastore.

Then back in Java land I send the unmodified text, and index to the GWT client via json.

Somewhere along the way the indexes are 'shifting', so by the time its on the client, they are off.

It seems the issue has to do with character encoding, I'm assuming Java/Go are interpreting the text (indexes) differently utf-8 char/byte?. I see references to Runes in the regexp package.

I think I can either make regexp.FindStringIndex return byte indexes in go, or make GWT client understand the utf-8 indexes.

Any suggestions? I should be using UTF-8 incase I need to internationalize the app in the future, right?

Thanks

EDIT:

Also when I was finding the index using Java on the server things just worked.

On the client (GWT) I'm using text.substring(start,end)

TEST:

package main

import "regexp"
import "fmt"

func main() {
    fmt.Print(regexp.MustCompile(`a`).FindStringIndex("ウィキa")[1])
}

The code outputs 10, not 4.

The plan is to get FindStringIndex to return 4, any ideas?

Update 2: Position Conversion

func main() {
    s:="ab日aba本語ba";
    byteIndex:=regexp.MustCompile(`a`).FindAllStringIndex(s,-1)
    fmt.Println(byteIndex)//[[0 1] [5 6] [7 8] [15 16]]

    offset :=0
    posMap := make([]int,len(s))//maps byte-positions to char-positions
    for pos, char := range s {
        fmt.Printf("character %c starts at byte position %d, has an offset of %d, and a char position of %d.
", char, pos,offset,pos-offset)
        posMap[pos]=offset
        offset += utf8.RuneLen(char)-1
    }
    fmt.Println("posMap =",posMap)
    for pos ,value:= range byteIndex{
        fmt.Printf("pos:%d value:%d subtract %d
",pos,value,posMap[value[0]])
        value[1]-=posMap[value[0]]
        value[0]-=posMap[value[0]]
    }
    fmt.Println(byteIndex)//[[0 1] [3 4] [5 6] [9 10]]

}

* Update 2 *

    lastPos:=-1
    for pos, char := range s {
        offset +=pos-lastPos-1
        fmt.Printf("character %c starts at byte position %d, has an offset of %d, and a char position of %d.
", char, pos,offset,pos-offset)
        posMap[pos]=offset
        lastPos=pos
    }

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
douqian1296 2012-04-13 01:14
关注
As you've probably gathered, Go and Java treat strings differently. In Java, a string is a series of codepoints (characters); in Go, a string is a series of bytes. Text manipulation functions in Go understand UTF-8 codepoints when necessary, but since the string is represented as bytes, the indices they return and work with are byte indexes, not character indexes.

As you observe in the comments, you can use a RuneReader and FindReaderIndex to get indexes in characters rather than bytes. strings.Reader provides an implementation of RuneReader, so you can use strings.NewReader to wrap a string in a RuneReader.

Another option is to take the substring you want the length of in characters and pass it to utf8.RuneLen, which returns the number of characters in the UTF-8 string. Using a RuneReader is probably more efficient, however.

本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

共享的GAE数据存储区，Go <-> Java，regexp.FindStringIndex索引移位（字节索引与utf-8-char-index） java
2012-04-12 22:59

回答 1 已采纳 As you've probably gathered, Go and Java treat strings differently. In Java, a string is a series
在GAE中<->“ CompositKey”查找的双向键？
2013-02-23 10:38

回答 1 已采纳 If you check the documentation for datastore.Query.Filter you'll note that multiple filters are AN
GAE Go-如何将私有变量放入数据存储区？
2014-07-10 22:48

回答 1 已采纳 You can if your types implement the PropertyLoadSaver interface: func (f *Foo) Save (c chan<-
GAE--java使用入门.doc
2022-06-25 10:13

GAE--java使用入门.doc
GAE Go Windows-“无法运行程序”，“不是有效的Win32应用程序” windows
2012-07-03 02:20

回答 2 已采纳 While the below configuration works on the Mac as it has Python installed by default, Windows requ
GAE GO-HTML模板和数据安全 html
2012-09-14 04:34

回答 1 已采纳 It's your Go program that parses the template. This can take place on several GAE server instances
从GAE API使用endpointscfg生成Java客户端库时出现HTTP 500错误 java
2016-03-22 15:25

回答 1 已采纳 I remove this part from "schema" section from the discovery document, and it works. "Order": {
sanskritvoice-gae:sanskritvoice.ru - 巴赞歌词和音频
2021-06-16 10:08

梵语网站的源代码工具平台：Python：2.7 框架：数据库：模板: , 声音：
在OSX上重新启动后，本地GAE数据存储区为空
2015-08-16 21:46

回答 2 已采纳 Got it up and running by adding both GOPATH and GOROOT environment variables to my .bash_profile.
如何在Go中的GAE数据存储区上插入多值属性？
2014-05-20 09:31

回答 1 已采纳 Change filter to "Phone =", ...
GAE数据存储（Golang）：添加新的数据库字段时的过滤查询
2016-08-23 14:43

回答 1 已采纳 The bad news is that you can't. Every query on GAE Datastore operates on an index. Since you just
Python库 | Flask-GAE-Mini-Profiler-0.1.1.tar.gz
2022-05-16 00:36

资源分类：Python库所属语言：Python 资源全名：Flask-GAE-Mini-Profiler-0.1.1.tar.gz 资源来源：官方安装方法：https://lanzao.blog.csdn.net/article/details/101784059
对“ __key__”属性的查询是否与GAE数据存储区高度一致？ java php python
2014-03-25 22:16

回答 1 已采纳 A query will be consistent only on ancestor queries. Otherwise is not consistent even if the index
maven-gae-plugin-0.9.6-sources.jar
2022-03-11 11:36

maven-gae-plugin-0.9.6-sources.jar
maven-gae-plugin-0.9.5-sources.jar
2022-03-11 11:34

maven-gae-plugin-0.9.5-sources.jar
maven-gae-plugin-0.9.4-sources.jar
2022-03-11 11:32

maven-gae-plugin-0.9.4-sources.jar
maven-gae-plugin-0.9.3-sources.jar
2022-03-11 11:30

maven-gae-plugin-0.9.3-sources.jar
maven-gae-plugin-0.9.2-sources.jar
2022-03-11 11:28

maven-gae-plugin-0.9.2-sources.jar
maven-gae-plugin-0.9.1-sources.jar
2022-03-11 11:26

maven-gae-plugin-0.9.1-sources.jar
maven-gae-plugin-0.9.0-sources.jar
2022-03-11 11:24

maven-gae-plugin-0.9.0-sources.jar
没有解决我的问题, 去提问

悬赏问题

¥15 用stata实现聚类的代码
¥15 请问paddlehub能支持移动端开发吗？在Android studio上该如何部署？
¥170 如图所示配置eNSP
¥20 docker里部署springboot项目，访问不到扬声器
¥15 netty整合springboot之后自动重连失效
¥15 悬赏！微信开发者工具报错，求帮改
¥20 wireshark抓不到vlan
¥20 关于#stm32#的问题：需要指导自动酸碱滴定仪的原理图程序代码及仿真
¥20 设计一款异域新娘的视频相亲软件需要哪些技术支持
¥15 stata安慰剂检验作图但是真实值不出现在图上

共享的GAE数据存储区，Go <-> Java，regexp.FindStringIndex索引移位（字节索引与utf-8-char-index）

1条回答 默认 最新

悬赏问题

1条回答默认最新