共享的GAE数据存储区，Go <-> Java，regexp.FindStringIndex索引移位（字节索引与utf-8-char-index）

Short version: This prints 3, which makes sense because in Go strings are basically a slice of bytes, and it takes three bytes to represent this character. How can I get len, and regexp functions to work in terms of characters, not bytes.

package main
import "fmt"
func main() {
    fmt.Println(len("ウ"))//returns 3
    fmt.Println(utf8.RuneCountInString("ウ"))//returns 1
}

Background:

I'm saving text into the GAE datastore using JDO (Java).

Then I'm processing the text using Go, specifically I'm using regexp.FindStringIndex and saving the index to the datastore.

Then back in Java land I send the unmodified text, and index to the GWT client via json.

Somewhere along the way the indexes are 'shifting', so by the time its on the client, they are off.

It seems the issue has to do with character encoding, I'm assuming Java/Go are interpreting the text (indexes) differently utf-8 char/byte?. I see references to Runes in the regexp package.

I think I can either make regexp.FindStringIndex return byte indexes in go, or make GWT client understand the utf-8 indexes.

Any suggestions? I should be using UTF-8 incase I need to internationalize the app in the future, right?

Thanks

EDIT:

Also when I was finding the index using Java on the server things just worked.

On the client (GWT) I'm using text.substring(start,end)

TEST:

package main

import "regexp"
import "fmt"

func main() {
    fmt.Print(regexp.MustCompile(`a`).FindStringIndex("ウィキa")[1])
}

The code outputs 10, not 4.

The plan is to get FindStringIndex to return 4, any ideas?

Update 2: Position Conversion

func main() {
    s:="ab日aba本語ba";
    byteIndex:=regexp.MustCompile(`a`).FindAllStringIndex(s,-1)
    fmt.Println(byteIndex)//[[0 1] [5 6] [7 8] [15 16]]

    offset :=0
    posMap := make([]int,len(s))//maps byte-positions to char-positions
    for pos, char := range s {
        fmt.Printf("character %c starts at byte position %d, has an offset of %d, and a char position of %d.
", char, pos,offset,pos-offset)
        posMap[pos]=offset
        offset += utf8.RuneLen(char)-1
    }
    fmt.Println("posMap =",posMap)
    for pos ,value:= range byteIndex{
        fmt.Printf("pos:%d value:%d subtract %d
",pos,value,posMap[value[0]])
        value[1]-=posMap[value[0]]
        value[0]-=posMap[value[0]]
    }
    fmt.Println(byteIndex)//[[0 1] [3 4] [5 6] [9 10]]

}

* Update 2 *

    lastPos:=-1
    for pos, char := range s {
        offset +=pos-lastPos-1
        fmt.Printf("character %c starts at byte position %d, has an offset of %d, and a char position of %d.
", char, pos,offset,pos-offset)
        posMap[pos]=offset
        lastPos=pos
    }

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
douqian1296 2012-04-13 01:14
关注
As you've probably gathered, Go and Java treat strings differently. In Java, a string is a series of codepoints (characters); in Go, a string is a series of bytes. Text manipulation functions in Go understand UTF-8 codepoints when necessary, but since the string is represented as bytes, the indices they return and work with are byte indexes, not character indexes.

As you observe in the comments, you can use a RuneReader and FindReaderIndex to get indexes in characters rather than bytes. strings.Reader provides an implementation of RuneReader, so you can use strings.NewReader to wrap a string in a RuneReader.

Another option is to take the substring you want the length of in characters and pass it to utf8.RuneLen, which returns the number of characters in the UTF-8 string. Using a RuneReader is probably more efficient, however.

本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

共享的GAE数据存储区，Go <-> Java，regexp.FindStringIndex索引移位（字节索引与utf-8-char-index） java
2012-04-12 22:59

回答 1 已采纳 As you've probably gathered, Go and Java treat strings differently. In Java, a string is a series
在GAE中<->“ CompositKey”查找的双向键？
2013-02-23 10:38

回答 1 已采纳 If you check the documentation for datastore.Query.Filter you'll note that multiple filters are AN
GAE Go-如何将私有变量放入数据存储区？
2014-07-10 22:48

回答 1 已采纳 You can if your types implement the PropertyLoadSaver interface: func (f *Foo) Save (c chan<-
GAE--java使用入门.doc
2022-06-25 10:13

GAE--java使用入门.doc
GAE Go Windows-“无法运行程序”，“不是有效的Win32应用程序” windows
2012-07-03 02:20

回答 2 已采纳 While the below configuration works on the Mac as it has Python installed by default, Windows requ
GAE GO-HTML模板和数据安全 html
2012-09-14 04:34

回答 1 已采纳 It's your Go program that parses the template. This can take place on several GAE server instances
从GAE API使用endpointscfg生成Java客户端库时出现HTTP 500错误 java
2016-03-22 15:25

回答 1 已采纳 I remove this part from "schema" section from the discovery document, and it works. "Order": {
sanskritvoice-gae:sanskritvoice.ru - 巴赞歌词和音频
2021-06-16 10:08

梵语网站的源代码工具平台：Python：2.7 框架：数据库：模板: , 声音：
在OSX上重新启动后，本地GAE数据存储区为空
2015-08-16 21:46

回答 2 已采纳 Got it up and running by adding both GOPATH and GOROOT environment variables to my .bash_profile.
如何在Go中的GAE数据存储区上插入多值属性？
2014-05-20 09:31

回答 1 已采纳 Change filter to "Phone =", ...
GAE数据存储（Golang）：添加新的数据库字段时的过滤查询
2016-08-23 14:43

回答 1 已采纳 The bad news is that you can't. Every query on GAE Datastore operates on an index. Since you just
Python库 | Flask-GAE-Mini-Profiler-0.1.1.tar.gz
2022-05-16 00:36

资源分类：Python库所属语言：Python 资源全名：Flask-GAE-Mini-Profiler-0.1.1.tar.gz 资源来源：官方安装方法：https://lanzao.blog.csdn.net/article/details/101784059
对“ __key__”属性的查询是否与GAE数据存储区高度一致？ java php python
2014-03-25 22:16

回答 1 已采纳 A query will be consistent only on ancestor queries. Otherwise is not consistent even if the index
maven-gae-plugin-0.9.6-sources.jar
2022-03-11 11:36

maven-gae-plugin-0.9.6-sources.jar
maven-gae-plugin-0.9.5-sources.jar
2022-03-11 11:34

maven-gae-plugin-0.9.5-sources.jar
maven-gae-plugin-0.9.4-sources.jar
2022-03-11 11:32

maven-gae-plugin-0.9.4-sources.jar
maven-gae-plugin-0.9.3-sources.jar
2022-03-11 11:30

maven-gae-plugin-0.9.3-sources.jar
maven-gae-plugin-0.9.2-sources.jar
2022-03-11 11:28

maven-gae-plugin-0.9.2-sources.jar
maven-gae-plugin-0.9.1-sources.jar
2022-03-11 11:26

maven-gae-plugin-0.9.1-sources.jar
maven-gae-plugin-0.9.0-sources.jar
2022-03-11 11:24

maven-gae-plugin-0.9.0-sources.jar
没有解决我的问题, 去提问

悬赏问题

¥15 itunes恢复数据最后一步发生错误
¥15 关于#windows#的问题：2024年5月15日的win11更新后资源管理器没有地址栏了顶部的地址栏和文件搜索都消失了
¥15 看一下OPENMV原理图有没有错误
¥100 H5网页如何调用微信扫一扫功能？
¥15 讲解电路图，付费求解
¥15 有偿请教计算电磁学的问题涉及到空间中时域UTD和FDTD算法结合的
¥15 vite打包后，页面出现h.createElement is not a function，但本地运行正常
¥15 Java，消息推送配置
¥15 Java计划序号重编制功能，此功能会对所有序号重新排序，排序后不改变前后置关系。
¥15 关于哈夫曼树应用得到一些问题

共享的GAE数据存储区，Go <-> Java，regexp.FindStringIndex索引移位（字节索引与utf-8-char-index）

1条回答 默认 最新

悬赏问题

1条回答默认最新