dquv73115 2013-12-26 03:55
浏览 61
已采纳

建立一个ngram频率表并处理多字节符文

I am currently learning Go and am making a lot of progress. One way I do this is to port past projects and prototypes from a prior language to a new one.

Right now I am busying myself with a "language detector" I prototyped in Python a while ago. In this module, I generate an ngram frequency table, where I then calculate the difference between a given text and a known corpora.

This allows one to effectively determine which corpus is the best match by returning the cosine of two vector representations of the given ngram tables. Yay. Math.

I have a prototype written in Go that works perfectly with plain ascii characters, but I would very much like to have it working with unicode multibyte support. This is where I'm doing my head in.

Here is a quick example of what I'm dealing with: http://play.golang.org/p/2bnAjZX3r0

I've only posted the table generating logic since everything already works just fine.

As you can see by running the snippet, the first text works quite well and builds an accurate table. The second text, which is German, has a few double-byte characters in it. Due to the way I am building the ngram sequence, and due to the fact that these specific runes are made of two bytes, there appear 2 ngrams where the first byte is cut off.

Could someone perhaps post a more efficient solution or, at the very least, guide me through a fix? I'm almost positive I am over analysing this problem.

I plan on open sourcing this package and implementing it as a service using Martini, thus providing a simple API people can use for simple linguistic computation.

As ever, thanks!

  • 写回答

2条回答 默认 最新

  • double2022 2013-12-26 06:43
    关注

    If I understand correctly, you want chars in your Parse function to hold the last n characters in the string. Since you're interested in Unicode characters rather than their UTF-8 representation, you might find it easier to manage it as a []rune slice, and only convert back to a string when you have your ngram ready to add to the table. This way you don't need to special case non-ASCII characters in your logic.

    Here is a simple modification to your sample program that does the above: http://play.golang.org/p/QMYoSlaGSv

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥60 求一个简单的网页(标签-安全|关键词-上传)
  • ¥35 lstm时间序列共享单车预测,loss值优化,参数优化算法
  • ¥15 基于卷积神经网络的声纹识别
  • ¥15 Python中的request,如何使用ssr节点,通过代理requests网页。本人在泰国,需要用大陆ip才能玩网页游戏,合法合规。
  • ¥100 为什么这个恒流源电路不能恒流?
  • ¥15 有偿求跨组件数据流路径图
  • ¥15 写一个方法checkPerson,入参实体类Person,出参布尔值
  • ¥15 我想咨询一下路面纹理三维点云数据处理的一些问题,上传的坐标文件里是怎么对无序点进行编号的,以及xy坐标在处理的时候是进行整体模型分片处理的吗
  • ¥15 CSAPPattacklab
  • ¥15 一直显示正在等待HID—ISP