建立一个ngram频率表并处理多字节符文

I am currently learning Go and am making a lot of progress. One way I do this is to port past projects and prototypes from a prior language to a new one.

Right now I am busying myself with a "language detector" I prototyped in Python a while ago. In this module, I generate an ngram frequency table, where I then calculate the difference between a given text and a known corpora.

This allows one to effectively determine which corpus is the best match by returning the cosine of two vector representations of the given ngram tables. Yay. Math.

I have a prototype written in Go that works perfectly with plain ascii characters, but I would very much like to have it working with unicode multibyte support. This is where I'm doing my head in.

Here is a quick example of what I'm dealing with: http://play.golang.org/p/2bnAjZX3r0

I've only posted the table generating logic since everything already works just fine.

As you can see by running the snippet, the first text works quite well and builds an accurate table. The second text, which is German, has a few double-byte characters in it. Due to the way I am building the ngram sequence, and due to the fact that these specific runes are made of two bytes, there appear 2 ngrams where the first byte is cut off.

Could someone perhaps post a more efficient solution or, at the very least, guide me through a fix? I'm almost positive I am over analysing this problem.

I plan on open sourcing this package and implementing it as a service using Martini, thus providing a simple API people can use for simple linguistic computation.

As ever, thanks!

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
double2022 2013-12-26 06:43
关注
If I understand correctly, you want chars in your Parse function to hold the last n characters in the string. Since you're interested in Unicode characters rather than their UTF-8 representation, you might find it easier to manage it as a []rune slice, and only convert back to a string when you have your ngram ready to add to the table. This way you don't need to special case non-ASCII characters in your logic.

Here is a simple modification to your sample program that does the above: http://play.golang.org/p/QMYoSlaGSv

本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(1条)

报告相同问题？

关注问题

建立一个ngram频率表并处理多字节符文
2013-12-26 03:55

回答 2 已采纳 If I understand correctly, you want chars in your Parse function to hold the last n characters in
如何让mysql的联表查询走全文索引呢？ mysql sql 数据库
2023-03-04 22:15

回答 1 已采纳该回答引用自ChatGPT 首先需要注意的是，全文索引只能应用于MATCH AGAINST语句中的查询条件，而不能直接应用于联表查询的ON条件。因此，你需要将联表查询和全文索引的查询条件分开来写。一种
docker建立mysql:5.7版本指定路径挂载不上。 docker
2022-06-26 15:59

回答 1 已采纳我当时安装使用的时候，就没有出现这个问题我用#CSDN#这个app发现了有技术含量的博客，小伙伴们求同去《Docker安装主从复制的mysql 详细步骤,以及解决错误》, 一起来围观吧 https:
ngraminator:一个非常小的ngram生成器
2021-03-18 14:46

一个用于Node.js和浏览器的非常小的ngram生成器。查看。引发 Node.js ngraminator = require ( 'ngraminator' ) // ngraminator(wordArray, ngramLenghtArray) available 脚本标签 < script src =" ...
PHP+MYSQL全文索引下的搜索出错问题排查。 mysql sql 数据库
2023-01-21 19:11

回答 4 已采纳这个问题可能是由于 MySQL 使用的分词器(ngram)不能正确处理字符串中的特殊符号"・、（）"导致的。可能需要使用不同的分词器或者更改分词器的配置来解决这个问题。另外，您可以尝试将查询语句
mysql无缘无故自动就shutdown了，以下代码是mysqld.log里的日志 mysql sql 数据库
2022-12-24 16:40

回答 1 已采纳这不是数据库崩溃吧，只是客户端连接记录。
关于#takes 1 positional argument but 2 were given #的问题，如何解决？(语言-python) python sklearn 机器学习
2022-05-26 18:58

回答 2 已采纳好像是参数位置变了。你题目中13行改成这个试试self.vectorizer = CountVectorizer(max_df=max_df, stop_words=stopwords, ngram_
使用Ngram融合多个语言模型
2022-05-12 09:28

panxin801的博客 -mix-lm 用于插值的第二个ngram模型，-lm是第一个ngram模型 -lambda 主模型（-lm对应模型）的插值比例，0~1，默认是0.5 -mix-lm2 用于插值的第三个模型 -mix-lambda2 用于插值的第二个模型（-mix-lm对应的模型）的...
python kmeans聚类后如何获取到分类的数据？ kmeans python 有问必答聚类
2022-01-16 22:16

回答 2 已采纳 # 整理聚类结果 listName = dfData['地区'].tolist() # 将 dfData 的首列 '地区' 转换为 listName dictCluster
elasticsearch的match_phrase不能精准匹配到内容 elasticsearch lucene
2020-05-21 23:35

回答 1 已采纳 https://blog.csdn.net/camelcanoe/article/details/79544155
我没有使用docker-machine和docker-compose运行LEMP mysql nginx php
2018-03-25 23:49

回答 1 已采纳 The default nginx config is in another place. Use this: volumes: - ./nginx/default:/etc/n
ngram
2021-03-06 05:28

ngram
基于Ngram双向匹配最大中文分词
2022-05-14 09:43

基于Ngram双向匹配最大中文分词包含data：停用词语料，标准切分语料，测试集，训练集 PrePostNgram1.py为双向最大匹配程序 Evaluate.py为评估程序结果 word内容为程序说明以及原理
ngrams：根据共享ngram的数量从词汇表中选择单词
2021-02-18 08:28

program_name --vocabulary vocabulary.txt --words word_list.txt --output output.txt使用普通ngram数除以两个单词的总ngram数，将word_list.txt的每一行与vocabulary.txt中的一行匹配。输出将写入output.txt。
ngrams:用于字符或单词 ngram 分析的 C++ 包。它使用三元搜索树而不是哈希表来实现更快的 ngram 频率计数。单词被转换为唯一的 ID，并被编码为更紧凑的 256 基整数。它是 Vlado Keselj 博士的 Text-Ngrams 1.6 的部分实现，它是 perl 中非常灵活的 Ngram 包
2021-06-06 23:16

$ make $ ngrams --type=word --n=3 --in= sample.txt 或者$ ngrams --type=character -n=3 --in= sample.txt 或者字节 ngram，例如，从二进制文件中获取 ngram。 $ ngrams --type=byte -n=3 --in= sample.txt 就是...
ngram 算法尝试
2018-11-11 17:21

ngram 尝试算法希望下载的人能继续编写下去。可以互相讨论
slm.tar.gz_ngram特征计算_相对熵
2022-09-24 04:40

利用剑桥的SLM工具包实现的一个计算ngram的代码，可以完成1-3元模型的训练以及压缩。压缩算法是利用相对熵策略压缩。
ngram：快速n-Gram标记化
2021-02-05 20:54

ngram：快速n-Gram标记化
数据集，包含有关“ngram”的观测频率的信息（n个字母的特定序列）在英文文本中.rar
2023-06-08 00:28

数据集
没有解决我的问题, 去提问

悬赏问题

¥60 求一个简单的网页(标签-安全|关键词-上传)
¥35 lstm时间序列共享单车预测，loss值优化，参数优化算法
¥15 基于卷积神经网络的声纹识别
¥15 Python中的request，如何使用ssr节点，通过代理requests网页。本人在泰国，需要用大陆ip才能玩网页游戏，合法合规。
¥100 为什么这个恒流源电路不能恒流？
¥15 有偿求跨组件数据流路径图
¥15 写一个方法checkPerson，入参实体类Person，出参布尔值
¥15 我想咨询一下路面纹理三维点云数据处理的一些问题，上传的坐标文件里是怎么对无序点进行编号的，以及xy坐标在处理的时候是进行整体模型分片处理的吗
¥15 CSAPPattacklab
¥15 一直显示正在等待HID—ISP

建立一个ngram频率表并处理多字节符文

2条回答 默认 最新

悬赏问题

2条回答默认最新