编程实现:1-gram sequence、uni-gram set和uni-gram vector

不限语言编程实现：choose a 1-gram sequence to parse a keyword, we name this representation as the uni-gram set. For example, the keyword “secure” is transformed to {s1, e1, c1, u1, r1, e2}, where “e1” is the first “e” in “secure” and “e2” is the second “e”. The uni-gram set is presented with a 160-bit long vector which named the uni-gram vector. The uni-gram vector consists of 26 ∗ 5 + 30 bits, where 26∗5 bits represent 26∗5 letters, 30 bits represent symbols and numbers those are in common use. A given bit is set to 1 if it characterizes a corresponding uni-gram; otherwise it remains 0.

题目翻译：选择一个1-gram的序列来解析一个关键字，我们将这个表示法命名为uni-gram set。例如，关键字“secure”转换为集合{s1、e1、c1、u1、r1、e2}，其中“e1”是“secure”中的第一个“e”，“e2”是第二个“e”。uni-gram set被表示为一个160位长的向量，它被命名为uni-gram vector。单克向量由26∗5 + 30位组成，其中26∗5位代表26∗5个字母，30位表示常用的符号和数字。如果uni-gram vector中的一个给定的bit位描述了一个相应的uni-gram，则它被设置为1；否则它保持0。

测试文件：keyword.txt
stategov
selfempnotinc
federalgov
localgov
priv

期望输出结果1：uni-gram set.txt
s1,t1,a1,t2,e1,g1,o1,v1,
s1,e1,l1,f1,e2,m1,p1,n1,o1,t1,i1,n2,c1
f1,e1,d1,e2,r1,a1,l1,g1,o1,v1
l1,o1,c1,a1,l2,g1,o2,v1
p1,r1,i1,v1

期望输出结果2：uni-gram vector.txt
{1,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,...}
{0,0,1,0,1,1,0,0,1,0,0,1,1,1,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,...}
{1,0,0,1,1,1,1,0,0,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...}
{1,0,1,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,...}
{0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...}

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除
收藏举报

1条回答默认最新

Java大魔王 2022-10-24 15:23

关注

30位常用符号和数字，不知道怎么对应位置顺序，目前只处理了全是英文字母的情况

if __name__ == '__main__':
    # 读取keyword.txt处理
    uni_gram_list = []
    with open("keyword.txt", "r", encoding="utf-8") as f:
        text_line_list = f.read().splitlines()
    for text in text_line_list:
        uni_gram_dict = {}
        uni_gram_item_list = []
        for c in text:
            if c.isalpha():
                if c not in uni_gram_dict.keys():
                    uni_gram_dict[c] = 1
                else:
                    uni_gram_dict[c] = uni_gram_dict[c] + 1
                uni_gram_item_list.append(c + str(uni_gram_dict[c]))
        uni_gram_list.append(uni_gram_item_list)
    # uni-gram set.txt输出处理
    f = open('uni-gram set.txt', 'w')
    for line in uni_gram_list:
        f.write(','.join(line)+'\n')
    f.close()
    # uni-gram vector.txt输出处理
    f = open('uni-gram vector.txt', 'w')
    uni_gram_vector_list = [[0 for j in range(160)] for i in range(len(uni_gram_list))]
    for index, value in enumerate(uni_gram_list):
        # 用ascill码处理
        uni_gram_list[index] = sorted(list(map(lambda x: ord(x[0:1]) - 97 + (int(x[1:2]) - 1) * 26, value)))
        for i, v in enumerate(uni_gram_list[index]):
            uni_gram_vector_list[index][v] = 1
    for line in uni_gram_vector_list:
        f.write(','.join(list(map(str, line))) + '\n')
    f.close()
    print("success")

本回答被题主选为最佳回答 , 对您是否有帮助呢?

编辑记录

报告相同问题？

关注问题

编程实现:1-gram sequence、uni-gram set和uni-gram vector c++ java python
2022-10-24 11:01

回答 1 已采纳 30位常用符号和数字，不知道怎么对应位置顺序，目前只处理了全是英文字母的情况 if __name__ == '__main__': # 读取keyword.txt处理 uni_gram
SQLSyntaxErrorException: ORA-00900: 无效 SQL 语句 java
2021-06-04 10:22

回答 1 已采纳 PreparedStatement pst = con.prepareStatement(String.valueOf(listsql.size())); 改成 PreparedStatemen
if-else语句不运行 php
2014-02-18 19:57

回答 1 已采纳 I just pasted that code to my IDE it told me that there is a missing closing brace: if (!is_numer
统计语言模型：Bi-gram
2023-05-30 18:30

今晚打佬虎的博客本文通过使用一小部分的中文语料，...Bigrams(二元语法模型),是一种简单易实现但实际应用价值有限的统计语言模型，是N-gram的一个特例。与它们构成的二元组合概率相同。的前提下，出现某个字符。即：在给定前一个字符。
怎么用自然语言处理来做错别字检查自然语言处理
2018-06-23 08:27

回答 2 已采纳 https://cloud.tencent.com/developer/article/1030573可以去看看
可靠有效的自定义搜索和替换功能 - preg或str替换 php
2012-03-24 15:37

回答 2 已采纳 I think it's better to use DOMDocument functionality than regexps. Here is a working prototype: /
bash学习过程中的疑问2:使用参数扩展来获取字符串的第i个字符为什么不对？ linux
2022-04-26 10:38

回答 2 已采纳看下提示，是使用cut完成 cut -d ' ' -f 1|cut -c 3
ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training翻译
2021-04-12 14:51

nopSled的博客摘要本文提出了一种新的序列到序列预训练模型ProphetNet，该模型引入了一种新的自监督目标，称为未来n-...我们分别使用基本规模的数据集（16GB）和大规模的数据集（160GB）对ProphetNet进行了预训练。然后，我们在CNN
移植OLED程序字库文件重定义 stm32
2022-08-22 13:20

回答 2 已采纳你是不是在头文件里定义变量，然后在两个C文件里都包含了这个头文件？
指定位置输出字符串(C语言) c语言有问必答
2022-11-27 20:16

回答 3 已采纳只需要搜索两个字符所在的位置下标就可以了啊，题目中函数功能说明如下： char *match(char *s,char ch1,char ch2) { for(int i=0;i<str
如何在PHP中为管理员和用户分别登录？ php
2017-02-25 06:37

回答 1 已采纳 In both function change $count < 1 to $count == 1 If in DB it matches a record and have
ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training
2021-01-23 22:10

AI强仔的博客本文根据2020年《ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training》翻译总结。 ProphetNet：将在XLNet中提到的two-stream 自注意力扩展到n-strean 自注意力。 ProphetNet可以同时预测...
在一个py文件中导入一个算法库，然后运行np.array()就出现如下警告，怎么解决求解？（如果不导入算法库，只导入numpy就不会有警告） python sklearn 有问必答算法
2021-08-26 23:48

回答 2 已采纳升级sklearn和numpy 到新版本，经测试版本分别为0.24.2 和1.21.2运行正常，没有弃用提示信息。
word2vec:skip-gram
2023-12-02 19:22

An_ich的博客 skip-gram
skip-gram word2vec代码实现
2022-12-13 23:24

hj_caas的博客 word2vec词向量模型，skip-gram方法
Lecture 3 N-gram Language Models
2023-06-03 20:05

小羊和小何的博客 Lecture 3 N-gram Language Models
NLP task2 N-Gram
2019-05-14 14:33

沐漜的博客 N-Gram是一种基于统计语言模型的算法 N-Gram是一种基于统计语言模型的算法。它的基本思想是将文本里面的内容按照字节进行大小为N的滑动窗口操作，形成了长度是N的字节片段序列。每一个字节片段称为gram，对所有gram...
基于Python的新闻识别预测n-gram模型和LSTM模型
2022-07-09 15:52

shejizuopin的博客它通过将ht-1和xt合在一起，并通过sigmoid函数并为细胞状态Ct-1中的每个数字输出一个介于0和1之间的数字。1表示完全保留这个,0表示完全去掉这个，中间值则表示保留的程度。这个方法为神经网络提供了一种学习的方法...
gensim中word2vec python源码理解（二）Skip-gram模型训练
2019-07-31 09:11

ForcedOverflow的博客 [gensim中word2vec python源码理解（一）初始化构建...gensim中word2vec python源码理解（二）Skip-gram模型训练本文是在上一篇《使用Hierarchical Softmax方法构建单词表》的基础上，继续记录对word2vec源码的...
没有解决我的问题, 去提问

问题事件

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
系统已结题 11月1日
关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
已采纳回答 10月24日
关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
赞助了问题酬金15元 10月24日
关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
创建了问题 10月24日

悬赏问题

¥15 (标签-UDP|关键词-client)
¥15 关于库卡officelite无法与虚拟机通讯的问题
¥15 qgcomp混合物线性模型分析的代码出现错误：Model aliasing occurred
¥100 已有python代码，要求做成可执行程序，程序设计内容不多
¥15 目标检测项目无法读取视频
¥15 GEO datasets中基因芯片数据仅仅提供了normalized signal如何进行差异分析
¥15 小红薯封设备能解决的来
¥100 求采集电商背景音乐的方法
¥15 数学建模竞赛求指导帮助
¥15 STM32控制MAX7219问题求解答

编程实现:1-gram sequence、uni-gram set和uni-gram vector

1条回答 默认 最新

问题事件

悬赏问题

1条回答默认最新