Adolf K Wiseman 2022-10-24 11:01 采纳率: 70%
浏览 62
已结题

编程实现:1-gram sequence、uni-gram set和uni-gram vector

不限语言编程实现:choose a 1-gram sequence to parse a keyword, we name this representation as the uni-gram set. For example, the keyword “secure” is transformed to {s1, e1, c1, u1, r1, e2}, where “e1” is the first “e” in “secure” and “e2” is the second “e”. The uni-gram set is presented with a 160-bit long vector which named the uni-gram vector. The uni-gram vector consists of 26 ∗ 5 + 30 bits, where 26∗5 bits represent 26∗5 letters, 30 bits represent symbols and numbers those are in common use. A given bit is set to 1 if it characterizes a corresponding uni-gram; otherwise it remains 0.

题目翻译:选择一个1-gram的序列来解析一个关键字,我们将这个表示法命名为uni-gram set。例如,关键字“secure”转换为集合{s1、e1、c1、u1、r1、e2},其中“e1”是“secure”中的第一个“e”,“e2”是第二个“e”。uni-gram set被表示为一个160位长的向量,它被命名为uni-gram vector。单克向量由26∗5 + 30位组成,其中26∗5位代表26∗5个字母,30位表示常用的符号和数字。如果uni-gram vector中的一个给定的bit位描述了一个相应的uni-gram,则它被设置为1;否则它保持0。

测试文件:keyword.txt
stategov
selfempnotinc
federalgov
localgov
priv

期望输出结果1:uni-gram set.txt
s1,t1,a1,t2,e1,g1,o1,v1,
s1,e1,l1,f1,e2,m1,p1,n1,o1,t1,i1,n2,c1
f1,e1,d1,e2,r1,a1,l1,g1,o1,v1
l1,o1,c1,a1,l2,g1,o2,v1
p1,r1,i1,v1

期望输出结果2:uni-gram vector.txt
{1,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,...}
{0,0,1,0,1,1,0,0,1,0,0,1,1,1,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,...}
{1,0,0,1,1,1,1,0,0,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...}
{1,0,1,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,...}
{0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...}

  • 写回答

1条回答 默认 最新

  • Java大魔王 2022-10-24 15:23
    关注

    30位常用符号和数字,不知道怎么对应位置顺序,目前只处理了全是英文字母的情况

    if __name__ == '__main__':
        # 读取keyword.txt处理
        uni_gram_list = []
        with open("keyword.txt", "r", encoding="utf-8") as f:
            text_line_list = f.read().splitlines()
        for text in text_line_list:
            uni_gram_dict = {}
            uni_gram_item_list = []
            for c in text:
                if c.isalpha():
                    if c not in uni_gram_dict.keys():
                        uni_gram_dict[c] = 1
                    else:
                        uni_gram_dict[c] = uni_gram_dict[c] + 1
                    uni_gram_item_list.append(c + str(uni_gram_dict[c]))
            uni_gram_list.append(uni_gram_item_list)
        # uni-gram set.txt输出处理
        f = open('uni-gram set.txt', 'w')
        for line in uni_gram_list:
            f.write(','.join(line)+'\n')
        f.close()
        # uni-gram vector.txt输出处理
        f = open('uni-gram vector.txt', 'w')
        uni_gram_vector_list = [[0 for j in range(160)] for i in range(len(uni_gram_list))]
        for index, value in enumerate(uni_gram_list):
            # 用ascill码处理
            uni_gram_list[index] = sorted(list(map(lambda x: ord(x[0:1]) - 97 + (int(x[1:2]) - 1) * 26, value)))
            for i, v in enumerate(uni_gram_list[index]):
                uni_gram_vector_list[index][v] = 1
        for line in uni_gram_vector_list:
            f.write(','.join(list(map(str, line))) + '\n')
        f.close()
        print("success")
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论 编辑记录

报告相同问题?

问题事件

  • 系统已结题 11月1日
  • 已采纳回答 10月24日
  • 赞助了问题酬金15元 10月24日
  • 创建了问题 10月24日

悬赏问题

  • ¥15 (标签-UDP|关键词-client)
  • ¥15 关于库卡officelite无法与虚拟机通讯的问题
  • ¥15 qgcomp混合物线性模型分析的代码出现错误:Model aliasing occurred
  • ¥100 已有python代码,要求做成可执行程序,程序设计内容不多
  • ¥15 目标检测项目无法读取视频
  • ¥15 GEO datasets中基因芯片数据仅仅提供了normalized signal如何进行差异分析
  • ¥15 小红薯封设备能解决的来
  • ¥100 求采集电商背景音乐的方法
  • ¥15 数学建模竞赛求指导帮助
  • ¥15 STM32控制MAX7219问题求解答