shero_f 2021-09-26 10:42 采纳率: 100%
浏览 54
已结题

python 这题不会写,有没有同志可以帮助一下,

从英文文档中读入文本,将每个句子表示为词袋特征向量。要求如下:

1)从文件中读出所有英文句子;

2)统计所有句子中的词;

3)将每个句子表示为词袋模型的向量;

4)将每个句子的向量保存到新的文档中。

文档集内容如下所示。

"State-of-the-art named entity recognition systems rely heavily on hand-crafted features and domain-specific knowledge in order to learn effectively from the small",

"supervised training corpora that are available. In this paper, we introduce two new neural architectures: one based on bidirectional LSTMs and conditional random fields",

"and the other that constructs and labels segments using a transition-based approach inspired by shift-reduce parsers. Our models rely on two sources of information about words",

"character-based word representations learned from the supervised corpus and unsupervised word representations learned from unannotated corpora",

"Our models obtain state-of-the-art performance in NER in four languages without resorting to any language-specific knowledge or resources such as gazetteers"

  • 写回答

2条回答 默认 最新

  • CSDN专家-kaily 2021-09-26 14:13
    关注
    import numpy as np
    import re
    from gensim import corpora
    
    def onehot_matrix(list1):
        words = []
        docs = []
        for i in list1:   # 去标点符号
            string = re.sub("[\,\.\:]", "",i)
            docs.append(string)  # 去掉标点符号的句子
    
        for i in range(len(docs)):
            docs[i] = docs[i].split(" ")
            words += docs[i]
        vocab=sorted(set(words),key=words.index)  # 所有不重复的词
    
        V=len(vocab)    # 建立一个M行V列的全0矩阵,M为句子数量,V为不重复词语数,即编码维度
        M=len(list1)
        onehot = np.zeros(V, dtype=int)  # 用来表示词
        bow = np.zeros((M,V), dtype=int) # 用来表示所有句子
        
        #生成词典
        dict = corpora.Dictionary([words])
        print(dict.token2id)  # 输出词典
        for i,doc in enumerate(docs):  #词袋 
            for word in doc:
                if word in words:
                    pos=vocab.index(word)
                    bow[i][pos] += 1
        return [list(i) for i in bow]
    
    list1 = ["State-of-the-art named entity recognition systems rely heavily on hand-crafted features and domain-specific knowledge in order to learn effectively from the small",
             "supervised training corpora that are available. In this paper, we introduce two new neural architectures: one based on bidirectional LSTMs and conditional random fields",
             "and the other that constructs and labels segments using a transition-based approach inspired by shift-reduce parsers. Our models rely on two sources of information about words",
             "character-based word representations learned from the supervised corpus and unsupervised word representations learned from unannotated corpora",
             "Our models obtain state-of-the-art performance in NER in four languages without resorting to any language-specific knowledge or resources such as gazetteers"]
    print(onehot_matrix(list1))
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

问题事件

  • 系统已结题 10月4日
  • 已采纳回答 9月26日
  • 创建了问题 9月26日

悬赏问题

  • ¥15 我想咨询一下路面纹理三维点云数据处理的一些问题,上传的坐标文件里是怎么对无序点进行编号的,以及xy坐标在处理的时候是进行整体模型分片处理的吗
  • ¥15 CSAPPattacklab
  • ¥15 一直显示正在等待HID—ISP
  • ¥15 Python turtle 画图
  • ¥15 关于大棚监测的pcb板设计
  • ¥15 stm32开发clion时遇到的编译问题
  • ¥15 lna设计 源简并电感型共源放大器
  • ¥15 如何用Labview在myRIO上做LCD显示?(语言-开发语言)
  • ¥15 Vue3地图和异步函数使用
  • ¥15 C++ yoloV5改写遇到的问题