shero_f 2021-09-26 10:42 采纳率: 100%
浏览 54
已结题

python 这题不会写,有没有同志可以帮助一下,

从英文文档中读入文本,将每个句子表示为词袋特征向量。要求如下:

1)从文件中读出所有英文句子;

2)统计所有句子中的词;

3)将每个句子表示为词袋模型的向量;

4)将每个句子的向量保存到新的文档中。

文档集内容如下所示。

"State-of-the-art named entity recognition systems rely heavily on hand-crafted features and domain-specific knowledge in order to learn effectively from the small",

"supervised training corpora that are available. In this paper, we introduce two new neural architectures: one based on bidirectional LSTMs and conditional random fields",

"and the other that constructs and labels segments using a transition-based approach inspired by shift-reduce parsers. Our models rely on two sources of information about words",

"character-based word representations learned from the supervised corpus and unsupervised word representations learned from unannotated corpora",

"Our models obtain state-of-the-art performance in NER in four languages without resorting to any language-specific knowledge or resources such as gazetteers"

  • 写回答

2条回答 默认 最新

  • CSDN专家-kaily 2021-09-26 14:13
    关注
    import numpy as np
    import re
    from gensim import corpora
    
    def onehot_matrix(list1):
        words = []
        docs = []
        for i in list1:   # 去标点符号
            string = re.sub("[\,\.\:]", "",i)
            docs.append(string)  # 去掉标点符号的句子
    
        for i in range(len(docs)):
            docs[i] = docs[i].split(" ")
            words += docs[i]
        vocab=sorted(set(words),key=words.index)  # 所有不重复的词
    
        V=len(vocab)    # 建立一个M行V列的全0矩阵,M为句子数量,V为不重复词语数,即编码维度
        M=len(list1)
        onehot = np.zeros(V, dtype=int)  # 用来表示词
        bow = np.zeros((M,V), dtype=int) # 用来表示所有句子
        
        #生成词典
        dict = corpora.Dictionary([words])
        print(dict.token2id)  # 输出词典
        for i,doc in enumerate(docs):  #词袋 
            for word in doc:
                if word in words:
                    pos=vocab.index(word)
                    bow[i][pos] += 1
        return [list(i) for i in bow]
    
    list1 = ["State-of-the-art named entity recognition systems rely heavily on hand-crafted features and domain-specific knowledge in order to learn effectively from the small",
             "supervised training corpora that are available. In this paper, we introduce two new neural architectures: one based on bidirectional LSTMs and conditional random fields",
             "and the other that constructs and labels segments using a transition-based approach inspired by shift-reduce parsers. Our models rely on two sources of information about words",
             "character-based word representations learned from the supervised corpus and unsupervised word representations learned from unannotated corpora",
             "Our models obtain state-of-the-art performance in NER in four languages without resorting to any language-specific knowledge or resources such as gazetteers"]
    print(onehot_matrix(list1))
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

问题事件

  • 系统已结题 10月4日
  • 已采纳回答 9月26日
  • 创建了问题 9月26日

悬赏问题

  • ¥15 centos7.9 IPv6端口telnet和端口监控问题
  • ¥120 计算机网络的新校区组网设计
  • ¥20 完全没有学习过GAN,看了CSDN的一篇文章,里面有代码但是完全不知道如何操作
  • ¥15 使用ue5插件narrative时如何切换关卡也保存叙事任务记录
  • ¥20 海浪数据 南海地区海况数据,波浪数据
  • ¥20 软件测试决策法疑问求解答
  • ¥15 win11 23H2删除推荐的项目,支持注册表等
  • ¥15 matlab 用yalmip搭建模型,cplex求解,线性化处理的方法
  • ¥15 qt6.6.3 基于百度云的语音识别 不会改
  • ¥15 关于#目标检测#的问题:大概就是类似后台自动检测某下架商品的库存,在他监测到该商品上架并且可以购买的瞬间点击立即购买下单