Janehiwang 2019-04-04 23:06 采纳率: 0%
浏览 555

nltk统计在超过5000条记录中出现的词

dataframe中某一列为文本

现在想统计出在超过5000条记录中出现的词,有什么函数可以用吗?
图片说明

处于没有思路的状态...

  • 写回答

1条回答 默认 最新

  • ly_2333 2019-04-06 14:16
    关注

    先去除标点符号,调用字符串函数split()切分成单词列表,调用nltk.FreqDist()进行统计

    import nltk
    
    def my_split(s):
        # 去除文章中的标点符号
        # 可以自己定义标点符号
        temp = [",",".","?","!",":",";","-","#","$","%","^","&","*","(",")","_","=","+","{","}","[","]","\\","|","'","<",">","~","`"]
        for e in temp:
            s = s.replace(e," ")
        return s
    
    test_str = my_split("I have a dream. A nice dream")
    freq_words = dict( nltk.FreqDist(test_str.split() ) )
    
    print(freq_words)
    

    输出结果:
    {'I': 1, 'have': 1, 'a': 1, 'dream': 2, 'A': 1, 'nice': 1}

    更细节的操作还有大小写、去除词根,不想要停用词的话可以去除停用词,nltk库都有相应的类可以调用

    评论

报告相同问题?

悬赏问题

  • ¥15 Excel发现不可读取的内容
  • ¥15 UE5#if WITH_EDITOR导致打包的功能不可用
  • ¥15 关于#stm32#的问题:CANOpen的PDO同步传输问题
  • ¥20 yolov5自定义Prune报错,如何解决?
  • ¥15 电磁场的matlab仿真
  • ¥15 mars2d在vue3中的引入问题
  • ¥50 h5唤醒支付宝并跳转至向小荷包转账界面
  • ¥15 算法题:数的划分,用记忆化DFS做WA求调
  • ¥15 chatglm-6b应用到django项目中,模型加载失败
  • ¥15 CreateBitmapFromWicBitmap内存释放问题。