MichiMeme 2021-11-09 14:24 采纳率: 0%
浏览 18

用spacy分词构建元学习数据集时遇到的分词问题

论文FEW-SHOT TEXT CLASSIFICATION WITH DISTRIBUTIONAL SIGNATURES中的Amazon数据集由text raw label组成,text是raw的分词结果,以列表形式存储,这是其中一个样本的text:

list1 = ['i', 'was', 'pleasantly', 'surprised', 'with', 'this', '"', 'out', 'of', 'the', 'box', '"', 'series', '.', ' ', 'good', 'writing', ',', 'good', 'acting', ',', 'laugh', 'out', 'loud', 'situations', '.', ' ', 'devito', 'showing', 'up', 'in', 'the', 'second', 'season', 'gave', 'it', 'a', 'little', 'boost', 'as', 'he', "'s", 'always', 'dependable', 'for', 'turning', 'the', 'mundane', 'into', 'the', 'hilarious', '.', 'it', "'s", 'basically', 'about', '3', 'jackass', 'friends', 'in', 'philly', 'who', 'own', 'a', 'bar', 'and', 'get', 'themselves', 'into', 'offbeat', 'situations', '.', ' ', 'what', 'i', 'liked', 'best', 'is', 'that', 'it', 'is', 'not', 'the', 'clice', 'venue', 'for', 'the', 'young', 'and', 'the', 'beautiful', '.', ' ', 'it', 'often', 'hi', '-', 'lightes', 'the', 'old', 'and', 'the', 'ugly', 'and', 'in', 'doing', 'so', 'cultivates', 'a', 'good', 'portion', 'of', 'the', 'laughs', '.', 'worth', 'you', 'time', 'and', 'money', '....', 'bg']
论文中没有写用的什么分词方法
这是我用spacy的en_core_web_sm对raw分词得到的结果
list2 = ['i', 'was', 'pleasantly', 'surprised', 'with', 'this', '"', 'out', 'of', 'the', 'box', '"', 'series', '.', ' ', 'good', 'writing', ',', 'good', 'acting', ',', 'laugh', 'out', 'loud', 'situations', '.', ' ', 'devito', 'showing', 'up', 'in', 'the', 'second', 'season', 'gave', 'it', 'a', 'little', 'boost', 'as', 'he', "'s", 'always', 'dependable', 'for', 'turning', 'the', 'mundane', 'into', 'the', 'hilarious.it', "'s", 'basically', 'about', '3', 'jackass', 'friends', 'in', 'philly', 'who', 'own', 'a', 'bar', 'and', 'get', 'themselves', 'into', 'offbeat', 'situations', '.', ' ', 'what', 'i', 'liked', 'best', 'is', 'that', 'it', 'is', 'not', 'the', 'clice', 'venue', 'for', 'the', 'young', 'and', 'the', 'beautiful', '.', ' ', 'it', 'often', 'hi', '-', 'lightes', 'the', 'old', 'and', 'the', 'ugly', 'and', 'in', 'doing', 'so', 'cultivates', 'a', 'good', 'portion', 'of', 'the', 'laughs.worth', 'you', 'time', 'and', 'money', '....', 'bg']

所有不匹配的分词结果都是单词中包含'.',类似hilarious.it,c.g.i
我现在想把spacy分词结果中的包含的'.'的单词手动分开,但是会出现影响到其他只包含'.'的字符串,并没有找到很好的手动分割方法
或者是不是有更合适的分词方法,能直接得到text的结果

  • 写回答

1条回答 默认 最新

  • XINFINFZ 2021-11-09 17:27
    关注

    其实主要是空格的问题 有的句子连接处没打空格就下一句了 就会合在一起 看我的elif里面的内容 解决了这个问题

    
    import spacy
    import re
    list1 = []
    nlp = spacy.load("en_core_web_sm")
    str1 = "I was pleasantly surprised with this \"out of the box\" series.  Good writing, good acting, laugh out loud situations.  Devito showing up in the second season gave it a little boost as he's always dependable for turning the mundane into the hilarious.It's basically about 3 jackass friends in Philly who own a bar and get themselves into offbeat situations.  What I liked best is that it is not the clice venue for the young and the beautiful.  It often hi-lightes the old and the ugly and in doing so cultivates a good portion of the laughs.Worth you time and money....bg"
    doc = nlp(str1.lower())
    for token in doc:
        if str(token)=='"':
            list1.append(str("\""))
        elif '.' in str(token) and str(token).count('.')!=len(str(token)):
            for x in re.findall(r'\w+|\.',str(token)):
                list1.append(x)
        else:
            list1.append(str(token))
    
    评论

报告相同问题?

问题事件

  • 提问应符合社区要求 11月9日
  • 创建了问题 11月9日

悬赏问题

  • ¥15 kali环境运行volatility分析android内存文件,缺profile
  • ¥15 写uniapp时遇到的问题
  • ¥15 vs 2008 安装遇到问题
  • ¥15 matlab有限元法求解梁带有若干弹簧质量系统的固有频率
  • ¥15 找一个网络防御专家,外包的
  • ¥100 能不能让两张不同的图片md5值一样,(有尝)
  • ¥15 informer代码训练自己的数据集,改参数怎么改
  • ¥15 请看一下,学校实验要求,我需要具体代码
  • ¥50 pc微信3.6.0.18不能登陆 有偿解决问题
  • ¥20 MATLAB绘制两隐函数曲面的交线