不限语言编程实现:choose a 1-gram sequence to parse a keyword, we name this representation as the uni-gram set. For example, the keyword “secure” is transformed to {s1, e1, c1, u1, r1, e2}, where “e1” is the first “e” in “secure” and “e2” is the second “e”. The uni-gram set is presented with a 160-bit long vector which named the uni-gram vector. The uni-gram vector consists of 26 ∗ 5 + 30 bits, where 26∗5 bits represent 26∗5 letters, 30 bits represent symbols and numbers those are in common use. A given bit is set to 1 if it characterizes a corresponding uni-gram; otherwise it remains 0.
题目翻译:选择一个1-gram的序列来解析一个关键字,我们将这个表示法命名为uni-gram set。例如,关键字“secure”转换为集合{s1、e1、c1、u1、r1、e2},其中“e1”是“secure”中的第一个“e”,“e2”是第二个“e”。uni-gram set被表示为一个160位长的向量,它被命名为uni-gram vector。单克向量由26∗5 + 30位组成,其中26∗5位代表26∗5个字母,30位表示常用的符号和数字。如果uni-gram vector中的一个给定的bit位描述了一个相应的uni-gram,则它被设置为1;否则它保持0。
测试文件:keyword.txt
stategov
selfempnotinc
federalgov
localgov
priv
期望输出结果1:uni-gram set.txt
s1,t1,a1,t2,e1,g1,o1,v1,
s1,e1,l1,f1,e2,m1,p1,n1,o1,t1,i1,n2,c1
f1,e1,d1,e2,r1,a1,l1,g1,o1,v1
l1,o1,c1,a1,l2,g1,o2,v1
p1,r1,i1,v1
期望输出结果2:uni-gram vector.txt
{1,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,...}
{0,0,1,0,1,1,0,0,1,0,0,1,1,1,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,...}
{1,0,0,1,1,1,1,0,0,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...}
{1,0,1,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,...}
{0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...}