python编程实现:sensitive keyword unigram vector2000.csv文件和nonsensitive keyword unigram vector2000.csv文件里分别存放有2000个文件的不同信息,每行四个单元格的值分别表示一个文件中含有的四个单词的向量,先要求用所给的p稳定分布的局部敏感哈希的k个哈希函数处理每个文件的这些向量,生成每个文件的布隆过滤器(由于布隆过滤器的性质,理想情况下每个向量会在布隆过滤器中有k个bit为1的位置)。
p稳定分布的局部敏感哈希请参考p-stable-lsh-python-main项目文件,布隆过滤器和哈希函数参考BloomFilter-master项目文件。
思路是把p-stable-lsh-python-main项目文件中的哈希函数k个ha,b(v)应用到BloomFilter-master项目文件中,BloomFilter-master项目文件原有的哈希函数都可以不要,然后对每行的信息都分别生成一个布隆过滤器,用BloomFilter-master项目文件里的insert函数插入csv文件中的每行4个向量,一个文件即一行对应一个布隆过滤器命名为“序号.bin”,如第一行叫1.bin。可以以k=3为例,即有三个哈希函数ha1,b1(v),ha2,b2(v)和ha3,b3(v),哈希函数的信息在p-stable-lsh-python-main项目文件,目前不清楚可不可以生成多个函数,需要测试。问题难度不确定,可以追加¥有意向的私。
测试文件:有两个,以sensitive keyword unigram vector2000.csv为例
第一行:
1,0,1,0,1,0,1,0,0,0,1,1,0,0,1,0,0,1,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,1,0,0,0,1,0,0,1,1,1,0,0,0,1,1,1,1,1,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,1,1,0,1,1,0,0,1,1,1,1,1,0,1,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,1,1,0,1,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
期望输出结果1:1.bin-2000.bin,需要能通过项目里is_contain函数测试