Adolf K Wiseman 2022-10-30 09:41 采纳率: 70%
浏览 60
已结题

编程实现:支持局部敏感哈希的布隆过滤器

img

python编程实现:sensitive keyword unigram vector2000.csv文件和nonsensitive keyword unigram vector2000.csv文件里分别存放有2000个文件的不同信息,每行四个单元格的值分别表示一个文件中含有的四个单词的向量,先要求用所给的p稳定分布的局部敏感哈希的k个哈希函数处理每个文件的这些向量,生成每个文件的布隆过滤器(由于布隆过滤器的性质,理想情况下每个向量会在布隆过滤器中有k个bit为1的位置)。

p稳定分布的局部敏感哈希请参考p-stable-lsh-python-main项目文件,布隆过滤器和哈希函数参考BloomFilter-master项目文件。

思路是把p-stable-lsh-python-main项目文件中的哈希函数k个ha,b(v)应用到BloomFilter-master项目文件中,BloomFilter-master项目文件原有的哈希函数都可以不要,然后对每行的信息都分别生成一个布隆过滤器,用BloomFilter-master项目文件里的insert函数插入csv文件中的每行4个向量,一个文件即一行对应一个布隆过滤器命名为“序号.bin”,如第一行叫1.bin。可以以k=3为例,即有三个哈希函数ha1,b1(v),ha2,b2(v)和ha3,b3(v),哈希函数的信息在p-stable-lsh-python-main项目文件,目前不清楚可不可以生成多个函数,需要测试。问题难度不确定,可以追加¥有意向的私。

测试文件:有两个,以sensitive keyword unigram vector2000.csv为例
第一行:
1,0,1,0,1,0,1,0,0,0,1,1,0,0,1,0,0,1,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,1,0,0,0,1,0,0,1,1,1,0,0,0,1,1,1,1,1,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,1,1,0,1,1,0,0,1,1,1,1,1,0,1,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,1,1,0,1,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

期望输出结果1:1.bin-2000.bin,需要能通过项目里is_contain函数测试

  • 写回答

3条回答 默认 最新

  • Java大魔王 2022-10-31 11:34
    关注

    读取csv文件的问题可以帮你解决

    from csv import reader
    import numpy as np
    
    if __name__ == '__main__':
        # 这里的文件路径根据自己放的位置,进行修改
        with open('nonsensitive keyword unigram vector2000.csv', 'r', encoding='utf-8') as f:
            # 按行读取,装入list
            data = list(reader(f))
        # 全部数据读取完后,转为numpy数组
        data = np.array(data)
        # 取第一行
        print(data[0])
        # 取第一行第一列
        print(data[0][0])
        # 取第一行第一列的数据,去除逗号
        print(data[0][0].replace(",", ""))
        # 取第一行第一列的数据,去除逗号后的长度
        print(len(data[0][0].replace(",", "")))
    
    评论

报告相同问题?

问题事件

  • 已结题 (查看结题原因) 11月5日
  • 修改了问题 10月30日
  • 创建了问题 10月30日

悬赏问题

  • ¥15 seatunnel-web使用SQL组件时候后台报错,无法找到表格
  • ¥15 fpga自动售货机数码管(相关搜索:数字时钟)
  • ¥15 用前端向数据库插入数据,通过debug发现数据能走到后端,但是放行之后就会提示错误
  • ¥30 3天&7天&&15天&销量如何统计同一行
  • ¥30 帮我写一段可以读取LD2450数据并计算距离的Arduino代码
  • ¥15 飞机曲面部件如机翼,壁板等具体的孔位模型
  • ¥15 vs2019中数据导出问题
  • ¥20 云服务Linux系统TCP-MSS值修改?
  • ¥20 关于#单片机#的问题:项目:使用模拟iic与ov2640通讯环境:F407问题:读取的ID号总是0xff,自己调了调发现在读从机数据时,SDA线上并未有信号变化(语言-c语言)
  • ¥20 怎么在stm32门禁成品上增加查询记录功能