hadoop中reducer中使用treemap,新的value会覆盖所有旧的value

在编写topN类型的mapreduer程序时，有一种方法是，使用treemap的排序功能，把要排序的字段保存为key，需要输出的字段保存为value，最后根据是取最大还是取最小来保留treemap的最前方元素还是最后方元素。但是我在编写这个程序时，把treemap的value设置成我自定义的一个类，然后，就发生了，每次put元素时，都会把treemap中所有key的所有value设置成新put进来元素的value值，十分蛋疼。
自定义的类student 继承 Writable类，只用一个Text的name属性，和一个int的age属性。student的代码在最后。

mapreducer的代码如下，

map就是输出类型为，key是随机生成的0-9之间的数字，studet的name是从输入的value，age是随机的，这个属性没用。

reducer的输出类型为，key就是map的key，Text是treemap的toString()，也就是执行一次reducer，treemap的内容。

treemap的key就是reducer传过来的key（即LongWritable的long），value就是传过来的student。按照key的大小，取key最大的三个。因为treemap已经是升序排列好了，所以其实只要去treemap的后三个就行了。

具体问题的说明，在输出文件的下面。

    public static class MapImpl extends
            Mapper<LongWritable, Text, LongWritable, student> {

        private Random ra = new Random(System.currentTimeMillis());
        private LongWritable k = new LongWritable();

        @Override
        protected void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
            int d = ra.nextInt(10);
            student s = new student();
            s.setName(value);
            s.setAge(ra.nextInt(100));

            k.set(d);
            context.write(k, s);
        }
    }

    public static class ReducerImpl extends
            Reducer<LongWritable, student, LongWritable, Text> {
        private int n = 0;

        private TreeMap<Long, ArrayList<student>> tmap = new TreeMap<Long, ArrayList<student>>();

        private static NullWritable k = NullWritable.get();

        @Override
        protected void setup(Context context) throws IOException,
                InterruptedException {
            super.setup(context);
            n = Integer.valueOf(context.getConfiguration().get("n", "5"));
        }

        @Override
        protected void reduce(LongWritable arg0, Iterable<student> arg1,
                Context arg2) throws IOException, InterruptedException {

            Long k = arg0.get();
            for (student text : arg1) {

                ArrayList<student> arrayList = tmap.get(k);
                if (arrayList == null) {
                    arrayList = new ArrayList<student>();
                }
                arrayList.add(text);
                tmap.put(k, arrayList);
            }

            while (tmap.size() > n) {
                Long firstKey = tmap.firstKey();
                tmap.remove(firstKey);
            }
            arg2.write(arg0, new Text(tmap.toString()));
        }
    }

    public static void main(String[] args) {

        try {
            Configuration conf = new Configuration();

            conf.set("n", "3");

            String[] argss = new GenericOptionsParser(conf, args)
                    .getRemainingArgs();

            if (argss.length != 2) {
                System.out.println("parameter error");
                System.exit(1);
            }

            Job job = Job.getInstance(conf);

            job.setJarByClass(test02.class);


            job.setMapperClass(MapImpl.class);
            job.setReducerClass(ReducerImpl.class);

            job.setMapOutputKeyClass(LongWritable.class);
            job.setMapOutputValueClass(student.class);
            job.setOutputKeyClass(LongWritable.class);
            job.setOutputValueClass(Text.class);


            job.setInputFormatClass(TextInputFormat.class);
            job.setOutputFormatClass(TextOutputFormat.class);

            Path inpath = new Path(argss[0]);
            Path outpath = new Path(argss[1]);

            HadoopUtils.PathCheck(conf, inpath, outpath);
            FileInputFormat.addInputPath(job, inpath);

            FileOutputFormat.setOutputPath(job, outpath);

            System.exit(job.waitForCompletion(true) ? 0 : 1);

        } catch (Exception e) {
            e.printStackTrace();
        }

    }

输入文件如下

a
b
c
d
e
f
g
h
i

输出文件如下

0   {0=[d   55]}
2   {0=[b   6], 2=[b    6, b    6]}
3   {0=[f   86], 2=[f   86, f   86], 3=[f   86]}
4   {2=[h   37, h   37], 3=[h   37], 4=[h   37]}
5   {3=[a   98], 4=[a   98], 5=[a   98, a   98]}
6   {4=[i   18], 5=[i   18, i   18], 6=[i   18]}
8   {5=[g   55, g   55], 6=[g   55], 8=[g   55]}

这样可以很清楚的看到，本来第一次reducer之后，key为0的键值还是[d 55]，但是第二次reducer之后，key为0的键值被[b 6]给覆盖了，而[b 6]就是新进来的key为2的键值。而且，key为2的键值[b 6]竟然有两个。因为输出文件每行的字母都不相同，所以产生的也不应该有重复的呀，很费解。

student 类继承 Writable类，只用一个Text的name属性，和一个int的age属性。

public class student implements Writable {
    Text name = new Text();
    int age = 0;

    public Text getName() {
        return name;
    }

    public void setName(Text name) {
        this.name = name;
    }

    public int getAge() {
        return age;
    }

    public void setAge(int age) {
        this.age = age;
    }

    @Override
    public String toString() {
        return name + "\t" + age;
    }

    @Override
    public void write(DataOutput out) throws IOException {
        this.name.write(out);
        out.writeInt(this.age);
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        this.name.readFields(in);
        this.age = in.readInt();
    }

}

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
pengyu111111 2020-03-30 11:49
关注
应该是引用传递的原因，注意在map reduce方法中，每次都new一下value

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

大数据框架中的hadoop和hive spark presto tez是什么关系 hadoop hive spark
2022-12-24 10:29

回答 1 已采纳 Hadoop是一个分布式计算框架，可以在大数据集上运行分布式应用程序。它由许多组件组成，包括HDFS（分布式文件系统）和MapReduce（分布式计算引擎）。Hive是一个基于Hadoop的数据仓库系
大数据、Hadoop hadoop 大数据
2022-12-19 16:44

回答 1 已采纳 format只需要对NameNode做，如果你在node3做了也没关系，删除node3上的、hdfs-site.xml中配置的NameNode对应的目录即可，然后在node1上也删除相同的目录后，重新
hadoop和大数据、spark的关系该怎么理解？ hadoop spark 大数据
2022-06-23 13:55

回答 1 已采纳 Hadoop和Spark都是处理大数据的框架。就象你说关系型数据库，这只是一个概念，但是代表了一系列的含意，比如数据是结构化的，基于关系模型存储的。而MySQL、Oracle、SqlServer这些就
大数据开发hadoop
2023-06-24 16:38

丸神的博客记载本人使用hadoop和yarn完成数据的存储和处理过程。还有很多学习不到位的地方，多多谅解
hadoop中，reduce运行到33%卡不住不动 hadoop 大数据有问必答
2021-11-09 17:48

回答 1 已采纳有可能datanode有多处磁盘损坏了，你可以尝试关闭其中那个有问题的节点继续测试
大数据hadoop完全分布式安装配置怎么做 hadoop
2023-03-02 16:06

回答 3 已采纳小魔女参考了bing和GPT部分内容调写:要安装配置Hadoop的完全分布式，首先需要准备好master节点和slave节点，其中master节点需要安装jdk，slave节点只需要安装ssh服务，并
安装hadoop中的xsync 集群分发脚本阶段大数据
2022-05-09 11:32

回答 1 已采纳添加权限 Chmod 7777
Hadoop大数据技术教程（ wukong-1.0v）
2020-09-22 12:10

悟空非空也的博客交通方面，大数据会帮助人们选择最佳出行方案。 Hadoop作为一个能够对大量数据进行分布式处理的软件框架，用户可以利用Hadoop生态体系开发和处理海量数据。由于Hadoop有可靠及高效的处理性能，使得它逐渐成为分析...
hadoop安装过程中格式hdfs出错 big data hadoop 数据库有问必答
2023-03-10 16:25

回答 5 已采纳命令没找到，没有配置hadoop环境变量。命令如下： sudo vi /etc/profile 1、环境变量配置为如下所示： export HADOOP_HOME=/home/lemaker/ope
在linux中启动Hadoop出现权限不够怎么解决 hadoop linux 有问必答
2021-10-28 09:01

回答 3 已采纳试试用下面两种方式之一1.sudo 安装命令2.sudo su - 这时候切换到root用户下了 ,可以随心所欲了
在linux中下载Hadoop出现的问题 hadoop linux
2022-12-22 16:40

回答 2 已采纳参考下这个看看 http://t.csdn.cn/9uzzh
大数据面试题（一）Hadoop
2021-08-18 21:47

敲代码的彭于晏的博客一.Hadoop 目录一.Hadoop 1.hdfs写流程 2.hdfs读流程 3.hdfs体系结构 ...9.Hadoop中combiner和partition的作用 10.用MapReduce怎么处理数据倾斜问题？ 11.shuffle阶段，你怎么理解的 12.MapReduce的m..
Hadoop HA中zkfc格式化失败或报错！ zookeeper 大数据有问必答
2021-11-17 16:51

回答 1 已采纳看看自己是不是在/etc/hosts设置文件里给127.0.0.1配置了像xxxxx这样的主机名，去掉127.0.0.1额外配的主机名应该就可以了，还有就是zk节点数最好是大于1的奇数，别弄成偶数了
大数据Hadoop生态圈常用面试题
2018-07-18 08:47

械风的博客 1.生产环境中有多少个reduce 该问题可以总结为： 1.一个task的map数量由谁来决定？ input split的大小间接决定了一个job拥有多少个map 默认input大小是64M可以通过修改mapred.min.split.size参数决定input split...
【大数据各平台组件搭建使用精进】MapReduce分布式计算（5）
2022-09-27 16:33

星欲冷hx的博客 topN Driver mapper reducer 运行成功概述 MapReduce是Hadoop系统核心组件之一，它是一种可用于大数据并行处理的计算模型、框架和平台，主要解决海量数据的计算，是目前分布式计算模型中应用较为广泛的一种。...
没有解决我的问题, 去提问

悬赏问题

¥15 用三极管设计—个共射极放大电路
¥15 请完成下列相关问题！
¥15 drone 推送镜像时候 purge: true 推送完毕后没有删除对应的镜像,手动拷贝到服务器执行结果正确在样才能让指令自动执行成功删除对应镜像，如何解决？
¥15 求daily translation（DT）偏差订正方法的代码
¥15 js调用html页面需要隐藏某个按钮
¥15 ads仿真结果在圆图上是怎么读数的
¥20 Cotex M3的调试和程序执行方式是什么样的？
¥20 java项目连接sqlserver时报ssl相关错误
¥15 一道python难题3
¥15 牛顿斯科特系数表表示

hadoop中reducer中使用treemap,新的value会覆盖所有旧的value

1条回答 默认 最新

悬赏问题

1条回答默认最新