hadoop中reducer中使用treemap,新的value会覆盖所有旧的value

在编写topN类型的mapreduer程序时，有一种方法是，使用treemap的排序功能，把要排序的字段保存为key，需要输出的字段保存为value，最后根据是取最大还是取最小来保留treemap的最前方元素还是最后方元素。但是我在编写这个程序时，把treemap的value设置成我自定义的一个类，然后，就发生了，每次put元素时，都会把treemap中所有key的所有value设置成新put进来元素的value值，十分蛋疼。
自定义的类student 继承 Writable类，只用一个Text的name属性，和一个int的age属性。student的代码在最后。

mapreducer的代码如下，

map就是输出类型为，key是随机生成的0-9之间的数字，studet的name是从输入的value，age是随机的，这个属性没用。

reducer的输出类型为，key就是map的key，Text是treemap的toString()，也就是执行一次reducer，treemap的内容。

treemap的key就是reducer传过来的key（即LongWritable的long），value就是传过来的student。按照key的大小，取key最大的三个。因为treemap已经是升序排列好了，所以其实只要去treemap的后三个就行了。

具体问题的说明，在输出文件的下面。

    public static class MapImpl extends
            Mapper<LongWritable, Text, LongWritable, student> {

        private Random ra = new Random(System.currentTimeMillis());
        private LongWritable k = new LongWritable();

        @Override
        protected void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
            int d = ra.nextInt(10);
            student s = new student();
            s.setName(value);
            s.setAge(ra.nextInt(100));

            k.set(d);
            context.write(k, s);
        }
    }

    public static class ReducerImpl extends
            Reducer<LongWritable, student, LongWritable, Text> {
        private int n = 0;

        private TreeMap<Long, ArrayList<student>> tmap = new TreeMap<Long, ArrayList<student>>();

        private static NullWritable k = NullWritable.get();

        @Override
        protected void setup(Context context) throws IOException,
                InterruptedException {
            super.setup(context);
            n = Integer.valueOf(context.getConfiguration().get("n", "5"));
        }

        @Override
        protected void reduce(LongWritable arg0, Iterable<student> arg1,
                Context arg2) throws IOException, InterruptedException {

            Long k = arg0.get();
            for (student text : arg1) {

                ArrayList<student> arrayList = tmap.get(k);
                if (arrayList == null) {
                    arrayList = new ArrayList<student>();
                }
                arrayList.add(text);
                tmap.put(k, arrayList);
            }

            while (tmap.size() > n) {
                Long firstKey = tmap.firstKey();
                tmap.remove(firstKey);
            }
            arg2.write(arg0, new Text(tmap.toString()));
        }
    }

    public static void main(String[] args) {

        try {
            Configuration conf = new Configuration();

            conf.set("n", "3");

            String[] argss = new GenericOptionsParser(conf, args)
                    .getRemainingArgs();

            if (argss.length != 2) {
                System.out.println("parameter error");
                System.exit(1);
            }

            Job job = Job.getInstance(conf);

            job.setJarByClass(test02.class);


            job.setMapperClass(MapImpl.class);
            job.setReducerClass(ReducerImpl.class);

            job.setMapOutputKeyClass(LongWritable.class);
            job.setMapOutputValueClass(student.class);
            job.setOutputKeyClass(LongWritable.class);
            job.setOutputValueClass(Text.class);


            job.setInputFormatClass(TextInputFormat.class);
            job.setOutputFormatClass(TextOutputFormat.class);

            Path inpath = new Path(argss[0]);
            Path outpath = new Path(argss[1]);

            HadoopUtils.PathCheck(conf, inpath, outpath);
            FileInputFormat.addInputPath(job, inpath);

            FileOutputFormat.setOutputPath(job, outpath);

            System.exit(job.waitForCompletion(true) ? 0 : 1);

        } catch (Exception e) {
            e.printStackTrace();
        }

    }

输入文件如下

a
b
c
d
e
f
g
h
i

输出文件如下

0   {0=[d   55]}
2   {0=[b   6], 2=[b    6, b    6]}
3   {0=[f   86], 2=[f   86, f   86], 3=[f   86]}
4   {2=[h   37, h   37], 3=[h   37], 4=[h   37]}
5   {3=[a   98], 4=[a   98], 5=[a   98, a   98]}
6   {4=[i   18], 5=[i   18, i   18], 6=[i   18]}
8   {5=[g   55, g   55], 6=[g   55], 8=[g   55]}

这样可以很清楚的看到，本来第一次reducer之后，key为0的键值还是[d 55]，但是第二次reducer之后，key为0的键值被[b 6]给覆盖了，而[b 6]就是新进来的key为2的键值。而且，key为2的键值[b 6]竟然有两个。因为输出文件每行的字母都不相同，所以产生的也不应该有重复的呀，很费解。

student 类继承 Writable类，只用一个Text的name属性，和一个int的age属性。

public class student implements Writable {
    Text name = new Text();
    int age = 0;

    public Text getName() {
        return name;
    }

    public void setName(Text name) {
        this.name = name;
    }

    public int getAge() {
        return age;
    }

    public void setAge(int age) {
        this.age = age;
    }

    @Override
    public String toString() {
        return name + "\t" + age;
    }

    @Override
    public void write(DataOutput out) throws IOException {
        this.name.write(out);
        out.writeInt(this.age);
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        this.name.readFields(in);
        this.age = in.readInt();
    }

}

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
pengyu111111 2020-03-30 11:49
关注
应该是引用传递的原因，注意在map reduce方法中，每次都new一下value

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

大数据框架中的hadoop和hive spark presto tez是什么关系 hadoop hive spark
2022-12-24 10:29

回答 1 已采纳 Hadoop是一个分布式计算框架，可以在大数据集上运行分布式应用程序。它由许多组件组成，包括HDFS（分布式文件系统）和MapReduce（分布式计算引擎）。Hive是一个基于Hadoop的数据仓库系
大数据、Hadoop hadoop 大数据
2022-12-19 16:44

回答 1 已采纳 format只需要对NameNode做，如果你在node3做了也没关系，删除node3上的、hdfs-site.xml中配置的NameNode对应的目录即可，然后在node1上也删除相同的目录后，重新
hadoop和大数据、spark的关系该怎么理解？ hadoop spark 大数据
2022-06-23 13:55

回答 1 已采纳 Hadoop和Spark都是处理大数据的框架。就象你说关系型数据库，这只是一个概念，但是代表了一系列的含意，比如数据是结构化的，基于关系模型存储的。而MySQL、Oracle、SqlServer这些就
大数据开发hadoop
2023-06-24 16:38

丸神的博客记载本人使用hadoop和yarn完成数据的存储和处理过程。还有很多学习不到位的地方，多多谅解
hadoop中，reduce运行到33%卡不住不动 hadoop 大数据有问必答
2021-11-09 17:48

回答 1 已采纳有可能datanode有多处磁盘损坏了，你可以尝试关闭其中那个有问题的节点继续测试
大数据hadoop完全分布式安装配置怎么做 hadoop
2023-03-02 16:06

回答 3 已采纳小魔女参考了bing和GPT部分内容调写:要安装配置Hadoop的完全分布式，首先需要准备好master节点和slave节点，其中master节点需要安装jdk，slave节点只需要安装ssh服务，并
安装hadoop中的xsync 集群分发脚本阶段大数据
2022-05-09 11:32

回答 1 已采纳添加权限 Chmod 7777
Hadoop大数据技术教程（ wukong-1.0v）
2020-09-22 12:10

悟空非空也的博客交通方面，大数据会帮助人们选择最佳出行方案。 Hadoop作为一个能够对大量数据进行分布式处理的软件框架，用户可以利用Hadoop生态体系开发和处理海量数据。由于Hadoop有可靠及高效的处理性能，使得它逐渐成为分析...
hadoop安装过程中格式hdfs出错 big data hadoop 数据库有问必答
2023-03-10 16:25

回答 5 已采纳命令没找到，没有配置hadoop环境变量。命令如下： sudo vi /etc/profile 1、环境变量配置为如下所示： export HADOOP_HOME=/home/lemaker/ope
在linux中启动Hadoop出现权限不够怎么解决 hadoop linux 有问必答
2021-10-28 09:01

回答 3 已采纳试试用下面两种方式之一1.sudo 安装命令2.sudo su - 这时候切换到root用户下了 ,可以随心所欲了
在linux中下载Hadoop出现的问题 hadoop linux
2022-12-22 16:40

回答 2 已采纳参考下这个看看 http://t.csdn.cn/9uzzh
大数据面试题（一）Hadoop
2021-08-18 21:47

敲代码的彭于晏的博客一.Hadoop 目录一.Hadoop 1.hdfs写流程 2.hdfs读流程 3.hdfs体系结构 ...9.Hadoop中combiner和partition的作用 10.用MapReduce怎么处理数据倾斜问题？ 11.shuffle阶段，你怎么理解的 12.MapReduce的m..
Hadoop HA中zkfc格式化失败或报错！ zookeeper 大数据有问必答
2021-11-17 16:51

回答 1 已采纳看看自己是不是在/etc/hosts设置文件里给127.0.0.1配置了像xxxxx这样的主机名，去掉127.0.0.1额外配的主机名应该就可以了，还有就是zk节点数最好是大于1的奇数，别弄成偶数了
大数据Hadoop生态圈常用面试题
2018-07-18 08:47

械风的博客 1.生产环境中有多少个reduce 该问题可以总结为： 1.一个task的map数量由谁来决定？ input split的大小间接决定了一个job拥有多少个map 默认input大小是64M可以通过修改mapred.min.split.size参数决定input split...
【大数据各平台组件搭建使用精进】MapReduce分布式计算（5）
2022-09-27 16:33

星欲冷hx的博客 topN Driver mapper reducer 运行成功概述 MapReduce是Hadoop系统核心组件之一，它是一种可用于大数据并行处理的计算模型、框架和平台，主要解决海量数据的计算，是目前分布式计算模型中应用较为广泛的一种。...
没有解决我的问题, 去提问

悬赏问题

¥15 FLUENT如何实现在堆积颗粒的上表面加载高斯热源
¥30 截图中的mathematics程序转换成matlab
¥15 动力学代码报错，维度不匹配
¥15 Power query添加列问题
¥50 Kubernetes&Fission&Eleasticsearch
¥15 報錯：Person is not mapped，如何解決？
¥15 c++头文件不能识别CDialog
¥15 Excel发现不可读取的内容
¥15 关于#stm32#的问题：CANOpen的PDO同步传输问题
¥20 yolov5自定义Prune报错，如何解决？

hadoop中reducer中使用treemap,新的value会覆盖所有旧的value

1条回答 默认 最新

悬赏问题

1条回答默认最新