hadoop python reduce运行出错code 1

下面是我的mapperG.py

import sys

for line in sys.stdin:
    if not line:  # Skip empty lines
        continue
    line = line.strip()
    words = line.strip().split('->')
    for i in range(0,len(words)-1):
        if words[i]<words[i+1]:
            print("{}\t{}".format(words[i]+words[i+1],0))
        else:
            print("{}\t{}".format(words[i + 1]+words[i], 1))

下面是我的reducerG.py

from operator import itemgetter
import sys

current_nodes = None
current_direct = -1

for line in sys.stdin:
    line = line.strip()
    try:
        nodes, direct = line.split('\t', 1)
    except ValueError:
        # This handles lines that do not conform to the expected format
        print(f"Error processing line: {line}", file=sys.stderr)
        continue  # Skip this line and move to the next
    
    # Check if we're still processing the same pairs
    if current_nodes == nodes:
        if current_direct!=int(direct):
            current_direct=2
    else:
        # Output the previous pairs
        if current_nodes:
            if current_direct!=2:
                s = "{0} and {1} are in a one-way relationship".format(current_nodes[0], current_nodes[
                    1]) if current_direct == 0 else "{} and {} are in a one-way relationship".format(
                    current_nodes[1], current_nodes[0])
                print(s)
        # Reset for the new pairs
        current_nodes = nodes
        current_direct=int(direct)

# Output the last pairs after finishing all lines
if current_direct != 2:
    s = "{0} and {1} are in a one-way relationship".format(current_nodes[0], current_nodes[
        1]) if current_direct == 0 else "{} and {} are in a one-way relationship".format(
        current_nodes[1], current_nodes[0])
    print(s)

这是我的命令：

hadoop jar $HADOOP_HOME/hadoop-streaming-3.2.3.jar -input relations.txt -output outputG3 -mapper "python3 mapperG.py" -reducer "python3 reducerG.py" -file mapperG.py -file reducerG.py

input文件长这样

我用如下的命令运行是可以出结果的：

cat relations.txt|python mapperG.py|sort|python reducerG.py

然而一使用hadoop就报错：

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
GISer Liu 2024-02-20 15:21
关注
该回答引用自GPT-3.5,由博主GIS_Liu编写：

根据你提供的信息，报错是在使用Hadoop时出现的，而在本地运行相同的逻辑却没有问题。这可能是由于Hadoop环境中的一些限制或配置问题导致的。

解决思路：

查看Hadoop日志： 首先，你可以查看Hadoop的日志以获取更详细的错误信息，这有助于更好地理解问题所在。

确认输入输出格式： 确保你的输入和输出格式与Hadoop Streaming所期望的格式一致。特别是在mapper和reducer之间的数据交换格式。

环境变量设置： 确保Hadoop环境变量已正确设置，并且Python版本与Hadoop兼容。有时候Hadoop环境使用的Python版本可能与你在本地使用的版本不同，这可能导致一些不一致性。

调试代码： 你可以在代码中添加一些打印语句，以便在Hadoop集群上运行时查看中间结果，进而找出问题所在。

排查权限问题： 确保Hadoop集群中你的用户有足够的权限来执行MapReduce作业，包括对输入和输出路径的读写权限。

解决方案：

根据你提供的代码，我注意到你的reducer部分存在一个潜在的问题：在比较节点关系时，你尝试访问一个元组（tuple）的索引来获取节点，但你没有将输入的文本进行解析以获得这样的元组。这可能是导致报错的原因之一。

我建议你修改reducer部分的代码，以正确解析mapper输出并比较节点关系。具体来说，你需要将节点关系解析为元组形式，类似于 (node1, node2) 的形式。你可以通过在mapper中以适当的方式输出这样的格式来实现。

此外，请确保mapper和reducer的输入输出格式与Hadoop Streaming所期望的格式一致，这也是常见的问题导致作业运行失败的原因之一。

修改后的代码示例：

# 修改后的reducerG.py from operator import itemgetter import sys current_nodes = None current_direct = -1 for line in sys.stdin: line = line.strip() try: nodes, direct = line.split('\t', 1) except ValueError: # This handles lines that do not conform to the expected format print(f"Error processing line: {line}", file=sys.stderr) continue # Skip this line and move to the next # Convert nodes to a tuple nodes = tuple(nodes.split()) # Check if we're still processing the same pairs if current_nodes == nodes: if current_direct != int(direct): current_direct = 2 else: # Output the previous pairs if current_nodes: if current_direct != 2: s = "{0} and {1} are in a one-way relationship".format(current_nodes[0], current_nodes[1]) if current_direct == 0 else "{} and {} are in a one-way relationship".format(current_nodes[1], current_nodes[0]) print(s) # Reset for the new pairs current_nodes = nodes current_direct = int(direct) # Output the last pairs after finishing all lines if current_direct != 2: s = "{0} and {1} are in a one-way relationship".format(current_nodes[0], current_nodes[1]) if current_direct == 0 else "{} and {} are in a one-way relationship".format(current_nodes[1], current_nodes[0]) print(s)

请将这些修改应用到你的代码中，然后重新运行作业，看看是否解决了问题。如有其他问题，可以根据Hadoop的日志进一步排查。

如果该回答解决了您的问题，请采纳！如果没有，请参考以下方案进行修订
解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

hadoop集群测试pi实例出错 hadoop
2022-04-24 18:59

回答 1 已采纳你好，我之前测试成功过，如有帮助，请采纳！然后进入到etc/hadoop 目录下，使用命令vim mapred-site.xml将路径添加到<property></property
hadoop 运行wordcount出错 hadoop
2021-10-01 19:28

回答 3 已采纳 Container [pid=7204,containerID=container_1607355221856_0001_01_000002] is running b
hadoop中，reduce运行到33%卡不住不动 hadoop 大数据有问必答
2021-11-09 17:48

回答 1 已采纳有可能datanode有多处磁盘损坏了，你可以尝试关闭其中那个有问题的节点继续测试
使用Python实现Hadoop MapReduce程序_hadoop mapreduce可以用python么
2024-05-11 12:43

2401_84164527的博客 continue。
python爬虫&hadoop&mysql intellij-idea mysql python
2022-04-17 15:49

回答 1 已采纳分析好的数据在存到mysql里呀
python hadoop单节点的搭建 hadoop python
2021-08-18 08:26

回答 1 已采纳可参考： ��ϴ��linux��Ĳ��񣬶�ҹر��ǻ��ʾ�Ҳ����ô�죿��ȣ�_�ٶ�֪��
Hadoop格式化出错是为什么😭 hadoop 有问必答
2022-04-10 21:19

回答 3 已采纳你这个错误是不是注解名和ip地址没有做好映射所导致的，你在hdfs-site.xml配置了主机名吧？有没有做好ip映射呢 vim /etc/hosts ip hadoop 修改后重启虚拟机rebo
hadoopshpython_在Hadoop上运行Python脚本
2021-03-19 08:29

GameFinder的博客之前已经配置好了Hadoop以及Yarn，可那只是第一步。下面还要在上面运行各种程序，这才是最重要的。...Python MapReduce Code这里我们要用到 Hadoop Streaming API，通过STIDN(Standard input)和 S...
利用hadoop集群计算pi 出错 hadoop
2021-10-01 21:23

回答 2 已采纳首先问一下，你用的那个jar是它自带的还是自己写的
我的jar包在hadoop运行程序出现了问题（非代码错误） hadoop java 有问必答
2021-07-27 18:50

回答 2 已采纳可参考：https://blog.csdn.net/wk51920/article/details/51698042https://stackoverflow.com/questions/145540
hadoop的运行问题 hadoop
2023-03-20 15:43

回答 1 已采纳这样运行了，就等结果输出就可以了，在output文件夹里面查看运行的结果
hadoop调用python算法_使用Hadoop Streaming运行Python版Wordcount
2020-12-06 13:00

weixin_39575937的博客编写map函数wordcount_.../usr/bin/env python# ---------------------------------------------------------------#This mapper code will input a line of text and output ## ------------------------------...
推荐系统&spark和hadoop hadoop python spark
2022-04-22 23:16

回答 1 已采纳 spark：主要用于海量数据的统计计算，跟做不做大屏没关系，比如进行机器学习。hadoop：主要用到的就是数据的分布式存储，海量的数据和日志，如果想留存，就用它来存储吧。做大屏统计是比较直观的数据结果
如何使用python将数据从hadoop保存到数据库
2022-08-28 07:09

m0_67391270的博客所以我在reducer.py文件中写了一些python代码,将数据直接写到MYSQL数据库,并尝试通过删除如下所示的输出路径来运行上述命令。现在毕竟我想做的是,当我运行上述命令时,我不想将输出数据存储在haddop默认创建的文本...
【1-3章】Spark编程基础(Python版)
2023-08-21 12:19

Cheer-ego的博客大数据技术概述、Spark设计与运行原理、Spark环境搭建和使用方法
没有解决我的问题, 去提问

问题事件

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
创建了问题 2月20日

悬赏问题

¥30 STM32 INMP441无法读取数据
¥100 求汇川机器人IRCB300控制器和示教器同版本升级固件文件升级包
¥15 用visualstudio2022创建vue项目后无法启动
¥15 x趋于0时tanx-sinx极限可以拆开算吗
¥500 把面具戴到人脸上，请大家贡献智慧
¥15 任意一个散点图自己下载其js脚本文件并做成独立的案例页面，不要作在线的，要离线状态。
¥15 各位帮我看看如何写代码，打出来的图形要和如下图呈现的一样，急
¥30 c#打开word开启修订并实时显示批注
¥15 如何解决ldsc的这条报错/index error
¥15 VS2022+WDK驱动开发环境

hadoop python reduce运行出错code 1

hadoop python reduce运行出错code 1

2条回答 默认 最新

问题事件

悬赏问题

2条回答默认最新