2301_80063666 2025-03-24 12:44 采纳率: 8.3%
浏览 18

多机多卡分布式启动问题

多机多卡分布式启动问题
bash脚本无法启动


/home/server/anaconda3/envs/cod/lib/python3.9/site-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(

卡在这里了
网络没问题,防火请关了,torch.distributed.launch也没问题,其他人也能跑
#!/bin/bash

export TORCH_USE_CUDA_DSA=1
export CUDA_LAUNCH_BLOCKING=1
SCRIPT_PATH="$(cd "$(dirname "$0")"; pwd -P)"
SCRIPT_NAME=$(basename "$0")
export NCCL_DEBUG=INFO
LOG_DATE="$(date +'%Y%m%d')"
LOG_DIR="${SCRIPT_PATH}/logs"
LOG_FILE="${LOG_DIR}/${SCRIPT_NAME}.log-${LOG_DATE}"
sudo sysctl -w net.ipv6.conf.all.disable_ipv6=1
sudo sysctl -w net.ipv6.conf.default.disable_ipv6=1
MASTER_PORT=$(shuf -i 10000-65000 -n 1)
echo $MASTER_PORT > /home/server/project/gtc/ddp_master_port.txt
# 创建日志目录
mkdir -p "${LOG_DIR}"

# ------------------ 解析 GPU 参数 ------------------
GPU_IDS=""
if [[ $# -gt 0 ]]; then
    GPU_IDS="$1"
fi

# 如果没有提供GPU_IDS,则默认使用所有GPU
if [[ -z "$GPU_IDS" ]]; then
    GPU_IDS="0,1,2,3"  # 这里可以根据实际情况修改默认值
fi

export CUDA_VISIBLE_DEVICES=$GPU_IDS
echo "Using GPUs: ${GPU_IDS:-All available}"

run_script() {
    local script_name=$1
    local MASTER_PORT=$2

    echo "-----------------------------------------------"
    echo "Running $script_name with GPUs ${GPU_IDS:-All}..."
    echo "-----------------------------------------------"

    export MASTER_PORT=$MASTER_PORT
    echo "开始训练"
    /home/server/anaconda3/envs/cod/bin/python -m torch.distributed.launch \
        --nproc_per_node=1 \
        --nnodes=2 \
        --node_rank=0 \
        --master_addr="IP" \
        --master_port=$MASTER_PORT \
        "$script_name".py \
        2>&1 | tee -a "$LOG_FILE"

    echo "执行完毕"
    echo "$script_name completed. Logs appended to $LOG_FILE"
}

run_script My_Train_dis 39500
run_script My_Testing_dis 39500
run_script eval_dis 39500

echo "All scripts executed. Check logs at $LOG_FILE"


  • 写回答

3条回答 默认 最新

  • 道友老李 JWE233286一种基于机器视觉的水表指针读数识别及修正的方法 专利发明者 2025-03-24 12:44
    关注
    让【道友老李】来帮你解答,本回答参考gpt编写,并整理提供,如果还有疑问可以点击头像关注私信或评论。
    如果答案让您满意,请采纳、关注,非常感谢!
    根据错误提示和代码,可以看到使用的`torch.distributed.launch`模块已经被deprecated,建议使用`torchrun`代替,并且使用`os.environ['LOCAL_RANK']`获取local rank参数。此外,代码中还存在一些其他问题,比如没有正确设置IP地址,需要在`--master_addr`参数中填入正确的IP地址。 下面是修改后的代码:
    #!/bin/bash
    export TORCH_USE_CUDA_DSA=1
    export CUDA_LAUNCH_BLOCKING=1
    SCRIPT_PATH="$(cd "$(dirname "$0")"; pwd -P)"
    SCRIPT_NAME=$(basename "$0")
    export NCCL_DEBUG=INFO
    LOG_DATE="$(date +'%Y%m%d')"
    LOG_DIR="${SCRIPT_PATH}/logs"
    LOG_FILE="${LOG_DIR}/${SCRIPT_NAME}.log-${LOG_DATE}"
    sudo sysctl -w net.ipv6.conf.all.disable_ipv6=1
    sudo sysctl -w net.ipv6.conf.default.disable_ipv6=1
    MASTER_PORT=$(shuf -i 10000-65000 -n 1)
    echo $MASTER_PORT > /home/server/project/gtc/ddp_master_port.txt
    # 创建日志目录
    mkdir -p "${LOG_DIR}"
    # 解析 GPU 参数
    GPU_IDS=""
    if [[ $# -gt 0 ]];
    then
        GPU_IDS="$1"
    fi
    if [[ -z "${GPU_IDS}" ]];
    then
        GPU_IDS="0,1,2,3"  # 这里可以根据实际情况修改默认值
    fi
    export CUDA_VISIBLE_DEVICES="${GPU_IDS}"
    echo "Using GPUs: ${GPU_IDS:-All available}"
    run_script() {
        local script_name=$1
        local MASTER_PORT=$2
        echo "-----------------------------------------------"
        echo "Running $script_name with GPUs ${GPU_IDS:-All}..."
        echo "-----------------------------------------------"
        export MASTER_PORT=$MASTER_PORT
        echo "开始训练"
        /home/server/anaconda3/envs/cod/bin/python -m torchrun \
            --nproc_per_node=1 \
            --nnodes=2 \
            --node_rank=0 \
            --master_addr="IP" \
            --master_port=$MASTER_PORT \
            "$script_name.py" \
            2>&1 | tee -a "$LOG_FILE"
        echo "执行完毕"
        echo "$script_name completed. Logs appended to $LOG_FILE"
    }
    run_script My_Train_dis 39500
    run_script My_Testing_dis 39500
    run_script eval_dis 39500
    echo "All scripts executed. Check logs at $LOG_FILE"
    

    请替换IP为正确的IP地址,并确保torchrun命令可用。希望这可以解决您的问题。

    评论

报告相同问题?

问题事件

  • 创建了问题 3月24日