PyTorch 中的 DataParallel 和 DistributeDataParallel 疑问 - V2EX

Home Sign Up Sign In

This topic created in 2385 days ago, the information mentioned may be changed or developed.

环境：单机（一个 Node ），4 块 GPU 卡

PyTorch 在训练的时候有两种可选模式，第一种是 DataParallel （ DP ）模式，第二种是 DistributeDataParallel （ DDP ），我在实测中发现：

单卡能跑到 100%的情况下，DP 在 4 卡的时候加速比为～ 2
单卡能跑到 100%的情况下，DDP 在 4 卡的时候加速比为～ 4

我看了官方的文档：https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

我有几个问题：

有什么策略能够判断 DP 的情况下，单卡是绑定了一个进程还是说 4 卡绑定了一个进程呢？
公司精通 CUDA 底层的同事和我说，这两种模式本质上是等价的，那么这个说法是对的吗？如果是对的，为什么等价，如果不对，差异在哪里呢。

No Comments Yet

单卡等价 ddp pytorch

About · Help · Advertise · Blog · API · FAQ · Solana · 3636 Online Highest 6679 ·

Select Language

创意工作者们的社区

World is powered by solitude

VERSION: 3.9.8.5 · 29ms · UTC 10:31 · PVG 18:31 · LAX 03:31 · JFK 06:31
♥ Do have faith in what you're doing.