解决'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'

最近在研究大模型的训练,公司的四卡4090也爆显存。于是想要尝试一下使用deepspeed的zero3_offload,结果报错:‘DeepSpeedCPUAdam’ object has no attribute ‘ds_opt_adam’

CUDA版本检测

>- Depsped Op Builder: Installed CUDA version 12.3 does not match the version torch was compiled with 11.7, unable to compile cuda/cpp extensions without a matching cuda version.
……
AttributeError: ‘DeepSpeedCPUAdam’ object has no attribute ‘ds_opt_adam’

看错误提示应该是系统的CUDA版本和torch编译的版本不匹配。

根据文章:deepspeed使用zero3 + offload报错:AttributeError: “DeepSpeedCPUAdam” object has no attribute “ds_opt_adam”

在执行代码前添加环境变量,跳过版本检测

export DS_SKIP_CUDA_CHECK=1

Ninja-build安装

执行完成后重新尝试加载

python -c 'import deepspeed; deepspeed.ops.adam.cpu_adam.CPUAdamBuilder().load()'

报错中包含了一些Ninja相关的信息,尝试以下命令

[root@gpu4 MGM]# ninja -v
ninja: error: loading 'build.ninja': no such file or directory

看来需要重新安装ninja。参考:centos安装ninja

yum install -y epel-release
yum install -y ninja-build

GCC版本升级

再次重新尝试加载,提示GCC版本过低,需要安装5以上的版本。参考:CentOS 7升级gcc版本

yum install centos-release-scl
yum install devtoolset-8-gcc* # 如果想要安装其他版本就把8改成其他的版本号
scl enable devtoolset-8 bash

大功告成,但是gcc版本只对本次会话有效,重启会话后还是会变回原来的4.8版本,后续想要切换版本的话直接执行

source /opt/rh/devtoolset-8/enable

问题解决

再次重新尝试加载,问题解决

[root@gpu4 MGM]# python -c 'import deepspeed; deepspeed.ops.adam.cpu_adam.CPUAdamBuilder().load()'
[2024-05-28 15:03:50,860] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
 [WARNING]  DeepSpeed Op Builder: Installed CUDA version 12.3 does not match the version torch was compiled with 11.7.Detected `DS_SKIP_CUDA_CHECK=1`: Allowing this combination of CUDA, but it may result in unexpected behavior.
Using /root/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py310_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 0.4856259822845459 seconds

参考文献

deepspeed使用zero3 + offload报错:AttributeError: "DeepSpeedCPUAdam" object has no attribute "ds_opt_adam"

centos安装ninja

CentOS 7升级gcc版本

本文链接 https://blog.kimi360.top/fea046133020/

本文采用 知识共享署名-非商业性使用-禁止演绎 4.0 国际许可协议 (CC BY-NC-ND 4.0) 进行许可。