Pytorch 踩坑记录

2019-09-21 | 6 浏览 | 0 评论

RuntimeError: cuda runtime error (4) : unspecified launch failure

在跑一个分割任务时发生如下错误。

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1525796793591/work/torch/lib/THC/generic/THCStorage.cu line=58 error=4 : unspecified launch failure
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1525796793591/work/torch/lib/THC/generic/THCStorage.cu line=58 error=4 : unspecified launch failure
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1525796793591/work/torch/lib/THC/generic/THCStorage.cu line=58 error=4 : unspecified launch failure
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1525796793591/work/torch/lib/THC/generic/THCStorage.cu line=58 error=4 : unspecified launch failure
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1525796793591/work/torch/lib/THC/generic/THCStorage.cu line=58 error=4 : unspecified launch failure
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1525796793591/work/torch/lib/THC/generic/THCStorage.cu line=58 error=4 : unspecified launch failure
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1525796793591/work/torch/lib/THC/generic/THCStorage.cu line=58 error=4 : unspecified launch failure
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1525796793591/work/torch/lib/THC/generic/THCStorage.cu line=58 error=4 : unspecified launch failure
INFO train_net_step_amodalinmodalseg.py: 442: Save ckpt on exception ...
Traceback (most recent call last):
  File "tools/train_net_step_amodalinmodalseg.py", line 427, in main
    loss.backward()
  File "/home/wanggh/miniconda3/envs/pytorch0.3/lib/python3.5/site-packages/torch/autograd/variable.py", line 167, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
  File "/home/wanggh/miniconda3/envs/pytorch0.3/lib/python3.5/site-packages/torch/autograd/__init__.py", line 99, in backward
    variables, grad_variables, retain_graph)
RuntimeError: cuda runtime error (4) : unspecified launch failure at /opt/conda/conda-bld/pytorch_1525796793591/work/torch/lib/THC/generic/THCStorage.cu:58

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tools/train_net_step_amodalinmodalseg.py", line 454, in <module>
    main()
  File "tools/train_net_step_amodalinmodalseg.py", line 443, in main
    save_ckpt(output_dir, args, step, train_size, maskRCNN, optimizer, logger)
  File "tools/train_net_step_amodalinmodalseg.py", line 132, in save_ckpt
    'optimizer': optimizer.state_dict()}, save_name)
  File "/home/wanggh/miniconda3/envs/pytorch0.3/lib/python3.5/site-packages/torch/serialization.py", line 135, in save
    return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol))
  File "/home/wanggh/miniconda3/envs/pytorch0.3/lib/python3.5/site-packages/torch/serialization.py", line 117, in _with_file_like
    return body(f)
  File "/home/wanggh/miniconda3/envs/pytorch0.3/lib/python3.5/site-packages/torch/serialization.py", line 135, in <lambda>
    return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol))
  File "/home/wanggh/miniconda3/envs/pytorch0.3/lib/python3.5/site-packages/torch/serialization.py", line 204, in _save
    serialized_storages[key]._write_file(f)
RuntimeError: cuda runtime error (4) : unspecified launch failure at /opt/conda/conda-bld/pytorch_1525796793591/work/torch/csrc/generic/serialization.cpp:38
terminate called without an active exception
Aborted (core dumped)

分析：

INFO train_net_step_amodalinmodalseg.py: 442: Save ckpt on exception ...
Traceback (most recent call last):
  File "tools/train_net_step_amodalinmodalseg.py", line 427, in main
    loss.backward()
  File "/home/wanggh/miniconda3/envs/pytorch0.3/lib/python3.5/site-packages/torch/autograd/variable.py", line 167, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
  File "/home/wanggh/miniconda3/envs/pytorch0.3/lib/python3.5/site-packages/torch/autograd/__init__.py", line 99, in backward
    variables, grad_variables, retain_graph)
RuntimeError: cuda runtime error (4) : unspecified launch failure at /opt/conda/conda-bld/pytorch_1525796793591/work/torch/lib/THC/generic/THCStorage.cu:58

对应源码中的

        loss.backward()
except (RuntimeError, KeyboardInterrupt):
        del dataiterator
        logger.info('Save ckpt on exception ...')
        save_ckpt(output_dir, args, step, train_size, maskRCNN, optimizer, logger)
        logger.info('Save ckpt done.')
        stack_trace = traceback.format_exc()
        print(stack_trace)

finally:
        if args.use_tfboard and not args.no_save:
            tblogger.close()

执行loss.backward()时发生异常，进而执行except中的save_ckpt。执行save_ckpt时又发生异常。

cuda runtime error (4)中的4代表含义可以查看文档，或文档。问题描述分别为

This indicates that a CUDA Runtime API call cannot be executed because it is being called during process shut down, at a point in time after CUDA driver has been unloaded.

An exception occurred on the device while executing a kernel. Common causes include dereferencing an invalid device pointer and accessing out of bounds shared memory. The device cannot be used until cudaThreadExit() is called. All existing device memory allocations are invalid and must be reconstructed if the program is to continue using CUDA.

剩余疑点：

/opt/conda/conda-bld/pytorch_1525796793591/work/torch/lib/THC/generic/THCStorage.cu是什么文件？在/opt下根本没有conda目录。

RuntimeError: merge_sort: failed to synchronize: unspecified launch failure

将代码升级为pytorch 0.4后，出现新的问题。这次直接给出了merge_sort的错误。

INFO train_net_step_amodalinmodalseg.py: 443: Save ckpt on exception ...
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1535491974311/work/torch/csrc/generic/serialization.cpp line=17 error=4 : unspecified launch failure
Traceback (most recent call last):
  File "tools/train_net_step_amodalinmodalseg.py", line 428, in main
    loss.backward()
  File "/home/wanggh/miniconda3/envs/pytorch0.4/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/wanggh/miniconda3/envs/pytorch0.4/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: merge_sort: failed to synchronize: unspecified launch failure

机器学习