RuntimeError: cuda runtime error (4) : unspecified launch failure
在跑一个分割任务时发生如下错误。
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1525796793591/work/torch/lib/THC/generic/THCStorage.cu line=58 error=4 : unspecified launch failure
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1525796793591/work/torch/lib/THC/generic/THCStorage.cu line=58 error=4 : unspecified launch failure
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1525796793591/work/torch/lib/THC/generic/THCStorage.cu line=58 error=4 : unspecified launch failure
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1525796793591/work/torch/lib/THC/generic/THCStorage.cu line=58 error=4 : unspecified launch failure
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1525796793591/work/torch/lib/THC/generic/THCStorage.cu line=58 error=4 : unspecified launch failure
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1525796793591/work/torch/lib/THC/generic/THCStorage.cu line=58 error=4 : unspecified launch failure
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1525796793591/work/torch/lib/THC/generic/THCStorage.cu line=58 error=4 : unspecified launch failure
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1525796793591/work/torch/lib/THC/generic/THCStorage.cu line=58 error=4 : unspecified launch failure
INFO train_net_step_amodalinmodalseg.py: 442: Save ckpt on exception ...
Traceback (most recent call last):
File "tools/train_net_step_amodalinmodalseg.py", line 427, in main
loss.backward()
File "/home/wanggh/miniconda3/envs/pytorch0.3/lib/python3.5/site-packages/torch/autograd/variable.py", line 167, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
File "/home/wanggh/miniconda3/envs/pytorch0.3/lib/python3.5/site-packages/torch/autograd/__init__.py", line 99, in backward
variables, grad_variables, retain_graph)
RuntimeError: cuda runtime error (4) : unspecified launch failure at /opt/conda/conda-bld/pytorch_1525796793591/work/torch/lib/THC/generic/THCStorage.cu:58
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "tools/train_net_step_amodalinmodalseg.py", line 454, in <module>
main()
File "tools/train_net_step_amodalinmodalseg.py", line 443, in main
save_ckpt(output_dir, args, step, train_size, maskRCNN, optimizer, logger)
File "tools/train_net_step_amodalinmodalseg.py", line 132, in save_ckpt
'optimizer': optimizer.state_dict()}, save_name)
File "/home/wanggh/miniconda3/envs/pytorch0.3/lib/python3.5/site-packages/torch/serialization.py", line 135, in save
return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol))
File "/home/wanggh/miniconda3/envs/pytorch0.3/lib/python3.5/site-packages/torch/serialization.py", line 117, in _with_file_like
return body(f)
File "/home/wanggh/miniconda3/envs/pytorch0.3/lib/python3.5/site-packages/torch/serialization.py", line 135, in <lambda>
return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol))
File "/home/wanggh/miniconda3/envs/pytorch0.3/lib/python3.5/site-packages/torch/serialization.py", line 204, in _save
serialized_storages[key]._write_file(f)
RuntimeError: cuda runtime error (4) : unspecified launch failure at /opt/conda/conda-bld/pytorch_1525796793591/work/torch/csrc/generic/serialization.cpp:38
terminate called without an active exception
Aborted (core dumped)
分析:
INFO train_net_step_amodalinmodalseg.py: 442: Save ckpt on exception ...
Traceback (most recent call last):
File "tools/train_net_step_amodalinmodalseg.py", line 427, in main
loss.backward()
File "/home/wanggh/miniconda3/envs/pytorch0.3/lib/python3.5/site-packages/torch/autograd/variable.py", line 167, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
File "/home/wanggh/miniconda3/envs/pytorch0.3/lib/python3.5/site-packages/torch/autograd/__init__.py", line 99, in backward
variables, grad_variables, retain_graph)
RuntimeError: cuda runtime error (4) : unspecified launch failure at /opt/conda/conda-bld/pytorch_1525796793591/work/torch/lib/THC/generic/THCStorage.cu:58
对应源码中的
loss.backward()
except (RuntimeError, KeyboardInterrupt):
del dataiterator
logger.info('Save ckpt on exception ...')
save_ckpt(output_dir, args, step, train_size, maskRCNN, optimizer, logger)
logger.info('Save ckpt done.')
stack_trace = traceback.format_exc()
print(stack_trace)
finally:
if args.use_tfboard and not args.no_save:
tblogger.close()
执行loss.backward()
时发生异常,进而执行except
中的save_ckpt
。执行save_ckpt
时又发生异常。
cuda runtime error (4)
中的4代表含义可以查看文档,或文档。问题描述分别为
This indicates that a CUDA Runtime API call cannot be executed because it is being called during process shut down, at a point in time after CUDA driver has been unloaded.
An exception occurred on the device while executing a kernel. Common causes include dereferencing an invalid device pointer and accessing out of bounds shared memory. The device cannot be used until cudaThreadExit() is called. All existing device memory allocations are invalid and must be reconstructed if the program is to continue using CUDA.
剩余疑点:
/opt/conda/conda-bld/pytorch_1525796793591/work/torch/lib/THC/generic/THCStorage.cu
是什么文件?在/opt
下根本没有conda
目录。
RuntimeError: merge_sort: failed to synchronize: unspecified launch failure
将代码升级为pytorch 0.4
后,出现新的问题。这次直接给出了merge_sort
的错误。
INFO train_net_step_amodalinmodalseg.py: 443: Save ckpt on exception ...
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1535491974311/work/torch/csrc/generic/serialization.cpp line=17 error=4 : unspecified launch failure
Traceback (most recent call last):
File "tools/train_net_step_amodalinmodalseg.py", line 428, in main
loss.backward()
File "/home/wanggh/miniconda3/envs/pytorch0.4/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/wanggh/miniconda3/envs/pytorch0.4/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: merge_sort: failed to synchronize: unspecified launch failure