Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'THCudaCheck FAIL' Using Cuda7.5 Docker Image #1

Open
spadavec opened this issue Jul 7, 2016 · 15 comments
Open

'THCudaCheck FAIL' Using Cuda7.5 Docker Image #1

spadavec opened this issue Jul 7, 2016 · 15 comments

Comments

@spadavec
Copy link

spadavec commented Jul 7, 2016

After installing the NVIDIA docker image, and loading the Torch RNN docker via:

nvidia-docker run --rm -ti crisbal/torch-rnn:cuda7.5 bash

and preprocessing via

root@3da15ad69af8:~/torch-rnn# python scripts/preprocess.py --input_txt data/library.txt --output_h5 data/library.h5 --output_json data/library.json

Attempting to train the system results in the following:

root@3da15ad69af8:~/torch-rnn# th train.lua -input_h5 data/library.h5 -input_json data/library.json
Running with CUDA on GPU 0
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-9234/cutorch/lib/THC/THCGeneral.c line=608 error=8 : invalid device function
/root/torch/install/bin/luajit: /root/torch/install/share/lua/5.1/nn/Container.lua:67:
In 2 module of nn.Sequential:
./LSTM.lua:128: cuda runtime error (8) : invalid device function at /tmp/luarocks_cutorch-scm-1-9234/cutorch/lib/THC/THCGeneral.c:608
stack traceback:
[C]: in function 'resize'
./LSTM.lua:128: in function <./LSTM.lua:118>
[C]: in function 'xpcall'
/root/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/root/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
train.lua:130: in function 'opfunc'
/root/torch/install/share/lua/5.1/optim/adam.lua:33: in function 'adam'
train.lua:187: in main chunk
[C]: in function 'dofile'
/root/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670

WARNING: If you see a stack trace below, it doesn't point to the place where this error occured. Please use only the one above.
stack traceback:
[C]: in function 'error'
/root/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
/root/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
train.lua:130: in function 'opfunc'
/root/torch/install/share/lua/5.1/optim/adam.lua:33: in function 'adam'
train.lua:187: in main chunk
[C]: in function 'dofile'
/root/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670

@crisbal
Copy link
Owner

crisbal commented Jul 8, 2016

I think this is an issue that one needs to report to the main torch-rnn repo (https://github.com/jcjohnson/torch-rnn) and not on this one.

First of all, are you for sure running a CUDA video card?
If yes, let's try something, what happens if you run nvidia-smi inside the container?
Does it show any relevant info?

@spadavec
Copy link
Author

@crisbal thanks for the heads up--i will post this to the torch-rnn repo instead. For what its worth, i do have a gpu installed:

root@9be35619d034:~/torch-rnn# nvidia-smi
Mon Jul 11 19:17:26 2016
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.27 Driver Version: 367.27 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 0000:01:00.0 On | N/A |
| 28% 41C P8 7W / 180W | 725MiB / 8113MiB | 1% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+

@crisbal
Copy link
Owner

crisbal commented Jul 11, 2016

Let me know if in the end it is my fault or their :)

One random thought I had: since you have a 1080 maybe it uses some new kind of CUDA that maybe it is not well supported by either nvidia-docker or torch.

@spadavec
Copy link
Author

@crisbal it looks like the issue is that a newer version of CUDA is needed:

jcjohnson/torch-rnn#122

Did you have any plans to make a CUDA8 version of the docker? Thanks for all the work you've done!

@crisbal
Copy link
Owner

crisbal commented Jul 12, 2016

As soon as I get my hands on a Cuda machine and on fast Internet I will.
Sorry I can't do it ASAP.

On Tue, Jul 12, 2016, 06:50 spadavec [email protected] wrote:

@crisbal https://github.com/crisbal it looks like the issue is that a
newer version of CUDA is needed:

jcjohnson/torch-rnn#122
jcjohnson/torch-rnn#122

Did you have any plans to make a CUDA8 version of the docker? Thanks for
all the work you've done!


You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#1 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ACmgZhqgcoWja9U4O3BL8Clff0Bd7u2iks5qUx0kgaJpZM4JGsle
.

@xoryouyou
Copy link

@spadavec I had the same issue and build this today https://hub.docker.com/r/xoryouyou/torch-rnn/

@HandsomeDevilv112
Copy link

I got this error today as I'm using a 1080 and have cuda 8 installed.
@xoryouyou, I tried the command on the page you posted, but I'm getting an error
docker pull xoryouyou/torch-rnn Using default tag: latest Error response from daemon: manifest for xoryouyou/torch-rnn:latest not found

@xoryouyou
Copy link

@HandsomeDevilv112 yeah the images it was only tagged as 1.0 and not latest I updated it.

@HandsomeDevilv112
Copy link

@xoryouyou: Cool! Much obliged. That seems to have done the trick. My apologies if there was a way for me to fix that myself and I just didn't catch it.

@valentinvieriu
Copy link

@xoryouyou Do you think you can share the Docker file also? I want to have a look on how you build your image.
I'm trying to use https://github.com/crisbal/docker-torch-rnn/blob/master/CUDA/8.0/Dockerfile but it does not compile
It fails at this section:

RUN git clone https://github.com/jcjohnson/torch-rnn && \
    pip install -r torch-rnn/requirements.txt

@xoryouyou
Copy link

@valentinvieriu sorry I currently don't have access to that machine where I build the torch-rnn but i'll see if I can recreate your issue.

@valentinvieriu
Copy link

This is the issue that pops out:
'''
copying h5py/tests/hl/test_file.py -> build/lib.linux-x86_64-2.7/h5py/tests/hl
running build_ext
Traceback (most recent call last):
File "", line 1, in
File "/tmp/pip-build-MyYa9y/h5py/setup.py", line 140, in
cmdclass = CMDCLASS,
File "/usr/lib/python2.7/distutils/core.py", line 151, in setup
dist.run_commands()
File "/usr/lib/python2.7/distutils/dist.py", line 953, in run_commands
self.run_command(cmd)
File "/usr/lib/python2.7/distutils/dist.py", line 972, in run_command
cmd_obj.run()
File "/usr/lib/python2.7/dist-packages/wheel/bdist_wheel.py", line 179, in run
self.run_command('build')
File "/usr/lib/python2.7/distutils/cmd.py", line 326, in run_command
self.distribution.run_command(command)
File "/usr/lib/python2.7/distutils/dist.py", line 972, in run_command
cmd_obj.run()
File "/usr/lib/python2.7/distutils/command/build.py", line 128, in run
self.run_command(cmd_name)
File "/usr/lib/python2.7/distutils/cmd.py", line 326, in run_command
self.distribution.run_command(command)
File "/usr/lib/python2.7/distutils/dist.py", line 972, in run_command
cmd_obj.run()
File "/tmp/pip-build-MyYa9y/h5py/setup_build.py", line 140, in run
from Cython.Build import cythonize
ImportError: No module named Cython.Build
'''
This is by uisng https://github.com/crisbal/docker-torch-rnn/blob/master/CUDA/8.0/Dockerfile

as said it fails at the:

RUN git clone https://github.com/jcjohnson/torch-rnn && \
    pip install -r torch-rnn/requirements.txt

section

Any help is appreciated. I'm not very familiar with the dependencies, I plan only to use this as a tool.

Thank you @xoryouyou

@valentinvieriu
Copy link

Ok for future references, this fixed the building issue on ubuntu 16.04
replace

RUN git clone https://github.com/jcjohnson/torch-rnn && \
    pip install -r torch-rnn/requirements.txt

from https://github.com/crisbal/docker-torch-rnn/blob/master/CUDA/8.0/Dockerfile
with

#torch-rnn and python requirements
# we use https://github.com/jcjohnson/torch-rnn/blob/master/requirements.txt as a quideline
WORKDIR /root
RUN apt-get install -y cython
RUN pip install --upgrade pip
RUN pip install Cython==0.23.4
RUN pip install numpy==1.10.4
RUN pip install argparse==1.2.1
RUN HDF5_DIR=/usr/lib/x86_64-linux-gnu/hdf5/serial/ pip install h5py==2.5.0
RUN pip install six==1.10.0
RUN git clone https://github.com/jcjohnson/torch-rnn

I will work on a Docker image and share it with the rest when it's finished

@xoryouyou
Copy link

@valentinvieriu I am currently building with the crisbal/docker-torch-rnn image on arch and it looks to build fine. Will report when done.

@xoryouyou
Copy link

Build on Linux 4.12.8-2-ARCH #1 SMP PREEMPT Fri Aug 18 14:08:02 UTC 2017 x86_64 GNU/Linux with Docker version 17.06.0-ce, build 3dfb8343
build_log.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants