Skip to content

CUDA

Installing CUDA

You can refer to Nvidia's official guide for installation.

Training Models

Due to the fast iteration and updates of machine learning packages, it is common to encounter issues where code cannot run due to package and driver version mismatch. Here, we recommend using Docker to ensure smooth code execution. You can find official compiled containers on the websites of Tensorflow and Nvidia.

Taking TensorFlow as an example, when you need to fine-tune a BERT model using TensorFlow v1.11, you can follow these steps:

  1. Check Docker version
docker --version

Assuming the version is 20.10.

Permission denied

This is because the administrator has not added you to the Docker permission group. Here are the specific steps:

  1. The administrator creates a new docker user group: sudo groupadd docker
  2. The administrator changes the user group of docker.sock to docker: sudo chgrp docker /var/run/docker.sock
  3. The administrator adds the user to the docker user group: sudo usermod -aG docker <username>
  4. The user closes the terminal and re-enters
  1. Download and run the official compiled image

TensorFlow official image (recommended for more flexibility in versions):

docker run --gpus all -it --rm docker.io/tensorflow/tensorflow:1.11.0-gpu-py3 bash

Replace 1.11.0 with the version you need. You can also directly call Python outside of Docker to run your script, similar to the following:

docker run -it --rm -v $PWD:/tmp -w /tmp tensorflow/tensorflow python ./script.py

You can also use the Nvidia official image:

docker run --gpus all -it --rm nvcr.io/nvidia/tensorflow:20.10-tf1-py3
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]]

This is usually because the GPU driver is not installed properly. The specific situation may be more complex, so please consult the administrator to resolve it.

  1. Check GPU availability in Python
import tensorflow as tf
print(tf.test.is_gpu_available())

Other operations:

  • After entering the image, you can temporarily leave the image using the combination Ctrl+P and Ctrl+Q.
  • docker ps allows you to browse the currently mounted containers.
  • docker attach <docker id> allows you to return to the corresponding container.
  • Copying files to the container: docker cp file <docker id>:/path
  • Copying files from the container: docker cp <docker id>:/path/file /path
  • Deleting a container: docker kill <container id>

Last update: September 16, 2023