Stern School of Business, New York University

Running Singularity containers under Slurm

Containers:

Containers are a way of using pre-built images that contain a particular application. Unlike virtual machines, they rely on on an underlying system to run them. On the plus side, they consume fewer resources than complete virtual machines. The most common container system is “docker”. There are thousands of docker images available to d specific applications, like GPU processing, web sites, word press, … However, docker images have to run with “root” permissions, so they are most appropriate in a business application environment and not a shared environment as in an HPC cluster.

To avoid this problem, a slightly different container system has been developed, singularity, where the containers run with the privileges of the user that started them. When you combine these containers with a scheduling system like slurm, you have a facility where users can run a container on one of many systems that satisfy their needs, i.e. memory, cpus, gpus,disk, …

Docker containers can be easily (one command) turned into singularity containers. An existing container (docker or singularity) can be used as the basis of a new container. This opens up almost unlimited possibilities. Slurm can be used to find an appropriate host to run the container on.

For instance, the SCRC has a container designed to run tensorflow applications on one of the two GPU slurm systems that are available. The container is publicly available, so anyone can run it. A sample application is to run the MNIST tensorflow application, which categorizes and recognizes handwritten digits. To run the example in a singularity container under slurm on the SCRC HPC grid, several pieces of code are necessary.

  1. a slurm job that requests an appropriate server to launch the container on, and that runs a shell script to start the container and execute the shell script that actually runs the tensorflow mnist analysis.

2. The shell script to run inside the container.

4. The python program that runs the mnisttest.py example

Here is the code: All of the pieces are at /gridapps/singularity/slurmexamples, see the README file to copy it.

Note the virtual environment at /gridapps/singularity/slurmexamples (virttensor). It has all the python modules needed to run tensorflow.

Code examples to run tensorflow singularity container under slurm

Code is at /gridapps/singularity/slurmexamples. Copy it from there.

  1. slurm job that runs the container
#!/bin/bash
#slurmGPU.sbatch
#
# This is a slurm job to run tensorflow on a GPU node with a
# singularity container that is built with tensorflow 2 and nvidia 11.1
#
# It uses a virtualenvironment that has tensorflow 2, etc. installed
#
#

#SBATCH --mem=32G
#SBATCH --time=2:00:00
#SBATCH --job-name=GPUSING

##### Uncomment one of the sbatch lines below to 
##### be more speciifc in your request for a GPU
##### We currently only have two servers with GPUs, one has a single V100 GOU, and the other has 2 RTX6000 GPUs

################################################
# run the job and only request 1 GPU but any type
#SBATCH --gres=gpu:1

# Request two GPUs, any type
##SBATCH --gres=gpu:2

# Request  1 v100 GPU
##SBATCH --gres=gpu:v100:1

#Request 1 rtx6000 GPU
##SBATCH --gres=gpu:p6000:1

#Request 2 RTX6000 gpus
##SBATCH --gres=gpu:p6000:2

###############################################

# run the custom SCRC singularity container and bind the nvidia cards on that server to the container
#  Pass the /mnt file system to the container as /mnt in the container

singularity run --nv --bind /mnt/:/mnt /gridapps/singularity/images/scrctensorflow2.sif  ~/MYJOB.sh

2. The shell script to run in the container.

MYJOB.sh 

#!/bin/bash

#FIRST start up the virtual environment that has python 3.6.5,tensorflow, numpy, ...
# start up the virtual environment
source /gridapps/singularity/slurmexamples/virttensor/bin/activate
#
#cd to the directory that has your  python3 tensorflow program
cd
# run your tensorflow job
python mnisttest.py

3. The python program, mnisttest.py to do the analysis

mnisttest.py 

import tensorflow as tf # Import tensorflow library
#import matplotlib.pyplot as plt # Import matplotlib library
import numpy as np # Import numpy library

mnist = tf.keras.datasets.mnist # Object of the MNIST dataset
(x_train, y_train),(x_test, y_test) = mnist.load_data() # Load data

#plt.imshow(x_train[0], cmap="gray") # Import the image
#plt.show() # Plot the image


x_train = tf.keras.utils.normalize(x_train, axis=1) # Normalize the training dataset
x_test = tf.keras.utils.normalize(x_test, axis=1) # Normalize the testing dataset

#Build the model object
model = tf.keras.models.Sequential()
# Add the Flatten Layer
model.add(tf.keras.layers.Flatten())
# Build the input and the hidden layers
model.add(tf.keras.layers.Dense(128, activation=tf.nn.relu))
model.add(tf.keras.layers.Dense(128, activation=tf.nn.relu))
# Build the output layer
model.add(tf.keras.layers.Dense(10, activation=tf.nn.softmax))

model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])

model.fit(x=x_train, y=y_train, epochs=5) # Start training process

# Evaluate the model performance
test_loss, test_acc = model.evaluate(x=x_test, y=y_test)
# Print out the model accuracy 
print('\nTest accuracy:', test_acc)

predictions = model.predict([x_test]) # Make prediction

print(np.argmax(predictions[1000])) # Print out the number

#plt.imshow(x_test[1000], cmap="gray") # Import the image
#plt.show() # Show the image





This example can easily be modified to pick a different GPU, or even to run different tensorflow python progams