How to run a Hadoop/Stratosphere job on TomPouce cluster

TomPouce is a cluster of 20 calculation nodes for a total of 240 cores that resides in the Inria Turing building (next to Ecole Polytechnique). It is used jointly by Inria teams, and jobs are installed there with the help of a scheduler SGE.

TomPouce specifications:

– Calculation:

20 nodes > bi-processors > 6 cores Total: 240 cores

48 Go Ram per node

Local space 471 Gb

Up to 16 nodes available for reservation

– Storage

Up to 16 * 471Gb HDFS

Dell R510 /home 19 To NFS

Dell R710 x2 /scratch 37 To FHGFS (Fraunhofer FS)

– Network:

Switch Dell 5548

Switch infiniband Mellanox InfiniScale IV QDR

– Hadoop specs:

Version 1.2.1

Java7

HDFS storage point on : /local/hadoop

HDFS user specific storage : /local/hadoop/username

Hadoop tmp : /local/hadoop/tmp-username/

– Hadoop & jobs:

/local/hadoop/ is recursively deleted after each qsub job

To connect to the TomPouce cluster, you need to have your ssh key in the Inria LDAP. If you do not have this, send an e-mail to helpmi@saclay.inria.fr with your ssh public key attached, telling that you would like to connect to TomPouce cluster and ask them to add it to the LDAP.

Here are the instructions to run a Hadoop job on TomPouce cluster:

1. Copy your Hadoop/Stratosphere job from your local machine into TomPouce front node.

Let’s consider that our job is the jar file myjob.jar.

$ scp myjob.jar username@195.83.212.209:~/

username makes reference to the Inria user name (e.g. arandaan). User’s SSH keys should have been previously stored in the LDAP.  The command above would copy the file with the job to the user home directory in the cluster.

2. Connect via ssh to the cluster front node

Type the command:

$ ssh username@195.83.212.209
Welcome to Bright Cluster Manager 6.0
Based on Scientific Linux release 6
Cluster Manager ID: #120054
Use the following commands to adjust your environment:
'module avail '            - show available modules
'module add 'module' '     - adds a module to your environment for this session
'module initadd 'module' ' - configure module to be loaded at every login

3. Creation of an execution script

The best way to execute jobs on TomPouce cluster is to create a script with the SGE parameters to use and the commands we want to run.

Here you can find an example of a script for a Hadoop job:

=====================

#!/bin/bash
#$ -N hadoop_run
#$ -pe hadoop 13
#$ -j y
#$ -o output.$JOB_ID
#$ -l h_rt=00:10:00,hadoop=true
#$ -cwd
#$ -q hadoop.q

if [ -f /etc/bashrc ]; then
. /etc/bashrc
fi
module load hadoop/1.2.1
hadir=/local/hadoop/$(whoami)
CONF=`pwd`/config.$JOB_ID
export HADOOP_CONF_DIR=$CONF
export HADOOP_LOG_DIR=$CONF/logs
export HADOOP_PID_DIR=/$CONF/pids
echo Configuring HDP $HADOOP_CONF_DIR

#creating required directories for the job
hadoop dfs -mkdir $hadir
hadoop dfs -mkdir $hadir/input
hadoop dfs -mkdir $hadir/conf
hadoop dfs -mkdir $hadir/rdfdata

#copying input data to hdfs
hadoop dfs -copyFromLocal 10Uni.n3 $hadir/input/
hadoop dfs -copyFromLocal qx.xml $hadir/conf/
hadoop dfs -lsr $hadir
echo running lubm import…
hadoop jar cliquesquare-0.1-run.jar load-skewed config.properties 10Uni.n3 $hadir/rdfdata/ 3 50000
echo listing directory
hadoop dfs -lsr $hadir

====================

And here, you can find one for a Stratosphere job:

====================
#!/bin/bash
#$ -N strato_run
#$ -pe stratosphere 24
#$ -j y
#$ -o output.$JOB_ID
#$ -l h_rt=00:10:00,hadoop=true,excl=true
#$ -cwd
#$ -q hadoop.q
export PATH=$PATH:'/cm/shared/apps/hadoop/current/conf/'
export STRATOSPHERE_HOME='/cm/shared/apps/stratosphere/current'
hostname
MASTER=`cat /home/guests/clustervision/current/masters`
# Copy the input file into the HDFS filesystem
hadoop --config /home/guests/clustervision/current/ dfs -copyFromLocal 
/home/guests/clustervision/tmp /var/hadoop/dfs.name.dir
#Running the Stratosphere task(s) here. I am specifying the jar and run parameters:
$STRATOSPHERE_HOME/bin/pact-client.sh run -j myjob.jar -a 2 
hdfs://$MASTER:50040/var/hadoop/dfs.name.dir/inputFile 
hdfs://$MASTER:50040/var/hadoop/dfs.name.dir/outputFile
# Copying the output files from the HDFS filesystem
hadoop --config /home/guests/clustervision/current/ fs -get 
/var/hadoop/dfs.name.dir/output

====================

First, in the script, include the SGE options to use. These options should be written after ‘#$’. In the previous script, the options used are:

                  • -N ‘job_name’  (this will generate an output file: jobname.o). Used to give a name to the job to run.
                  • Specify also the  parallel environment with: -pe ‘environment’  N  (where N is the number of cores. This number is limited to 180). ‘Environment’ should be ‘hadoop’ or ‘stratosphere’
                  • -j y : indicates we want to use the same output file
                  • -o output.$JOB_ID: indicates the standard output will be in a file name ouput.$JOB_ID where $JOB_ID will be the number SGE will assign automatically to our job.
                  • -l h_rt=00:10:00,hadoop=true,excl=true. To demand a resource, the option is –l name=value. In this case, h_rt=00:10:00 indicates that the job should be killed after 10 minutes, hadoop=true indicates that the job to run is a Hadoop job (it is used like that even for Stratosphere jobs) and excl=true indicates that it is executed exclusively.
                  • To run the job from the current directory the option is: -cwd
                  • To use Hadoop, specify the Hadoop queue with the option: -q hadoop.q . NOTE: Stratosphere also uses Hadoop queue.

4. Dataset upload.

Use /scratch/ to store your required datasets.

Keep in mind that this storage point do not have any backup.

5. Submission of the Hadoop job.

The command for submitting a job using SGE is qsub  . To submit a job, just execute qsub followed by the name of the script with all the commands:

qsub script.qsub

You can find other options and further documentation here: SGE-user documentation.

If you have a reservation run qsub -ar reservation_id

6. Logs

In the folder  /home/guests/clustervision/ a log with the name output.$JOB_ID will be generated containing the output of the job execution.

Logs from Hadoop or Stratosphere will be generated in the folder /home/guests/clustervision/config.$JOB_ID/logs.

Further documentation:

Hadoop and how to run it on OAK cluster

Stratosphere and how to run it on OAK cluster

Hadoop and Stratosphere on TomPouce cluster

Among the Inria engineers, Marianne Lombard and Emmanuel Penot have been trained on SGE.

Permanent link to this article: https://team.inria.fr/oak/how-to-run-a-hadoop-job-on-tompouce-cluster/