wiki:ssg-mig
Last modified 9 years ago Last modified on 06/05/08 12:03:22

This page describes the structure and function of the shell script which executes SSG's R-script

The original and modified shell scripts are attached below.

The job submission command for the modified script should look like this:

qsub -t 1-100 qsub_sge_job.sh 10000 10:0.1:50

  1. The '-t' option specifies the number of array jobs to SGE. So, in the above example the number of array jobs=100
    and array starts from 1 with an increment of 1
    i.e., SGE_TASK_FIRST = 1 SGE_TASK_LAST = 100 and SGE_TASK_STEPSIZE = 1
  1. The first argument after the script file ('qsub_sge_job.sh' in the above example) specifies the total number of iterations.
    So, in the above example, total number of iterations=10000
  1. The second argument after the script file specifies the seed value number_of_imputations:proportion_of_missing_data:no_families, each
    separated by a colon.
    So, in the above example, seed value=10:0.1:50

Mentioning the number of arrays, total number of iterations, and the seed value as command-line arguments to qsub,
gives the user flexibility to change them according to user discretion/choice.

For eg., for a 10,000 iteration job, the user can easily break up into

100 jobs of 100 iterations each => qsub -t 1-100 qsub_sge_job.sh 10000 10:0.1:50 or

10 jobs of 1000 iterations each => qsub -t 1-10 qsub_sge_job.sh 10000 10:0.1:50 or

1000 jobs of 10 iterations each => qsub -t 1-1000 qsub_sge_job.sh 10000 10:0.1:50

The latest shell script with modifications is as follows:

1 #!/bin/bash
2 #$ -S /bin/bash
3 #$ -V
4 #$ -N mig10
5 #$ -cwd
6 # -m beas
7 # -M ppreddy@@uab.edu
8 #$ -l h_rt=1000:00:00
9 #$ -e ./
10 #$ -o ./
11
12
13 iterations=$1
14 seed=$2
15
16 function parse_seed() {
17
18   echo $seed
19   b=$( echo $seed | awk 'BEGIN{ FS=":" } { print $1 "\n" $2 "\n" $3}' )
20   c=($b)
21   return $c
22
23 }
24
25 function runR() {
26
27   parse_seed
28   c=$(echo $?)
29   n=${c[0]}
30   m=${c[1]}
31   f=${c[2]}
32   amax=`expr $SGE_TASK_LAST - $SGE_TASK_FIRST + 1`
33   task_id=`expr $SGE_TASK_ID - 1`
34   range=`expr $iterations / $amax`
35   index=`expr $task_id \* $range`
36   s=`expr $index + 1`
37   e=`expr $index + $range`
38   echo "task_id=$task_id index=$index"
39   echo "n=$n m=$m f=$f st=$s end=$e"
40   R --silent --no-save --no-restore "--args n=$n m=$m f=$f st=$s end=$e" < MigAnalysis.R
41
42 }
43
44 runR
45
46

Lines 9 and 10

9 #$ -e ./
10 #$ -o ./

The standard error and standard output streams are the defaults, no specific naming mentioned here.
As a result of this, the standard error and output stream files start with the name mentioned as with '-N ' option
Also, the output of the execution of the R-script will be in the same file as the standard output

Lines 13 and 14

13 iterations=$1
14 seed=$2

The above two shell variables are read from the command-line arguments given to the SGE command 'qsub'
The total number of iterations is the first argument after the script name
The seed value is the second argument after the script name

Lines 16-23

These lines define the function parse_seed. This is same as in the earlier script, except that there is no hard-coding of the seedfile.
Instead, user can mention a particular seed as an argument on the command-line (explained above)
The seed value is parsed, the individual parameters are extracted and saved in an array. This array of parameters is returned by the function

Lines 25-42

These lines define the function runR. Here, the parse_seed function is called and the individual parameters are extracted.
The parameters are: n=number_of_imputations m=proportion_of_missing_data f=no_families

Lines 32-37 compute the start (s) and end (e) values for each iteration of the job The total number of iterations comes from the first command-line argument to the script
The shell variable amax denotes the maximum number of array jobs this particular script needs to be split into.
This is mentioned with the qsub command option -t (explained above)
Taking the above example, qsub -t 1-100 qsub_sge_job.sh 10000 10:0.1:50
When total number of iterations = 10000 and number of array jobs = 100, the start and end values for each array job are computed as follows:

Array Start End
1 1 100
2 101 200
3 201 300
4 301 400
. . .
. . .
. . .
100 9901 10000

Line 40 R --silent --no-save --no-restore "--args n=$n m=$m f=$f st=$s end=$e" < MigAnalysis.R This line is the actual R command. The arguments to the R-script are given by the "--args" option The above particular command causes the output of the execution of the R-script to be written into the same file as the standard output stream file

Line 44

This line calls the function runR, which is executed upon the user entering the qsub command

As a result of the modifications to the shell script, changes were made in the R-script itself. These were mainly related to the parsing of command-line arguments inside the R-script and doing away with hard-coding of R-variables, mpr and noFamilies. The following lines depict the parsing of command-line arguments. The complete MigAnalysis?.R script is attached below.

1 # Move toward a production version that will call the appropriate
2 # functions in a loop.
3
4 library(nlme)
5
6 ## First read in the arguments listed at the command line
7
8 args=(commandArgs(TRUE))
9 print(args)
10
11 ## args is now a list of character vectors
12 ## First check to see if arguments are passed.
13 ## Then cycle through each element of the list and evaluate the expressions.
14
15 if(length(args)==0){
16     print("No arguments supplied.")
17     ## supply default values
18 }else{
19     for(i in 1:length(args)){
20          eval(parse(text=args[[i]]))
21     }
22 }
23
24 nim<-as.integer(n)
25 mpr<-m
26 noFamilies<-as.integer(f)
27 iterEnd <- as.integer(end)
28 iterStart <- as.integer(st)
29 print(nim)
30 print(mpr)
31 print(noFamilies)
32 print(iterEnd)
33 print(iterStart)
34 Sys.sleep(iterEnd/50)
35
36 source("SimBayes.R")
37 source("ExtractData.R")
38 source("EMFull.R")
39 source("MIFull.R")

Attachments