Table of Contents
Accomplished
- SSG's MigAnalysis job-submission script (a shell-script) was modified to be portable and achieve faster execution. In addition, the R-script was also modified to reflect the changes in job-submission script. The details of the modified scripts, how they are to be submitted is explained here [http://projects.uabgrid.uab.edu/r-group/wiki/ssg-mig]
- The workflow logic for the above modified scripts, in psuedo-code format is explained here [http://projects.uabgrid.uab.edu/r-group/wiki/WorkflowLogic]
- The UAB-CIS resource, 'olympus' has two versions of R-language installed R-2.6.1 and R-2.5.0. After resolving these issues [http://dev.uabgrid.uab.edu/ticket/60], [http://dev.uabgrid.uab.edu/ticket/68], [http://dev.uabgrid.uab.edu/ticket/72], and [http://dev.uabgrid.uab.edu/ticket/73] with respect to execution of R-jobs on olympus, we are now able to successfully submit R-jobs to olympus.
- Gridway, a meta-scheduler was made use of in submitting R-jobs to two grid-enabled resources on campus, namely cheaha and olympus. Gridway dynamically discovers resources and submits jobs to those resources which match the job-requirements. Gridway offers more abstraction and ease-of-use to submit jobs to multiple resources at the same time.
- The performance statistics of executing SSG's MigAnalysis R-job on cheaha and olympus individually and from Gridway are given below. The statistics were developed with the set of constant parameters as: no.of imputations = 10, percentage of missing data = 0.1, no.of families = 50. The no.of iterations which the R-job ran through was varied from 10-100 and the time calculated for each iteration.
cheaha
Iterations Time(m) Standalone GW to Cheaha 10 5 5 20 9 11 30 15 15 40 20 19 50 25 25 60 31 28 70 33 36 80 40 37 90 45 45 100 52 48
olympus
Iterations Time(m) Standalone GW to Olympus 10 4 5 20 8 9 30 13 14 40 17 18 50 21 22 60 26 26 70 30 30 80 34 34 90 40 40 100 44 43
- A container-based approach for grid-enabled resources, wherein a similar job-directory structure is maintained across the resources, was put into effect, as of now in the $HOME directories on cheaha and olympus. This approach is to mainly intended to prepare the container i.e., the account on the remote resource by defining a set of environment variables as recommended by SURA grid [http://www.sura.org/programs/SURAgrid_EnvVar.htm]. The environment variables are now defined in the $HOME path, which have to be eventually defined on a system-wide path.
http://projects.uabgrid.uab.edu/r-group/wiki/ContainerManagement documents the efforts towards a container-based approach to adding resources.
- Result of submitting R-job from Gridway. At the time of submission, the number of available node information as given out by gridway is as follows:
HID PRIO OS ARCH MHZ %CPU MEM(F/T) DISK(F/T) N(U/F/T) LRMS HOSTNAME 0 1 NULLNULL NULL 0 0 0/0 0/0 0/27/496 PBS altix.asc.edu 1 1 Linux2.6.9-55.E x86_6 1603 200 1580/2007 107524/118867 0/110/118 SGE cheaha.ac.uab.edu 2 1 Linux2.6.9-42.0 x86_6 3200 33 2689/3943 54002/63479 0/127/127 SGE olympus.cis.uab.edu
Total no. of Iterations Divided into Chunks Sleep Time (s) Avg. Run Time of a Single Job (m) Total Run Time ((sleeptime*chunks)+avg.time) (m) Resource Run on 100 10 4 5.22 6.02 cheaha 1000 50 8 13.41 20.08 cheaha 1000 100 8 5.33 18.67 olympus 10000 100 12 40.35 60.35 olympus 10000 200 8 26.78 53.46 olympus-138, cheaha-62
- Taking the instance of the 1000 iteration job divided into 100 chunks, when run through gridway, the total time taken to complete was almost 19 minutes, whereas the original job submitted by SSG with the same number of iterations took almost 1 hour. A performance gain of almost 68.34%
- The performance can be gained further (depending on the available bandwidth on resources), by adjusting the chunks and the sleep time between successive jobs.
- Making use of Alabama Supercomputing resources to execute R-jobs, though they are not currently grid-enabled. Carrying out R-jobs on ASA resources to collect performance statistics.
Outstanding Issues
- The current version of Gridway being used is 5.2.2.
- Gridway does not execute array jobs of more than 4
- Gridway fails to copy input/output files or create job-directories on remote resources
- There are known bugs 5718 and 5719 on 5.2.2 version of Gridway
- These bugs exist in the latest 5.4 stable release of Gridway
- Investigating what causes these failures
- Job and queue policies on cheaha limits a single user to submit only 25 jobs at a time. As a result, even thought cheaha is not used up most of the time, jobs submitted >25 will be in qw state. This can reduce the performance
- The clock-speed of olympus is 3.2GHz, twice than cheaha. but the performance results, as shown above for R-job execution, on olympus is not doubled when comapred with those on cheaha.
Investigations
- Currently, we are in the process of grid-enabling ASA resources to submit R-jobs from Gridway. We are in discussion with Derek Gottlieb of ASA to resolve Globus container issues
- ASA resources have R-2.5.0 and 2.1.1 versions. R-jobs were modified to execute on 2.5.0 version successfully. As a result of discussion with Derek, the latest version of R, 2.7.1 is now on its way to being installed on ASA resources
- Have sent requisitions to SURA grid to set up an an UAB account
- In future, try to set up an account with TERA grid to get more resource power
- Develop a cool work-flow monitoring system which makes use of !CaBIG framework
