wiki:StatusReport-July2008
Last modified 9 years ago Last modified on 08/12/08 16:36:03

Accomplished

  • SSG's MigAnalysis job-submission script (a shell-script) was modified to be portable and achieve faster execution. In addition, the R-script was also modified to reflect the changes in job-submission script. The details of the modified scripts, how they are to be submitted is explained here http://projects.uabgrid.uab.edu/r-group/wiki/ssg-mig
  • The workflow logic for the above modified scripts, in psuedo-code format is explained here http://projects.uabgrid.uab.edu/r-group/wiki/WorkflowLogic
  • The UAB-CIS resource, 'olympus' has two versions of R-language installed R-2.6.1 and R-2.5.0. After resolving these issues http://dev.uabgrid.uab.edu/ticket/60, http://dev.uabgrid.uab.edu/ticket/68, http://dev.uabgrid.uab.edu/ticket/72, and http://dev.uabgrid.uab.edu/ticket/73 with respect to execution of R-jobs on olympus, we are now able to successfully submit R-jobs to olympus.
  • Gridway, a meta-scheduler was made use of in submitting R-jobs to two grid-enabled resources on campus, namely cheaha and olympus. Gridway dynamically discovers resources and submits jobs to those resources which match the job-requirements. Gridway offers more abstraction and ease-of-use to submit jobs to multiple resources at the same time.
  • The performance statistics of executing SSG's MigAnalysis R-job on cheaha and olympus individually and from Gridway are given below. The statistics were developed with the set of constant parameters as: no.of imputations = 10, percentage of missing data = 0.1, no.of families = 50. The no.of iterations which the R-job ran through was varied from 10-100 and the time calculated for each iteration.

cheaha

Iterations Time(m)
Standalone GW to Cheaha
10 5 5
20 9 11
30 15 15
40 20 19
50 25 25
60 31 28
70 33 36
80 40 37
90 45 45
100 52 48

olympus

Iterations Time(m)
Standalone GW to Olympus
10 4 5
20 8 9
30 13 14
40 17 18
50 21 22
60 26 26
70 30 30
80 34 34
90 40 40
100 44 43
  • A container-based approach for grid-enabled resources, wherein a similar job-directory structure is maintained across the resources, was put into effect, as of now in the $HOME directories on cheaha and olympus. This approach is to mainly intended to prepare the container i.e., the account on the remote resource by defining a set of environment variables as recommended by SURA grid http://www.sura.org/programs/SURAgrid_EnvVar.htm. The environment variables are now defined in the $HOME path, which have to be eventually defined on a system-wide path.
    http://projects.uabgrid.uab.edu/r-group/wiki/ContainerManagement documents the efforts towards a container-based approach to adding resources.

  • Result of submitting R-job from Gridway. At the time of submission, the number of available node information as given out by gridway is as follows:
    HID PRIO  OS              ARCH   MHZ %CPU  MEM(F/T)     DISK(F/T)     N(U/F/T) LRMS                 HOSTNAME
    0   1     NULLNULL        NULL     0    0       0/0           0/0     0/27/496 PBS                  altix.asc.edu
    1   1     Linux2.6.9-55.E x86_6 1603  200 1580/2007 107524/118867    0/110/118 SGE                  cheaha.ac.uab.edu
    2   1     Linux2.6.9-42.0 x86_6 3200   33 2689/3943   54002/63479    0/127/127 SGE                  olympus.cis.uab.edu
    
Total no. of Iterations Divided into Chunks Sleep Time (s) Avg. Run Time of a Single Job (m) Total Run Time ((sleeptime*chunks)+avg.time) (m) Resource Run on
100 10 4 5.22 6.02 cheaha
1000 50 8 13.41 20.08 cheaha
1000 100 8 5.33 18.67 olympus
10000 100 12 40.35 60.35 olympus
10000 200 8 26.78 53.46 olympus-138, cheaha-62
  • Taking the instance of the 1000 iteration job divided into 100 chunks, when run through gridway, the total time taken to complete was almost 19 minutes, whereas the original job submitted by SSG with the same number of iterations took almost 1 hour. A performance gain of almost 68.34%
  • The performance can be gained further (depending on the available bandwidth on resources), by adjusting the chunks and the sleep time between successive jobs.
  • Making use of Alabama Supercomputing resources to execute R-jobs, though they are not currently grid-enabled. Carrying out R-jobs on ASA resources to collect performance statistics.

Outstanding Issues

  • The current version of Gridway being used is 5.2.2.
    • Gridway does not execute array jobs of more than 4
    • Gridway fails to copy input/output files or create job-directories on remote resources
    • There are known bugs 5718 and 5719 on 5.2.2 version of Gridway
    • These bugs exist in the latest 5.4 stable release of Gridway
    • Investigating what causes these failures

  • Job and queue policies on cheaha limits a single user to submit only 25 jobs at a time. As a result, even thought cheaha is not used up most of the time, jobs submitted >25 will be in qw state. This can reduce the performance
  • The clock-speed of olympus is 3.2GHz, twice than cheaha. but the performance results, as shown above for R-job execution, on olympus is not doubled when comapred with those on cheaha.

Investigations

  • Currently, we are in the process of grid-enabling ASA resources to submit R-jobs from Gridway. We are in discussion with Derek Gottlieb of ASA to resolve Globus container issues
  • ASA resources have R-2.5.0 and 2.1.1 versions. R-jobs were modified to execute on 2.5.0 version successfully. As a result of discussion with Derek, the latest version of R, 2.7.1 is now on its way to being installed on ASA resources
  • Have sent requisitions to SURA grid to set up an an UAB account
  • In future, try to set up an account with TERA grid to get more resource power
  • Develop a cool work-flow monitoring system which makes use of !CaBIG framework