Last modified 9 years ago Last modified on 11/04/09 15:03:24

These notes related to the meeting on 2009-10-27 to determine if this workflow can be scaled -- increasing scale or decreasing run-time.

  1. Boshao Zhang from SSG works on Statistical Genetics, mainly with Imputations. Makes use of Impute statistical analysis software from Oxford.
  1. Boshao's Imputation work consists of R-scripts and Impute scripts wrapped up in a shell script to be submitted to an SGE scheduler
  1. HapMap data, which is a Genetics database serves as the input for Boshao's Imputation work. !Hapmap data is in a text form, consisting of 0's and 1's
  1. Workflow consists of a tree dir structure, consisting of for loops (pseudo code below)
       for i in 1 2 3 4 {
          mkdir dir$i
          cd dir$i
          for j in 1 2 3 {
            mkdir dir$j
            cd dir$j
            for k in 1 2 3 {
               mkdir dir$k
               cd dir$k
               ./sim_results.txt                  }
          }     }
    A simulation with 10 array jobs to the SGE scheduler is the result of the above logic.

Each i, j, k - comes from HapMap db?

  1. Each array job takes 16 hrs on Cheaha and 20 hrs on Coosa to complete. The time taken to complete a single job is linear i.e., a job with population size = 1000 completes in 16 hrs so, a job with population size = 10 completes in 1.6 hrs
  1. The sim_results.txt essentially consists of:
  1. an R-script consuming each of i, j, and, k to generate input files
  2. the above files are passed as inputs to impute - this consumes the bulk of execution time
  3. an R-script which performs post-analysis on the output of impute
  1. The impute job itself can be broken down with the following parameters:
    1. interval (-int) caveat: Impute advises not to break this down from the current interval specification
    2. population size
  1. Action Items for Boshao:
    1. Get Tapan to look at his scientific workflow itself, if it aligns/is similar with some of the MIG work he has been doing, so that Boshao may benefit by structuring his scientific workflow before scaling
    2. Breakdown the population size from the current 1000 to 10 and find the accuracy of the results, before scaling