These notes related to the meeting on 2009-10-27 to determine if this workflow can be scaled -- increasing scale or decreasing run-time.
1. Boshao Zhang from SSG works on Statistical Genetics, mainly with Imputations. Makes use of Impute statistical analysis software from Oxford.
2. Boshao's Imputation work consists of R-scripts and Impute scripts wrapped up in a shell script to be submitted to an SGE scheduler
3. HapMap data, which is a Genetics database serves as the input for Boshao's Imputation work. !Hapmap data is in a text form, consisting of 0's and 1's
4. Workflow consists of a tree dir structure, consisting of for loops (pseudo code below)
for i in 1 2 3 4 {
mkdir dir$i
cd dir$i
for j in 1 2 3 {
mkdir dir$j
cd dir$j
for k in 1 2 3 {
mkdir dir$k
cd dir$k
./sim_results.txt }
} }
A simulation with 10 array jobs to the SGE scheduler is the result of the above logic.
Each i, j, k - comes from HapMap db?
5. Each array job takes 16 hrs on Cheaha and 20 hrs on Coosa to complete. The time taken to complete a single job is linear i.e.,
a job with population size = 1000 completes in 16 hrs so, a job with population size = 10 completes in 1.6 hrs
6. The sim_results.txt essentially consists of:
- an R-script consuming each of i, j, and, k to generate input files
- the above files are passed as inputs to impute - this consumes the bulk of execution time
- an R-script which performs post-analysis on the output of impute
7. The impute job itself can be broken down with the following parameters:
- interval (-int) caveat: Impute advises not to break this down from the current interval specification
- population size
8. Action Items for Boshao:
- Get Tapan to look at his scientific workflow itself, if it aligns/is similar with some of the MIG work he has been doing, so that Boshao may benefit by structuring his scientific workflow before scaling
- Breakdown the population size from the current 1000 to 10 and find the accuracy of the results, before scaling
