wiki:RNotes
Last modified 10 years ago Last modified on 12/14/07 20:52:54

Executing LAM/MPI jobs via GridWay?

looked up gridengine and rocks to see about how regular mpi could be used as a process launcher. in doing so, came across the gridengine user wiki which has all sorts of good user intergration notes for various app (eg. matlab, flexlm).

they also linked to the loose/tight integration docs which i reread and it's now clear why the mpi_home env and path are important because in loose integration sge doesn't do anything but tell you the slaves are available. there is no sge controlled environ on the slave nodes (eg. no $TMPDIR), so the startup method is just ssh and therefore whatever files are sourced during a non-interactive login influence the environment of the shell on the slaves.

also interesting is that the lam_loose_rsh scripts don't need to be registered with sge (eg. if you want to just call them directly from the sge script -- may be useful in grid setups).

went back through and re-read the the sge jobmanager because the lam_loose_rsh instrux were showing a commandline with -catch-rsh which is showing up in the error files of gridway output (is there a misconfig of lam_loose_rsh that is causing the master processes to be abandoned? but qconf -sp lam_loose_rsh indicates the correct command).

in re-reading sge jobmanager i finally grok'd the section where the -pe is specified and see that if I specify SGE_PE or GRID_PE in the job environment, then I should be able to set the pe that way.

attempted to submit snow job to cheaha via gridway for 4 nodes but cheaha is filled up, 60 of 60 nodes used. this is not completely true as many of the nodes just have a single job running and can accept more work. verfied with submit of local job to sge which ran with 4 nodes now distributed across 4 distinct compute nodes (still has left over lam on head node). the gridway job remains pending. bug is that gridway doesn't see the slots available but the nodes instead. good example of why having multiple clusters ready though is good, because this could have just spilt over to a second cluster.

Some workarounds to enable SGE/LAM integration

Once I understood the problem JPR was having with the SGE/LAM integration, here is my attempt to fix it. If LAM MPI is the default MPI on a cluster then the solution is quite simple. While setting up the SGE parallel environment (PE) setup a PE, say lam, and use it instead of PE mpi. To add a new parallel environment called lam you can use either qconf -ap lam and complete the various fields. Here is what you should enter:

pe_name           lam
slots             9999
user_lists        NONE
xuser_lists       NONE
start_proc_args   /opt/gridengine/mpi/startlam.sh $pe_hostfile
stop_proc_args    /opt/gridengine/mpi/stoplam.sh
allocation_rule   $fill_up
control_slaves    FALSE
job_is_first_task TRUE
urgency_slots     min

I am assuming that the startlam.sh and stoplam.sh already exists in the /opt/gridengine/mpi directory (it was there on the Rocks 4.2.1). If you are using a different version of LAM that is not installed in the default Rocks location /opt/lam/gnu then you have to edit the file startlam.sh and make sure it points to your version of LAM.

After this step make sure you add this to the queue. If you are using the default queue all.q then you can do that by using the command "qconf -mq all.q" and enter lam on the line that start with pe_list. Here is a sample output:

qname                 all.q
hostlist              @allhosts
seq_no                0
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               exclusive lam make mpi mpich single
rerun                 FALSE
slots                 1,[compute-0-0.local=2],[compute-1-1.local=2], \
[rest deleted]

That's it. Now you can submit a sample job to lam PE and see if it works. Here is a sample SGE script:

#!/bin/bash
#
#$ -cwd
#$ -j y
#$ -S /bin/bash
#$ -V
#
#$ -pe lam 4
# 
EXECUTABLE="./hello"
mpirun -np $NSLOTS $EXECUTABLE 

Make sure that you have the PATH and LD_LIBRARY_PATH variables set correctly to the appropriate LAM directories, so that you get the correct mpicc/mpif77 for compiling and mpirun for starting processes. There is no need to include any LAM specific commands such as lamboot or lamhalt.

Once a separate lam PE is setup the next step is to setup Globus/SGE integration (Globus SGE Connector Installation). If LAM MPI is the only version of MPI used then all we need to do is specify the path to mpirun and set mpi_pe to lam.

However, if you need to support multiple MPI implementations then one of the options is to create a separate jobmanager for LAM. Here are the instructions for setting up a new jobmanager (for now I have the instructions to do this manually, it could be automated as part of SGE connector installation).

In the directory $GLOBUS_LOCATION/lib/perl/Globus/GRAM/JobManager/ copy the file sge.pm to sgelam.pm. Change the package name from sge to sgelam, change the values for variables $mpirun and $mpi_pe to /opt/lam/gnu/bin/mpirun and lam, respectively, and delete the line that specifies -machinefile. Here is a diff between the two files:

diff sge.pm sgelam.pm 
15c15
< package Globus::GRAM::JobManager::sge;
---
> package Globus::GRAM::JobManager::sgelam;
36c36
<     $mpirun      = '/usr/local/topspin/mpi/mpich/bin/mpirun';
---
>     $mpirun      = '/opt/lam/gnu/bin/mpirun';
38c38
<     $mpi_pe      = 'mpi';
---
>     $mpi_pe      = 'lam';
497d496
<                                    . "-machinefile \$TMPDIR/machines "

In the directory $GLOBUS_LOCATION/etc/grid-services copy jobmanager-sge to jobmanager-sgelam and replace sge with sgelam. Here is what the file jobmanager-sgelam should look like:

stderr_log,local_cred - /usr/local/globus-4.0.5/libexec/globus-job-manager globus-job-manager -conf /usr/local/globus-4.0.5/etc/globus-job-manager.conf -type sgelam -rdn jobmanager-sgelam -machine-type unknown -publish-jobs -seg

Now you can test if you can submit a job to the jobmanager-sgelam using the following command:

[puri@stage ~]$ globusrun -r olympus.cis.uab.edu/jobmanager-sgelam "&(executable=/shared/home/puri/mpi/hello)(stdout=/shared/home/puri/stdout)(stderr=/shared/home/puri/stderr)(jobtype=mpi)(count=4)"
globus_gram_client_callback_allow successful
GRAM Job submission successful
GLOBUS_GRAM_PROTOCOL_JOB_STATE_PENDING
GLOBUS_GRAM_PROTOCOL_JOB_STATE_DONE
[puri@stage ~]$ 
[puri@stage ~]$ globus-url-copy gsiftp://olympus.cis.uab.edu/shared/home/puri/stdout file:///home/puri/stdout
[puri@stage ~]$ 
[puri@stage ~]$ cat stdout
/opt/gridengine/default/spool/compute-4-25/active_jobs/632072.1/pe_hostfile
compute-4-25
compute-4-25
compute-1-10
compute-1-10

LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University

Hello World! I am 0 of 4
Hello World! I am 2 of 4
Hello World! I am 1 of 4
Hello World! I am 3 of 4

LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University

[puri@stage ~]$ 

Alright, now how to run R using Globus command-line tools? Here is a simple command to test it. First one is using the jobmanager-sge by setting the environment variable SGE_PE to lam:

[puri@stage ~]$ globusrun -r olympus.cis.uab.edu/jobmanager-sge "&(executable=/shared/apps/R-2.6.1/bin/R)(arguments=CMD BATCH ./TestSnow.R)(environment = (SGE_PE "lam") )"
globus_gram_client_callback_allow successful
GRAM Job submission successful
GLOBUS_GRAM_PROTOCOL_JOB_STATE_PENDING
GLOBUS_GRAM_PROTOCOL_JOB_STATE_DONE
[puri@stage ~]$

Here is the script generated by Globus on olympus:

[puri@olympus ~]$ more .globus/job/olympus.cis.uab.edu/11896.1197486862/sge_job_script.11916
#!/bin/sh
# Grid Engine batch job script built by Globus job manager

#$ -S /bin/sh
#$ -m n
#$ -o /dev/null
#$ -e /dev/null
X509_USER_PROXY=/shared/home/puri/.globus/job/olympus.cis.uab.edu/11896.1197486862/x509_up; export X509_USER_PROXY
GLOBUS_LOCATION=/usr/local/globus-4.0.5; export GLOBUS_LOCATION
GLOBUS_GRAM_JOB_CONTACT=https://olympus.cis.uab.edu:45001/11896/1197486862/; export GLOBUS_GRAM_JOB_CONTACT
GLOBUS_GRAM_MYJOB_CONTACT=URLx-nexus://olympus.cis.uab.edu:45002/; export GLOBUS_GRAM_MYJOB_CONTACT
HOME=/shared/home/puri; export HOME
LOGNAME=puri; export LOGNAME
#$ -pe lam 1
LD_LIBRARY_PATH=;
export LD_LIBRARY_PATH;
. /opt/gridengine/default/common/settings.sh
# Change to directory requested by user
cd /shared/home/puri
/shared/apps/R-2.6.1/bin/R "CMD" "BATCH" "./TestSnow.R"  < /dev/null&
wait

Note that if we try to run this job as an MPI job then it will hang since the jobmanager-sge is setup to use MPICH. Here is the command used and the script generated:

[puri@stage ~]$ globusrun -r olympus.cis.uab.edu/jobmanager-sge "&(executable=/shared/apps/R-2.6.1/bin/R)(arguments=CMD BATCH ./TestSnow.R)(environment = (SGE_PE "lam") )(jobtype=mpi)(count=4)"
globus_gram_client_callback_allow successful
GRAM Job submission successful
GLOBUS_GRAM_PROTOCOL_JOB_STATE_PENDING
GLOBUS_GRAM_PROTOCOL_JOB_STATE_ACTIVE
[HANGS]
[puri@olympus ~]$ more .globus/job/olympus.cis.uab.edu/11980.1197487036/sge_job_script.11999
#!/bin/sh
# Grid Engine batch job script built by Globus job manager

#$ -S /bin/sh
#$ -m n
#$ -o /dev/null
#$ -e /dev/null
X509_USER_PROXY=/shared/home/puri/.globus/job/olympus.cis.uab.edu/11980.1197487036/x509_up; export X509_USER_PROXY
GLOBUS_LOCATION=/usr/local/globus-4.0.5; export GLOBUS_LOCATION
GLOBUS_GRAM_JOB_CONTACT=https://olympus.cis.uab.edu:45001/11980/1197487036/; export GLOBUS_GRAM_JOB_CONTACT
GLOBUS_GRAM_MYJOB_CONTACT=URLx-nexus://olympus.cis.uab.edu:45002/; export GLOBUS_GRAM_MYJOB_CONTACT
HOME=/shared/home/puri; export HOME
LOGNAME=puri; export LOGNAME
LD_LIBRARY_PATH=;
export LD_LIBRARY_PATH;
. /opt/gridengine/default/common/settings.sh
# Change to directory requested by user
cd /shared/home/puri
#$ -pe lam 4
/usr/local/topspin/mpi/mpich/bin/mpirun -np 4 -machinefile $TMPDIR/machines /shared/apps/R-2.6.1/bin/R "CMD" "BATCH" "./TestSnow.R"  < /dev/null

The error message sent to the file TestSnow?.Rout on olympus:

[deleted]
> library(snow)
> c1<-makeCluster(as.integer(Sys.getenv("NSLOTS")), type = "MPI")
Loading required package: Rmpi
Error in if (nslaves <= 0) stop("Choose a positive number of slaves.") :
  missing value where TRUE/FALSE needed
Calls: makeCluster -> switch -> makeMPIcluster -> mpi.comm.spawn
Execution halted

Second option would be to use the jobmanager-sgelam:

[puri@stage ~]$ globusrun -r olympus.cis.uab.edu/jobmanager-sgelam "&(executable=/shared/apps/R-2.6.1/bin/R)(arguments=CMD BATCH ./TestSnow.R)" globus_gram_client_callback_allow successful
GRAM Job submission successful
GLOBUS_GRAM_PROTOCOL_JOB_STATE_PENDING
GLOBUS_GRAM_PROTOCOL_JOB_STATE_DONE
[puri@stage ~]$

Here is the corresponding scrtipt generated on olympus:

[puri@olympus ~]$ more .globus/job/olympus.cis.uab.edu/12927.1197487330/sge_job_script.12946
#!/bin/sh
# Grid Engine batch job script built by Globus job manager

#$ -S /bin/sh
#$ -m n
#$ -o /dev/null
#$ -e /dev/null
X509_USER_PROXY=/shared/home/puri/.globus/job/olympus.cis.uab.edu/12927.1197487330/x509_up; export X509_USER_PROXY
GLOBUS_LOCATION=/usr/local/globus-4.0.5; export GLOBUS_LOCATION
GLOBUS_GRAM_JOB_CONTACT=https://olympus.cis.uab.edu:45001/12927/1197487330/; export GLOBUS_GRAM_JOB_CONTACT
GLOBUS_GRAM_MYJOB_CONTACT=URLx-nexus://olympus.cis.uab.edu:45002/; export GLOBUS_GRAM_MYJOB_CONTACT
HOME=/shared/home/puri; export HOME
LOGNAME=puri; export LOGNAME
LD_LIBRARY_PATH=;
export LD_LIBRARY_PATH;
. /opt/gridengine/default/common/settings.sh
# Change to directory requested by user
cd /shared/home/puri
/shared/apps/R-2.6.1/bin/R "CMD" "BATCH" "./TestSnow.R"  < /dev/null&
wait

Here is the output in the file TestSnow?.Rout:

> library(snow)
> c1<-makeCluster(as.integer(Sys.getenv("NSLOTS")), type = "MPI")
Loading required package: Rmpi
        1 slaves are spawned successfully. 0 failed.
> clusterCall(c1,function() Sys.info()[c("nodename","machine")])
[[1]]
            nodename              machine
"compute-4-11.local"             "x86_64"

> stopCluster(c1)
[1] 1
> print("Hello")
[1] "Hello"
>
>
> proc.time()
   user  system elapsed
  0.723   0.080   2.034

As you can observer we only ran on one processor since we did not specify multiple processes. However if we specify multiple processes with the count argument then Globus treats it as an array job. If we specify the jobtype as mpi then it will try to invoke R from mpirun. Here are the commands, script generated, and output.

[puri@stage ~]$ globusrun -r olympus.cis.uab.edu/jobmanager-sgelam "&(executable=/shared/apps/R-2.6.1/bin/R)(arguments=CMD BATCH ./TestSnow.R)(count=4)"
globus_gram_client_callback_allow successful
GRAM Job submission successful
GLOBUS_GRAM_PROTOCOL_JOB_STATE_PENDING
GLOBUS_GRAM_PROTOCOL_JOB_STATE_ACTIVE
GLOBUS_GRAM_PROTOCOL_JOB_STATE_DONE
[puri@stage ~]$
[puri@olympus ~]$ more .globus/job/olympus.cis.uab.edu/14075.1197487657/sge_job_script.14094
#!/bin/sh
# Grid Engine batch job script built by Globus job manager

#$ -S /bin/sh
#$ -m n
#$ -o /dev/null
#$ -e /dev/null
X509_USER_PROXY=/shared/home/puri/.globus/job/olympus.cis.uab.edu/14075.1197487657/x509_up; export X509_USER_PROXY
GLOBUS_LOCATION=/usr/local/globus-4.0.5; export GLOBUS_LOCATION
GLOBUS_GRAM_JOB_CONTACT=https://olympus.cis.uab.edu:45001/14075/1197487657/; export GLOBUS_GRAM_JOB_CONTACT
GLOBUS_GRAM_MYJOB_CONTACT=URLx-nexus://olympus.cis.uab.edu:45002/; export GLOBUS_GRAM_MYJOB_CONTACT
HOME=/shared/home/puri; export HOME
LOGNAME=puri; export LOGNAME
LD_LIBRARY_PATH=;
export LD_LIBRARY_PATH;
. /opt/gridengine/default/common/settings.sh
# Change to directory requested by user
cd /shared/home/puri
/shared/apps/R-2.6.1/bin/R "CMD" "BATCH" "./TestSnow.R"  < /dev/null&
/shared/apps/R-2.6.1/bin/R "CMD" "BATCH" "./TestSnow.R"  < /dev/null&
/shared/apps/R-2.6.1/bin/R "CMD" "BATCH" "./TestSnow.R"  < /dev/null&
/shared/apps/R-2.6.1/bin/R "CMD" "BATCH" "./TestSnow.R"  < /dev/null&
wait
> library(snow)
> c1<-makeCluster(as.integer(Sys.getenv("NSLOTS")), type = "MPI")
Loading required package: Rmpi
        1 slaves are spawned successfully. 0 failed.
> clusterCall(c1,function() Sys.info()[c("nodename","machine")])
[[1]]
           nodename             machine
"compute-1-6.local"            "x86_64"

> stopCluster(c1)
[1] 1
> print("Hello")
[1] "Hello"
>
>
> proc.time()
   user  system elapsed
  0.729   0.072  62.509

The jobs is executed four times and the output is overwritten by each process, what we see here is the output from the last process to complete.

[puri@stage ~]$ globusrun -r olympus.cis.uab.edu/jobmanager-sgelam "&(executable=/shared/apps/R-2.6.1/bin/R)(arguments=CMD BATCH ./TestSnow.R)(jobtype=mpi)(count=4)"
globus_gram_client_callback_allow successful
GRAM Job submission successful
GLOBUS_GRAM_PROTOCOL_JOB_STATE_PENDING
GLOBUS_GRAM_PROTOCOL_JOB_STATE_DONE
[puri@stage ~]$
[puri@olympus ~]$ more .globus/job/olympus.cis.uab.edu/15229.1197487888/sge_job_script.15248
#!/bin/sh
# Grid Engine batch job script built by Globus job manager

#$ -S /bin/sh
#$ -m n
#$ -o /dev/null
#$ -e /dev/null
X509_USER_PROXY=/shared/home/puri/.globus/job/olympus.cis.uab.edu/15229.1197487888/x509_up; export X509_USER_PROXY
GLOBUS_LOCATION=/usr/local/globus-4.0.5; export GLOBUS_LOCATION
GLOBUS_GRAM_JOB_CONTACT=https://olympus.cis.uab.edu:45001/15229/1197487888/; export GLOBUS_GRAM_JOB_CONTACT
GLOBUS_GRAM_MYJOB_CONTACT=URLx-nexus://olympus.cis.uab.edu:45002/; export GLOBUS_GRAM_MYJOB_CONTACT
HOME=/shared/home/puri; export HOME
LOGNAME=puri; export LOGNAME
LD_LIBRARY_PATH=;
export LD_LIBRARY_PATH;
. /opt/gridengine/default/common/settings.sh
# Change to directory requested by user
cd /shared/home/puri
#$ -pe lam 4
/opt/lam/gnu/bin/mpirun -np 4 /shared/apps/R-2.6.1/bin/R "CMD" "BATCH" "./TestSnow.R"  < /dev/null
> library(snow)
> c1<-makeCluster(as.integer(Sys.getenv("NSLOTS")), type = "MPI")
Loading required package: Rmpi
Error in if (nslaves <= 0) stop("Choose a positive number of slaves.") :
  missing value where TRUE/FALSE needed
Calls: makeCluster -> switch -> makeMPIcluster -> mpi.comm.spawn
Execution halted

If we invoke R from mpirun then according to the documentation at Simple Network of Workstations for R we should use c1<-makeCluster(as.integer(Sys.getenv("NSLOTS"))-1) or c1<-getMPIcluster(). However that change gave the following error message:

> library(snow)
> c1<-makeCluster(as.integer(Sys.getenv("NSLOTS"))-1)
Loading required package: Rmpi
Error in if (nslaves <= 0) stop("Choose a positive number of slaves.") :
  missing value where TRUE/FALSE needed
Calls: makeCluster -> switch -> makeMPIcluster -> mpi.comm.spawn
Execution halted
> library(snow)
> c1<-getMPIcluster()
> clusterCall(c1,function() Sys.info()[c("nodename","machine")])
Error in checkCluster(cl) : not a valid cluster
Calls: clusterCall -> checkCluster
Execution halted

However, if we use define the environment variable SGE_PE with the jobmanager-sgelam and specify the count without specifying the jobtype as mpi (i.e., the jobtype=single), then we will see the following:

[puri@stage ~]$ globusrun -r olympus.cis.uab.edu/jobmanager-sgelam "&(executable=/shared/apps/R-2.6.1/bin/R)(arguments=CMD BATCH ./TestSnow.R)(count=4)(environment = (SGE_PE "lam") )"
globus_gram_client_callback_allow successful
GRAM Job submission successful
GLOBUS_GRAM_PROTOCOL_JOB_STATE_PENDING
GLOBUS_GRAM_PROTOCOL_JOB_STATE_DONE
[puri@stage ~]$
[puri@olympus ~]$ more .globus/job/olympus.cis.uab.edu/18955.1197489084/sge_job_script.18974
#!/bin/sh
# Grid Engine batch job script built by Globus job manager

#$ -S /bin/sh
#$ -m n
#$ -o /dev/null
#$ -e /dev/null
X509_USER_PROXY=/shared/home/puri/.globus/job/olympus.cis.uab.edu/18955.1197489084/x509_up; export X509_USER_PROXY
GLOBUS_LOCATION=/usr/local/globus-4.0.5; export GLOBUS_LOCATION
GLOBUS_GRAM_JOB_CONTACT=https://olympus.cis.uab.edu:45001/18955/1197489084/; export GLOBUS_GRAM_JOB_CONTACT
GLOBUS_GRAM_MYJOB_CONTACT=URLx-nexus://olympus.cis.uab.edu:45002/; export GLOBUS_GRAM_MYJOB_CONTACT
HOME=/shared/home/puri; export HOME
LOGNAME=puri; export LOGNAME
#$ -pe lam 4
LD_LIBRARY_PATH=;
export LD_LIBRARY_PATH;
. /opt/gridengine/default/common/settings.sh
# Change to directory requested by user
cd /shared/home/puri
/shared/apps/R-2.6.1/bin/R "CMD" "BATCH" "./TestSnow.R"  < /dev/null&
/shared/apps/R-2.6.1/bin/R "CMD" "BATCH" "./TestSnow.R"  < /dev/null&
/shared/apps/R-2.6.1/bin/R "CMD" "BATCH" "./TestSnow.R"  < /dev/null&
/shared/apps/R-2.6.1/bin/R "CMD" "BATCH" "./TestSnow.R"  < /dev/null&
wait
> library(snow)
> c1<-makeCluster(as.integer(Sys.getenv("NSLOTS")), type = "MPI")
Loading required package: Rmpi
        4 slaves are spawned successfully. 0 failed.
> clusterCall(c1,function() Sys.info()[c("nodename","machine")])
[[1]]
            nodename              machine
"compute-3-25.local"             "x86_64"

[[2]]
            nodename              machine
"compute-3-25.local"             "x86_64"

[[3]]
            nodename              machine
"compute-2-17.local"             "x86_64"

[[4]]
            nodename              machine
"compute-2-17.local"             "x86_64"

> stopCluster(c1)
[1] 1
> print("Hello")
[1] "Hello"
>
>
> proc.time()
   user  system elapsed
  0.749   0.055   4.542

Again note that this was executed four times and the output is from the last instance.

Here are some of my comments on how to go forward:

Option 1. Figure out how to configure and run Rmpi with MPICH. This will solve all problems that we are currently facing. The documentation at Rmpi for R says that Rmpi works with MPICH 1.2. So we should figure out how to do this.

Option 2: We can write an mpirun wrapper on the clusters to figure out which version of MPI to use based on some environment variables. This mpirun wrapper would not invoke mpirun while executing R (of course this is a hack).

Option 3: We can support a new jobtype, say Rtype, and edit sge.pm to support this jobtype.

Updated SGE Globus Connector

After the above experiment it became clear that we need to (a) support multiple MPI implementations and (b) fix the "single" jobtype so that it executes only once irrespective what the count argument is. In order to fix these two problems, I have rewritten a good part of the sge.pm script (attached below) to support both of these requirements and added a new jobtype "array" so that it can support the case where count>1.

Attachments

  • sge.pm (20.5 KB) - added by puri@… 10 years ago. SGE Jobmanager - Updated to support multiple MPI Implementations