Notes on Running an r-job on cheaha
- To run a simple batch script on SGE ROCKS cluster, I followed the documentation given on http://cheaha.ac.uab.edu/rocks-documentation/4.2.1/launching-batch-jobs.html
- The documentation itself has a sample shell script that can be downloaded and executed.
- To submit a job, "qsub" is the command and to examine its status is "qstat"
- Next followed the instructions given on page http://me.eng.uab.edu/wiki/index.php?title=R-userinfo to run R-jobs on the SGE ROCKS cluster
- First edited my .bashrc file on cheaha. (Basically did a copy and paste from the above mentioned page)
- In order to actually run an R-job, copied an example R-script from the same page http://me.eng.uab.edu/wiki/images/8/85/Hpc_test.R
- The above script I saved it as hpc_test.R in my current working directory
- Next, I copied the shell script to submit the R- script, hpc_test.R as sge-hpc-test.sh. The script that I copied was the one which had no external arguments passed to the R-script
#!/bin/bash #$ -cwd #$ -m beas #$ -N hpcdemo #$ -M tapan@uab.edu # requesting 6hrs wall clock time #$ -l h_rt=6:00:00 #$ -v PATH,R_HOME,R_LIBS,LD_LIBRARY_PATH,CWD R CMD BATCH --no-save --no-restore hpc_test.R
- Failed to edit the -M option, but submitted the job, with qsub sge-hpc-test.sh
- After the job submission, I opened the output file ( sge-hpc-test.sh.o58346 ) of the job
- The contents of sge-hpc-test.sh.o58514 were as follows:
Warning: no access to tty (Bad file descriptor). Thus no job control in this shell. - The FAQ on the page http://me.eng.uab.edu/wiki/index.php?title=R-userinfo had a solution to the above problem, which was:
This warning can safely be ignored, or you can prevent it by using "-S /bin/sh" in you job script, for example, add this line near the top of the job script: #$ -S /bin/sh - After adding the above line to sge-hpc-test.sh, submitted the script again with qsub sge-hpc-test.sh
- Three files are generated as a result, which are R-JOB-BATCH-MODE.o58514, R-JOB-BATCH-MODE.e58514, and hpc_test.Rout
- The contents of R-JOB-BATCH-MODE.o58514 are
WARNING: ignoring environment value of R_HOME
- The contents of hpc_test.Rout are
R : Copyright 2006, The R Foundation for Statistical Computing Version 2.3.1 (2006-06-01) ISBN 3-900051-07-0 R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. > invisible(options(echo = TRUE)) > ##First read in the arguments listed at the command line > args=(commandArgs(TRUE)) Error in commandArgs(TRUE) : unused argument(s) ( ...) Execution halted
- By changing the sge-hpc-test.sh script with the example R-Batch scripts given in the website, http://me.eng.uab.edu/wiki/index.php?title=R-userinfo, I get the same result in the hpc_test.Rout
Update to the above script
- Changed the hpc-test.R script by not looking for arguments and hardcoding the variable assignment in the script
- changed the sge-hpc-tst.sh script to the one which does not look for any external command-line arguments
#!/bin/bash #$ -cwd #$ -m beas #$ -N hpcdemo #$ -M tapan@uab.edu # requesting 6hrs wall clock time #$ -l h_rt=6:00:00 #$ -v PATH,R_HOME,R_LIBS,LD_LIBRARY_PATH,CWD R CMD BATCH --no-save --no-restore hpc_test.R
- Cleared the session coolies and HTTP authentication from the browser
After performing the above changes, I was able to run and submit the r-job successfully
Notes on executing test scripts from r-group on cheaha and stage
- Following instructions on http://projects.uabgrid.uab.edu/r-group/wiki/BasicTests
- For setting up an account for UABGrid, clicked on the link https://ca.uabgrid.uab.edu/user/custom_request_cert.php
- This prompts you to a login through either UAB Authentication/OpenID
- After loggin in through UAB Authentication, UABGrid Certification Form opens, showing details, like User's name, E-mail Address, Organization
- For the first time, when I hit submit-request, I got an Error in Form.
ERROR(S) IN FORM Missing e-mail Address. Please register to UABgrid first - So, clicking on the given link, I registerd to the UABGrid, which goes to the myVocs. In there, I clicked on the preferences tab and added my e-mail address
- After adding my e-mail address, I clicked on the https://ca.uabgrid.uab.edu/user/custom_request_cert.php again again.
- On the second time too, my e-mail address was blank.
- So, closed the certification page and cleared all the session cookies (with the help of Firefox add-on "Web Developer", you can click on cookies tab and clear all your session cookies)
- Clicked on the link https://ca.uabgrid.uab.edu/user/custom_request_cert.php a third time, and now my e-mail address appears
- Now I hit on Submit Request and I get the following information
You are about to create a certificate using the following information: User's Name ppreddy@uab.edu E-mail Address ppreddy@uab.edu Organization University of Alabama at Birmingham Department/Unit UABgrid Locality Birmingham State/Province Alabama Country US Certificate Life 1 Year
- Created the certificate
- Then downloaded the userkey.pem and usercert.pem onto my desktop on the local machine
- cheaha.ac.uab.edu does not have a .globus directory in my /home/ppreddy account. So created a .globus directory and did a scp of userkey.pem and usercert.pem to the .globus directory. When I tried to run gridway commands, like gwhost and gwps , I get the following message on cheaha.ac.uab.edu
[ppreddy@cheaha ~]$ gwhost -bash: gwhost: command not found
- So, logged onto stage.uabgrid.uab.edu and it does have a .globus directory already in my /home/ppreddy folder. So, did a scp of userkey.pem and usercert.pem to the .globus directory. Gridway commands do run on stage.uabgrid.uab.edu
- Created the example script testjob with the following contents
EXECUTABLE=/bin/uname ARGUMENTS=-a RESCHEDULE_ON_FAILURE=no - When I do a gwsubmit -t testjob, I get the message
[ppreddy@stage ~]$ gwsubmit -t testjob FAILED: failed could not register user (check proxy)The above finding has been reported as a ticket http://dev.uabgrid.uab.edu/ticket/48
Also, when the job I submit, is hung, the job submitted by any other user after me also gets hung.
The problem of failed to register user proxy cascaded because of my failed job.
So, the gwd had to be killed in order to clear off the hung job. This had to be done with a hard kill kill -9 <pid> and subsequently the lock file removed rm $GW_LOCATION/var/lock
gwd was started again after clearing off the hung job
- Without executing the script, I copied the testjob lines onto the command line itself and I get the following message
[ppreddy@stage ~]$ /bin/uname -a RESCHEDULE_ON_FAILURE=no Linux stage.uabgrid.uab.edu 2.6.9-55.0.2.ELsmp #1 SMP Tue Jun 26 14:30:58 EDT 2007 i686 i686 i386 GNU/Linux
- Upon encountering the problem of failed to register user proxy , searched the gw-users mailing lists and a discussion thread on the same problem was found. The discussion thread can be found here http://www.globus.org/mail_archive/gridway-user/2007/03/msg00007.html Two suggestions have been posted:
- Check whether your Java version is 1.4. If it is, upgrade it to Java 1.5 > Java version in our environment is 1.5
- Check whether gw was installed in multi-user mode and run the gw_em_mad_ws
> I did execute the gw_em_mad_ws command as given in http://dev.uabgrid.uab.edu/uabgrid-stage/wiki/BuildTheStage
sudo -u ppreddy /opt/gw/bin/gw_em_mad_ws
for which I got a message, "Not authorized to do so"
John Paul found out that I was not added to the "gwusers" group and so, I was not able to submit jobs to gridway.
I was added onto the gwusers group with the commandusermod -a -G gwusers ppreddy
and now Iam able to submit jobs to the gridway.
