Introduction
Eventhough ROCKS has a Globus Roll which supports a web-services version of Globus 4.0.x, we are choose not to use that roll for now. Some of the reasons for this are that:
- we want to keep up with the latest Globus releases without having to wait for a repackaging cycle
- the Globus roll installs on the head node and each compute node - this is not an ideal configuration unless provisions are made to address the NAT issues when crossing the head node (it's doable, we're just not going to do this quite yet)
- Globus is most useful in our current configuration as a staging interface to the cluster and this requires an install on the head node alone.
Installing Globus has a number of steps but is fairly simple. Most of the documentation has been collected elsewhere. We will follow the steps in the UABgrid-stage setup and this page will document specific additional tasks or modifications required for cheaha. The goal is to recreate a reproducible install for cheaha.
Background
Remove existing package
Cheaha ROCKS 4.2.1 release was installed with the globus roll. The roll and related configuration has been removed from the node configuration database in /home/install, however, the globus packages on the head node still need to be removed. Ideally we would run:
rpm -e globus globus-simple-ca
but because this globus install provides perl libraries to SGE for GRAM and URL. Removing the globus package breaks this dependency. The required libraries will be provided by the new globus install we are performing, however, the rpm record of the dependency will not be satisfied.
Reconsidering install proceedure...
We will hold off on this step for now until the globus package is built.
Update the globus user
The globus account defined by the ROCKS globus package list /opt/globus as the home dir for the user. This is OK for a fully packaged configuration, but we will be building globus from source and it is best to build this as the user globus and we shouldn't build it from inside the install directory. Our solution is to change the globus account to have a directory under /home like normal users. (As a side benefit, should we later decided to install globus on the compute nodes this would make it easy to run the make install across the cluster).
Cheaha already has the globus account defined in /etc/password from the ROCKS roll install. We simply need to create a standard home directory for globus, update the mount data, and update the user account record.
cp -rp /etc/skel /export/home/globus chown -R globus.globus /export/home/globus echo globus cheaha.local:/export/home/globus >> /etc/auto.home usermod -d /home/globus globus
Preperation
Install Latest Java & Ant
Follow the instructions in Getting Started for installing ANT and JAVA.
Verify Existing Install
Java, Ant, the globus services all seem satisfactory.
Globus WS configuration and setup for rocks can be found here:
http://goc.pragma-grid.net/wiki/index.php/GT4_WSGram_with_SGE_on_ROCKS_4.2
The main thing we need to do is create the uabgrid certs, it seems.
The example /opt/globus/globus-ws.init has a bug. There is an extra space on the first line and a * instead of a # for the comment on the second:
#!/bin/sh
# configures Globus WS container
export GLOBUS_LOCATION=/opt/globus
export JAVA_HOME=/usr/java/jdk1.5.0_07
export ANT_HOME=/opt/rocks
export GLOBUS_OPTIONS="-Xms256M -Xmx512M"
export SGE_CELL=default
export SGE_ARCH=lx26-amd64
export SGE_EXECD_PORT=537
export SGE_QMASTER_PORT=536
export SGE_ROOT=/opt/gridengine
. $GLOBUS_LOCATION/etc/globus-user-env.sh
cd $GLOBUS_LOCATION
case "$1" in
start)
$GLOBUS_LOCATION/sbin/globus-start-container-detached -p 8443
;;
stop)
$GLOBUS_LOCATION/sbin/globus-stop-container-detached
;;
*)
echo "Usage: globus {start|stop}" >&2
exit 1
;;
esac
exit 0
In the Globus Container setup section you need to execute:
chmod +x /etc/init.d/globus-ws chmod +x /opt/globus/globus-ws.init
And create a valid /etc/grid-security/grid-mapfile before you can successfully run:
service globus-ws start
Enable GSI SSH
In order to support secure remote access for users to develop the job script on cheaha, we need to support GSI-base ssh access. This is fairly easy to configure.
cp /opt/globus/sbin/SXXsshd /etc/init.d/sshd-globus chmod +x /etc/init.d/sshd-globus cd /etc/init.d/ patch -b sshd-globus << EOF 10c10 < # Provides: sshd --- > # Provides: sshd-globus 31c31 < SSHD_ARGS="" --- > SSHD_ARGS="-p 2222" EOF chkconfig --add sshd-globus chkconfig sshd-globus on service sshd-globus start
This should take care of all the steps necessary to run the Globus sshd along side the standard sshd. Grid users can now access cheaha with:
gsissh -p 2222 cheaha.ac.uab.edu
For background on this config see bug #17. Before the Globus sshd can replace the standard sshd on the standard port we will likely have to address PAM support (bug #18).
Enable RFT
Note: on ROCKS the running mysql as root still requires the root user's password. It is not configured without a password as is the default on CentOS4. You'll need cheaha's root password to perform the following test, sudo won't do.
The default instructions reference a $GLOBUS_LOCATION/etc/globus_wsrf_rft/jndi-config.xml, that doesn't seem to exist, at least on cheaha. You can create one that looks like the following and all you have to change is the password. Since the mysql driver changes have already been applied:
<?xml version="1.0" encoding="UTF-8"?>
<jndiConfig xmlns="http://wsrf.globus.org/jndi/config">
<service name="ReliableFileTransferFactoryService">
<resource name="home"
type="org.globus.wsrf.impl.ServiceResourceHome">
<resourceParams>
<parameter>
<name>
factory
</name>
<value>
org.globus.wsrf.jndi.BeanFactory
</value>
</parameter>
</resourceParams>
</resource>
<resource name="mdsConfiguration"
type="org.globus.wsrf.impl.servicegroup.client.MDSConfiguration">
<resourceParams>
<parameter> <name>reg</name>
<value>true</value>
</parameter>
<parameter> <name>factory</name>
<value>org.globus.wsrf.jndi.BeanFactory</value>
</parameter>
</resourceParams>
</resource>
</service>
<service name="ReliableFileTransferService">
<resource name="configuration"
type="org.globus.transfer.reliable.service.RFTConfiguration">
<resourceParams>
<parameter>
<name>
factory
</name>
<value>
org.globus.wsrf.jndi.BeanFactory
</value>
</parameter>
<parameter>
<name>
backOff
</name>
<value>
10000
</value>
</parameter>
<parameter>
<name>
maxActiveAllowed
</name>
<value>
100
</value>
</parameter>
</resourceParams>
</resource>
<resource name="dbConfiguration"
type="org.globus.transfer.reliable.service.database.RFTDatabaseOptions">
<resourceParams>
<parameter>
<name>
factory
</name>
<value>
org.globus.wsrf.jndi.BeanFactory
</value>
</parameter>
<parameter>
<name>
driverName
</name>
<value>
com.mysql.jdbc.Driver
</value>
</parameter>
<parameter>
<name>
connectionString
</name>
<value>
jdbc:mysql:///rftDatabase
</value>
</parameter>
<parameter>
<name>
userName
</name>
<value>
globus
</value>
</parameter>
<parameter>
<name>
password
</name>
<value>
foo
</value>
</parameter>
<parameter>
<name>
maxActive
</name>
<value>
20
</value>
</parameter>
<parameter>
<name>
maxIdle
</name>
<value>
10
</value>
</parameter>
<parameter>
<name>
maxWait
</name>
<value>
-1
</value>
</parameter>
</resourceParams>
</resource>
<resource name="home"
type="org.globus.transfer.reliable.service.ReliableFileTransferHome">
<resourceParams>
<parameter>
<name>factory</name>
<value>org.globus.wsrf.jndi.BeanFactory</value>
</parameter>
<parameter>
<name>resourceClass</name>
<value>org.globus.transfer.reliable.service.ReliableFileTransferResource</value>
</parameter>
<parameter>
<name>resourceKeyName</name>
<value>{http://www.globus.org/namespaces/2004/10/rft}TransferKey</value>
</parameter>
<parameter>
<name>resourceKeyType</name>
<value>java.lang.String</value>
</parameter>
<parameter>
<name>sweeperDelay</name>
<value>60000</value>
</parameter>
</resourceParams>
</resource>
</service>
</jndiConfig>
The rft test easy to perform if you use these instructions as a normal user with a proxy initialized:
cat > /tmp/rft.xfr << EOF true 16000 16000 false 1 true 1 null null false 10 gsiftp://cheaha.ac.uab.edu:2811/tmp/rftTest.tmp gsiftp://cheaha.ac.uab.edu:2811/tmp/rftTest_Done.tmp EOF rm -rf /tmp/rftTest.tmp /tmp/rftTest_Done.tmp cp /etc/hosts /tmp/rftTest.tmp rft -h cheaha.ac.uab.edu -f /tmp/rft.xfr diff /tmp/rftTest_Done.tmp /etc/hosts
Sanity Checking
All the sanity checks seem to work except for the ws run:
globusrun-ws -submit -F localhost -s -c /bin/uname -a
It fails with localhost or fqdn.
The mds4 query needs to be done with a full hostname because other wise the certs don't match up. Seems localhost is still identifying itself with the simpleca set up cert which was replaced with the uabgrid cert.
wsrf-query -x -s https://cheaha.ac.uab.edu:8443/wsrf/services/DefaultIndexService
Install SGE connectors
Before running any of the rest, change ownership of /opt/globus back to the globus user. It seems other install steps (maybe root config on first login) change some ownership which will interfere with container startup.
Install the SGE extenstions as described in the uabgrid-stage setup. Not in the gt4rocks wiki. They reference an old (0.12) gram adaptor that is superceded by the 1.1 version.
Some background:
The gt4rocks wiki suggests going to aist page for 0.12 gram adaptor. There not a lot of history recorded with the adaptor versions. It seems the aist 0.12 adaptor is a follow on to the gt3 adaptor written by the uk folks in 2003. The 1.1 adaptor is a newer version (just for gt4 though). Use this one as instructed.
I followed the path setting to put the mpi path first then ran gpt-postinstall as globus.
The gpt-postinstall step needs to be run as root, probably because ROCKS installs globus as root. If you try to run it as globus, you get permission denied errors and the process fails. If you run it as root then it succeeds (does a whole lot, cross your fingers) and then ends with this statement:
WARNING: The following packages were not set up correctly:
globus_gram_job_manager_setup_condor-noflavor-pgm
globus_gram_job_manager_setup_lsf-noflavor-pgm
globus_gram_job_manager_setup_pbs-noflavor-pgm
globus_scheduler_event_generator_condor_setup-noflavor-pgm
globus_scheduler_event_generator_lsf_setup-noflavor-pgm
globus_scheduler_event_generator_pbs_setup-noflavor-pgm
globus_scheduler_provider_setup_condor-noflavor-pgm
globus_scheduler_provider_setup_lsf-noflavor-pgm
globus_scheduler_provider_setup_pbs-noflavor-pgm
globus_wsrf_gram_service_java_setup_condor-noflavor-pgm
globus_wsrf_gram_service_java_setup_lsf-noflavor-pgm
globus_wsrf_gram_service_java_setup_pbs-noflavor-pgm
Check the package documentation or run postinstall -verbose to see what happened
Which seems reasonable because we don't have these schedulers active.
Once gpt-postinstall (which runs the post install setup scripts) has run it won't run them again. You force the run of all script but this should be avoided (there's a lot of them). To find out what has been install you can use gpt-query and search for the 4 adaptors installed. they should all be listed with 1.1 versions.
I ran the setup script manually as described in the gt4rocks wiki. it seems to have succeeded. this is different instrux than the stage wiki, so not sure what the impact.
fixed above: There are numerous permission errors after running gpt-postinstall as root. They related to poending log files and such. It's fixed with chown globus.globus /opt/globus.
Resterting globus-ws succeeds partially but the new sge jobmanager is not found. Assuming the build steps didn't end up actually building it.
the job manager script was not installed. it needs to be added manually: bug #10 This fix has been superseeded by the globus-scheduler-provider-sge script in ticket:20.
the patch to the sge.pm script was slightly modifed form of both the gt4rocks wiki and stage instructions. followed the first half of the gt4rocks wiki, skipped the $tag remove in stage wiki, but added the -machines file argument of the stage wiki.
note, it seems that some step resets both the sge.pm and jndi config file above (assume this is the gpt-postinstall but not sure. just be aware that your rft config might get hosed. if you see rft compaints in the container log about not being able to contact postmaster that means you been switch back to postgres.
also note, the SGE_QMASTER settings were chosen from the gt4rocks wiki because the test globusrun-ws was reporting that SGE_QMASTER_PORT was not set and sge-master not found. Adding it to the ENV{} hash fixes that error. Don't know the origin of that error.
the firewall didn't need to be set because it seems it might not be too strict. we may want to crank it down at some point.
Globus Container Needs to Advertise Hostname
Since cheaha is behind a firewall and recognizes it's own external IP address as the firewalled network IP address, we need to tell the globus container to advertise itself by name and not IP. This causes clients outside the firewalled network to get cheaha's FQDN and do a lookup to get the IP instead of relying on (incorrect) internal IP address that it would ordinarily advertise.
The fix is to add the publishHostName parameter to the globalConfiguration section of $GLOBUS_LOCATION/etc/globus_wsrf_core/server-config.wsdd and $GLOBUS_LOCATION/etc/globus_wsrf_core/client-server-config.wsdd files. Simply add
<parameter name="publishHostName" value="true" />
To the globalConfiguration section, eg. it should look like this:
<globalConfiguration>
<parameter name="publishHostName" value="true" />
Then restart the container as root /etc/init.d/globus-ws restart
Ganglia Glue for MDS
In order to get the webservices MDS to report full cluster statistics it needs to query a local utilization monitor. By default it is not configured to query one. The mds configuration to use Ganglia is documented in the Globus Quick Start guide. Basically, edit $GLOBUS_LOCATION/etc/globus_wsrf_mds_usefulrp/gluerp.xml and follow the steps therein.
Add the Glue Provider
We need to add this block of configuration to link up Ganglia and MDS (in addition to the edit of the gluerp.xml above). Edit the file $GLOBUS_LOCATION/etc/gram-service-SGE/gluerp-config.xml and the following block after the </ns1:resourcePropertyImpl>.
<ns1:resourcePropertyElementProducers>
<ns1:className>org.globus.mds.usefulrp.glue.GangliaElementProducer</ns1:className>
<ns1:arguments>localhost</ns1:arguments>
<ns1:arguments>8649</ns1:arguments>
<ns1:period>300</ns1:period>
<ns1:transformClass>org.globus.mds.usefulrp.rpprovider.transforms.GLUEComputeElementTransform<ns1:transformClass>
</ns1:resourcePropertyElementProducers>
If patch and XML would play nice together we could do something like this:
patch -d $GLOBUS_LOCATION/etc/gram-service-SGE -b gluerp-config.xml << EOF 10a11,19 > > <ns1:resourcePropertyElementProducers> > <ns1:className>org.globus.mds.usefulrp.glue.GangliaElementProducer</ns1:className> > <ns1:arguments>localhost</ns1:arguments> > <ns1:arguments>8649</ns1:arguments> > <ns1:period>300</ns1:period> > <ns1:transformClass>org.globus.mds.usefulrp.rpprovider.transforms.GLUEComputeElementTransform<ns1:transformClass> > </ns1:resourcePropertyElementProducers> > 20a30 > EOF
