OpenPBS Install Notes

OpenPBS 2.3.16 source can be downloaded from here provided you complete the request form. Manuals can be downloaded from here once registered via request form.

MAUI 3.0.7p8 source can be downloaded from here along with documentation. A MAUI-PBS integration guide can be found on Supercluster.org site.



Build/Install Server with SSH
=============================

** You must have tcl8.0 and tk8.0 installed.  (Debian package tk8.0-dev)

> tar -zxvf OpenPBS_2_3_16.tar.gz
> cd OpenPBS_2_3_16
> make clean
> ./configure --prefix=/usr/local/pbs --exec-prefix=/usr/local/pbs --datadir=/var/spool/pbs --set-server_home=/var/spool/pbs --with-scp
> make
> su
> make install
> cd doc
> make install
> vi /etc/init.d/pbs
NOTE:  See startup script section for contents of this file.
> chmod u+x /etc/init.d/pbs

NOTE:  If you have mounted /usr/local/pbs on other machines for access to
Server/Executable/Client applications you may need to permissions on a few
files (if no-root-squash is used on the mount).
    > chmod go+rx /usr/local/pbs/sbin/pbs_mom
    > chmod go+rx /usr/local/pbs/sbin/pbs_sched
    > chmod go+rx /usr/local/pbs/sbin/pbs_server
You will also need to copy a different /etc/init.d/pbs which only includes
the items you wish to start up.  Typically items are disabled by being absent.
However if MOM, Sched, Server are present on all machines all will try to
start up.



Build/Install Executable hosts with SSH
=======================================

If the install prefix is shared by the server and all executable hosts,
then this step must be performed before Server installation.  However,
you will need to disable the starting of pbs_sched and pbs_server in
the /etc/init.d/pbs startup script on each of the executable hosts.

> cd OpenPBS_2_3_16
> make clean
> ./configure --prefix=/usr/local/pbs --exec-prefix=/usr/local/pbs --datadir=/var/spool/pbs --set-server_home=/var/spool/pbs --with-scp --disable-server --disable-gui --set-sched=no
> make

NOTE:  If you are working with a group of hosts with slightly different
operating system versions, the most success seems to be obtained by compiling
on the machine with the lowest version OS.

For each executable host (with same OS and hardware)...
> ssh root@<remote>
> cd ????/OpenPBS_2_3_16
> make install
> cd doc
> make install
> scp roberts:/etc/init.d/pbs /etc/init.d/pbs
> exit

ERRORS:
##### make error:
   make[4]: *** No rule to make target `<built-in>', needed by `attr_atomic.o'.  Stop.
This is cause by later version of GNU C++.  There is a simple fix but you will
need to remove the OpenPBS installation directory and untar it again.
> \rm -R OpenPBS_2_3_16
> tar -zxvf OpenPBS_2_3_16.tar.gz
> cd ????/OpenPBS_2_3_16
> vi buildutils/makedepend-sh
>>>> modify "eval $CPP..." command at line 576 of 758 to include 'grep -v ">$"'
                eval $CPP $arg_cc $d/$s $errout | \
                  sed -n -e "s;^\# [0-9][0-9 ]*\"\(.*\)\";$f: \1;p" | \
                  grep -v "$s\$" | \
                  grep -v ">$" | \
                  sed -e 's;\([^ :]*: [^ ]*\).*;\1;' \
                  >> $TMP

##### make error:
   ../lib/Liblog/liblog.a(pbs_log.o)(.text+0x389): In function `log_record':
   /root/OpenPBS_2_3_16/src/lib/Liblog/pbs_log.c:306: undefined reference to `errno'
This also seems due to a #include dependancy changed in later version of C++.
> cd ????/OpenPBS_2_3_16
> vi src/lib/Liblog/pbs_log.c
>>>> added line at the top (say line 96)
#include <errno.h>




Adding a New Executable node:
=============================

NOTE:  Please skip this step you are installing for the first time.
Steps to follow:
1) Do section "Build/Install Executable hosts with SSH"
2) Do section "Configuring each Executable node."
3) Do section "Using SCP for file transfer"
4) Update the SSH hosts files on all nodes...
    NOTE:  See "Using SCP for file transfer" for SSH2 or OpenSSH settings,
           if the following setup doesn't work.
    > scp existingnode:/etc/ssh/ssh_known_hosts /etc/ssh/ssh_known_hosts
    > scp existingnode:/etc/ssh/shosts.equiv /etc/ssh/shosts.equiv
    > echo newhostname.ph.unimelb.edu.au >> /etc/ssh/shosts.equiv
    Look at the new node's local public key (ignoring the username@host bit)
    > cat /etc/ssh/ssh_host_rsa_key.pub
    > vi /etc/ssh/ssh_known_hosts
    >>>>> add the following "hostname.domain,hostname,ip key" lines
    >>>>> (leave off username@host)...
    newhostname.ph.unimelb.edu.au,newhostname,128.250.0.0   ssh-rsa AAAAB3N...

    Copy these changes off to all other nodes and PBS server...
    > scp /etc/ssh/ssh_known_hosts othernode:/etc/ssh/
    > scp /etc/ssh/shosts.equiv othernode:/etc/
    ...

3) Register the new node with the server...
    > qmgr
    Qmgr: create node newhostname np=1,ntype=cluster
    Qmgr: quit
NOTE:  np=1 specifies the maximum number of processes to run on the host.
4) Check the status of the new node.  Give it a few minutes for the status
   to become "free"
    > /usr/local/pbs/bin/pbsnodes -a




Maintanence of Nodes and Queues:
================================

To stop the submission of jobs to a particular node you need to set
the node state to offline...

> qmgr
Qmgr: set node hostname state=offline

Any existing jobs on the node will complete normally and no further
jobs will be submitted to the node.  You will need to check when
any running jobs are complete, or you can set the following mail
option on a job (a=abort,e=exits)...
> pbsnodes -a
> qstat -f JOBID...
> qalter -m ae -M winton@hostname.ph.unimelb.edu.au[,ownerusername] JOBID...

The node can the be reinstated after the maintenance by setting the state
to free...
> qmgr
Qmgr: set node hostname state=free

To stop the submission of new jobs to a queue...
Qmgr:  set queue defaultq enabled=false

To stop the further execution of jobs in a queue...
Qmgr:  set queue defaultq started=false


If you need to urgently take down a machine, you can set the state to
offline (as above) and rerun any job on the machine.
Set machine to offline...
> qmgr
Qmgr: set node hostname state=offline

Find out all running jobs on the host...
> pbsnodes -a
Re-schedule the jobs...
> qrerun JOBID...
Check if the job's resource list has any problems...
> qstat -f JOBID...
You may need to respecify the resource list settings to run on a different host
(eg.  Resource_List.neednodes = bluefish    this looks bad).  To do this just
reset all the other resource list values but leave out neednodes...
> qalter -l 'nice=15,nodect=1,nodes=1' JOBID
Check if Maui has give the job a priority
> diagnose -p




Configuring the Server host:
============================

Add the service ports by name to the system.
> su
> vi /etc/services
>>>>>>>> add these lines
pbs             15001/tcp                       # pbs server (pbs_server)
pbs_mom         15002/tcp                       # mom to/from server
pbs_resmom      15003/tcp                       # mom resource management reqs
pbs_resmom      15003/udp                       # mom resource management reqs
pbs_sched       15004/tcp                       # scheduler

If using inet.d or xinet.d (which isn't part of this installation) then
you will need to modify hosts.allow ...
NOTE:  This is not necessary.
> vi /etc/hosts.allow
>>>>>>> add these lines
pbs:        128.250.50.0/255.255.255.0, 128.250.51.192/255.255.255.192
pbs_mom:    128.250.50.0/255.255.255.0, 128.250.51.192/255.255.255.192
pbs_resmom: 128.250.50.0/255.255.255.0, 128.250.51.192/255.255.255.192
pbs_sched:  128.250.50.0/255.255.255.0, 128.250.51.192/255.255.255.192

PBS needs to know the name of the PBS server node...
> vi /var/spool/pbs/server_name 
>>>>> change to the correct server name
roberts

You can define an initial set of nodes available in PBS.
> vi /var/spool/pbs/server_priv/nodes
>>>>> add all of the nodes
###
### In reverse order of power for MAUI scheduler.
### MAUI chooses the last nodes first.   LJW 7/8/2002
###
roberts         np=1
redfish         np=2
bluefish        np=2
lem             np=2
lorax           np=2
vangogh         np=2
yooks           np=2
zooks           np=2
eppcluster1     np=1
eppcluster2     np=1

You will need to configure all PBS processing nodes.
NOTE:  If the server node is not going to be used for processing then this is
not necessary.
> vi /var/spool/pbs/mom_priv/config
$logevent 0x1ff
$clienthost roberts
$ideal_load 1.1
$max_load 1.5
$usecp *.ph.unimelb.edu.au:/home/ /home/
$usecp *.ph.unimelb.edu.au:/epp/home/ /home/
$usecp *.ph.unimelb.edu.au:/data/ /data/
EOF

** NOTE:  The values $max_load and $ideal_load are set so that any node
   that is in use is taken off the PBS queues.  This is used as secondary
   information to the number of processors (NP) and should only be used
   if jobs can be submitted to the nodes by other methods.  $max_load should
   be set to the maximum acceptible load for a node to remain free.  If this
   is exceeded the node with be marked "busy" and taken out of PBS queues.
   $ideal_load should be set to a level below which a node is definitely
   free for processing (provided all NP processors are not in use).  If the
   node is marked "busy" and the machine load falls below this it will be
   marked as "free" again.  If using this feature, as a rule of thumb,
   set $max_load to NP-0.5 and $ideal_load to NP-0.9 . (eg. a single job/CPU
   node has np=1 $max_load=0.5 $ideal_load=0.1 ; a 4 job/CPU node has
   np=4 $max_load=3.5 $ideal_load=3.1)
** NOTE:  $usecp lines tell PBS to use copy (unix cp command) instead of
   'rcp' or 'scp' when staging in/out files/results for the specified
   machines and directories.  That is, the directory is cross mounted!
   For example...
    * Use submits job from  "frank.ph.unimelb.edu.au"
    * Job ends up running on  "bill.ph.unimelb.edu.au"
    * The job running on "bill" knows that any files it needs to copy from/to
      hosts matching "*.ph.unimelb.edu.au" within directory /home/ can
      just be copied into /home/... on the local system, as these are
      equivalent.

> vi /usr/local/pbs/lib/xpbsmon/xpbsmonrc
>>>> Make sure default hostname is OK.  May have name of build host.
*sitesInfo:...

> vi /usr/local/pbs/lib/xpbs/xpbsrc 
>>>> Make sure default hostname is OK.  May have name of build host.
*serverHosts: roberts
>>>>
*selectHosts: roberts

> /usr/local/pbs/sbin/pbs_server -t create
> ps -ef | grep pbs
*Kill the pbs_server process (one should exist!)
> kill ?????
> /etc/init.d/pbs start
> ps -ef | grep pbs

> update-rc.d pbs defaults
  *OR* RedHat
> /sbin/chkconfig --add pbs

> /usr/local/pbs/bin/qmgr roberts.ph.unimelb.edu.au
Qmgr: set server managers=winton@roberts.ph.unimelb.edu.au
Qmgr: set server managers+=glenn@roberts.ph.unimelb.edu.au
Qmgr: create queue defaultq queue_type=e
Qmgr: set queue defaultq resources_default.nodes=1
Qmgr: set queue defaultq resources_default.nodect=1
Qmgr: set queue defaultq resources_default.nice=15
Qmgr: set queue defaultq enabled=true
Qmgr: set queue defaultq started=true
Qmgr: set server default_queue=defaultq
Qmgr: set server acl_hosts=*.ph.unimelb.edu.au
Qmgr: set server acl_host_enable=true
Qmgr: set server scheduling=true
Qmgr: set server query_other_jobs=true
Qmgr: quit

NOTE:  query_other_jobs=true ensures that people can query the status
  of other's jobs.  Otherwise other people's jobs are invisible.

To check the status:
> /usr/local/pbs/bin/pbsnodes -a





PBS Startup Script (Debian Linux)
=================================
#! /bin/bash
#
# pbs		This script will start and stop the PBS daemons
#
# chkconfig: 345 85 85
# description: PBS is a batch system for SMPs and clusters
#
PATH=/bin:/usr/bin:/sbin:/usr/sbin
PBS_SBIN=/usr/local/pbs/sbin

# let see how we were called
case "$1" in
  start) 
	echo "Starting PBS daemons: "
	if [ -x $PBS_SBIN/pbs_mom ] ; then
		echo -n "Starting pbs_mom: "
		start-stop-daemon --start -b --quiet --exec $PBS_SBIN/pbs_mom
		if [ $? == 0 ]; then
			echo .
		else
			echo failed
		fi
	fi
	if [ -x $PBS_SBIN/pbs_sched ] ; then
		echo -n "Starting pbs_sched: "
		start-stop-daemon --start -b --quiet --exec $PBS_SBIN/pbs_sched
		if [ $? == 0 ]; then
			echo .
		else
			echo failed
		fi
	fi
	if [ -x $PBS_SBIN/pbs_server ] ; then
		echo -n "Starting pbs_server: "
		start-stop-daemon --start -b --quiet --exec $PBS_SBIN/pbs_server
		if [ $? == 0 ]; then
			echo .
		else
			echo failed
		fi
        fi
  ;;
  stop)
	echo "Shutting down PBS: "
        if [ -x $PBS_SBIN/pbs_server ] ; then
		echo -n "Stopping pbs_server: "
		start-stop-daemon --stop --quiet --exec $PBS_SBIN/pbs_server
		if [ $? == 0 ]; then
			echo .
		else
			echo failed
		fi
	fi
        if [ -x $PBS_SBIN/pbs_mom ] ; then
		echo -n "Stopping pbs_mom: "
		start-stop-daemon --stop --quiet --exec $PBS_SBIN/pbs_mom
		if [ $? == 0 ]; then
			echo .
		else
			echo failed
		fi
	fi
        if [ -x $PBS_SBIN/pbs_sched ] ; then
		echo -n "Stopping pbs_sched: "
		start-stop-daemon --stop --quiet --exec $PBS_SBIN/pbs_sched
		if [ $? == 0 ]; then
			echo .
		else
			echo failed
		fi
	fi
	sleep 2
  ;;
  status)
        if [ -x $PBS_SBIN/pbs_server ] ; then
		echo -n "Status pbs_server: "
		start-stop-daemon --stop --test --quiet --exec $PBS_SBIN/pbs_server
		if [ $? == 0 ]; then
			echo OK
		else
			echo failed
		fi
	fi
        if [ -x $PBS_SBIN/pbs_mom ] ; then
		echo -n "Status pbs_mom: "
		start-stop-daemon --stop --test --quiet --exec $PBS_SBIN/pbs_mom
		if [ $? == 0 ]; then
			echo OK
		else
			echo failed
		fi
	fi
        if [ -x $PBS_SBIN/pbs_sched ] ; then
		echo -n "Status pbs_sched: "
		start-stop-daemon --stop --test --quiet --exec $PBS_SBIN/pbs_sched
		if [ $? == 0 ]; then
			echo OK
		else
			echo failed
		fi
	fi
  ;;
  restart)
	echo "Restarting PBS"
	$0 stop
	$0 start
	echo "done."
  ;;
  *)
	echo "Usage: pbs {start|stop|restart|status}"
	exit 1
esac






PBS Startup Script (RedHat Linux)
=================================
#! /bin/bash
#
# pbs           This script will start and stop the PBS daemons
#
# chkconfig: 345 85 85
# description: PBS is a batch system for SMPs and clusters
#
. /etc/rc.d/init.d/functions
PBS_SBIN=/usr/local/pbs/sbin

# let see how we were called
case "$1" in
  start)
        echo "Starting PBS daemons: "
        if [ -x $PBS_SBIN/pbs_mom ] ; then
                echo -n "Starting pbs_mom: "
                daemon $PBS_SBIN/pbs_mom
                echo
        fi
        if [ -x $PBS_SBIN/pbs_sched ] ; then
                echo -n "Starting pbs_sched: "
                daemon $PBS_SBIN/pbs_sched
                echo
        fi
        if [ -x $PBS_SBIN/pbs_server ] ; then
                echo -n "Starting pbs_server: "
                daemon $PBS_SBIN/pbs_server
                echo
        fi
  ;;
  stop)
        echo "Shutting down PBS: "
        if [ -x $PBS_SBIN/pbs_server ] ; then
                echo -n "Stopping pbs_server: "
                killproc $PBS_SBIN/pbs_server -TERM
                echo
        fi
        if [ -x $PBS_SBIN/pbs_mom ] ; then
                echo -n "Stopping pbs_mom: "
                killproc $PBS_SBIN/pbs_mom -TERM
                echo
        fi

        if [ -x $PBS_SBIN/pbs_sched ] ; then
                echo -n "Stopping pbs_sched: "
                killproc $PBS_SBIN/pbs_sched -TERM
                echo
        fi
        sleep 2
  ;;
  status)
        if [ -x $PBS_SBIN/pbs_server ] ; then
                echo -n "Status pbs_server: "
                status $PBS_SBIN/pbs_server
        fi
        if [ -x $PBS_SBIN/pbs_mom ] ; then
                echo -n "Status pbs_mom: "
                status $PBS_SBIN/pbs_mom
        fi
        if [ -x $PBS_SBIN/pbs_sched ] ; then
                echo -n "Status pbs_sched: "
                status $PBS_SBIN/pbs_sched
        fi
  ;;
  restart)
        echo "Restarting PBS"
        $0 stop
        $0 start
        echo "done."
  ;;
  *)
        echo "Usage: pbs {start|stop|restart|status}"
        exit 1
esac






Configuring each Executable node:
=================================

> su
It might be best to copy a version of mom_priv/config from another host.
If not, you can create another one, but the cross mounts lines "$usecp..."
will likely be the same.
> cat << EOF > /var/spool/pbs/mom_priv/config
\$logevent 0x1ff
\$clienthost roberts
\$ideal_load 1.1
\$max_load 1.5
\$usecp *.ph.unimelb.edu.au:/home/ /home/
\$usecp *.ph.unimelb.edu.au:/epp/home/ /home/
\$usecp *.ph.unimelb.edu.au:/data/ /data/
EOF
> cat << EOF > /var/spool/pbs/server_name
roberts
EOF

> vi /etc/services
>>>>>>>> add these lines
pbs             15001/tcp                       # pbs server (pbs_server)
pbs_mom         15002/tcp                       # mom to/from server
pbs_resmom      15003/tcp                       # mom resource management reqs
pbs_resmom      15003/udp                       # mom resource management reqs
pbs_sched       15004/tcp                       # scheduler

> vi /etc/hosts
>>>>>>>> ensure $clienthost exists
128.250.51.211  roberts.ph.unimelb.edu.au       roberts

> /etc/init.d/pbs start 
> update-rc.d pbs defaults
  *OR* RedHat
> /sbin/chkconfig --add pbs

To check the status (good idea to restart the server pbs processes):
> /usr/local/pbs/bin/pbsnodes -a




Testing the nodes:
==================

> qsub sometest.csh
> qstat -n

> /usr/local/pbs/bin/pbsnodes -a

> qmgr
>> list server
>> list queue @roberts
>> print server
>> print queue @roberts
>> quit





Using SCP for file transfer:
============================

You need to setup HostbaseAuthentication for PBS to work with SCP.
This effectively allows users to SSH/SCP to/from all nodes and hosts where
you can submit jobs from, without the need for a password.

*** OpenSSH systems
On each host:
> su
> vi /etc/ssh/sshd_config
>>>>> Add or modify HostbasedAuthentication to yes
HostbasedAuthentication yes
> vi /etc/ssh/ssh_config
>>>>> Add or modify HostbasedAuthentication to yes and EnableSSHKeysign to yes
Host *
   HostbasedAuthentication yes
   EnableSSHKeysign yes

> /etc/init.d/ssh reload
> vi /etc/ssh/ssh_known_hosts
>>>>> Add list of hosts where users are allowed to authenticate without password
###
### PBS hosts fully qualified  (format "host.domain,host,IP RSA-public-key")
###
bluefish,bluefish.ph.unimelb.edu.au,128.250.51.213    ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAvqiORxq5TVNlLBjJfeoUvAJVkOWVdrg5FkNueouDqdf2eGsmtFVDdPyFqKp+gV9YgczwGygjKUoeu9Nu3iU/hguTwv7Iq5uQPy2aNAMKZ2WWNBRFGZRFYcf83iLGXLIuuIZRSJAoXMv/INWdRkLlvD6v7SruOKqx7ZfendlQGq0=
lem.ph.unimelb.edu.au,lem,128.250.51.216            ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEA5k2+IODpQy44F+bzZAgokSHiT5Etsj25dOoGLiWfosOMAu9PFSw3H3FSDWNpKgRMokmt91btFIvLBGEzUNNU1ECU4nmKloX1z0Mv0PoajBocMEuhndmz9f7BXrnYABVJHrl7nMQtYAOSPoYZBv6smm1yi82oSFxfCPXGIX3kVl0=
roberts.ph.unimelb.edu.au,roberts,128.250.51.211    ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEA1QMykldVEd9tKFPhj0bm7ZGi8Yfg56iZwqNIuea5mzgMLmJXAkBr3WBsLcKhsOuZk+durDAQsVUFh5vXT46g3lrXr9e3ZMjpBX5uTJHE51vO2BZyhFwQQmMJBrom1xW7cYK7al7WR5TpZg0LgEsm9/BnJcHqEz/+nBkhl0hVhbc=
redfish.ph.unimelb.edu.au,redfish,128.250.51.212    ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAoj8Zs1LrQoRl7Q3Zn3Zn7qXo+wSrp3liGxgHxnAVJdBOPrxyrthuOPt7ryt+elVsl3UKSr2aUK4xh+mmMdamE6vk8mafM8FFh+XCDa616YFMa/h6/cpWb6CavmDqJzZu/sNeE7cwsQGmXAYWMz2tGlMv/4Ga8uDkYHwVutjvdY8=
yooks.ph.unimelb.edu.au,yooks,128.250.51.225        ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEA1QMykldVEd9tKFPhj0bm7ZGi8Yfg56iZwqNIuea5mzgMLmJXAkBr3WBsLcKhsOuZk+durDAQsVUFh5vXT46g3lrXr9e3ZMjpBX5uTJHE51vO2BZyhFwQQmMJBrom1xW7cYK7al7WR5TpZg0LgEsm9/BnJcHqEz/+nBkhl0hVhbc=
zooks.ph.unimelb.edu.au,zooks,128.250.51.224        ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEA1QMykldVEd9tKFPhj0bm7ZGi8Yfg56iZwqNIuea5mzgMLmJXAkBr3WBsLcKhsOuZk+durDAQsVUFh5vXT46g3lrXr9e3ZMjpBX5uTJHE51vO2BZyhFwQQmMJBrom1xW7cYK7al7WR5TpZg0LgEsm9/BnJcHqEz/+nBkhl0hVhbc=
vangogh.ph.unimelb.edu.au,vangogh,128.250.51.215    ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEA0lPXhQCygf5w758wppn/iufequiHAJop3WFTipKEgR8PqqeYCfECHMUBwUvtWfMmupz51lo5imnlG0gm82+OMojs0FpT8xAsffYi4xJN7UtPS9wg5GTIi18qjLgeNyFXWSo517iRFjQgxwzyiEDITnRIqPw9pvM45i2M6TG6i+0=
lorax.ph.unimelb.edu.au,lorax,128.250.51.214        ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEA3B3J7U8PiNYzccpwyzf9eeMTh8ArFKDqiloZkcfEJOhZZGt4dEOTYSLeDUrRkCsW2UiNpAC8gLL7UtoMsy78yvYFN78a1Eb+B2bqg8iAD6U54epJZE07a4P5BFqYa/SEW+LeeX0w5AIOv8o29ISGi69P87NnUH1MAmfyHaqH2l0=
eppcluster1.ph.unimelb.edu.au,eppcluster1,128.250.51.221    1024 35 121220380173868178173920410789468319241798556204819439046856497350416879144687894533689997914158004060856959376973406991956261326237654401569792281647808360651535751684452664982624129941734329746054674303059429791214067100419288990779216416908291086946867822201510524755206919984170989862906174928745817538411
eppcluster1.ph.unimelb.edu.au,eppcluster1,128.250.51.221    ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAxE0gdiUBWSqpzieGYQUI1Id+m+Svm9Esc+r86rUGdxCr7VqIOVc7RrG7l6A50xxeCB5UdJemuDYsVQmfAvwkMhrP/NYdKZulT2x7sdkvnB0mOaT/dJXOV47HiI6vqfrYbZrGIklRgUEGC1aiQd9M1Ps6Xi6QH7nws2VV2a+Yl7U=
eppcluster2.ph.unimelb.edu.au,eppcluster2,128.250.51.222    1024 35 121220380173868178173920410789468319241798556204819439046856497350416879144687894533689997914158004060856959376973406991956261326237654401569792281647808360651535751684452664982624129941734329746054674303059429791214067100419288990779216416908291086946867822201510524755206919984170989862906174928745817538411
eppcluster2.ph.unimelb.edu.au,eppcluster2,128.250.51.222    ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEAxE0gdiUBWSqpzieGYQUI1Id+m+Svm9Esc+r86rUGdxCr7VqIOVc7RrG7l6A50xxeCB5UdJemuDYsVQmfAvwkMhrP/NYdKZulT2x7sdkvnB0mOaT/dJXOV47HiI6vqfrYbZrGIklRgUEGC1aiQd9M1Ps6Xi6QH7nws2VV2a+Yl7U=

> vi /etc/ssh/shosts.equiv
>>>>> Add list of hosts where users are allowed to authenticate without password
lorax.ph.unimelb.edu.au
vangogh.ph.unimelb.edu.au
zooks.ph.unimelb.edu.au
yooks.ph.unimelb.edu.au
redfish.ph.unimelb.edu.au
roberts.ph.unimelb.edu.au
lem.ph.unimelb.edu.au
bluefish.ph.unimelb.edu.au
eppcluster1.ph.unimelb.edu.au
eppcluster2.ph.unimelb.edu.au

After this has been done on all hosts you should test this.
As a user try ssh connection from host to host:
> su - someuser
> ssh redfish.ph.unimelb.edu.au
> ssh lem.ph.unimelb.edu.au
> ssh roberts.ph.unimelb.edu.au

*** SSH2 systems
> su
> vi /etc/ssh2/sshd2_config
>>>>> Add or modify "AllowedAuthentications" line to include "hostbased"
        AllowedAuthentications          publickey,password,hostbased
> vi /etc/ssh2/ssh2_config  
>>>>> Add or modify "AllowedAuthentications" line to include "hostbased" first
        AllowedAuthentications          hostbased,publickey,password
>>>>> Add default domain
        DefaultDomain                   physics.usyd.edu.au
> /etc/init.d/sshd restart
> mkdir -p /etc/ssh2/knownhosts

You must copy each remote host public key to each host.  Remember to use
the fully qualified remote hostname as part of the knownhosts filename...

For remote SSH2 machines to connect to SSH2 machines...
> scp host:/etc/ssh2/hostkey.pub /etc/ssh2/knownhosts/fullhost.ssh-dss.pub
For remote OpenSSH machines to connect to SSH2 machines...
> ssh host ssh-keygen -x -f /etc/ssh/ssh_host_dsa_key.pub > /etc/ssh2/knownhosts/fullhost.ssh-dss.pub
NOTE:  For some reason DSA keys are named  *.ssh-dss.pub  and RSA1 keys are
       named  *.ssh-rsa.pub  .

> mkdir -p /etc/ssh2/hostkeys
> cp /etc/ssh2/knownhosts/fullhost.ssh-dss.pub /etc/ssh2/hostkeys/key_22_fullhost.pub
NOTE:  Format is for filename is   key_[port]_[fullhost].pub

> vi /etc/shosts.equiv
>>>>> Add list of hosts where users are allowed to authenticate without password

*** Notes on using SSH2 and OpenSSH together
On OpenSSH system you can convert OpenSSH-pubkeys to SSH2-pubkeys...
> ssh-keygen -x -f /etc/ssh/ssh_host_dsa_key.pub
On OpenSSH system you can convert SSH2-pubkeys to OpenSSH-pubkeys...
> ssh-keygen -x -f hostkey.pub
To check if public keys in different systems match you compare fingerprints...
To get a fingerprint of a key in OpenSSH...
> ssh-keygen -B -f /etc/ssh/ssh_host_dsa_key.pub
To get a fingerprint of a key in SSH2...
> ssh-keygen2 -F /etc/ssh2/knownhosts/cabibbo.physics.usyd.edu.au.ssh-dss.pub


*** PROBLEMS:
  * If you cannot ssh from one host to another with password:
        - Check the hosts public key (/etc/ssh/ssh_host_rsa_key.pub) is OK
          in the /etc/ssh/ssh_known_hosts file.  Check the fully qualified
          hostname and IP address
            > cat /etc/ssh/ssh_host_rsa_key.pub
            > nslookup hostname.domain
        - Check permission on the ssh an ssh-keysign executables. These should
          have the SUID turned on.
            > ls -l /usr/bin/ssh
            > ls -l /usr/lib/ssh-keysign 
            > chmod u+s /usr/bin/ssh
            > chmod u+s /usr/lib/ssh-keysign
  * Make sure you add any new hosts to the  /etc/ssh/ssh_known_hosts  and
    /etc/ssh/shosts.equiv  files on each machine in the PBS cluster.
    You should also add entries to/for any machine that you can submit
    PBS jobs from.
  * If all else fails you might get an idea of whats happening using debug
    in ssh.
        > ssh -vvv lorax.ph.unimelb.edu.au
  * You should also check that your /etc/hosts doesn't have a different name
    (first) for your host to that of your /etc/ssh/ssh_known_hosts .  eg...
        ...
        debug2: we sent a hostbased packet, wait for reply
        debug1: Remote: Accepted for epp.ph.unimelb.edu.au [128.250.51.211] by /etc/ssh/shosts.equiv.
                    CHECK THIS HOST NAME!





Installing MAUI scheduler:
==========================

> gunzip -c maui-3.0.7p8.tar.gz | tar -xvf -
> cd maui-3.0.7
> ./configure
Maui Installation Directory? (Default: /usr/local)
NOTE:  This is where Maui executables will be copied: /usr/local/maui
Maui Home Directory? (Default: /root/maui-3.0.7)
NOTE:  This is where Maui config, log, and checkpoint files are maintained: /var/spool/maui
Compiler? (Default: gcc) 
Checksum Seed? (Any random number between 0 and MAX_INT) 4563456
OPSYS:         LINUX
COMPILER:      gcc
CHECKSUMSEED:  4563456
MAUI_HOME_DIR: /var/spool/maui
MAUI_INST_DIR: /usr/local/maui
PRIMARY ADMIN: root
SERVERHOST:    roberts
Correct? [Y|N] (Default: N) Y    **** NOTE: Answer Y regardless of ADMIN/SERVER
Do you want to use PBS? [Y|N] (Default: Y) Y
PBS Target Directory: (default: /usr/local) /usr/local/pbs

> vi maui.cfg
>>>>> Modify host to full hostname...
SERVERHOST            roberts.ph.unimelb.edu.au
>>>>> Modify administrator list to include ROOT and others...
ADMIN1                root winton glenn
>>>>> Modify the log level...
LOGLEVEL              0
>>>>> Configure the defer job settings...
###
### Defer Status settings - PBS queue configuration
###                                                     17/12/2002 LJW
DEFERCOUNT 20
DEFERSTARTCOUNT 3
DEFERTIME 0:03:00

* NOTE:  LOGLEVEL of 3 is good for debugging.  Anything higher than 0 creates
  too many entries for normal operation.
* NOTE:  DEFER* - If a PBS node goes down PBS server takes a little while for
  this to register.  If a job is submitted to an apparently up node which
  is actually down, it will fail to start, and by default Maui marks this
  job as deferred and will not restart the job for another hour.  The above
  DEFER* settings tell Maui to mark the job status=defer after 3 failed
  starts, then wait 3 minutes and attempt to restart it.  If the job is
  deferred more than 20 times it will be placed in Batch-Hold after which
  only an administrator can start/delete the job by hand.

> make
> su
> make install
> ps -ef | grep pbs
*Kill the pbs_sched process if one exists
> kill ????
*Disable the PBS scheduler (init.d file checks for binary before it runs)
> mv /usr/local/pbs/sbin/pbs_sched /usr/local/pbs/sbin/pbs_sched.DISABLED
> cp /etc/init.d/pbs /etc/init.d/maui
> vi /etc/init.d/maui
>>>>> Modify the comments...
>>>>> Delete all but the pbs_sched start,stop,status if blocks.
>>>>> Modify pbs_sched references to /usr/local/maui/bin/maui...
#! /bin/bash
#
# MAUI          This script will start and stop the MAUI scheduler daemon
#
# chkconfig: 345 86 86
# description: MAUI is a scheduler for PBS batch system
#

PATH=/bin:/usr/bin:/sbin:/usr/sbin

# let see how we were called
case "$1" in
  start)
        if [ -x /usr/local/maui/bin/maui ] ; then
                echo -n "Starting MAUI: "
                start-stop-daemon --start -b --quiet --exec /usr/local/maui/bin/maui
                if [ $? == 0 ]; then
                        echo .
                else
                        echo failed
                fi
        fi
  ;;
  stop)
        if [ -x /usr/local/maui/bin/maui ] ; then
                KILLMESSG=""
                echo -n "Sending MAUI kill:  "
                /usr/local/maui/bin/schedctl -k
                if [ $? == 0 ]; then
                    KILLMESSG="(this is OK if scheduler shutdown complete)"
                fi
                sleep 2
                echo -n "Stopping MAUI:  "
                start-stop-daemon --stop --quiet --exec /usr/local/maui/bin/maui                if [ $? == 0 ]; then
                        echo .
                else
                        echo failed "$KILLMESSG"
                fi
        sleep 3
        fi
  ;;
  status)
        if [ -x /usr/local/maui/bin/maui ] ; then
                echo -n "Status of MAUI: "
                start-stop-daemon --stop --test --quiet --exec /usr/local/maui/bin/maui
                if [ $? == 0 ]; then
                        echo OK
                else
                        echo failed
                fi
        fi
  ;;
  restart)
        echo "Restarting MAUI"
        $0 stop
        $0 start
        echo "done."
  ;;
  *)
        echo "Usage: maui {start|stop|restart|status}"
        exit 1
esac

> /etc/init.d/maui start
> update-rc.d maui defaults
  *OR* RedHat
> /sbin/chkconfig --add maui

NOTE:  If you've change Maui configuration and wish to reload without a
restart you can run the following command...
    >  /usr/local/maui/bin/schedctl -R



A possible queue configuration:
===============================

Scenario:
One default queue for processing of jobs on all nodes.
One queue for afterhours jobs on all nodes (higher priority).
One queue for short lived test jobs (highest priority).

> /usr/local/pbs/bin/qmgr
** If no default queue already...
Qmgr: create queue defaultq queue_type=e
Qmgr: set queue defaultq enabled=true
Qmgr: set queue defaultq started=true
Qmgr: set queue defaultq resources_default.nodes=1
Qmgr: set queue defaultq resources_default.nodect=1
Qmgr: set queue defaultq resources_default.nice=15
Qmgr: set server default_queue=defaultq
** Setup the afterhours queue (estimate 4hrs with maximum 48hrs over weekend)...
Qmgr: create queue afterhours queue_type=e
Qmgr: set queue afterhours resources_default.nodes=1
Qmgr: set queue afterhours resources_default.nodect=1
Qmgr: set queue afterhours resources_default.nice=15
Qmgr: set queue afterhours resources_default.cput=04:00:00
Qmgr: set queue afterhours resources_max.cput=48:00:00
Qmgr: set queue afterhours enabled=true
Qmgr: set queue afterhours started=true
** Setup the test queue (2 minute jobs)...
Qmgr: create queue testq queue_type=e
Qmgr: set queue testq resources_default.nodes=1
Qmgr: set queue testq resources_default.nodect=1
Qmgr: set queue testq resources_default.nice=15
Qmgr: set queue testq resources_default.cput=00:02:00
Qmgr: set queue testq resources_max.cput=00:02:00
Qmgr: set queue testq enabled=true
Qmgr: set queue testq started=true
Qmgr: quit

> su
> vi /var/spool/maui/maui.cfg
>>>>>> Add the following lines
###
### Standing Reservations - PBS queue configuration
###                                                     8/8/2002 LJW

SRNAME[0]       normal
SRSTARTTIME[0]  09:00:00
SRENDTIME[0]    20:00:00
SRDAYS[0]       Mon Tue Wed Thu Fri
SRCLASSLIST[0]  defaultq testq
SRHOSTLIST[0]   ALL

# After hours queue will run 8pm to 9am Mon-Fri and all day Sat-Sun
SRNAME[1]       afterhours1 
SRSTARTTIME[1]  20:00:00
SRENDTIME[1]    24:00:00
SRPERIOD[1]     DAY
SRDAYS[1]       Mon Tue Wed Thu Fri
SRCLASSLIST[1]  afterhours defaultq testq
SRHOSTLIST[1]   ALL

SRNAME[2]       afterhours2
SRSTARTTIME[2]  00:00:00
SRENDTIME[2]    09:00:00
SRPERIOD[2]     DAY
SRDAYS[2]       Mon Tue Wed Thu Fri
SRCLASSLIST[2]  afterhours defaultq testq
SRHOSTLIST[2]   ALL

SRNAME[3]       afterhours3
SRPERIOD[3]     DAY
SRDAYS[3]       Sat Sun
SRCLASSLIST[3]  afterhours defaultq testq
SRHOSTLIST[3]   ALL

CLASSCFG[afterhours]    PRIORITY=20
CLASSCFG[testq]         PRIORITY=10
CLASSCFG[defaultq]      PRIORITY=50 


> /etc/init.d/maui restart
> /usr/local/maui/bin/showres -v
> /usr/local/maui/bin/diagnose -r

** NOTE:  CLASS refers to the queue name that the job is in.
** NOTE:  Doesn't seem to be a good idea to overlap reservations!  Also, any
   reservation going over midnight requires 2 reservations, one until
   24:00:00 and the other from 00:00:00 .
** NOTE:  "SRHOSTLIST[X] ALL" must be specified or nothing works!
** NOTE:  Standing Reservations will not be visible in "showres" or "diagnose"
   if they are SRPERIOD=DAY and SRDAYS is not the current day.  I think, if
   you set SRPERIOD=WEEK the SRDAYS will be ignored.
** WARNING:  CLASSCFG/PRIORITY does not seem to work!!!  Even after setting
   variables...
        CREDWEIGHT 1   USERWEIGHT 1   GROUPWEIGHT 1   CLASSWEIGHT 5
        CREDCAP 100000   CLASSCAP 100000



Using FairShare in Maui
=======================
> su
> vi /var/spool/maui/maui.cfg
>>>>>> Add the following lines
#### Job queuetime has a factor of 2 weight in queuing priority
QUEUETIMEWEIGHT       2
XFACTORWEIGHT         1
RESOURCEWEIGHT        10

#### FairShare system has a factor of 5 weight in queuing priority
FSWEIGHT              5
FSPOLICY              DEDICATEDPES
#### 24 hour intervals, 14 of these in total, decay of 0.5 foreach interval
FSINTERVAL            24:00:00
FSDEPTH               14
FSDECAY               0.50
#### User FairShare has a factor of 10 weight in queuing priority
FSUSERWEIGHT          10
#### Users are penalised after using 25% of the total system (integrated)
USERCFG[DEFAULT] FSTARGET=25


To look at the fairshare configuration...
> /usr/local/maui/bin/diagnose -f
NOTE:  FSInterval periods are displayed from left to right for all values.
  FSWeight is the relative weight of each period.  The "Users" section shows
  their total usage (weighted) and their target percentage.
To look at the actual priority assigned to jobs in the queue, broken down
into fairshare, queuetime, xfactor (expansion factor) components.
> /usr/local/maui/bin/diagnose -p



Making PBS accessible to all users
==================================

> su
> cd /usr/local/bin
> ln -s ../pbs/bin/* .
> cd /usr/local/man/man1
> ln -s ../../pbs/man/man1/* .




Problems
========

***** TTY or SCP problems *****
Try doing an scp test yourself.  You may echo
something in your .cshrc that will stop scp from working.  Some debug lines...

### This will echo DEBUG1 to stderr and stdout within a PBS job.
if ($?PBS_JOBNAME) then
       echo DEBUG1
       sh -c "echo DEBUG1 1>&2"
endif

### Use this to cause TTY output/settings for login shells only.
if ( ($?prompt) && ($?tty) && ( "$tty" != "" ) ) then
    stty ???
    echo ???
endif

### Some systems do not have /usr/bin/scp which causes a problem (eg. RedHat6)
    > ln -s /usr/local/bin/scp /usr/bin/scp

***** Complete Failure? *****
There are many reasons why this might happen.  I'll list a few I've
come up against.

1) Some systems have feature where the "set user ID" flag on files is
   automatically turned off (for security).  This will break PBS.
   The following files require the setUID bit...
        sbin/pbs_iff
        sbin/pbs_rcp

2) The PBS server must exist in the /etc/hosts file.  Do not rely on DNS.






Created: 13 Aug 2002
Last modified: 13 Aug 2002

Authorised by: Professor Geoff Taylor, School of Physics
Maintained by: Dr Lyle Winton winton@physics.unimelb.edu.au