Limulus Project Software Release Notes

Date: 01/12/14 
Release Nickname: Morrison Hotel
Base OS: Scientific Linux 6.x

Contents:
  1. Assistance
  2. Root Password And System Overview
  3. User Accounts
  4. Documentation
  5. Powering Up Nodes
  6. Executing Commands On Nodes
  7. Node Log Files 
  8. Changing The System Host Name
  9. Fan Control and Thermal Management
  10. SATA Ports Configuration
  11. Optional RAID Configuration
  12. Ganglia Configuration
  13. Installed Software


1. Assistance
=============

Paid support customers may contact Basement Supercomputing as follows:

Submit questions to:

  http://basement-supercomputing.com/qa/

Email: support@basement-supercomputing.com


2. Root Password And System Overview
====================================

Root password is: changeme
Machine name: limulus

Operating System: Scientific Linux 6.6

File System (sizes may vary and /home may be on raid array):
  /     50G   
  /boot 485M
  /home 62G 

The worker nodes do not automatically power-up on booting 
the machine.  (See below) 

Nodes operate using a RAM-Disk. Both /opt and /home are mounted 
on the nodes via NFS. Packages and libraries are added in /opt.

Root and users can ssh to all nodes

The IP address and node names are:

  10.0.0.10       n0 
  10.0.0.11       n1
  10.0.0.12       n2

In the 10.0.0 subnet, the main node is called:

  10.0.0.1        limulus headnode

The second Ethernet port is configured to 
request a DHCP IP address.

Limulus 200 and 400 models use the Intel i7 processor. These processors have
four physical cores and support Intel Hyper-threading (HT). Intel HT will report 8 cores 
per processor.  For some HPC applications these four extra virtual processors have been 
shown to slow down floating point operations.  Therefore, all Limulus model 200 systems 
have Intel HT turned off by default (set in the BIOS).  Indeed, there are four
physical FPU units per processor that can only be effectively utilized
by four cores.


3. User Accounts
================
Users are created using standard the "useradd" command
New user names will be propagate to nodes within 5 minutes.
New user ssh keys are automatically generated on first login.

When logging into nodes for the first time or when
logging into newly booted bodes, the following message
will be printed.

  /usr/bin/xauth:  creating new authority file /root/.Xauthority

This message can be safely ignored.


4. Documentation
================

There is on-line documentation available. There
are pointers to reference material in the documentation.
Point the browser to:

  # firefox  http://localhost/limulus

To see the documentation in text mode without starting a browser enter:

# links /usr/share/doc/limulus-doc/index.html

The current Limulus reference manual is part of the Cluster Documentation Project

  http://cdp.clustermonkey.net/index.php/Limulus_Manual

Note: to make the documentation viewable over a local network,
edit the /etc/httpd/conf.d/limulus.conf as root and add
an "Allow" line in place of the

    # Allow from .example.com

For instance, to allow local access to the limulus
documentation for a local network with 192.168.0.0/255.255.255.0
add the following:

    Allow from 192.168.0.0/24

There may be multiple instances that need replacing. Reload the web server using 

  # service httpd reload

The same can be done for ganglia using the /etc/httpd/conf.d/ganglia.conf file.


5. Powering Up/Down Nodes
=========================

Low Level Power Control
-----------------------
At the lowest level, node power is applied by using the "relayset" command.
The relayset command controls the four power relays. The power relays are 
mapped as follows:

  relay 1    empty
  relay 2    n0
  relay 3    n1
  relay 4    n2

The relays are initialized on boot by "relayset -init". NOTE: Manual re-initialization 
will result in a power-cycle for all worker nodes.

For example, after initial power-on of the base system, to turn on node n0 enter:

  # relayset 2 on

The remaining nodes can be powered up as follows:
(node n1, then node n2)

  # relayset 3 on
  # relayset 4 on

To power off a node n0, 

  # relayset 2 off 

Keep in mind, turning the power off with relayset will not gracefully shut the node
down. It is like pulling the plug on the nodes. If the node has an attached disk 
drive, the drive may not be shutdown properly. It is best to use the high level scripts
below to power the nodes ON and OFF. 

Other options for "relayset" are described below. Note relayset is designed to be 
"silent" so that it can be easily used in other scripts

  To initialize (do first):  relayset init
  To turn relay on/off:      relayset 1|2|3|4 on|off
  To get status:             relayset 1|2|3|4 status
                             (Returns 1 if on, 0 if off)
  To list the devices found: relayset list
  To print debug messages add "debug" to the command line
  Returns -1 on error, 0 or 1 if successful.

High Level Power Control (Recommended)
--------------------------------------
There are two higher level power control scripts, node-poweron and node-poweroff. 
These scripts are the preferred way to turn nodes ON or OFF. 

node-poweron
  No node arguments turns all nodes ON. If a node is already on, nothing will happen.
  Node name(s) can be given as argument(s) in the range {n0,...,n6}. For example:
    # node-poweron n0 n2
    # node-poweron -s n1
    # node-poweron -s 
  Invalid nodes will be ignored. Default Limulus nodes are {n0,n1,n2}
  The script waits until all nodes are started or the process times out.
  -s runs in quiet mode; -h provides this help.

node-poweroff
  No node arguments turns all nodes OFF. If a node is already off, nothing will happen.
  Node name(s) can be given as argument(s) in the range {n0,...,n6}. For example:
    # node-poweroff n0 n2
    # node-poweroff -s n1
    # node-poweroff -s
  Invalid nodes will be ignored. Default Limulus nodes are {n0,n1,n2}
  A delay is included so nodes can properly shutdown before power
  is removed. Any node attached drives are placed in stand-by mode.
  -s runs in quiet mode; -h provides this help.


System Power ON and OFF
-----------------------
The three worker nodes are controlled by the main node and DO NOT
start when the system is booted. If you would like the nodes to
start when the machine is booted, simply add the following at the 
end of the /etc/rc.local file:

  node-poweron

When the system is rebooted or halted, the nodes are turned OFF gracefully
(i.e. the OS is shutdown). If there are any attached disks, they are placed in
standby mode (# hdparm -Y /dev/hda). This step is important because
the drives remain powered when the nodes are off. On reboot, the drives
with wake-up and work as expected.


6. Executing Commands On Nodes
==============================

ssh may be used to execute commands on nodes or to login directly to the nodes.
You may also use "pdsh" utility to execute commands on all or some of the nodes.

For instance to run "uptime" on all the nodes:

  # pdsh uptime
  n1:  17:54:46 up 3 min,  0 users,  load average: 0.00, 0.00, 0.00
  n0:  17:54:46 up 3 min,  0 users,  load average: 0.00, 0.00, 0.00
  n2:  17:54:46 up 3 min,  0 users,  load average: 0.00, 0.01, 0.00

Node status is checked every 60 seconds. If a node is active, it will be used by
the pdsh command. Recently started or rebooted nodes, may not respond to pdsh right away.
Recently shutdown nodes may be cause an ssh time-out. The whats-up package is used
to maintain the file pointed to by the WCOLL environment variable. Individual or
groups of nodes can be addressed with pdsh as follows using the "-w" option:

  # pdsh -w n1 hostname
  n1: n1
  # pdsh -w n[0,2] hostname
  n2: n2
  n0: n0
  # pdsh -w n[0-2] hostname
  n1: n1
  n0: n0
  n2: n2

Please see the pdsh manpage for more information.


7. Node Log Files 
=================

Rebooting nodes causes the local logs to be lost. To provide record of node activity 
the node log files are mirrored (using rsyslogd) on the headnode and placed in

  /var/log/nodes/{n0,n1,n2}.log

These logs are written to the local disk on the headnode. They are also rotated every 
week and kept for four weeks. Node logs are also written locally (to RAM disk) and 
purged daily and not rotated. 


8. Changing The System Host Name
================================

As configured, each Limulus system assumes the hostname is "limulus"
The LAN interface is configured for DHCP. It you want to
change the hostname to a FQHN the following steps are needed
to ensure Grid Engine works properly. Note: the "headnode" alias
is used provide a consistent name for the login node (or "headnode")

  A. Edit /etc/sysconfig/network and provide a new HOSTNAME.
  In this example we will use "waldo.basement-supercomputing.com"

  B. Edit /etc/hosts on the head/main node add change the
  10.0.0.1 line to reflect you new hostname, the example
  below used the new name "waldo.basement-supercomputing.com"
  For example, change;

    10.0.0.1        limulus headnode

  to

    10.0.0.1         waldo.basement-supercomputing.com waldo headnode

  Or, if you have a static IP address and want to include it in your hosts file:

    192.168.0.42   waldo.basement-supercomputing.com waldo
    10.0.0.1       waldo headnode

  C. Change the /opt/gridengine/default/common/host_aliases
  to look like:

    headnode waldo.basement-supercomputing.com waldo

  D. In order for NFS v4 to work properly you need to provide a
  "Domain" in the /etc/idmapd.conf file. Edit this file
  and replace the line (or add after the line):

    #Domain = local.domain.edu

  with your local domain. For example:

    Domain = basement-supercomputing.com

  This file will also be sent to nodes on boot-up. If
  this is not set when the host has FQDN, all NFS 
  mounted files on the nodes will have the owner "nobody" 
  and not be usable by their respective owners.

  E. Change the /etc/hosts file in the VNFS. This requires several
  steps. First edit /var/chroots/sl62_base/etc/hosts and
  change the line:

    10.0.0.1        limulus headnode

  to

    10.0.0.1         waldo.basement-supercomputing.com waldo headnode

  Next rebuild the VNFS or

    # wwvnfs --chroot=/var/chroots/sl62_base/ --hybridpath=/vnfs

  When asked "overwrite the Warewulf VNFS Image," enter "yes"
  The new VNFS with updated /etc/hosts will be saved in
  the data store and used on next reboot.

  F. It is advisable to reboot the headnode and restart the worker 
  nodes at this point so that the hostname change can take effect.


9. Fan Control and Thermal Management
=====================================

The front fans are controlled using the fancontrol daemon. 
If the worker processors become hot the fan speed is increased.

The fancontrol daemon is started on boot using the limfanctl rc script.
A daemon called "fancontrol" and "limulus-node-temp" are started and
monitor the temperatures and fan speed. The configuration
file for the fancontrol daemon is in /etc/fancontrol.
This file should not be changed. 

On boot and shutdown the fans will run at high speed for
a short time because the limfanctl service is not running. 
If for some reason you need to restart the limfanctl
daemons enter:
 
  # service limfanctl restart

To check that the limfanctl daemons are running,
enter:

  # service limfanctl status 

Thermal Throttling
------------------

The Intel Haswell line of processors are known to run hot. This is
partially due to the "turbo mode" that increases the clock speed
while trying to keep the processor within the thermal specification.
In addition, if a critical temperature is reached (often about 90C for
the Haswell), the processor will lower the clock speed to reduce the 
temperature. 

In order to keep each Limulus system as quiet as possible, the
front intake fans are low noise with a high air-flow. This 
design may under some circumstance result in thermal throttling
of the node processors. Each application has it's
own temperature profile and many can take advantage of Intel
turbo mode and not hit the throttling limit. 

Interestingly, with many parallel applications the use of
turbo mode adds little to the performance. This behavior is due
to parallel applications using all the cores running at full speed.
Under these conditions there is no extra frequency headroom 
to bump the clock speeds. The actual thermal profile is application
dependent. As shipped, Intel turbo mode is enabled on 
all cores. Should you notice that certain applications result
in throttling (you can observer throttling events int /var/log/messages)
you can disable turbo mode on the nodes and thus lower the
temperatures and not impact performance. 

To turn off turbo mode in the nodes, the "node-turbo-off" script can be run.
For example:

  # sh node-turbo-off
  Turning off turbo mode on nodes: n0 n1 n2
  Node n0 turbo mode is OFF
  Node n1 turbo mode is OFF
  Node n2 turbo mode is OFF

To turn turbo mode back on, use "node-turbo-on." The status of
the nodes turbo mode can be found using the "node-turbo-status" command.

There is no need to turn turbo mode off on the main node. Also, it is 
a good idea to keep the default "powersave" governor setting for the node processors.


10. SATA Ports Configuration
============================

Model 100 and 200 (Limulus HPC):
--------------------------------

Each Limulus HPC has a total of ten SATA (6 Gb/s) ports connected to
the main motherboard. There are six on-board SATA ports and an add-in card
with four additional ports. There are seven removable storage slots available
on the case. These are as follows:

   1 - DVD slot
   2 - 2.5 inch slots (for SSD)
   4 - 3.5 inch slots (for spinning disk)

The DVD slot and the eSATA port on top of the case are connected to
the 4-port SATA card. The two 2.5 inch (SSD) slots and the four 3.5 inch
(spinning disk) slots are connected to the main motherboard. 

Depending on how your system is configures, some or all of these
storage slots will have drives in them.

There are two available SATA ports on the add-in card. There are also
two 2.5 inch internal slots on the bottom of the case. It is possible to
use these extra ports with two additional 2.5 inch drives. These drives
slots are not removable. 

The three compute nodes operate as disk-less and have no disks connected. 
As a reference, each node does have six SATA (6 Gb/s) ports on the motherboard. 

Model 300 and 400 (Limulus Hadoop):
-----------------------------------

Each Limulus Hadoop has six SATA (6 Gb/s) ports on both the main motherboard and 
worker nodes. There are a total of ten removable storage slots on the case.
These are as follows:

   2 - 3.5 inch slots (for spinning disk)
   8 - 2.5 inch slots (for SSD)

The two 3.5 inch slots and two of the 2.5 inch slots are connected to the main motherboard.
The worker nodes each have two 2.5 inch slots connected to their motherboard. 

The eSATA port on top of the case is connected to a SATA port on the main motherboard.
This configuration leaves one open SATA slot on the main motherboard that could be
used for a 2.5 inch drive mounted on the inside bottom of the case.

The layout of of the 2.5 inch SSD drives is as follows:

       top of case

  +---------------------+
  | headnode |    n0    |
  |  SATA 1  |  SATA 1  |
  |---------------------|
  |    n1    |    n2    |
  |  SATA 1  |  SATA 1  |
  |---------------------|
  | headnode |    n0    |
  |  SATA 2  |  SATA 2  |
  |---------------------|
  |    n1    |    n2    |
  |  SATA 2  |  SATA 2  |
  +---------------------+


11. Optional RAID Configuration

If the system has preconfigured RAID sets, be sure to add a notification
email to /etc/mdadm.conf. Monitoring is started in /etc/r.local by issuing an
"mdadm --monitor --scan --daemonize" 


12. Ganglia Configuration
=========================

The latest version of the Ganglia monitoring system allows much more
user configuration than previous versions. In order to allow for 
customization, Ganglia is installed in "edit" mode that allows
any viewer to change the configuration of the Ganglia web page. 

The "edit" mode is set by disabling the authorization system. This setting
has been made in /etc/ganglia/conf.php by adding (before the final "?>"):

$conf['auth_system'] = 'disabled';

The full configuration is in /usr/share/ganglia/conf_default.php.
Do not edit this file, however, make the changes in /etc/ganglia/conf.php.

Once Ganglia is configured, the authorization can be reset by removing
the setting in /etc/ganglia/conf.php. The system will then be "readonly."


13. Installed HPC Software
==========================

All Limulus software has been configured to work seamlessly after
installation. Adding other packages may require additional configuration.
Environment Modules are used to integrate most HPC software to
the cluster environment. For instance, to use the Sun Grid Engine batch scheduler 
you must enter:

  # module load sge6

Before using any of the commands (e.g qsub) You may view the batch queue 
and worker nodes using the "userstat" utility. (load sge module first)
See the documentation for more information on the "modules" package.

Ganglia should start automatically on the nodes when booted.
You can view the ganglia interface by setting the browser to: 

  http://localhost/ganglia

See above #4 for making ganglia page viewable on the local LAN.

YUM Repositories:
-----------------
In addition to the Scientific Linux repositories, both the Limulus (limulus.repo)
and the EPEL - Extra Packages for Enterprise Linux (epel.repo) are enabled. 
Be careful when using other repositories as similar, but incompatible, versions 
of some software may be available.

NOTE: kernel updates are disabled in /etc/yum.conf. Limulus systems
require 3.x kernels to support the features of the latest Intel  processors.
When necessary, these kernels will be placed in the Limulus update repository.
Also, YUM will check for updates once a day, see /etc/cron.daily/yum-autoupdate.


Installed HPC Software:
-----------------------
    * Warewulf Cluster Toolkit - Cluster provisioning and administration 
    * PDSH - Parallel Distributed Shell for collective administration
    * Open Grid Scheduler - previously Sun Grid Engine Resource Scheduler
    * Ganglia - Cluster Monitoring System
    * GNU Compilers (gcc, g++, g77, gdb) - Standard GNU compiler suite
    * Modules - Manages User Environments
    * MPICH - MPI Library (message passing middleware)
    * OPEN-MPI - MPI Library (message passing middleware)
    * Open-MX - Myrinet Express over Ethernet
    * ATLAS - host tuned BLAS library
    * OpenBLAS - hand tuned BLAS library
    * FFTW - Optimized FFT library
    * FFTPACK - FFT library
    * LAPACK and BLAS - Reference Linear Algebra library
    * SCALAPACK - linear algebra routines for parallel distributed memory machines
    * PETSc -  data structures and routines for parallel PDE solvers
    * GNU GSL - GNU Scientific Library (over 1000 functions)
    * PADB - Parallel Application Debugger Inspection Tool
    * Userstat - a "top" like job queue/node monitoring application
    * Beowulf Performance Suite - benchmark and testing suite
    * relayset - power relay control utility and scripts
    * ssmtp - mail forwarder for nodes 
    * whatsup - node status using ping
    * Julia - Easy To Use High Performance Parallel Scientific Language