CIS - Cluster Information Service


Jan Astalos
Department of Parallel & Distributed Computing
Institute of Informatics
Slovak Academy of Sciences

Description

Cluster Information Service (CIS) provides clients with the information about availability of resources in Linux cluster. This information is necessary for workload allocation and better utilization of resources (CPU, memory and network).
 

Background

Looking at the trends in high performance computing, you can see still more people switching from expensive supercomputers to much cheaper computer clusters. Reasons are  good price/performance ratio, ease of maintenance and upgrading of hardware and still more free software available on the web. Clusters are used for both HTC (High Throughput Computing) and HPC (High Performance Computing). If you have a lot of independent tasks and you just want a system for efficient workload distribution, you fall info the first category and you need to get a good queuing system such as LSF. But if you have parallel application with highly communicating tasks, you may want to balance not only CPU load but also network load. Unfortunately, there is still lack of good freely available monitoring systems with low intrusiveness usable in computer clusters.
 

Purpose

CIS is intended to give a deeper look into a Linux cluster. If you need a monitoring system for resource management or if you just don't want to have a number of windows with top utility, it may help you. It is designed to have as low impact on monitored system as possible. Mainly by its:

Structure

CIS consists of servers and monitors. Server (cisd) runs on head node of cluster which is usually not used for computations and acts as front-end. Monitors (sysmon, procmon, sockmon, netmon or integrated monitor cismon) run on computing nodes and they send information about monitored objects to the server in regular time intervals. In addition, they inform cisd when a monitored event occurs (at the moment object creation and destroying is monitored). CIS clients (xcis or resource management systems) are usually not running on computing nodes, so internal cluster network is not loaded by client calls.

Data transfer

After initial full information monitors are sending only changed data. Since messages are sent using UDP protocol, they are labeled with sequence number. When server detects message loss, it will ask monitor for full information until the data are consistent again. UDP was selected because it's far better to send up-to-date information than retransmit obsolete data.

Kernel instrumentation

Since Linux is most used OS in computer clusters, at this time, only Linux monitors are available. Gathering process information from standard kernel interface (/proc filesystem) is rather inefficient. Try to run top -d 1 and look how much of CPU time it consumes. Yes, some overhead is due to visualization, but main part is due to multiple system calls for each process, conversions to text format and back and due to a lot of useless information transferred in each call. Moreover, system call for checking available memory is _very_ inefficient in 2.2 kernel series (improved in 2.4). Information that is completely missing are socket counters.

Therefore, CIS has monitoring probes in kernel. Probes are implemented as kernel modules and they work by wrapping some system calls. They provide monitors with information in binary form (whole information in one system call). Using special netlink device, they can inform monitor whenever monitored even occurs. Modifications in original kernel sources are minimal (export of some symbols, and some additional information in socket structures) so updating of kernel patch for new version of kernel (in the same series) is quite simple.

You can run CIS also without patching your kernel. Monitors will autodetect absence of kernel modules and fall into /proc compatibility mode. Socket monitor won't work, but if you have embarrassingly parallel code, you won't need it anyway. Overhead of monitoring will be much higher, but if CPU load on your nodes is relatively stable and mean time between spawning two processes is high (let's say, hours) you can decrease the overhead by increasing monitoring interval.
 

Information provided by CIS

Node information

CPU availability represents an estimated percentage of CPU time that can be allocated to new process. It is computed from CPU usage and priorities of processes. Unlike load values, reaction on changes is very fast, because recently created processes (with unknown CPU usage characteristics) are assumed to have 100% CPU demand.

Process information

Communication links (sockets)

Network devices

LM sensors

Features

Record/Replay

One of the most important features of CIS is the ability to save monitoring information into record files, so it's possible to get long term characteristics of parallel applications (CPU, memory and network usage patterns). This can help resource management system in task placement decisions. Monitoring information (both on-line and archived) can be visualized using xcis client. CIS contains also utilities for processing record files (cut, merge, print in readable form).

Note: This is not intended for identical rerun of nondeterministic applications. If you need it, you have to look for some application monitor with this feature.

Low overhead

The overhead depends on CPU speed, number of objects and number of detected events (object creation and termination) in monitoring interval. On Intel Pentium III 550 MHz (quite common in today's clusters) with 384 MB of RAM, one 100Mbit network interface and monitoring interval one second the CPU and network overhead was:
  interval
sysmon
procmon
sockmon
netmon
top
[s]
%CPU
transfer [Bps]
%CPU
transfer [Bps]
%CPU
transfer [Bps]
%CPU
transfer [Bps]
%CPU
transfer [Bps]
empty
1
0.01
67
0.01
46
<0.01
119
<0.01
46
1.5
1743
loaded
1
0.01
75
0.01
96
<0.01
272
<0.01
45
1.66
1647
empty
0.1
0.11
750
0.13
491
0.05
1289
0.05
402
   
loaded
0.1
0.12
750
0.18
904
0.08
2021
0.06
402
   

Since monitors were consuming less than one OS clock tick, for correct measurement it was necessary to patch the kernel for precise time accounting (see links section). Precision of CPU measurements in CIS is 0.01% so it was not possible to measure the CPU overhead of sockmon and netmon. In empty state there was 13 processes (including monitors) on the node. In loaded state there were in addition 10 tasks computing and communicating in cycle. For comparison overhead of top utility with the same monitoring interval is in the table. The node was quite heavy loaded and probably that's why transfer rate of top is lower (it couldn't get as much CPU time as in empty state).

Screenshots from xcis client


 

Availability

Cis package is (except of CIS library which is distributed under LGPL license) distributed under the terms of GPL General Public License. It is running on OS Linux, x*86 architecture, and kernel version 2.2 and 2.4.
Download: cis-0.9.5.tar.gz  archive

Xcis client is written in Open Motif/Lesstif and it requires Microline Widget Library which is distributed under Netscape Public License as a part of mozilla source tree mozilla-19981008.tar.gz. It also requires plot_widget library.
Download: Microline3.0.tar.gz http://www-pat.fnal.gov/nirvana/plot_wid.html

Xcis is distributed under GPL General Public License with a special exception that allows you to link the executable with the two libraries above.
Download: xcis-0.9.5.tar.gz

Modified resource manager for PVM (Parallel Virtual Machine) with load balancing based on CIS information.
Download: srm-cis.tar.gz
 

Related links

Precise time accounting in Linux
SCMS - SMILE Cluster Management System
NWS - Network Weather Service