An Anatomy of the Gathering Subsystem
The monitoring framework is divided into 3 distinct parts:
-
Gathering Subsystem
-
Action Subsystem
-
Driver
This blog post describes the rationale behind the code for the gathering subsystem.
How does it work?
Any OSCAR component which wants to be monitored or can generate monitoring data such as heartbeat, IPMI, PBS_MOM e.t.c provides an xml configuration file which specifies certain properties of that component. This xml file will be located in the Gather subdirectory of the monitoring framework. A visual representation of the directory structure is thus:
When the gathering subsystem is initialized it parses all the “config.xml” files which contains all the configuration data for each component. The parsing is done with the help of the parser. The structure of the xml file will be discussed later.
The “Gather.pm” receives from the parser a hash of values, from this information, it is able to create an Oscar_Monitor table in the OSCAR database with column for each component to be monitored. The naming scheme for these columns will also be discussed later . With the Oscar_Monitor initialized, any component can then call the update method with an xml file and expect its data to be made available through the OSCAR database.
In the process of implementing the proposed framework, I saw that my implementation could not handle monitoring data from simple commands. For example, if I wanted to monitor the disk space available on the compute nodes I should be able to write a script to check this and some how put this information in the monitoring database. The task of putting the information in the Oscar_Monitor table is trivial but finding a way to continuously poll this script after a certain time interval would require the script writer to incorporate features that would ensure the script runs indefinitely so as to update its monitoring information available in the Oscar_Monitor table. Monitoring 10 different components with the C3 suite of commands will require 10 different scripts running indefinitely; eventually taking its toll on the processor. It was because of this problem that I created the oscargather daemon which is independent of the monitoring framework but has the ability to connect to the gathering subsystem to determine which components need to be polled if any and call its script every n seconds which will be specified in the components configuration files.
Starting or stopping the oscargather daemon is done with a simple “service oscargatherd start/stop”. This means all the 10 components which needed to be polled can be terminated or started cleanly by simply using this since each is just a thread of execution in this daemon. The oscargather also plays another vital role; it allows components(polled and non-polled) to be added on the fly by simply restarting the service. This is because restarting the service causes the monitoring framework to reinitialize itself therefore taking into account the new components added.
Structure of the “config.xml”
<?xml version=’1.0′?>
<monitorconfig>
<component> string </component>
<function> string </function>
<polled>Boolean </polled>
<script> string</script>
<time> integer </time>
</monitorconfig>
All components need to contain a config.xml file in the Gather directory.
Component: This is used to specify what component will generate the monitoring data for example you can have “Heartbeat”, “C3”, “IPMI”…
Function: This is a short description of the nature of the data to be generated for example nodestatus (for heartbeat), or diskfree (for C3). This is required because one component can generate monitoring data for 2 different purposes for example you can have C3_diskfree (cexec df –h) and C3_memoryfree (cexec free). In this case C3 is used in 2 different contexts. The naming scheme for each column in the Oscar Monitoring table is <component>_<function>. So for each component, its name and function separated by an underscore is created in the monitoring tables.
Polled: If a component wishes to be polled by the gathering daemon it should set this value to true.
Script: All polled components must specify a script name in this tag.
Time: This refers to the time interval for the polling.
Structure of the “data.xml”
<monitordata>
<column>C3_Diskfree</column>
<data>
<hostname>oscarnode2.xperiment</hostname>
<nodedata>diskfree 120GB</nodedata>
</data>
<data>
<hostname>oscarnode1.xperiment</hostname>
<nodedata>diskfree 140GB</nodedata>
</data>
<data>
<hostname>oscarnode3.xperiment</hostname>
<nodedata>diskfree 10GB</nodedata>
</data>
</monitordata>
Column: Specifies what column in the table to update.
Data: Each of the data tags contains two elements the hostname and nodedata. Hostname is simply the full host name of each compute node. The nodedata tag on the other hand is an anonymous data type meaning it is capable of storing data of any type. With this any generic data can be entered into the monitor table.
A full diagram of the whole system (without an implemented Action subsystem)

I agreed with you
Hello webmaster
I would like to share with you a link to your site
write me here preonrelt@mail.ru