SMT Monitoring

Silicon detector monitoring is a multi-host system. It's purpose is to collect data describing current status of
hardware generate  and send alarm messages to Significant Event System.

Monitoring system consists of the following applications:


                                                                      What to do when

If you need to stop monitoring:

  • login to 'd0ol28' as 'd0run'
  • cd '/home/d0run/onlineMonitoring'
  • source 'STOP_MONITORING.tcsh'
  • check if there are not processes: 'ps -aux | grep SMTDataStore'
  • if you see any 'SMTDataStore' kill them:   kill -9 'process_number'

  • If you need to start monitoring:

  •  login to 'd0ol28' as 'd0run',
  • cd '/home/d0run/onlineMonitoring'
  • setup d0online
  • check if there are not processes: 'ps -aux | grep SMTDataStore'
  • if you see any 'SMTDataStore' kill them:   kill -9 'process_number'
  • restart monitoring: source START_MONITORING.tcsh
  • check if you can access 'http://www-d0ol.fnal.gov/smtMonitoring/'

  • If a new message has been sent to Significant Event System:

  • confirm that examine reports the same problem
  • get more information about problematic HDI/CHIP:
  • using GUI on the left hand side interesting part of the ditector(crate/VBR/HDI) select intersting part of the detector (crate/VRB/HDI)
  • using GUI on the right hand side of the 'http://www-d0ol.fnal.gov/smtMonitoring/' select information you need (singnal/hit number/ occupancy)
  • look for problems: no events, signal 'zero', high occupancy

  •  

    If you think that monitoring does not send information about faulty system to Significant Event System:

  •  on the page http://www-d0ol.fnal.gov/smtMonitoring/ check 'IOC DATA UPDATE TIME' and 'SES UPDATE TIME' . Both times should be close to current time. The first one informs when that last data transfer has been done from IOC whereas the second one displays date when the last message has been sent to Significant Event System.
  • if data is not collected from an IOC and you are not an expert: PAUSE the run, reboot PowerPC, start monitoring for the IOC from SMT GUI
  • If data is not collected from an IOC and you are an EXPERT: login to the that IOC, look for suspended processes (VxWork command 'i'), look into log file /home/d0run/onlineMonitoring/IOC_d0olsmtXX_ConnectionLog.txt
  •  
     

                                                                        What experts should know

    Linux box application consists of the following threades:

    When application starts it creates data structure to store SMT hardware data. Data structure is created
    based on the current online database content. OCI Oracle functions are used to get information about
    all the SMT crates, VRBs, HDIs and chips from online Oracle database. Each HDI and chip have tables
    to discribe signals, hit count and occupancy.

    The 'OCIConnection' thread waits for incomming connection requests from frondent processors.
    As soon as these connection requests appear they are accepted and every 1 minute (currently set) data request is sent to frondent processors by OCIConnection thread. Received data is stored in previously
    created data stucture. Connections are socket based.

    The 'shmServer' thread serves data to Java servlet requests. Connection is socked based. A new socket
    connection is opened for a new data request. Connection is closed as soon as requested data is served.
    There are two files written to a disk for each data request. An html file and php file. An html file
    contains information about how many separated images are in the php file. php file contains real
    data and on a base of that file histograms are created on fly in the WWW browser.
    After a data request has been succesfully completed names of existing html files including a new created
    one are sent to java servlet client. Java servlet displayes them in the browser.

    The two threads 'shmServer' and 'OCIConnection' share the same data structures. They are synchronized
    using set of semaphores. There is one semaphore created to synchronize access to each SMT crate data.
     

    The 'SESConnection' thread checks hardware data every 10,000 events (currently set) and generates
    alarms if occupancy acceeds 25% (currently set) or if an HDI or a chip are dead (signal from all the
    strips is zero).  Messages are generated based on the content of the same data structure that is used by
    'OCIConnection' thread as well.  There is an semaphore used to ensure that the data structure is not overritten while check is being done.
     

    Java servlets (www-d0ol/smtMonitoring) are served by WWW server and are used to display the data histograms (signal, hit count, occupancy).

    Using the GUI, shown on the left hand side of the WWW page,  you can select interesting parts of the detecor. Information can be obtained only for those HDIs that are NOT marked as 'disabled'.
    Below the tree like menu of  existing SMT hardware there is a list of  'IOC DATA UPDATE TIME' which tells you when the last data was collected  from a particular IOC. Based on that time one can easily
    figure out if data base containes updated data for a current run.

    There is 'SES UPDATETIME' list for all IOCs below  already mentioned'IOC DATA UPDATE TIME'
    list. Dates and times in that list inform when the last hardware check was done and when the last
    message was sent to Significan Event System.
     

    The GUI on right hand side of WWW page allows to obtain information about SMT hardware in text and
    graphical formats.

    Meaning of selection criteria for text like output:

    Before you push the 'DataAction', you must make your selections. The easiest way to fill in the CrateName,
    VRBName and HDIName it to click on a crate, then VRB and then HDI in the left hand WWW window.
    If you want something at the  SVX level you use the pull down menu under SVXNumber.
    After 'DataAction' is pushed selected data will appear in a scrollable widged that is going to be drawn below
    that button.
     

    To display data in a form of histograms one needs to select/fill  table below 'SMT MONITORING GRAPHICS'.
    Meaning of selection criteria for graphics type output.

    After button 'DataAction' is pushed selected data will appear in new browser widget.
     

    Frondent processor server: is a part of SMT frondent processor code.  As soon as 'SMT Monitoring'
    is turned ON in CREATOR window SDAQ Supervisor passes all the run commands from COOR to frondent processors. At the begin of each run commands INIT and START are sent to SMT frondent
    processor. When INIT command comes initialization procedure is called. When START commands commes data collection starts.
    As soon as initialization is completed monitoring server tries to connect to Linux box application
    (IOCConnection thread) .  At that point IOCConnection process takes over and asks server for data
    every 1min (currently set).  Connection is socket based.
    As soon as run is stopped command STOP is passed via  SDAQ Supervisor to frondent and monitoring
    server breakes connection.

    Hardware data for monitoring purposes can be collected from:

    Collected hardware data is unpacked and stored in to histograms for each HDI and each chip.  On a request
    from IOCConnection thread that data is copied to a buffer and sent back to IOCConnection thread.
     
     

    Current setup.
    The whole software is stored in cvs repository in the following directories:


    SMT online monitoring application runs on 'd0ol28.fnal.gov' D0 online host from 'd0run' account.
    Application is located in directory '~d0run/onlineMonitoring'.

    Since graphs are created using PHP software PHP library needs to be supported by WWW server.
    Current version installed is 'php-4.1.2'.

    For graphs creation an additional packed 'jpgraph-1.6.1' (written is PHP) was used. It is installed on a disc
    accessible from online cluster in directory `/d0usr/products/jpgraph'.
    There are links created in the application to the above mentioned location.

    SMT online monitoring application should be started/ stopped using official d0 scrit start_daq/stop_daq
    Usage of that script ensures that only one instance of application runs at a time.

    There are several log files created in dir `~d0run/onlineMonitoring/'

    Every time application is stopped log files are copied to '/projects/smtMonitoring/log/' directory
    with current date and time.

    All 'html' and 'php' files that contain created graphs are in directory '/projects/smtMonitoring/jpgraph_cache'.

    Files older then 1 month are purged from both directories '/projects/smtMonitoring/log/'  and '/projects/smtMonitoring/jpgraph_cache' by an automatic cron job.
     

    Compilation and Linking.
    In order to compile C++ code ('onl_smtcalib/ssdaq/smtStore/monitoring') one needs to 'setup D0RunII p13.10.00' or later if backward compatible.

    In order to compile Java servlets ('onl_smtcalib/ssdaq/smtStore/servlet') one needs to :
    'setup tomcat' and 'setup java v1_3_1_02'.  Compilation is done automatically by 'makefile'.
    Copiled files 'class' are copied to `www/WEB-INF/classes/' and library 'jar' is copied to 'www/WEB-INF/lib/'. After successful compilation one needs to copy 'class' and 'jar' files to
    a well known location of 'tomcat' server. Currently that are:
    '/projects/elog/jakarta-tomcat-3.2.1/webapps/smtMonitoring/WEB-INF/classes/' and
    '/projects/elog/jakarta-tomcat-3.2.1/webapps/smtMonitoring/WEB-INF/lib/'

    In order to compile frondent part change directory to 'onl_smtcalib/ssdaq/smtStore/frondent'
    and use existing makefile that will make the whole job for you.