MultiTherMon

Project/Tool scope:

MultiTherMon is a Scalable, Modular and Lightweight Monitoring Framework for supporting fine-grain and accurate monitoring of power/energy/thermal and architectural parameters in distributed and large-scale high-performance computing installations.  It takes advantage of built-in hardware monitoring support in today’s processors and accelerators present at the hearth of the computing nodes. MultiTherMon targets fine-grain and multi-scale resource monitoring for power & energy and thermal control up to the exascale. Indeed it leverage the MQ Telemetry Transport (MQTT) a machine-to-machine lightweight publish/subscribe messaging protocol as a backbone for scalable and flexible infrastructure for multi-scale monitoring and control of future Exascale supercomputers.

examon_fmk

Features

The framework measures

  • Per-core performance counters:
    • Instructions retired
    • Un-halted core clock cycles at the current frequency
    • Un-halted core clock cycles at the reference frequency
    • Core temperature
    • Time stamp counter
  • Per-CPU/Socket energy/power consumption
  • DRAM energy/power consumption

The measured data are sent over the network using the MQTT protocol (TCP/IP).

MultiTherMon is currently used in several HPC installations:

  • Eurora @ CINECA
  • GALILEO @ CINECA