MultiTherMon is a Scalable, Modular and Lightweight Monitoring Framework for supporting fine-grain and accurate monitoring of power/energy/thermal and architectural parameters in distributed and large-scale high-performance computing installations. It takes advantage of built-in hardware monitoring support in today’s processors and accelerators present at the hearth of the computing nodes. MultiTherMon targets fine-grain and multi-scale resource monitoring for power & energy and thermal control up to the exascale. Indeed it leverage the MQ Telemetry Transport (MQTT) a machine-to-machine lightweight publish/subscribe messaging protocol as a backbone for scalable and flexible infrastructure for multi-scale monitoring and control of future Exascale supercomputers.
The framework measures
- Per-core performance counters:
- Instructions retired
- Un-halted core clock cycles at the current frequency
- Un-halted core clock cycles at the reference frequency
- Core temperature
- Time stamp counter
- Per-CPU/Socket energy/power consumption
- DRAM energy/power consumption
The measured data are sent over the network using the MQTT protocol (TCP/IP).
MultiTherMon is currently used in several HPC installations:
- Eurora @ CINECA
- GALILEO @ CINECA