HPC Environment Management: New Challenges in the Petaflop Era
Jonas Dias (High Performance Computing Center, COPPE/Federal University of Rio de Janeiro)
Albino Aveleda (High Performance Computing Center, COPPE/Federal University of Rio de Janeiro)
Abstract:
High Performance Computing (HPC) is becoming much more popular nowadays. Currently, the biggest supercomputers in the world have hundreds of thousands of processors and consequently may have more software and hardware failures. HPC centers managers also have to deal with multiple clusters from different vendors with their particular architectures. However, since there are not enough HPC experts to manage all the new supercomputers, it is expected that non-experts will be managing those large clusters. In this paper we study the new challenges to manage HPC environments containing different clusters with different sizes and architectures. We review available tools and present LEMMing [1], an easy-to-use open source tool developed to support high performance computing centers. LEMMing integrates machine resources and the available management and monitoring tools on a single point of management.
Keywords:
Parallel and Distributed Computing, Management Application, Management Tools