Keynote

Keynote Speaker

Professor Anne Benoît, ENS Lyon France

Optimization problems in the presence of failures on large-scale parallel systems

Abstract: Large-scale parallel systems include millions of components, and they induce two major problems: resilience and energy consumption. Resilience is (loosely) defined as surviving to failures. Failures are usually handled by adding redundancy, either continuously (replication) or at periodic intervals (migration from faulty node to spare node, rollback and recovery). In the latter case, the state of an application must be preserved (checkpointing), and the system must roll back to the last saved checkpoint. However, the amount of replication and/or the frequency of checkpointing must be optimized carefully, and we will discuss how to optimally decide the checkpointing interval. We will also discuss the second important challenge of power consumption. Power management is necessary due to both monetary and environmental constraints. Energy is needed to provide power to the individual cores and also to provide cooling for the system. Using dynamic voltage and frequency scaling (DVFS) is a widely used technique to decrease energy consumption, but it can severely degrade performance and increase execution time.

Online user: 1