|
|
KeynoteKeynote Speaker
Professor Anne Benoît, ENS Lyon France Abstract: Large-scale parallel systems include millions of components, and they induce two major problems: resilience and energy consumption. Resilience is (loosely) defined as surviving to failures. Failures are usually handled by adding redundancy, either continuously (replication) or at periodic intervals (migration from faulty node to spare node, rollback and recovery). In the latter case, the state of an application must be preserved (checkpointing), and the system must roll back to the last saved checkpoint. However, the amount of replication and/or the frequency of checkpointing must be optimized carefully, and we will discuss how to optimally decide the checkpointing interval. We will also discuss the second important challenge of power consumption. Power management is necessary due to both monetary and environmental constraints. Energy is needed to provide power to the individual cores and also to provide cooling for the system. Using dynamic voltage and frequency scaling (DVFS) is a widely used technique to decrease energy consumption, but it can severely degrade performance and increase execution time. |