Skip to Main Content U.S. Department of Energy
Center for Adaptive Supercomputing - Multithreaded Architectures

Compiler and Runtime Enhancements

Experiences with application performance on the Cray MTA-2 architecture need to be refined for the Cray XMT machine. For instance, the XMT inter-node communication architecture is substantially different from the MTA-2; it is based on the Cray XT-4’s interconnect which is inherently designed for distributed memory systems and applications that exploit locality. In light of these architectural changes, we are researching several enhancements to the MTA programming model and compilation process that should result in productivity and performance benefits. These extensions would increase Cray compiler effectiveness for certain classes of applications by better thread management, locality optimizations, and providing options for deterministic scheduling of the threads.

Research Areas Include:

Compiler thread generation and destruction exposed to the programmer

We believe that providing mechanisms to the programmer to guide the thread generation and destruction process will yield performance improvements in situations where default compiler policies make conservative assumptions that are detrimental to performance. We propose to introduce additional policies, either compiler- or programmer-directed, that lazily destroy thread-stream mapping to avoid overhead penalties. The pay-off is a performance improvement with minimal policy intervention on the programmer’s side.

Result reproducibility and/or accuracy of transient reductions

Concurrent streaming applications do not exhibit deterministic properties: e.g., concurrent categorical data analysis on streaming network packets compute intermediate entropy results of streaming data that is irreproducible from one run to the next. Here, non-deterministic thread activation leads to a deterministic final result but to non-deterministic transient results.

To enable this class of algorithms to produce deterministic transient results we propose to automatically trace the order of thread/input-data activations. The pay-off is better reproducibility and/or accuracy of transient reductions.

Similarly, massively multithreaded numerical applications have a very high degree of parallel execution. While this parallelism is the primary source of execution efficiency, it also results in a high degree of variability in the order in which computations may be carried out. For such computation a means must be provided to guide the compiler and/or runtime so that deterministic results are generated.

This would be a fairly difficult thing for a compiler to determine without guidance. It would be fairly straightforward to add directives that specified that a particular section of code or a set of statements be serialized to guarantee determinism. Determination of the best mechanisms to employ is the subject of this research.

Mechanisms for exploiting locality to improve scalability

The attractiveness of the MTA’s programming model comes with the price of ignoring any latency-reduction benefits that exploiting locality would bring. Scaling to a large number of processors with increased execution speeds imposes an even larger burden on the programmer and compiler to hide latency. We propose to enhance the MTA model with optional capabilities that can exploit locality to reduce this burden. The pay-off is a reduced concurrency requirement to achieve program speed-up, thus reducing the impact of Amdahl’s law caused by relying on a single acceleration technique, such as latency hiding. Thanks to the XMT’s programming model, there is always the possibility to revert back to full latency-hiding techniques in cases where thread partitioning results are impractical or inefficient.

CASS-MT

Research and Development

Resources

Recent News

Additional Resources

PNNL Contacts