Research Projects

Mantaro

Mantaro is a communication model that focuses on data movement optimizations for heterogeneous environments with specialized ISAs. It is aware of different execution models, heterogeneous memory hierarchies and the associated implications on energy and time for data movements. Mantaro in particular explores how existing communication semantics can be mapped to massively parallel processors like GPUs, and initial focus was on a suitable communication architecture, and addressing the implications of naming and ordering for two-sided communication paradigms. Mantaro builds on top of specialized communication models like GGAS (among others) and combines them behind suitable user-level communication abstractions.

  • Current people: Holger Fröning (PI)
  • Previous people: Benjamin Klenk (PhD student), Arthur Kühlwein (graduate student), Daniel Schlegel (graduate student), Günther Schindler (graduate student)
  • Awards: Best paper award at IPDPS2017, best paper finalist at ISC2017
  • Collaborations: NVIDIA Research
  • Funding: German Excellence Initiative, NVIDIA
  • Dissemination (selection):
    • GTC 2017 talk: "S7300 - Managed Communication for Multi-GPU Systems" [recording] [pdf]
    • IPDPS 2017 conference paper: [pdf]
    • ISC 2017 conference paper: [pdf]
    • ISPASS 2015 conference paper: [pdf]
    • Invited talk at OMHI2015 workshop at Euro-Par 2013: [pdf]
  • Since 2014

 

Mekong (formerly GCUDA)

Mekong - BSP styles of programming for multi-GPU systems: the main objective of Mekong (formerly GCUDA) is to provide a simplified path to scale out the execution of GPU programs from one GPU to almost any number, independent of whether the GPUs are located within one host or distributed at cloud or cluster level. Unlike existing solutions, this work proposes to maintain the GPU’s native programming model, which relies on a bulk-synchronous, thread-collective execution; that is, no hybrid solutions like OpenCL/CUDA programs combined with message passing are required. As a result, we can maintain the simplicity and efficiency of GPU computing in the scale-out case, together with a high productivity and performance. Mekong employs polyhedral analysis and code generation for an automated generation of communication to solve dependencies. The analysis of memory access patterns allow to identify read and write sets, and simple set algebra is used to solve dependencies among kernels.

GCUDA received funding from Google in form of a research award. Mekong has been granted additional BMBF funding. Since 2014. \cite{Matz2016}

  • Current people: Holger Fröning (PI), Alexander Matz (PhD student), Lorenz Braun (PhD student)
  • Previous people: Christoph Klein (graduate student), Dominik Sterk (graduate student), Dennis Rieber (graduate student)
  • Awards: Received a Google Faculty Research Award (2014)
  • Collaborations: EMCL lab at Heidelberg University, NVIDIA, ETHZ
  • Funding: Google Faculty Research Award, BMBF
  • Dissemination:
    • GTC 2018 talk: "GPU Mekong: Automated Multi-GPU Programming using Advanced Compilation Techniques" [recording]
    • MULTIPROG workshop paper at HiPEAC 2016: [pdf]
  • Read more: http://sites.google.com/site/gpumekong
  • Since 2015

 

DeepChip: Deep learning on resource-constrained systems

Many processes require evaluation of complex numerical functions close to the machine or structure of interest, to avoid the effort of data transfer or to enable small reaction times. Although computing performance of embedded platforms is increasing, it is often significantly lower than the requirements of state-of-the-art algorithms. With the advent of Deep Neural Networks (DNN), the achievable classification performance has been pushed to new levels. The high cost of execution, however, renders them unusable to many real-world applications. A possible approach is the use of hybrid processors (ARM+FPGA or similar), but this raises the question on how to auto-generate optimized DNN classifier implementations. In the DeepChip project, we tackle this problem by optimizing deep models in terms of sparsity, asynchrony and reduced precision, and by extending machine learning languages with a hybrid back-end that is responsible for HDL code generation, automated partitioning and integration.

The DeepChip inference architecture is currently a software-only solution and can be applied to various architectures. In this context, we demonstrated that homogeneous ARM processors are actually quite suitable for DL inference, which is a surprising finding (UCHPC2017 paper). The DeepChip architecture is based on extreme forms of quantization in combination with compression and suitable data structures to minimize the computational and memory requirements of deep networks. A similar concept has already been validated for beamforming applications (ICASSP2018 paper).

This research line was initiated during the research stay as visiting professor at TU Graz. A DACH project has been funded and started December 2016, which is a collaboration with the group of Franz Pernkopf from Technical University of Graz. In a second project, an FWF COMET flagship project, we collaborate with Manfred Mücke from Materials Center Leoben to extend this work to condition monitoring systems. Since 2015.

  • Current people: Holger Fröning (PI), Günther Schindler (PhD student), Himanshu Tiwari (graduate student), Antsa Andriamboavonjy (graduate student)
  • Previous people: Eugen Rusakov (graduate student), Klaus Neumann (graduate student), Andreas Melzer (graduate student)
  • Collaborations: Franz Pernkopf (Technical University of Graz, Austria), Manfred Mücke (Materials Center Leoben, Austria)
  • Funding: DFG/FWF (DACH project)
  • Dissemination:
    • ICASSP 2018 conference paper: [pdf]
    • UCHPC workshop paper at EuroPAR 2017: [pdf
  • Read more: http://www.deepchip.org
  • Since 2015

 

Integrated Power Models

A fundamental understanding of power consumption is essential to design and operate future computing systems. Especially inter-node interconnection networks are a neglected topic in the area of power optimization and modeling. In this project, we collaborate with colleagues from the University of Castilla-La Mancha (Spain) to analyze, optimize and model the energy proportionality of scalable interconnection networks. In particular, we proposed power saving policies based on sleep states, which due to the distributed nature of networks is fundamentally different to policies for processors, for instance. Currently, we are exploring the combination of this with network management techniques including congestion management and adaptive routing. Since 2015.

  • Current people: Holger Fröning (co-PI), Felix Zahn (PhD student), Florian Nowak (undergraduate student)
  • Previous people: Steffen Lammel (graduate student), Armin Schäffer (graduate student)
  • Collaborations: Pedro Garcia (University of Castilla-La Mancha, Spain)
  • Dissemination:
    • Wiley CCPE journal paper 2018: [link]
    • HiPINEB workshop paper at HPCA 2018: [pdf]
    • HiPINEB workshop paper at HPCA 2017: [pdf]
    • ExaComm workshop paper at ISC 2016: [pdf]
    • HiPINEB workshop paper at HPCA 2016
    • Invited talk at HiPEAC Computing Systems Week, Milano, 2015: [pdf]
  • Since 2015

 

Graphite

While common approaches try to optimally support graph computations by dedicated software stacks (e.g. graph engines), in this work we explore how existing columnar databases can be extended to optimally support graph queries. Direct advantages include reduced data movements, and in addition other aspects like attributes, updates, concurrency and NUMA effects can be much better addressed. Most recently, we focus on scheduling optimizations and address unnecessary guarantees of common atomic operations. This is a joint project with SAP, and the Technical University of Dresden. Since 2015.

  • Current people: Holger Fröning (PI), Matthias Hauck (PhD Student), Romans Kasperovics (SAP), Hannes Rauhe (Innovation Lab Berlin)
  • Previous people: Marcus Paradies (SAP)
  • Collaborations: SAP, Innovation Lab Berlin
  • Dissemination:
    • GRADES workshop at SIGMOD 2017: [pdf]
    • HPEC conference paper 2016: [pdf]
    • EuroDW workshop paper at EuroSys 2016: [pdf]
    • PELGA workshop paper at EuroPar 2015: [pdf]
  • Since 2015

 

Data acquisition for high-energy physics experiments

For the ATLAS high-energy physics experiment at CERN we are contributing to the data acquisition system, in particular the data collection manager. In this project, a commodity Ethernet network is used and upper-level software layers like the data collection manager guarantee minimal collection latencies by traffic shaping techniques. In addition, a complete data-flow messaging library is designed and optimized for this special application. Most recently, we focus on modeling next-generation system behavior and try to conceptually address the tremendous increase in data rate by decoupling buffers inside the data-acquisition system. This is a collaboration with colleagues from CERN, who is also funding this work, and University of Castilla-La Mancha, Spain. Since 2013.

  • Current people: Holger Fröning (co-PI), Wainer Wandelli (CERN, co-PI), Alejandro Santos (PhD student)
  • Previous people: Tommaso Colombo (PhD student)
  • Collaborations: Pedro Garcia (University of Castilla-La Mancha, Spain)
  • Dissemination:
    • DEBS 2018 conference paper: [link]
    • TIPP Springer Proceedings in Physics, 2018: [link]
    • MSPDS workshop paper at HPCS 2017: [pdf]
    • Springer Journal of Supercomputing 2016: [pdf]
    • HiPINEB workshop paper at IEEE CLUSTER 2015: [pdf]
  • Since 2013

 

Past Research Projects

GGAS - Global GPU Address Spaces for Clusters

GGAS is a shared GPU address space that spans over the device memories of GPUs at cluster level. All threads in the cluster, both running on the host and on the GPU, can use GGAS as a low-latency direct access path to distributed GDDR memory. This allows for very fast and low overhead synchronization and data movement between GPUs. Independently of the location of GPUs – local or remote – the programming model of CUDA is maintained. Through the new GPU-Direct RDMA technology, the data can directly be transferred between the memories of two GPUs on different nodes – without temporary copies to host memory. Furthermore, GGAS allows for bypassing the host CPU in order to transfer data between kernels without returning control flow back to the host CPUs. In combination with the recent introduction of Dynamic Parallelism GPUs can autonomously compute and communicate even in distributed environments.

In summary, GGAS maintains the GPUs bulk-synchronous, massively parallel programming model by relying on thread-collective communication, allows confining the control flow to the GPU domain, bypassing the CPUs for all computation and communication tasks and avoiding context switches that for communication, which are costly in terms of energy and time. GGAS minimizes branch divergence, opposed to explicit communication layers like message passing. It is a direct, zero-copy communication model that moves data without intermediate copies between distributed GPU memories, again contributing to the minimization of time and energy.

  • Dissemination (selection):
    • Elsevier Journal of Parallel Computing, 2016: [link]
    • International Journal of High Performance Computing Applications, 2015: [link]
    • ASHES workshop at IPDPS 2014: [pdf]
    • E2SC workshop at SC 2014: [pdf]
    • IEEE CLUSTER 2013: [pdf]
    • CCGRID 2014: [pdf]
    • HUCAA workshop at ICPP 2014: [pdf]
    • Invited paper at GPCDP workshop at Green Computing 2014: [pdf]
    • ISPASS: [pdf]

MEMSCALE

A new memory architecture for clusters and datacenters, with the objective to overcome memory capacity constraints and to minimize over-provisioning of scarce resources. Key are global address spaces (GAS) across physically distributed resources. The scalability problem of coherence is addressed by reverting to highly relaxed model consistency models. Goal of this work is to overcome the current static resource partitioning in clusters, and thereby to facilitate a highly dynamic aggregation and disaggregation of resources. This approach can help to dramatically reduce resource over-provisioning and to maximize utilization, thus improving the energy-efficiency of clusters and datacenters. First results include the acceleration of in-memory databases, the acceleration of data-intensive applications and combining the scalability of message passing with the ease of programming of shared memory for future multi-/many-core architectures. This is a collaboration with the Parallel Architectures Group led by Professor Jose Duato at the Technical University of Valencia, Spain.

EXTOLL

A new ultra-low latency cluster interconnect designed from scratch for the use in HPC systems. Key properties are high message rates, high scalability and inherent support for multi-core processors by virtualizing the network interface. Current FPGA implementations of this design are able to outperform state-of-the-art silicon for selected applications and benchmarks. This is a collaboration with the Computer Architecture Group led by Professor Ulrich Brüning at the same institute and the EXTOLL company.

ONCILLA

Oncilla is a new project lead by Professor Sudhakar Yalamanchili from Georgia Tech that aims to provide a commodity-based non-coherent global address space (GAS) to support efficient data movement between host memory (DRAM) and accelerators (GPUs) using tightly integrated “converged fabrics”. By using a custom NIC, low-latency, non-coherent put/get operations are available to access remote memory and to build a large, non-coherent GAS system. It relies on Global address space (GAS) model as a memory management scheme, and is a runtime abstraction assisting the programmer with regard to data movement and placement tasks. 

 

 

 

back to top