Gurbinder Gill

Publications

[1]

Gurbinder Gill (1), Roshan Dathathri (1), Saeed Maleki (2), Madan Musuvathi (2), Todd Mytkowicz (2), Olli Saarikivi (2) ((1) Katana Graph Inc., (2) Microsoft Research). Distributed Training of Embeddings using Graph Analytics, IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2021 , May 2021.

Many applications today, such as natural language processing, network and code analysis, rely on semantically embedding objects into low-dimensional fixed-length vectors. Such embeddings naturally provide a way to perform useful downstream tasks, such as identifying relations among objects and predicting objects for a given context. Unfortunately, training accurate embeddings is usually computationally intensive and requires processing large amounts of data. This paper presents a distributed training framework for a class of applications that use Skip-gram-like models to generate embeddings. We call this class Any2Vec and it includes Word2Vec (Gensim), and Vertex2Vec (DeepWalk and Node2Vec) among others. We first formulate Any2Vec training algorithm as a graph application. We then adapt the state-of-the-art distributed graph analytics framework, D-Galois, to support dynamic graph generation and re-partitioning, and incorporate novel communication optimizations. We show that on a cluster of 3248-core hosts our framework GraphAny2Vec matches the accuracy of the state-of-the-art shared-memory implementations of Word2Vec and Vertex2Vec, and gives geo-mean speedups of 12 x and 5 x respectively. Furthermore, GraphAny2Vec is on average 2 x faster than DMTK, the state-of-the-art distributed Word2Vec implementation, on 32 hosts while yielding much better accuracy.

[2]

Gurbinder Gill (1), Roshan Dathathri (1), Loc Hoang (1), Ramesh Peri (2), Keshav Pingali (1) ((1) The University of Texas at Austin, (2) Intel Corporation). Single Machine Graph Analytics on Massive Datasets Using Intel Optane DC Persistent Memory, Proceedings of the International Conference on Very Large Data Bases (PVLDB), 2020 , August 2020.

Intel Optane DC Persistent Memory is a new kind of byte-addressable memory with higher density and lower cost than DRAM. This enables affordable systems that support up to 6TB of memory. In this paper, we use such a system for massive graphs analytics. We discuss how such a system should be deployed for such applications, and evaluate three existing shared-memory graph frameworks, Galois, GAP, and GraphIt, on large real-world web-crawls. We recommend algorithms and runtime optimizations for getting the best performance on such systems using the results of our study. We also show that for these applications, the Galois framework on Optane DC PMM is on average 2x faster than D-Galois, the state-of-the-art distributed graph framework, on a production cluster with similar compute power. Thus, Optane DC PMM yields benefits in productivity, performance, and cost for massive graph analytics.

[3]

Gurbinder Gill, Roshan Dathathri, Loc Hoang, Keshav Pingali. A Study of Partitioning Policies for Graph Analytics on Large-scale Distributed Platforms, Proceedings of the International Conference on Very Large Data Bases (PVLDB), 2019 , August 2019.

Distributed-memory clusters are used for in-memory processing of very large graphs with billions of nodes and edges. This requires partitioning the graph among the machines in the cluster. When a graph is partitioned, a node in the graph may be replicated on several machines, and communication is required to keep these replicas synchronized. Good partitioning policies attempt to reduce this synchronization overhead while keeping computational load balanced across machines. A number of recent studies have looked at ways to control replication of nodes, but these studies are not conclusive because they were performed on small clusters with eight to sixteen machines, did not consider work-efficient data-driven algorithms, or did not optimize communication for the partitioning strategies they studied. This paper presents an experimental study of partitioning strategies for work-efficient graph analytics applications on large KNL and Skylake clusters with up to 256 machines using the Gluon communication runtime which implements partitioning-specific communication optimizations. Evaluation results show that although simple partitioning strategies like Edge-Cuts perform well on a small number of machines, an alternative partitioning strategy called Cartesian Vertex-Cut (CVC) performs better at scale even though paradoxically it has a higher replication factor and performs more communication than Edge-Cut partitioning does. Results from communication micro-benchmarks resolve this paradox by showing that communication overhead depends not only on communication volume but also on the communication pattern among the partitions. These experiments suggest that high-performance graph analytics systems should support multiple partitioning strategies, like Gluon does, as no single graph partitioning strategy is best for all cluster sizes. For such systems, a decision tree for selecting a good partitioning strategy based on characteristics of the computation and the cluster is presented.

[4]

Loc Hoang, Roshan Dathathri, Gurbinder Gill, Keshav Pingali. CuSP: A Customizable Streaming Edge Partitionerfor Distributed Graph Analytics, IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2019 , May 2019.

Graph analytics systems must analyze graphs with billions of vertices and edges which require several terabytes of storage. Distributed-memory clusters are often used for analyzing such large graphs since the main memory of a single machine is usually restricted to a few hundreds of gigabytes. This requires partitioning the graph among the machines in the cluster. Existing graph analytics systems usually come with a built-in partitioner that incorporates a particular partitioning policy, but the best partitioning policy is dependent on the algorithm, input graph,and platform. Therefore, built-in partitioners are not sufficiently flexible. Stand-alone graph partitioners are available, but they too implement only a small number of partitioning policies.This paper presents CuSP, a fast streaming edge partitioning framework which permits users to specify the desired partitioning policy at a high level of abstraction and generates high-quality graph partitions fast. For example, it can partition wdc12, the largest publicly available web-crawl graph, with 4 billion vertices and 129 billion edges, in under 2 minutes for clusters with 128 machines. Our experiments show that it can produce quality partitions 6× faster on average than the state-of-the-art stand-alone partitioner in the literature while supporting a wider range of partitioning policies.

[5]

Roshan Dathathri*, Gurbinder Gill*, Loc Hoang, Keshav Pingali (* Both authors contributed equally). Phoenix: A Substrate for Resilient Distributed Graph Analytics, ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2019 , April 2019.

This paper presents Phoenix, a communication and synchronization substrate that implements a novel protocol for recovering from fail-stop faults when executing graph analytics applications on distributed-memory machines. The standard recovery technique in this space is checkpointing, which rolls back the state of the entire computation to a state that existed before the fault occurred. The insight behind Phoenix is that this is not necessary since it is sufficient to continue the computation from a state that will ultimately produce the correct result. We show that for graph analytics applications, the necessary state adjustment can be specified easily by the application programmer using a thin API supported by the Phoenix system. Phoenix has no observable overhead during fault-free execution, and it is resilient to any number of faults while guaranteeing that the correct answer will be produced at the end of the computation. This is in contrast to other systems in this space which may either have overheads even during fault-free execution or produce only approximate answers when faults occur during execution. We incorporated Phoenix into D-Galois, the state-of-the-art distributed graph analytics system, and evaluated it on two production clusters. Our evaluation shows that in the absence of faults, Phoenix is 24 faster than GraphX, which provides fault-tolerance using the Spark system. Phoenix also outperforms a checkpoint-restart technique implemented in D-Galois: in fault-free execution, Phoenix has no observable overhead, whereas the checkpoint-restart technique has 31% overhead. In addition, Phoenix mostly outperforms checkpointing when faults occur, particularly in the common case when only a small number of hosts fail simultaneously.

[6]

Loc Hoang, Matteo Pontecorvi, Roshan Dathathri, Gurbinder Gill, Bozhi You, Keshav Pingali, Vijaya Ramachandran. A Round-Efficient Distributed Betweenness Centrality Algorithm, ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2019 , February 2019.

We present an O(n)-round distributed-memory algorithm called Min-Rounds BC (MRBC) for computing the betweenness centrality (BC) of every vertex in a directed unweighted graph with n vertices. This algorithm can also be used to solve the all-pairs-shortest-paths (APSP) problem in such graphs. Our algorithm is framed in the CONGEST model, and it improves the number of rounds by at least a constant factor over previous results for directed APSP and BC. We implemented MRBC in D-Galois, the state-of-theart distributed graph analytics system, incorporated additional optimizations enabled by the D-Galois model, and evaluated its performance on a production cluster with up to 256 hosts using power-law and road networks. Compared to the BC algorithm of Brandes, MRBC reduces the number of rounds by 11.8× and the communication time by 3.3× on the average for the graphs in our test suite. As a result, MRBC is 2.6× faster on the average than Brandes BC for real-world web-crawls on 256 hosts.

[7]

Gurbinder Gill, Roshan Dathathri, Loc Hoang, Andrew Lenharth, Keshav Pingali. Abelian: A Compiler for Graph Analytics on Distributed, Heterogeneous Platforms, International European Conference on Parallel and Distributed Computing (Euro-Par), 2018, August 2018.

The trend towards processor heterogeneity and distributed-memory has significantly increased the complexity of parallel programming. In addition, the mix of applications that need to run on parallel platforms today is very diverse, and includes graph applications that typically have irregular memory accesses and unpredictable control-flow. To simplify the programming of graph applications on such platforms, we have implemented a compiler called Abelian that translates shared-memory descriptions of graph algorithms written in the Galois programming model into efficient code for distributed-memory platforms with heterogeneous processors. The compiler manages inter-device synchronization and communication while leveraging state-of-the-art compilers for generating device specific code. The experimental results show that the novel communication optimizations in the Abelian compiler reduce the volume of communication by 23×, enabling the code produced by Abelian to match the performance of handwritten distributed CPU and GPU programs that use the same runtime. The programs produced by Abelian for distributed CPUs are roughly 2.4× faster than those in the Gemini system, a third-party distributed CPU-only system, demonstrating that Abelian can manage heterogeneity and distributed-memory successfully while generating high-performance code.

[8]

Roshan Dathathri*, Gurbinder Gill*, Loc Hoang, Hoang-Vu Dang, Alex Brooks, Nikoli Dryden, Marc Snir, Keshav Pingali (* Both authors contributed equally). Gluon: A Communication-Optimizing Substrate for Distributed Heterogeneous Graph Analytics, ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2018, June 2018.

This paper introduces a new approach to building distributed-memory graph analytics systems that exploits heterogeneity in processor types (CPU and GPU), partitioning policies, and programming models. The key to this approach is Gluon, a communication-optimizing substrate. Programmers write applications in a shared-memory programming system of their choice and interface these applications with Gluon using a lightweight API. Gluon enables these programs to run on heterogeneous clusters and optimizes communication in a novel way by exploiting structural and temporal invariants of graph partitioning policies. To demonstrate Gluon’s ability to support different programming models, we interfaced Gluon with the Galois and Ligra shared-memory graph analytics systems to produce distributed-memory versions of these systems named D-Galois and D-Ligra, respectively. To demonstrate Gluon’s ability to support heterogeneous processors, we interfaced Gluon with IrGL, a state-of-the-art single-GPU system for graph analytics, to produce D-IrGL, the first multi-GPU distributed-memory graph analytics system. Our experiments were done on CPU clusters with up to 256 hosts and roughly 70,000 threads and on multi-GPU clusters with up to 64 GPUs. The communication optimizations in Gluon improve end-to-end application execution time by ∼2.6× on the average. D-Galois and D-IrGL scale well and are faster than Gemini, the state-of-the-art distributed CPU graph analytics system, by factors of ∼3.9× and ∼4.9×, respectively, on the average.

[9]

Hoang-Vu Dang, Roshan Dathathri, Gurbinder Gill, Alex Brooks, Nikoli Dryden, Andrew Lenharth, Loc Hoang, Keshav Pingali, Marc Snir. A Lightweight Communication Runtime for Distributed Graph Analytics, IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2018, May 2018.

Distributed-memory multi-core clusters enable in-memory processing of very large graphs with billions of nodes and edges. Recent distributed graph analytics systems have been built on top of MPI. However, communication in graph applications is very irregular, and each host exchanges different amounts of non-contiguous data with other hosts. MPI does not support such a communication pattern well, and it has limited ability to integrate communication with serialization, deserialization, and graph computation tasks. In this paper, we describe a lightweight communication runtime called LCI that supports a large number of threads on each host and avoids the semantic mismatches between the requirements of graph computations and the communication library in MPI. The implementation of LCI is informed by lessons learnt from two baseline MPI-based implementations. We have successfully integrated LCI with two state-of-the-art graph analytics systems - Gemini and Abelian. LCI improves the latency up to 3.5× for microbenchmarks compared to MPI solutions and improves the end-to-end performance of distributed graph algorithms by up to 2×.

[10]

Gurbinder Singh Gill, Vaibhav Saxena, Rashmi Mittal, Thomas George, Yogish Sabharwal, Lalit Dagar. Evaluation and enhancement of weather application performance on Blue Gene/Q, High Performance Computing (HiPC), 2013, December 2013.

Numerical weather prediction (NWP) models use mathematical models of the atmosphere to predict the weather. Ongoing efforts in the weather and climate community continuously try to improve the fidelity of weather models by employing higher order numerical methods suitable for solving model equations at high resolutions. In realistic weather forecasting scenario, simulating and tracking multiple regions of interest (nests) at fine resolutions is important in understanding the interplay between multiple weather phenomena and for comprehensive predictions. These multiple regions of interest in a simulation can be significantly different in resolution and other modeling parameters. Currently, the weather simulations involving these nested regions process them one after the other in a sequential fashion. There exists a lot of prior work in performance evaluation and optimization of weather models, however most of this work is either limited to simulations involving a single domain or multiple nests with same resolution and model parameters such as model physics options. In this paper, we evaluate and enhance the performance of popular WRF model on IBM Blue Gene/Q system. We consider nested simulations with multiple child domains and study how parameters such as physics options and simulation time steps for child domains affect the computational requirements. We also analyze how such configurations can benefit from parallel execution of the children domains rather than processing them sequentially. We demonstrate that it is important to allocate processors to nested child domains in proportion to the work load associated with them when executing them in parallel. This ensures that the time spent in the different nested simulations is nearly equal, and the nested domains reach the synchronization step with the parent simulation together. Our experimental evaluation using a simple heuristic for allocation of nodes shows that the performance of WRF simulations can be improved by up to 14% by parallel execution of sibling domains with different configuration of domain sizes, temporal resolutions and physics options.

[11]

Durgaprasad Gangodkar, Sachin Gupta, Gurbinder Singh Gill, Padam Kumar, Ankush Mittal. Efficient variable size template matching using fast normalized cross correlation on multicore processors, International Conference on Advanced Computing, Networking and Security (ADCONS), 2011, December 2011.

Normalized Cross Correlation (NCC) is an efficient and robust way for finding the location of a template in given image. However NCC is computationally expensive. Fast normalized cross correlation (FNCC) makes use of pre-computed sum-tables to improve the computational efficiency of NCC. In this paper we propose a strategy for parallel implementation of FNCC algorithm using NVIDIA's Compute Unified Device Architecture (CUDA) for real-time template matching. We also present an approach to make proposed method adaptable to variable size templates which is an important challenge to tackle. Efficient parallelization strategies, adopted for pre-computing sum-tables and extracting data parallelism by dividing the image into series of blocks, substantially reduce required computational time. We show that by optimal utilization of different memories available on multicore architecture and exploiting the asynchronous nature of CUDA kernel calls we can obtain speedup of the order 17X as compared to the sequential implementation.

About

Research Interests

Contact

Publications

Blog Entries

Honors