Now showing 1 - 10 of 38
  • Placeholder Image
    Publication
    Multi-granularity Locking in Hierarchies with Synergistic Hierarchical and Fine-Grained Locks
    (01-01-2018)
    Ganesh, K.
    ;
    Kalikar, Saurabh
    ;
    We propose a new locking mechanism for hierarchies wherein the locking requests can be a combination of coarse and fine. Existing protocols such as multiple-granularity locking (MGL) are efficient when all the requests are of the same granularity. MGL is either too coarse or too fine-grained when multiple threads request for various parts of the hierarchy with differing granularity requirements. Simultaneous handling of hierarchical and fine-grained requests poses new challenges in checking for racy requests. We propose a novel indexing technique for hierarchies which uniquely identifies every node as an interval value and effectively captures hierarchical dependencies between nodes even when the hierarchy is a tree, DAG or a cycle. Our experiments with real-world XML hierarchies and synthetic benchmarks show that the proposed locking technique provides a higher degree of concurrency with minimal locking cost resulting in overall performance improvement.
  • Placeholder Image
    Publication
    ParTBC: Faster Estimation of Top-k Betweenness Centrality Vertices on GPU
    (01-03-2022)
    Singh, Somesh
    ;
    Shah, Tejas
    ;
    Betweenness centrality (BC) is a popular centrality measure, based on shortest paths, used to quantify the importance of vertices in networks. It is used in a wide array of applications including social network analysis, community detection, clustering, biological network analysis, and several others. The state-of-the-art Brandes' algorithm for computing BC has time complexities of and for unweighted and weighted graphs, respectively. Brandes' algorithm has been successfully parallelized on multicore and manycore platforms. However, the computation of vertex BC continues to be time-consuming for large real-world graphs. Often, in practical applications, it suffices to identify the most important vertices in a network; that is, those having the highest BC values. Such applications demand only the top vertices in the network as per their BC values but do not demand their actual BC values. In such scenarios, not only is computing the BC of all the vertices unnecessary but also exact BC values need not be computed. In this work, we attempt to marry controlled approximations with parallelization to estimate the k-highest BC vertices faster, without having to compute the exact BC scores of the vertices. We present a host of techniques to determine the top-k vertices faster, with a small inaccuracy, by computing approximate BC scores of the vertices. Aiding our techniques is a novel vertex-renumbering scheme to make the graph layout more structured, which results in faster execution of parallel Brandes' algorithm on GPU. Our experimental results, on a suite of real-world and synthetic graphs, show that our best performing technique computes the top-k vertices with an average speedup of 2.5× compared to the exact parallel Brandes' algorithm on GPU, with an error of less than 6%. Our techniques also exhibit high precision and recall, both in excess of 94%.
  • Placeholder Image
    Publication
    Single-linkage clustering of dynamic data
    (10-01-2023)
    Mongandampulath Akathoott, Anju
    ;
    The surge in data sizes in fluid processing applications necessitates partitioning the data into clusters and studying their representatives instead of studying each voxel data point. In addition, the dynamic nature of these data poses further challenges. Under such circumstances, it becomes essential to develop an approach that can handle the delta data with minimal updates to the underlying data structure, without processing the complete data from scratch on every update. However, this poses synchronization challenges in parallelization. In this article, we propose SLCoDD (single-linkage clustering of dynamic data), a geometric distance based dynamic clustering and its multi-core parallelization using OpenMP. To improve efficiency, SLCoDD exploits geometric properties of the bounding squares. We illustrate trade-offs in various ways of performing point additions to clusters, point deletions, and their batched versions. Using a suite of large inputs, we demonstrate the effectiveness of SLCoDD. SLCoDD's fully dynamic version achieves a substantial geomean speedup of (Formula presented.) over the static parallel version and of (Formula presented.) over the dynamic sequential version.
  • Placeholder Image
    Publication
    BOLD: an ontology-based log debugger for C programs
    (01-05-2022)
    Pattipati, Dileep Kumar
    ;
    ;
    Program debugging is often an ad hoc activity, with a combination of manual and semi-automated processing. A challenge posed by debugging is the lack of standardized procedures for instrumenting, representing, and analyzing the execution trace. Further, debugging is often low level, getting into the nitty-gritty details of variables and their semantics—rather than at a high-level. The presence of libraries only exacerbates these issues. We propose BOLD, an Ontology-based Log Debugger, to unify various activities involved in the debugging of sequential C programs. The syntactical information of programs can be represented as Resource Description Framework (RDF) triples. BOLD automatically instruments programs by querying these triples. It represents the execution trace of the program also as RDF triples called trace triples. BOLD’s novel high-level reasoning abstracts these triples as spans. A span gives a way of examining the values of a particular variable over certain portions of the program execution. The properties of the spans are defined formally as a Web Ontology Language ontology. A developer can debug a given buggy program by querying the trace triples and reasoning with the spans. To empirically assess BOLD, we debugged the programs in standard Software-artifact Infrastructure Repository. Experiments related to debugging through automated reasoning show improvement in conciseness of the developers’ specifications and run time performance of BOLD compared to gdb-Python.
  • Placeholder Image
    Publication
    Domlock: A new multi-granularity locking technique for hierarchies
    (01-10-2017)
    Kalikar, Saurabh
    ;
    We present efficient locking mechanisms for hierarchical data structures. Several applications work on an abstract hierarchy of objects, and a parallel execution on this hierarchy necessitates synchronization across workers operating on different parts of the hierarchy. Existing synchronization mechanisms are too coarse, too inefficient, or too ad hoc, resulting in reduced or unpredictable amount of concurrency.We propose a new locking approach based on the structural properties of the underlying hierarchy.We show that the developed techniques are efficient evenwhen the hierarchy is an arbitrary graph. Theoretically,we present our approach as a locking-cost-minimizing instance of a generic algebraic model of synchronization for hierarchies. Using STMBench7, we illustrate considerable reduction in the locking cost, resulting in an average throughput improvement of 42%.
  • Placeholder Image
    Publication
    A GPU Algorithm for Earliest Arrival Time Problem in Public Transport Networks
    (01-12-2020)
    Haryan, Chirayu Anant
    ;
    Ramakrishna, G.
    ;
    ;
    Reddy, Allam Dinesh
    Given a temporal graph G, a source vertex s, and a departure time at source vertex ts, the earliest arrival time problem (EAT) is to start from s on or after ts and reach all the vertices in G as early as possible. Ni et al. have proposed a parallel algorithm for EAT and obtained a speedup up to 9.5× on real-world graphs with respect to the connection-scan serial algorithm by using multi-core processors. We propose a topology-driven parallel algorithm for EAT on public transport networks and implement using general-purpose programming on the graphics processing unit (GPU). A temporal connection in a temporal graph for a public transport network is associated with a departure time and a duration time, and many connections exist from u to v for an edge (u, v). We propose two pruning techniques connection-type and clustering, and use arithmetic progression technique appropriately to process many connections of an edge, without scanning all of them. In the connection-type technique, the connections of an edge with the same duration are grouped together. In the clustering technique, we follow 24-hour format and the connections of an edge are partitioned into 24 clusters so that the departure time of connections in the ith cluster is at least i-hour and at most i+1-hour. The arithmetic progression technique helps to store a sequence of departure times of various connections in a compact way. We propose a hybrid approach to combine the three techniques (connection-type, clustering and arithmetic progression) in an efficient way. Our techniques achieve an average speedup up to 61× when compared to the existing connection-scan serial algorithm running on CPU. Also, the average speedup of our algorithm is 12.65× against the parallel edge-scan-dependency graph algorithm running on GPU.
  • Placeholder Image
    Publication
    Lighthouse: An automatic code generator for graph algorithms on GPUs
    (01-01-2017)
    Shashidhar, G.
    ;
    We propose LightHouse, a GPU code-generator for a graph language named Green-Marl for which a multicore CPU backend already exists. This allows a user to seamlessly generate both the multicore as well as the GPU backends from the same specification of a graph algorithm. This restriction of not modifying the language poses several challenges as we work with an existing abstract syntax tree of the language, which is not tailored to GPUs. LightHouse overcomes these challenges with various optimizations such as reducing the number of atomics and collapsing loops. We illustrate its effectiveness by generating efficient CUDA codes for four graph analytic algorithms, and comparing performance against their multicore OpenMP versions generated by Green- Marl. In particular, our generated CUDA code performs comparable to 4 to 64-threaded OpenMP versions for different algorithms.
  • Placeholder Image
    Publication
    Efficient parallelization of SPH algorithm on modern multi-core CPUs and massively parallel GPUs
    (01-12-2021)
    Jagtap, Pravin
    ;
    ;
    Sanapala, V. S.
    ;
    Smoothed Particle Hydrodynamics (SPH) is fast emerging as a practically useful computational simulation tool for a wide variety of engineering problems. SPH is also gaining popularity as the back bone for fast and realistic animations in graphics and video games. The Lagrangian and mesh-free nature of the method facilitates fast and accurate simulation of material deformation, interface capture, etc. Typically, particle-based methods would necessitate particle search and locate algorithms to be implemented efficiently, as continuous creation of neighbor particle lists is a computationally expensive step. Hence, it is advantageous to implement SPH, on modern multi-core platforms with the help of High-Performance Computing (HPC) tools. In this work, the computational performance of an SPH algorithm is assessed on multi-core Central Processing Unit (CPU) as well as massively parallel General Purpose Graphical Processing Units (GP-GPU). Parallelizing SPH faces several challenges such as, scalability of the neighbor search process, force calculations, minimizing thread divergence, achieving coalesced memory access patterns, balancing workload, ensuring optimum use of computational resources, etc. While addressing some of these challenges, detailed analysis of performance metrics such as speedup, global load efficiency, global store efficiency, warp execution efficiency, occupancy, etc. is evaluated. The OpenMP and Compute Unified Device Architecture(CUDA) parallel programming models have been used for parallel computing on Intel Xeon(R) E5-2630 multi-core CPU and NVIDIA Quadro M4000 and NVIDIA Tesla p100 massively parallel GPU architectures. Standard benchmark problems from the Computational Fluid Dynamics (CFD) literature are chosen for the validation. The key concern of how to identify a suitable architecture for mesh-less methods which essentially require heavy workload of neighbor search and evaluation of local force fields from neighbor interactions is addressed.
  • Placeholder Image
    Publication
    Custom code generation for a graph DSL
    (23-02-2020)
    Gogoi, Bikash
    ;
    Cheramangalath, Unnikrishnan
    ;
    We present challenges faced in making a domain-specific language (DSL) for graph algorithms adapt to varying requirements of generating a spectrum of efficient parallel codes. Graph algorithms are at the heart of several applications, and achieving high performance on graph applications has become critical due to the tremendous growth of irregular data. However, irregular algorithms are quite challenging to auto-parallelize, due to access patterns influenced by the input graph - which is unavailable until execution. Former research has addressed this issue by designing DSLs for graph algorithms, which restrict generality but allow efficient codegeneration for various backends. Such DSLs are, however, too rigid, and do not adapt to changes. For instance, these DSLs are incapable of changing the way of processing if the underlying graph changes. As another instance, most of the DSLs do not support more than one backends. We narrate our experiences in making an existing DSL, named Falcon, adaptive. The biggest challenge in the process is to retain the DSL code for specifying the underlying algorithm, and still generate different backend codes. We illustrate the effectiveness of our proposal by auto-generating codes for vertex-based versus edge-based graph processing, synchronous versus asynchronous execution, and CPU versus GPU backends from the same specification.
  • Placeholder Image
    Publication
    Optimizing graph processing on GPUs using approximate computing
    (16-02-2019)
    Singh, Somesh
    ;
    Parallelizing graph algorithms on GPUs is challenging due to the irregular memory accesses and control-flow involved in graph traversals. In this work, we tame these challenges by injecting approximations. In particular, we improve memory coalescing by renumbering and replicating nodes, memory latency by adding edges among specific nodes brought into shared memory, and thread-divergence by normalizing degrees across nodes assigned to a warp. Using a suite of graphs with varied characteristics and five popular algorithms, we demonstrate the effectiveness of our proposed techniques. Our approximations for coalescing, memory latency and thread-divergence lead to mean speedups of 1.3×, 1.41× and 1.06× achieving accuracies of 83%, 78% and 84%, respectively.