Now showing 1 - 10 of 24
  • Placeholder Image
    Publication
    The implications of shared data synchronization techniques on multi-core energy efficiency
    (01-01-2012)
    Gautham, Ashok
    ;
    Korgaonkar, Kunal
    ;
    Slpsk, Patanjali
    ;
    ;
    Shared data synchronization is at the heart of the multicore revolution since it is essential for writing concurrent programs. Ideally, a synchronization technique should be able to fully exploit the available cores, leading to improved performance. However, with the growing demand for energy-efficient systems, it also needs to work within the energy and power budget of the system. In this paper, we perform a detailed study of the performance as well as energy efficiency of popular shared-data synchronization techniques on a commodity multicore processor. We show that Software Transactional Memory (STM) systems can perform better than locks for workloads where a significant portion of the running time is spent in the critical sections. We also show how power-conserving techniques available on modern processors like C-states and clock frequency scaling impact energy consumption and performance. Finally, we compare the performance of STMs and locks under similar power budgets.
  • Placeholder Image
    Publication
    Subclasses of baxter permutations based on pattern avoidance
    (01-01-2016) ;
    Koroth, Sajin
    Baxter permutations are a class of permutations which are in bijection with a class of floorplans that arise in chip design called mosaic floorplans. We study a subclass of mosaic floorplans called Hierarchical Floorplans of Order k defined from mosaic floorplans by placing certain geometric restrictions. This naturally leads to studying a subclass of Baxter permutations. This subclass of Baxter permutations are characterized by pattern avoidance. We establish a bijection, between the subclass of floorplans we study and a subclass of Baxter permutations, based on the analogy between decomposition of a floorplan into smaller blocks and block decomposition of permutations. Apart from the characterization, we also answer combinatorial questions on these classes. We give an algebraic generating function (but without a closed form solution) for the number of permutations, an exponential lower bound on growth rate, and a linear time algorithm for deciding membership in each subclass. Based on the recurrence relation describing the class, we also give a polynomial time algorithm for enumeration. We finally prove that Baxter permutations are closed under inverse based on an argument inspired from the geometry of the corresponding mosaic floorplans. This proof also establishes that the subclass of Baxter permutations we study are also closed under inverse. Characterizing permutations instead of the corresponding floorplans can be helpful in reasoning about the solution space and in designing efficient algorithms for floorplanning.
  • Placeholder Image
    Publication
    XStream: Cross-core spatial streaming based MLC prefetchers for parallel applications in CMPs
    (01-01-2014)
    Panda, Biswabandan
    ;
    Hardware prefetchers are commonly used to hide and tolerate off-chip memory latency. Prefetching techniques in the literature are designed for multiple independent sequential applications running on a multicore system. In contrast to multiple independent applications, a single parallel application running on a multicore system exhibits different behavior. In case of a parallel application, cores share and communicate data and code among themselves, and there is commonality in the demand miss streams across multiple cores. This gives an opportunity to predict the demand miss streams and communicate the predicted streams from one core to another, which we refer as cross-core stream communication. We propose cross-core spatial streaming (XStream), a practical and storage-efficient cross-core prefetching technique. XStream detects and predicts the cross-core spatial streams at the private mid level caches (MLCs) and sends the predicted streams in advance to MLC prefetchers of the predicted cores. We compare the effectiveness of XStream with the ideal cross-core spatial streamer. Experimental results demonstrate that, on an average (geomean), compared to the state-of-the-art spatial memory streaming, storage efficient XStream reduces the execution time by 11.3% (as high as 24%) and 9% (as high as 29.09%) for 4-core and 8-core systems respectively. © 2014 ACM.
  • Placeholder Image
    Publication
    Fast-SL: An efficient algorithm to identify synthetic lethal sets in metabolic networks
    (25-03-2015)
    Pratapa, Aditya
    ;
    ;
    Motivation: Synthetic lethal sets are sets of reactions/genes where only the simultaneous removal of all reactions/genes in the set abolishes growth of an organism. Previous approaches to identify synthetic lethal genes in genome-scale metabolic networks have built on the framework of flux balance analysis (FBA), extending it either to exhaustively analyze all possible combinations of genes or formulate the problem as a bi-level mixed integer linear programming (MILP) problem. We here propose an algorithm, Fast-SL, which surmounts the computational complexity of previous approaches by iteratively reducing the search space for synthetic lethals, resulting in a substantial reduction in running time, even for higher order synthetic lethals. Results: We performed synthetic reaction and gene lethality analysis, using Fast-SL, for genome-scale metabolic networks of Escherichia coli, Salmonella enterica Typhimurium and Mycobacterium tuberculosis. Fast-SL also rigorously identifies synthetic lethal gene deletions, uncovering synthetic lethal triplets that were not reported previously. We confirm that the triple lethal gene sets obtained for the three organisms have a precise match with the results obtained through exhaustive enumeration of lethals performed on a computer cluster. We also parallelized our algorithm, enabling the identification of synthetic lethal gene quadruplets for all three organisms in under 6 h. Overall, Fast-SL enables an efficient enumeration of higher order synthetic lethals in metabolic networks, which may help uncover previously unknown genetic interactions and combinatorial drug targets.
  • Placeholder Image
    Publication
    A Recursive Model for Smooth Approximation to Wirelength and Its Impact on Analytical Placement
    (04-02-2015)
    Ray, B. N.B.
    ;
    Analytical placement engines use half-perimeter wire length (HPWL) of the circuit as an objective function to place blocks optimally within a chip. Inspired by popularly used log sum-exp (LSE) wire length model [6], ABS wire length model [5] and weighted average (WA) wire length model [3], we propose a new recursive wire length model for HPWL, providing smooth approximation to the max function. We show that the accuracy of the new model is better than that of LSE, WA and ABS wire length models, both theoretically and experimentally. When deployed inside an analytical engine, we show that our model provides more than 12% reduction in wire length compared to LSE at the expense of 50% more runtime. We also observed that the proposed model and the existing iterative models differ in their impact on the relative effort that has to be put in at the global placement vs. The detailed placement phase.
  • Placeholder Image
    Publication
    An efficient wirelength model for analytical placement
    (01-01-2013)
    Ray, B. N.B.
    ;
    Smooth approximations to half-perimeter wire-length are being investigated actively because of the recent increase in interest in analytical placement. It is necessary to not just provide smooth approximations but also to provide error analysis and convergence properties of these approximations. We present a new approximation scheme which uses a non-recursive approximation to the MAX function. We also show the convergence properties and the error bounds. The accuracy of our proposed scheme is better than those of the popular Logarithm-Sum-Exponential (LSE) wirelength model [7] and the recently proposed Weighted Average(WA) wirelength model[3]. We also experimentally validate the comparison by using global and detail placements produced by NTU Placer [1] on ISPD 2004 benchmark suite. The experimentations on benchmarks validate that the error bounds of our model are lower, with an average of 4% error in the total wirelength. © 2013 EDAA.
  • Placeholder Image
    Publication
    CUPL: A compile-time uncoalesced memory access pattern locator for CUDA
    (11-07-2013)
    Amilkanthwar, Madhur
    ;
    Coalesced memory access patterns in CUDA yields high performance but achieving such patterns in an application can be tedious. We propose a tool, CUPL, which locates uncoalesced access patterns (UCAP) in a given kernel at compile-time. CUPL does static analysis of a given kernel using polyhedral model and reports warnings if the input kernel exhibits UCAP. CUPL has two-fold use 1) It can help the programmer to locate regions of the code to optimize 2) It can help a compiler to perform efficient data layout transformations. Initial experiments show that CUPL reports warnings at appropriate places in kernels from Rodinia benchmark and NVIDIA SDK suites. © 2013 Authors.
  • Placeholder Image
    Publication
    XStat: Statistical X-filling algorithm for peak capture power reduction in scan tests
    (01-03-2014)
    Trinadh, A. Satya
    ;
    Potluri, Seetal
    ;
    ;
    Babu, Ch Sobhan
    ;
    Excessive power dissipation can cause high voltage droop on the power grid, leading to timing failures. Since test power dissipation is typically higher than functional power, test peak power minimization becomes very important in order to avoid test induced timing failures. Test cubes for large designs are usually dominated by don't care bits, making X-leveraging algorithms promising for test power reduction. In this paper, we show that X-bit statistics can be used to reorder test vectors on scan based architectures realized using toggle-masking flip flops. Based on this, the paper also presents an algorithm namely balanced X-filling that when applied to ITC'99 circuits, reduced the peak capture power by 7.4% on the average and 40.3% in the best case. Additionally XStat improved the running time for Test Vector Ordering and X-filling phases compared to the best known techniques.
  • Placeholder Image
    Publication
    CSHARP: Coherence and SHaring Aware cache Replacement Policies for parallel applications
    (01-12-2012)
    Panda, Biswabandan
    ;
    Parallel applications are becoming mainstream and architectural techniques for multicores that target these applications are the need of the hour. Sharing of data by multiple threads and issues due to data coherence are unique to parallel applications. We propose CSHARP, a hardware framework that brings coherence and sharing awareness to any shared last level cache replacement policy. We use the degree of sharing of cache lines and the information present in coherence vectors to make replacement decisions. We apply CSHARP to a state-of-the-art cache replacement policy called TA-DRRIP to show its effectiveness. Our experiments on four core simulated system show that applying CSHARP on TA-DRRIP gives an extra 10% reduction in miss-rate at the LLC. Compared to LRU policy, CSHARP on TA-DRRIP shows a 18% miss-rate reduction and a 7% performance boost. We also show the scalability of our proposal by studying the hardware overhead and performance on a 8-core system. © 2012 IEEE.