Now showing 1 - 10 of 11
  • Placeholder Image
    Publication
    Sparsity-Aware Caches to Accelerate Deep Neural Networks
    (01-03-2020)
    Ganesan, Vinod
    ;
    Sen, Sanchari
    ;
    Kumar, Pratyush
    ;
    Gala, Neel
    ;
    ;
    Raghunathan, Anand
    Deep Neural Networks (DNNs) have transformed the field of artificial intelligence and represent the state-of-the-art in many machine learning tasks. There is considerable interest in using DNNs to realize edge intelligence in highly resource-constrained devices such as wearables and IoT sensors. Unfortunately, the high computational requirements of DNNs pose a serious challenge to their deployment in these systems. Moreover, due to tight cost (and hence, area) constraints, these devices are often unable to accommodate hardware accelerators, requiring DNNs to execute on the General Purpose Processor (GPP) cores that they contain. We address this challenge through lightweight micro-architectural extensions to the memory hierarchy of GPPs that exploit a key attribute of DNNs, viz. sparsity, or the prevalence of zero values. We propose SparseCache, an enhanced cache architecture that utilizes a null cache based on a Ternary Content Addressable Memory (TCAM) to compactly store zero-valued cache lines, while storing non-zero lines in a conventional data cache. By storing address rather than values for zero-valued cache lines, SparseCache increases the effective cache capacity, thereby reducing the overall miss rate and execution time. SparseCache utilizes a Zero Detector and Approximator (ZDA) and Address Merger (AM) to perform reads and writes to the null cache. We evaluate SparseCache on four state-of-the-art DNNs programmed with the Caffe framework. SparseCache achieves 5-28% reduction in miss-rate, which translates to 5-21% reduction in execution time, with only 0.1% area and 3.8% power overhead in comparison to a low-end Intel Atom Z-series processor.
  • Placeholder Image
    Publication
    Approximate Error Detection with Stochastic Checkers
    (01-08-2017)
    Gala, Neel
    ;
    Venkataramani, Swagath
    ;
    Raghunathan, Anand
    ;
    Designing reliable systems, while eschewing the high overheads of conventional fault tolerance techniques, is a critical challenge in the deeply scaled CMOS and post-CMOS era. To address this challenge, we leverage the intrinsic resilience of application domains such as multimedia, recognition, mining, search, and analytics where acceptable outputs are produced despite occasional approximate computations. We propose stochastic checkers (checkers designed using stochastic logic) as a new approach to performing error checking in an approximate manner at greatly reduced overheads. Stochastic checkers are inherently inaccurate and require long latencies for computation. To limit the loss in error coverage, as well as false positives (correct outputs flagged as erroneous), caused due to the approximate nature of stochastic checkers, we propose input permuted partial replicas of stochastic logic, which improves their accuracy with minimal increase in overheads. To address the challenge of long error detection latency, we propose progressive checking policies that provide an early decision based on a prefix of the checker's output bitstream. This technique is further enhanced by employing progressively accurate binary-to-stochastic converters. Across a suite of error-resilient applications, we observe that stochastic checkers lead to greatly reduced overheads (29.5% area and 21.5% power, on average) compared with traditional fault tolerance techniques while maintaining high coverage and very low false positives.
  • Placeholder Image
    Publication
    An accuracy tunable non-boolean co-processor using coupled nano-oscillators
    (01-09-2017)
    Gala, Neel
    ;
    Krithivasan, Sarada
    ;
    Tsai, Wei Yu
    ;
    Li, Xueqing
    ;
    Narayanan, Vijaykrishnan
    ;
    As we enter an era witnessing the closer end of Dennard scaling, where further reduction in power supplyvoltage to reduce power consumption becomes more challenging in conventional systems, a goal of developing a system capable of performing large computations with minimal area and power overheads needs more optimization aspects. A rigorous exploration of alternate computing techniques, which can mitigate the limitations of Complementary Metal-Oxide Semiconductor (CMOS) technology scaling and conventional Boolean systems, is imperative. Reflecting on these lines of thought, in this article we explore the potential of non-Boolean computing employing nano-oscillators for performing varied functions. We use a two coupled nano-oscillator as our basic computational model and propose an architecture for a non-Boolean coupled oscillator based co-processor capable of executing certain functions that are commonly used across a variety of approximate application domains. The proposed architecture includes an accuracy tunable knob, which can be tuned by the programmer at runtime. The functionality of the proposed co-processor is verified using a soft coupled oscillator model based on Kuramoto oscillators. The article also demonstrates how real-world applications such as Vector Quantization, Digit Recognition, Structural Health Monitoring, and the like, can be deployed on the proposed model. The proposed co-processor architecture is generic in nature and can be implemented using any of the existing modern day nano-oscillator technologies such as Resonant Body Transistors (RBTs), Spin-Torque Nano-Oscillators (STNOs), and Metal-Insulator Transition (MITs). In this article, we perform a validation of the proposed architecture using the HyperField Effect Transistor (FET) technology-based coupled oscillators, which provide improvements of up to 3.5?increase in clock speed and up to 10.75?and 14.12?reduction in area and power consumption, respectively, as compared to a conventional Boolean CMOS accelerator executing the same functions.
  • Placeholder Image
    Publication
    ProCA: Progressive configuration aware design methodology for low power stochastic ASICs
    (03-03-2014)
    Gala, Neel
    ;
    Devanathan, V. R.
    ;
    Srinivasan, Karthik
    ;
    Visvanathan, V.
    ;
    With increasing integration of capabilities into mobile application processors, a host of imaging operations that were earlier performed in software are now implemented in hardware. Though imaging applications are inherently error resilient, the complexity of such designs has increased over time and thus identifying logic that can be leveraged for energy-quality trade-offs has become difficult. The paper proposes a Progressive Configuration Aware (ProCA) criticality analysis framework, that is 10X faster than the state-of-the-art, to identify logic which is functionally-critical to output quality. This accounts for the various modes of operation of the design. Through such a framework, we demonstrate how a low powered tunable stochastic design can be derived. The proposed methodology uses layered synthesis and voltage scaling mechanisms as primary tools for power reduction. We demonstrate the proposed methodology on a production quality imaging IP implemented in 28nm low leakage technology. For the tunable stochastic imaging IP, we gain up to 10.57% power reduction in exact mode and up to 32.53% power reduction in error tolerant mode (30dB PSNR), with negligible design overhead. © 2014 IEEE.
  • Placeholder Image
    Publication
    Best is the enemy of good: Design techniques for low power tunable approximate application specific integrated chips targeting media-based applications
    (01-06-2015)
    Gala, Neel
    ;
    Devanathan, V. R.
    ;
    Visvanathan, V.
    ;
    With the possible end of the Moore's Law at the horizon, Approximate Computing has gathered momentum over the past years as a possible alternative for low power designs. Approximate computing trades better energy performance for tolerable inaccuracies at the output of the design. In this paper, we propose an application independent automated flow that converts a given design into an approximate version using either voltage scaling or power gating based techniques. The proposed model is shown to be effective for designing low power media type IPs (Intellectual Properties) based ASICs (Application Specific Integrated Chips). The model encompasses various automated techniques to identify logic within a given design which in turn can be leveraged for approximation. Following this identification the model uses a series of physical optimizations which lead to a tunable approximate circuit capable of operating in both approximate and accurate modes of operations depending on the environment and user constraints. The flow has been demonstrate to provide up to 35% power reduction in ASICs (operating in approximate mode) that implement imaging applications such as video decoding and image restoration IPs.
  • Placeholder Image
    Publication
    ChADD: An ADD Based Chisel Compiler with Reduced Syntactic Variance
    (16-03-2016)
    Chauhan, Vikas
    ;
    Gala, Neel
    ;
    The need for quick design space exploration and higher abstracted features required to design complex circuits has led designers to adopt High Level Synthesis languages (HLS) for hardware generation. Chisel is one such language, which offers majority of the abstraction facilities found in today's software languages and also guarantees synthesizability of the generated hardware. However, most of the HLS languages, including Chisel, suffer from syntactic variance and thus the hardware inferred by these languages are inconsistent and rely heavily on the description styles used by the designer. Thus semantically equivalent circuit descriptions with different syntax can lead to different hardware utilization. In this paper, we propose the use of ADDs (Assignment Decision Diagrams) as an intermediate representation between Chisel and the target net list representation. Following this path we have shown that for a given design, two different styles of Chisel implementations yield the same target net list, thereby ensuring syntactic invariance. For the same design implementations the conventional Chisel compiler reports significant syntactic variance. In addition, we show empirically that the net list generated by the proposed technique is equally competitive to the most optimal net list generated by the conventional compiler while targeting an FPGA, implying that different implementations leads to close to optimal solutions.
  • Placeholder Image
    Publication
    Shakti-T : A RISC-V processor with light weight security extensions
    (25-06-2017)
    Menon, Arjun
    ;
    Murugan, Subadra
    ;
    ;
    Gala, Neel
    ;
    With increased usage of compute cores for sensitive appli-cations, including e-commerce, there is a need to provide additional hardware support for securing information from memory based attacks. This work presents a unified hard-ware framework for handling spatial and temporal mem-ory attacks. The paper integrates the proposed hardware framework with a RISC-V based micro-Architecture with an enhanced application binary interface that enables soft-ware layers to use these features to protect sensitive data. We demonstrate the effectiveness of the proposed scheme through practical case studies in addition to taking the de-sign through a VLSI CAD design ow. The proposed pro-cessor reduces the metadata storage overhead up to 4 x in comparison with the existing solutions, while incurring an area overhead of just 1914 LUTs and 2197 IP ops on an FPGA, without affecting the critical path delay of the pro-cessor.
  • Placeholder Image
    Publication
    A Programmable Event-driven Architecture for Evaluating Spiking Neural Networks
    (11-08-2017)
    Roy, Arnab
    ;
    Venkataramani, Swagath
    ;
    Gala, Neel
    ;
    Sen, Sanchari
    ;
    ;
    Raghunathan, Anand
    Spiking neural networks (SNNs) represent the third generation of neural networks and are expected to enable new classes of machine learning applications. However, evaluating large-scale SNNs (e.g., of the scale of the visual cortex) on power-constrained systems requires significant improvements in computing efficiency. A unique attribute of SNNs is their event-driven nature-information is encoded as a series of spikes, and work is dynamically generated as spikes propagate through the network. Therefore, parallel implementations of SNNs on multi-cores and GPGPUs are severely limited by communication and synchronization overheads. Recent years have seen great interest in deep learning accelerators for non-spiking neural networks, however, these architectures are not well suited to the dynamic, irregular parallelism in SNNs. Prior efforts on specialized SNN hardware utilize spatial architectures, wherein each neuron is allocated a dedicated processing element, and large networks are realized by connecting multiple chips into a system. While suitable for large-scale systems, this approach is not a good match to size or cost constrained mobile devices. We propose PEASE, a Programmable Event-driven processor Architecture for SNN Evaluation. PEASE comprises of Spike Processing Units (SPUs) that are dynamically scheduled to execute computations triggered by a spike. Instructions to the SPUs are dynamically generated by Spike Schedulers (SSs) that utilize event queues to track unprocessed spikes and identify neurons that need to be evaluated. The memory hierarchy in PEASE is fully software managed, and the processing elements are interconnected using a two-tiered bus-ring topology matching the communication characteristics of SNNs. We propose a method to map any given SNN to PEASE such that the workload is balanced across SPUs and SPU clusters, while pipelining across layers of the network to improve performance. We implemented PEASE at the RTL level and synthesized it to IBM 45 technology. Across 6 SNN benchmarks, our 64-SPU configuration of PEASE achieves 7.1×-17.5× and 2.6×-5.8× speedups, respectively, over software implementations on an Intel Xeon E5-2680 CPU and NVIDIA Tesla K40C GPU. The energy reductions over the CPU and GPU are 71×-179× and 198×-467×, respectively.
  • Placeholder Image
    Publication
    Tunable stochastic computing using layered synthesis and temperature adaptive voltage scaling
    (01-01-2013)
    Gala, Neel
    ;
    Devanathan, V. R.
    ;
    Visvanathan, V.
    ;
    Gandhi, Virat
    ;
    With increasing computing power in mobile devices, conserving battery power (or extending battery life) has become crucial. This together with the fact that most applications running on these mobile devices are increasingly error tolerant, has created immense interest in stochastic (or inexact) computing. In this paper, we present a framework wherein, the devices can operate at varying error tolerant modes while significantly reducing the power dissipated. Further, in very deep sub-micron technologies, temperature has a crucial role in both performance and power. The proposed framework presents a novel layered synthesis optimization coupled with temperature aware supply and body bias voltage scaling to operate the design at various 'tunable' error tolerant modes. We implement the proposed technique on a H.264 decoder block in industrial 28nm low leakage technology node, and demonstrate reductions in total power varying from 30% to 45%, while changing the operating mode from exact computing to inaccurate/error-tolerant computing. © 2013 IEEE.
  • Placeholder Image
    Publication
    PERI: A Configurable Posit Enabled RISC-V Core
    (01-06-2021)
    Tiwari, Sugandha
    ;
    Gala, Neel
    ;
    ;
    Owing to the failure of Dennard's scaling, the past decade has seen a steep growth of prominent new paradigms leveraging opportunities in computer architecture. Two technologies of interest are Posit and RISC-V. Posit was introduced in mid-2017 as a viable alternative to IEEE-754, and RISC-V provides a commercial-grade open source Instruction Set Architecture (ISA). In this article, we bring these two technologies together and propose a Configurable Posit Enabled RISC-V Core called PERI. The article provides insights on how the Single-Precision Floating Point ("F") extension of RISC-V can be leveraged to support posit arithmetic. We also present the implementation details of a parameterized and feature-complete posit Floating Point Unit (FPU). The configurability and the parameterization features of this unit generate optimal hardware, which caters to the accuracy and energy/area tradeoffs imposed by the applications, a feature not possible with IEEE-754 implementation. The posit FPU has been integrated with the RISC-V compliant SHAKTI C-class core as an execution unit. To further leverage the potential of posit, we enhance our posit FPU to support two different exponent sizes (with posit-size being 32-bits), thereby enabling multiple-precision at runtime. To enable the compilation and execution of C programs on PERI, we have made minimal modifications to the GNU C Compiler (GCC), targeting the "F"extension of the RISC-V. We compare posit with IEEE-754 in terms of hardware area, application accuracy, and runtime. We also present an alternate methodology of integrating the posit FPU with the RISC-V core as an accelerator using the custom opcode space of RISC-V.