Options
Veezhinathan Kamakoti
Loading...
Preferred name
Veezhinathan Kamakoti
Official Name
Veezhinathan Kamakoti
Alternative Name
Veezhinathan, Kamakoti
Kamakoti, V.
Kamakoti, Veezhinathan
Main Affiliation
Email
Scopus Author ID
Google Scholar ID
7 results
Now showing 1 - 7 of 7
- PublicationSparsity-Aware Caches to Accelerate Deep Neural Networks(01-03-2020)
;Ganesan, Vinod ;Sen, Sanchari ;Kumar, Pratyush ;Gala, Neel; Raghunathan, AnandDeep Neural Networks (DNNs) have transformed the field of artificial intelligence and represent the state-of-the-art in many machine learning tasks. There is considerable interest in using DNNs to realize edge intelligence in highly resource-constrained devices such as wearables and IoT sensors. Unfortunately, the high computational requirements of DNNs pose a serious challenge to their deployment in these systems. Moreover, due to tight cost (and hence, area) constraints, these devices are often unable to accommodate hardware accelerators, requiring DNNs to execute on the General Purpose Processor (GPP) cores that they contain. We address this challenge through lightweight micro-architectural extensions to the memory hierarchy of GPPs that exploit a key attribute of DNNs, viz. sparsity, or the prevalence of zero values. We propose SparseCache, an enhanced cache architecture that utilizes a null cache based on a Ternary Content Addressable Memory (TCAM) to compactly store zero-valued cache lines, while storing non-zero lines in a conventional data cache. By storing address rather than values for zero-valued cache lines, SparseCache increases the effective cache capacity, thereby reducing the overall miss rate and execution time. SparseCache utilizes a Zero Detector and Approximator (ZDA) and Address Merger (AM) to perform reads and writes to the null cache. We evaluate SparseCache on four state-of-the-art DNNs programmed with the Caffe framework. SparseCache achieves 5-28% reduction in miss-rate, which translates to 5-21% reduction in execution time, with only 0.1% area and 3.8% power overhead in comparison to a low-end Intel Atom Z-series processor. - PublicationProCA: Progressive configuration aware design methodology for low power stochastic ASICs(03-03-2014)
;Gala, Neel ;Devanathan, V. R. ;Srinivasan, Karthik ;Visvanathan, V.With increasing integration of capabilities into mobile application processors, a host of imaging operations that were earlier performed in software are now implemented in hardware. Though imaging applications are inherently error resilient, the complexity of such designs has increased over time and thus identifying logic that can be leveraged for energy-quality trade-offs has become difficult. The paper proposes a Progressive Configuration Aware (ProCA) criticality analysis framework, that is 10X faster than the state-of-the-art, to identify logic which is functionally-critical to output quality. This accounts for the various modes of operation of the design. Through such a framework, we demonstrate how a low powered tunable stochastic design can be derived. The proposed methodology uses layered synthesis and voltage scaling mechanisms as primary tools for power reduction. We demonstrate the proposed methodology on a production quality imaging IP implemented in 28nm low leakage technology. For the tunable stochastic imaging IP, we gain up to 10.57% power reduction in exact mode and up to 32.53% power reduction in error tolerant mode (30dB PSNR), with negligible design overhead. © 2014 IEEE. - PublicationChADD: An ADD Based Chisel Compiler with Reduced Syntactic Variance(16-03-2016)
;Chauhan, Vikas ;Gala, NeelThe need for quick design space exploration and higher abstracted features required to design complex circuits has led designers to adopt High Level Synthesis languages (HLS) for hardware generation. Chisel is one such language, which offers majority of the abstraction facilities found in today's software languages and also guarantees synthesizability of the generated hardware. However, most of the HLS languages, including Chisel, suffer from syntactic variance and thus the hardware inferred by these languages are inconsistent and rely heavily on the description styles used by the designer. Thus semantically equivalent circuit descriptions with different syntax can lead to different hardware utilization. In this paper, we propose the use of ADDs (Assignment Decision Diagrams) as an intermediate representation between Chisel and the target net list representation. Following this path we have shown that for a given design, two different styles of Chisel implementations yield the same target net list, thereby ensuring syntactic invariance. For the same design implementations the conventional Chisel compiler reports significant syntactic variance. In addition, we show empirically that the net list generated by the proposed technique is equally competitive to the most optimal net list generated by the conventional compiler while targeting an FPGA, implying that different implementations leads to close to optimal solutions. - PublicationShakti-T : A RISC-V processor with light weight security extensions(25-06-2017)
;Menon, Arjun ;Murugan, Subadra; ;Gala, NeelWith increased usage of compute cores for sensitive appli-cations, including e-commerce, there is a need to provide additional hardware support for securing information from memory based attacks. This work presents a unified hard-ware framework for handling spatial and temporal mem-ory attacks. The paper integrates the proposed hardware framework with a RISC-V based micro-Architecture with an enhanced application binary interface that enables soft-ware layers to use these features to protect sensitive data. We demonstrate the effectiveness of the proposed scheme through practical case studies in addition to taking the de-sign through a VLSI CAD design ow. The proposed pro-cessor reduces the metadata storage overhead up to 4 x in comparison with the existing solutions, while incurring an area overhead of just 1914 LUTs and 2197 IP ops on an FPGA, without affecting the critical path delay of the pro-cessor. - PublicationA Programmable Event-driven Architecture for Evaluating Spiking Neural Networks(11-08-2017)
;Roy, Arnab ;Venkataramani, Swagath ;Gala, Neel ;Sen, Sanchari; Raghunathan, AnandSpiking neural networks (SNNs) represent the third generation of neural networks and are expected to enable new classes of machine learning applications. However, evaluating large-scale SNNs (e.g., of the scale of the visual cortex) on power-constrained systems requires significant improvements in computing efficiency. A unique attribute of SNNs is their event-driven nature-information is encoded as a series of spikes, and work is dynamically generated as spikes propagate through the network. Therefore, parallel implementations of SNNs on multi-cores and GPGPUs are severely limited by communication and synchronization overheads. Recent years have seen great interest in deep learning accelerators for non-spiking neural networks, however, these architectures are not well suited to the dynamic, irregular parallelism in SNNs. Prior efforts on specialized SNN hardware utilize spatial architectures, wherein each neuron is allocated a dedicated processing element, and large networks are realized by connecting multiple chips into a system. While suitable for large-scale systems, this approach is not a good match to size or cost constrained mobile devices. We propose PEASE, a Programmable Event-driven processor Architecture for SNN Evaluation. PEASE comprises of Spike Processing Units (SPUs) that are dynamically scheduled to execute computations triggered by a spike. Instructions to the SPUs are dynamically generated by Spike Schedulers (SSs) that utilize event queues to track unprocessed spikes and identify neurons that need to be evaluated. The memory hierarchy in PEASE is fully software managed, and the processing elements are interconnected using a two-tiered bus-ring topology matching the communication characteristics of SNNs. We propose a method to map any given SNN to PEASE such that the workload is balanced across SPUs and SPU clusters, while pipelining across layers of the network to improve performance. We implemented PEASE at the RTL level and synthesized it to IBM 45 technology. Across 6 SNN benchmarks, our 64-SPU configuration of PEASE achieves 7.1×-17.5× and 2.6×-5.8× speedups, respectively, over software implementations on an Intel Xeon E5-2680 CPU and NVIDIA Tesla K40C GPU. The energy reductions over the CPU and GPU are 71×-179× and 198×-467×, respectively. - PublicationTunable stochastic computing using layered synthesis and temperature adaptive voltage scaling(01-01-2013)
;Gala, Neel ;Devanathan, V. R. ;Visvanathan, V. ;Gandhi, ViratWith increasing computing power in mobile devices, conserving battery power (or extending battery life) has become crucial. This together with the fact that most applications running on these mobile devices are increasingly error tolerant, has created immense interest in stochastic (or inexact) computing. In this paper, we present a framework wherein, the devices can operate at varying error tolerant modes while significantly reducing the power dissipated. Further, in very deep sub-micron technologies, temperature has a crucial role in both performance and power. The proposed framework presents a novel layered synthesis optimization coupled with temperature aware supply and body bias voltage scaling to operate the design at various 'tunable' error tolerant modes. We implement the proposed technique on a H.264 decoder block in industrial 28nm low leakage technology node, and demonstrate reductions in total power varying from 30% to 45%, while changing the operating mode from exact computing to inaccurate/error-tolerant computing. © 2013 IEEE. - PublicationSHAKTI-F: A Fault Tolerant Microprocessor Architecture(28-02-2015)
;Gupta, Sukrat ;Gala, Neel ;Madhusudan, G. S.Deeply scaled CMOS circuits are vulnerable to soft and hard errors. These errors pose reliability concerns, especially for systems used in radiation-prone environments like space and nuclear applications. This paper presents SHAKTI-F, a RISC-V based SEE-tolerant micro-processor architecture that provides a solution to the reliability issues mentioned above. The proposed architecture uses error correcting codes (ECC) to tolerate errors in registers and memories, while it employs a combination of space and time redundancy based techniques to tolerate errors in the ALU. Two novel re-computation techniques for detecting errors for the addition/subtraction and multiplication modules are proposed. The scheme also identifies parts of the circuitry that need to be radiation hardened thus providing a total protection to SEEs. The proposed scheme provides fine-grain error detection capability that help in localization of the error to a specific functional unit and isolating the same, rather than the entire processor or a large module within a processor. This provides a graceful degradation and/or fail-safe shutdown capability to the processor. The HDL model of the processor was validated by simulating it with randomly induced SEEs. The proposed scheme adds an extra penalty of only 20% on the core area and 25% penalty on the performance when compared with conventional systems. This is very less when compared to the penalty incurred by employing schemes including double modular and triple modular redundancy. Interestingly, there is a 45% reduction in power consumption due to introduction of fault tolerance. The resulting system runs at 330M Hz on a 55nm technology node, which is sufficient for the class of applications these cores are utilized for.