Options
Madhu Mutyam
Loading...
Preferred name
Madhu Mutyam
Official Name
Madhu Mutyam
Alternative Name
Mutyam, Madhu
Main Affiliation
Email
ORCID
Scopus Author ID
Researcher ID
Google Scholar ID
50 results
Now showing 1 - 10 of 50
- PublicationWay sharing set associative cache architecture(24-04-2012)
;Janraj, C. J. ;Kalyan, T. Venkata ;Warrier, TriptiIn order to minimize the conflict miss rate, cache memories can be organized in set-associative manner. The downside of increasing the associativity is increase in the per access energy consumption. In conventional n-way set-associative caches, irrespective of the set-wise demand, each set has n cache ways at its disposal, but cache sets may exhibit nonuniform demand for these cache ways. Exploiting this property, we propose a novel cache architecture, called way sharing cache, wherein by allowing sharing of cache ways among a pair of cache sets, we obtain dynamic energy savings as high as 41% in DL1 cache with negligible performance penalty. © 2012 IEEE. - PublicationAn application-aware cache replacement policy for last-level caches(27-02-2013)
;Warrier, Tripti S. ;Anupama, B.Current day multicore processors employ multi-level cache hierarchy with one or two levels of private caches and a shared last-level cache (LLC). Efficient cache replacement policies at LLC are essential for reducing the off-chip memory traffic as well as contention for memory bandwidth. Cache replacement techniques for unicore LLCs may not be efficient for multicore LLCs as multicore LLCs can be shared by applications with varying access behavior, running simultaneously. One application may dominate another by flooding of cache requests and evicting the useful data of the other application. This paper proposes a new cache replacement policy for shared LLC called Application-aware Cache Replacement (ACR). ACR policy prevents victimizing low-access rate application by a high-access rate application. It dynamically keeps track of maximum life-time of cache lines in shared LLC for each concurrent application and helps in efficient utilization of the cache space. Experimental evaluation of ACR technique for 2-core and 4-core systems using SPEC CPU 2000 and 2006 benchmark suites shows significant speed-up improvement over the least recently used and thread-aware dynamic re-reference interval prediction techniques. © 2013 Springer-Verlag. - PublicationFuzzy fairness controller for NVMe SSDs(29-06-2020)
;Tripathy, Shivani ;Sahoo, Debiprasanna ;Satpathy, ManoranjanModern NVMe SSDs are widely deployed in diverse domains due to characteristics like high performance, robustness, and energy efficiency. It has been observed that the impact of interference among the concurrently running workloads on their overall response time differs significantly in these devices, which leads to unfairness. Workload intensity is a dominant factor influencing the interference. Prior works use a threshold value to characterize a workload as high-intensity or low-intensity; this type of characterization has drawbacks due to lack of information about the degree of low- or high-intensity. A data cache in an SSD controller - usually based on DRAMs - plays a crucial role in improving device throughput and lifetime. However, the degree of parallelism is limited at this level compared to the SSD back-end consisting of several channels, chips, and planes. Therefore, the impact of interference can be more pronounced at the data cache level. No prior work has addressed the fairness issue at the data cache level to the best of our knowledge. In this work, we address this issue by proposing a fuzzy logic-based fairness control mechanism. A fuzzy fairness controller characterizes the degree of flow intensity (i.e., the rate at which requests are generated) of a workload and assigns priorities to the workloads. We implement the proposed mechanism in the MQSim framework and observe that our technique improves the fairness, weighted speedup, and harmonic speedup of SSD by 29.84%, 11.24%, and 24.90% on average over state of the art, respectively. The peak gains in fairness, weighted speedup, and harmonic speedup are 2.02x, 29.44%, and 56.30%, respectively. - PublicationFormal modeling and verification of controllers for a family of DRAM caches(01-11-2018)
;Sahoo, Debiprasanna ;Sha, Swaraj ;Satpathy, Manoranjan; ;Ramesh, S.Roop, ParthaDie-stacking technology enables the use of a high density DRAM as a cache. Major processor vendors have recently started using these stacked DRAM modules as the last level cache of their products. These stacked DRAM modules provide high bandwidth with relatively low latency compared to the off-package DRAM modules. Recent studies on DRAM caches propose several variants to optimize performance and power of the systems. However, none of the existing works discuss its design and verification aspect. DRAM cache controller (DCC) design is significantly complex in comparison to a conventional DRAM-based main memory controller. This is because it involves controlling both the timing aspect of DRAM system as well as the functional aspect of cache. Therefore, without rigorous modeling and verification of such designs, it would be difficult to ensure correctness. In the current research, we focus on the design and verification issues of DCC. We select a common variant of DRAM cache and build a formal model of its controller in terms of interacting state machines; we term the common variant as the baseline and its model as the base model. We then verify safety, liveness, and timing properties of this variant using model checking. Next, we demonstrate how the formal models and the associated properties of other variants of DCCs can be derived from the base model in a systematic way. Analyzing the individual DRAM cache variations, we observe that most of the variants exhibit product-line characteristics. - PublicationPBC: Prefetched Blocks Compaction(01-08-2016)
;Raghavendra, K. ;Panda, BiswabandanCache compression improves the performance of a multi-core system by being able to store more cache blocks in a compressed format. Compression is achieved by exploiting data patterns present within a block. For a given cache space, compression increases the effective cache capacity. However, this increase is limited by the number of tags that can be accommodated at the cache. Prefetching is another technique that improves system performance by fetching the cache blocks ahead of time into the cache and hiding the off-chip latency. Commonly used hardware prefetchers, such as stream and stride, fetch multiple contiguous blocks into the cache. In this paper we propose prefetched blocks compaction (PBC) wherein we exploit the data patterns present across these prefetched blocks. PBC compacts the prefetched blocks into a single block with a single tag, effectively increasing the cache capacity. We also modify the cache organization to access these multiple cache blocks residing in a single block without any need for extra tag look-ups. PBC improves the system performance by 11.1 percent with a maximum of 43.4 percent on a four-core system. - PublicationEFGR: An enhanced Fine Granularity Refresh feature for high-performance DDR4 DRAM devices(27-10-2014)
;Tawa, Venkata Kalyan ;Kasha, RaviHigh-density DRAM devices spend significant time refreshing the DRAM cells, leading to performance drop. The JEDEC DDR4 standard provides a Fine Granularity Refresh (FGR) feature to tackle refresh. Motivated by the observation that in FGR mode, only a few banks are involved, we propose an Enhanced FGR (EFGR) feature that introduces three optimizations to the basic FGR feature and exposes the bank-level parallelism within the rank even during the refresh. The first optimization decouples the nonrefreshing banks. The second and third optimizations determine the maximum number of nonrefreshing banks that can be active during refresh and selectively precharge the banks before refresh, respectively. Our simulation results show that the EFGR feature is able to recover almost 56.6% of the performance loss incurred due to refresh operations. - PublicationEndurance enhancement of write-optimized STT-RAM caches(30-09-2019)
;Saraf, PuneetLow density and high leakage power of SRAM are the major setbacks for its scalability. Non-volatile memory (NVM) like spin-transfer torque random access memory (STT-RAM) is a suitable replacement for SRAM at the last level cache (LLC). NVM offers high density, and near zero leakage, which are highly desired for on-chip caches. A few drawbacks of STT-RAM such as high write latency and limited endurance, have to be taken care before it can replace SRAM at the LLC. Prior works have either tried to optimize the write latency or endurance. In this paper, by considering write-optimized STT-RAM, we propose endurance improvement techniques by reducing the maximum number of writes, global write variation, and the average number of writes. We take care of low retention time of write-optimized STT-RAMs using a refresh mechanism. We employ refresh-aware cache replacement policies wherein the cache blocks that are about to expire are preferred to the recently refreshed cache blocks. This refresh-aware policy, when combined with the recency information of the cache blocks, enhances both performance and endurance of STT-RAM LLC. We show that our refresh-aware policy achieves the maximum lifetime improvement of 32.5% for single-core and 70.7% for muti-core compared to STT-RAM with no wear leveling. When we combine recency information with our refresh-aware policy, there is a slight improvement in the performance. - PublicationP systems with membrane creation: Universality and efficiency(01-01-2001)
; Krithivasan, KamalaPsystems, introduced by Gh. Pǎun form a new class of distributed computing model. Several variants of Psystems were already shown to be computationally universal. In this paper, we propose a new variant of Psystems, P systems with membrane creation, in which some objects are productive and create membranes. This new variant of Psystems is capable of solving the Hamiltonian Path Problem in linear time. We show that Psystems with membrane creation are computationally complete. © Springer-Verlag Berlin Heidelberg 2001. - PublicationSFFMap: Set-First Fill mapping for an energy efficient pipelined data cache(03-12-2014)
;Majumder, Pritam ;Kalyan T., VenkataConventionally, consecutively addressed blocks are mapped onto different sets in cache. In this work, we propose a new block address mapping, Set-First Fill (SFFMap), for pipelined L1 data caches wherein consecutively addressed data blocks are mapped onto the same set. This increases the inter-block spatial locality within the cache set. In order to exploit SFFMap, we propose to store and if possible, access the most recently used set in the cache's pipeline registers. Further, selective access (SSA) and selective update (SSU) techniques are proposed for set-buffer to increase the effectiveness of SFFMap. Our experimental evaluation for in-order and out-of-order processors with an 8-way set-associative data cache shows that SFFMap, together with SSA and SSU, achieves around 27% reduction in dynamic energy and 4-5% performance improvement. The proposed techniques need minor modifications to the existing hardware, making it an adoptable design. - PublicationDeBAR: Deflection based adaptive router with minimal buffering(01-01-2013)
;Jose, John ;Nayak, Bhawna ;Kumar, KranthiEnergy efficiency of the underlying communication framework plays a major role in the performance of multicore systems. NoCs with buffer-less routing are gaining popularity due to simplicity in the router design, low power consumption, and load balancing capacity. With minimal number of buffers, deflection routers evenly distribute the traffic across links. In this paper, we propose an adaptive deflection router, DeBAR, that uses a minimal set of central buffers to accommodate a fraction of mis-routed flits. DeBAR incorporates a hybrid flit ejection mechanism that gives the effect of dual ejection with a single ejection port, an innovative adaptive routing algorithm, and a selective flit buffering based on flit marking. Our proposed router design reduces the average flit latency and the deflection rate, and improves the throughput with respect to the existing minimally buffered deflection routers without any change in the critical path. © 2013 EDAA.