Options
Harish Guruprasad Ramaswamy
Loading...
Preferred name
Harish Guruprasad Ramaswamy
Official Name
Harish Guruprasad Ramaswamy
Alternative Name
Ramaswamy, Harish G.
Main Affiliation
Email
ORCID
Scopus Author ID
Google Scholar ID
3 results
Now showing 1 - 3 of 3
- PublicationOn knowledge distillation from complex networks for response prediction(01-01-2019)
;Arora, Siddhartha; Recent advances in Question Answering have lead to the development of very complex models which compute rich representations for query and documents by capturing all pairwise interactions between query and document words. This makes these models expensive in space and time, and in practice one has to restrict the length of the documents that can be fed to these models. Such models have also been recently employed for the task of predicting dialog responses from available background documents (e.g., Holl-E dataset). However, here the documents are longer, thereby rendering these complex models infeasible except in select restricted settings. In order to overcome this, we use standard simple models which do not capture all pairwise interactions, but learn to emulate certain characteristics of a complex teacher network. Specifically, we first investigate the conicity of representations learned by a complex model and observe that it is significantly lower than that of simpler models. Based on this insight, we modify the simple architecture to mimic this characteristic. We go further by using knowledge distillation approaches, where the simple model acts as a student and learns to match the output from the complex teacher network. We experiment with the Holl-E dialog data set and show that by mimicking characteristics and matching outputs from a teacher, even a simple network can give improved performance. - PublicationOn controllable sparse alternatives to softmax(01-01-2018)
;Laha, Anirban ;Chemmengath, Saneem A. ;Agrawal, Priyanka; ;Sankaranarayanan, KarthikConverting an n-dimensional vector to a probability distribution over n objects is a commonly used component in many machine learning tasks like multiclass classification, multilabel classification, attention mechanisms etc. For this, several probability mapping functions have been proposed and employed in literature such as softmax, sum-normalization, spherical softmax, and sparsemax, but there is very little understanding in terms how they relate with each other. Further, none of the above formulations offer an explicit control over the degree of sparsity. To address this, we develop a unified framework that encompasses all these formulations as special cases. This framework ensures simple closed-form solutions and existence of sub-gradients suitable for learning via backpropagation. Within this framework, we propose two novel sparse formulations, sparsegen-lin and sparsehourglass, that seek to provide a control over the degree of desired sparsity. We further develop novel convex loss functions that help induce the behavior of aforementioned formulations in the multilabel classification setting, showing improved performance. We also demonstrate empirically that the proposed formulations, when used to compute attention weights, achieve better or comparable performance on standard seq2seq tasks like neural machine translation and abstractive summarization. - PublicationConsistent algorithms for multiclass classification with an abstain option(01-01-2018)
; ;Tewari, AmbujAgarwal, ShivaniWe consider the problem of n-class classification (n ≥ 2), where the classifier can choose to abstain from making predictions at a given cost, say, a factor α of the cost of misclassification. Our goal is to design consistent algorithms for such n-class classification problems with a ‘reject option’; while such algorithms are known for the binary (n = 2) case, little has been understood for the general multiclass case. We show that the well known Crammer-Singer surrogate and the one-vs-all hinge loss, albeit with a different predictor than the standard argmax, yield consistent algorithms for this problem when α = ½. More interestingly, we design a new convex surrogate, which we call the binary encoded predictions surrogate, that is also consistent for this problem when α = ½ and operates on a much lower dimensional space (log(n) as opposed to n). We also construct modified versions of all these three surrogates to be consistent for any given α ∈ [0, ½].