Now showing 1 - 7 of 7
  • Placeholder Image
    Publication
    On knowledge distillation from complex networks for response prediction
    (01-01-2019)
    Arora, Siddhartha
    ;
    ;
    Recent advances in Question Answering have lead to the development of very complex models which compute rich representations for query and documents by capturing all pairwise interactions between query and document words. This makes these models expensive in space and time, and in practice one has to restrict the length of the documents that can be fed to these models. Such models have also been recently employed for the task of predicting dialog responses from available background documents (e.g., Holl-E dataset). However, here the documents are longer, thereby rendering these complex models infeasible except in select restricted settings. In order to overcome this, we use standard simple models which do not capture all pairwise interactions, but learn to emulate certain characteristics of a complex teacher network. Specifically, we first investigate the conicity of representations learned by a complex model and observe that it is significantly lower than that of simpler models. Based on this insight, we modify the simple architecture to mimic this characteristic. We go further by using knowledge distillation approaches, where the simple model acts as a student and learns to match the output from the complex teacher network. We experiment with the Holl-E dialog data set and show that by mimicking characteristics and matching outputs from a teacher, even a simple network can give improved performance.
  • Placeholder Image
    Publication
    On controllable sparse alternatives to softmax
    (01-01-2018)
    Laha, Anirban
    ;
    Chemmengath, Saneem A.
    ;
    Agrawal, Priyanka
    ;
    ;
    Sankaranarayanan, Karthik
    ;
    Converting an n-dimensional vector to a probability distribution over n objects is a commonly used component in many machine learning tasks like multiclass classification, multilabel classification, attention mechanisms etc. For this, several probability mapping functions have been proposed and employed in literature such as softmax, sum-normalization, spherical softmax, and sparsemax, but there is very little understanding in terms how they relate with each other. Further, none of the above formulations offer an explicit control over the degree of sparsity. To address this, we develop a unified framework that encompasses all these formulations as special cases. This framework ensures simple closed-form solutions and existence of sub-gradients suitable for learning via backpropagation. Within this framework, we propose two novel sparse formulations, sparsegen-lin and sparsehourglass, that seek to provide a control over the degree of desired sparsity. We further develop novel convex loss functions that help induce the behavior of aforementioned formulations in the multilabel classification setting, showing improved performance. We also demonstrate empirically that the proposed formulations, when used to compute attention weights, achieve better or comparable performance on standard seq2seq tasks like neural machine translation and abstractive summarization.
  • Placeholder Image
    Publication
    Ablation-CAM: Visual explanations for deep convolutional network via gradient-free localization
    (01-03-2020)
    Desai, Saurabh
    ;
    In response to recent criticism of gradient-based visualization techniques, we propose a new methodology to generate visual explanations for deep Convolutional Neural Networks (CNN) - based models. Our approach - Ablation-based Class Activation Mapping (Ablation CAM) uses ablation analysis to determine the importance (weights) of individual feature map units w.r.t. class. Further, this is used to produce a coarse localization map highlighting the important regions in the image for predicting the concept. Our objective and subjective evaluations show that this gradient-free approach works better than state-of-the-art Grad-CAM technique. Moreover, further experiments are carried out to show that Ablation-CAM is class discriminative as well as can be used to evaluate trust in a model.
  • Placeholder Image
    Publication
    Convex calibrated surrogates for the multi-label f-measure
    (01-01-2020)
    Zhang, Mingyuan
    ;
    ;
    Agarwal, Shivani
    The F-measure is a widely used performance measure for multi-label classification, where multiple labels can be active in an instance simultaneously (e.g. in image tagging, multiple tags can be active in any image). In particular, the F-measure explicitly balances recall (fraction of active labels predicted to be active) and precision (fraction of labels predicted to be active that are actually so), both of which are important in evaluating the overall performance of a multi-label classifier. As with most discrete prediction problems, however, directly optimizing the F-measure is computationally hard. In this paper, we explore the question of designing convex surrogate losses that are calibrated for the F-measure specifically, that have the property that minimizing the surrogate loss yields (in the limit of sufficient data) a Bayes optimal multi-label classifier for the F-measure. We show that the F-measure for an s-label problem, when viewed as a 2s × 2s loss matrix, has rank at most s2 + 1, and apply a result of Ramaswamy et al. (2014) to design a family of convex calibrated surrogates for the F-measure. The resulting surrogate risk minimization algorithms can be viewed as decomposing the multi-label F-measure learning problem into s2 + 1 binary class probability estimation problems. We also provide a quantitative regret transfer bound for our surrogates, which allows any regret guarantees for the binary problems to be transferred to regret guarantees for the overall F-measure problem, and discuss a connection with the algorithm of Dembczynski et al. (2013). Our experiments confirm our theoretical findings.
  • Placeholder Image
    Publication
    Inductive Bias of Gradient Descent for Weight Normalized Smooth Homogeneous Neural Nets
    (01-01-2022)
    Morwani, Depen
    ;
    We analyze the inductive bias of gradient descent for weight normalized smooth homogeneous neural nets, when trained on exponential or cross-entropy loss. We analyse both standard weight normalization (SWN) and exponential weight normalization (EWN), and show that the gradient flow path with EWN is equivalent to gradient flow on standard networks with an adaptive learning rate. We extend these results to gradient descent, and establish asymptotic relations between weights and gradients for both SWN and EWN. We also show that EWN causes weights to be updated in a way that prefers asymptotic relative sparsity. For EWN, we provide a finite-time convergence rate of the loss with gradient flow and a tight asymptotic convergence rate with gradient descent. We demonstrate our results for SWN and EWN on synthetic data sets. Experimental results on simple datasets support our claim on sparse EWN solutions, even with SGD. This demonstrates its potential applications in learning neural networks amenable to pruning.
  • Placeholder Image
    Publication
    Consistent algorithms for multiclass classification with an abstain option
    (01-01-2018) ;
    Tewari, Ambuj
    ;
    Agarwal, Shivani
    We consider the problem of n-class classification (n ≥ 2), where the classifier can choose to abstain from making predictions at a given cost, say, a factor α of the cost of misclassification. Our goal is to design consistent algorithms for such n-class classification problems with a ‘reject option’; while such algorithms are known for the binary (n = 2) case, little has been understood for the general multiclass case. We show that the well known Crammer-Singer surrogate and the one-vs-all hinge loss, albeit with a different predictor than the standard argmax, yield consistent algorithms for this problem when α = ½. More interestingly, we design a new convex surrogate, which we call the binary encoded predictions surrogate, that is also consistent for this problem when α = ½ and operates on a much lower dimensional space (log(n) as opposed to n). We also construct modified versions of all these three surrogates to be consistent for any given α ∈ [0, ½].
  • Placeholder Image
    Publication
    Consistent plug-in classifiers for complex objectives and constraints
    (01-01-2020)
    Tavker, Shiv Kumar
    ;
    ;
    Narasimhan, Harikrishna
    We present a consistent algorithm for constrained classification problems where the objective (e.g. F-measure, G-mean) and the constraints (e.g. demographic parity fairness, coverage) are defined by general functions of the confusion matrix. Our approach reduces the problem into a sequence of plug-in classifier learning tasks. The reduction is achieved by posing the learning problem as an optimization over the intersection of two sets: the set of confusion matrices that are achievable and those that are feasible. This decoupling of the constraint space then allows us to solve the problem by applying Frank-Wolfe style optimization over the individual sets. For objective and constraints that are convex functions of the confusion matrix, our algorithm requires O(1/e2) calls to the plug-in subroutine, which improves on the O(1/e3) calls needed by the reduction-based algorithm of Narasimhan (2018) [29]. We show empirically that our algorithm is competitive with prior methods, while being more robust to choices of hyper-parameters.