Options
Anwesha: A Tool for Semantic Search in Bangla
Date Issued
01-01-2022
Author(s)
Das, Arup
Kundu, Bibekananda
Ghorai, Lokasis
Gupta, Arjun Kumar
Indian Institute of Technology, Madras
Abstract
Bangla is a low-resource language that is highly agglutinative, and designing effective search and information retrieval systems over Bangla is quite challenging. This paper presents our explorations toward building (Anwesha), a prototype for a search engine in Bangla. To the best of our knowledge, this search system is the first such initiative in Bangla that facilitates retrieval of semantically related documents by use of diverse knowledge sources like WordNet, statistical co-occurrences (by way of Latent Semantic Analysis (LSA)) and external knowledge sources like Wikipedia (by way of Explicit Semantic Analysis (ESA)). We also present our efforts to overcome the limitations of existing spell-check and lemmatization approaches in Bangla and integrate them into Anwesha. In addition, we also present methods to explain search results by highlighting keywords that LSA or ESA reckons to be semantically related to the query. Since there is no Gold standard dataset available to evaluate the effectiveness of Bangla information retrieval systems, we have created a dataset containing query document relevance pairs in two distinct domains. We analyze the system’s performance on queries having different difficulty levels. Our technique could be adapted to facilitate effective semantic search in other low-resource, highly inflected languages.
Volume
3315