Anwesha: A Tool for Semantic Search in Bangla

Das, Arup; Kundu, Bibekananda; Ghorai, Lokasis; Gupta, Arjun Kumar; Sutanu  Chakraborti

Anwesha: A Tool for Semantic Search in Bangla

Date Issued

01-01-2022

Author(s)

Das, Arup

Kundu, Bibekananda

Ghorai, Lokasis

Gupta, Arjun Kumar

Sutanu Chakraborti

Indian Institute of Technology, Madras

Abstract

Bangla is a low-resource language that is highly agglutinative, and designing effective search and information retrieval systems over Bangla is quite challenging. This paper presents our explorations toward building (Anwesha), a prototype for a search engine in Bangla. To the best of our knowledge, this search system is the first such initiative in Bangla that facilitates retrieval of semantically related documents by use of diverse knowledge sources like WordNet, statistical co-occurrences (by way of Latent Semantic Analysis (LSA)) and external knowledge sources like Wikipedia (by way of Explicit Semantic Analysis (ESA)). We also present our efforts to overcome the limitations of existing spell-check and lemmatization approaches in Bangla and integrate them into Anwesha. In addition, we also present methods to explain search results by highlighting keywords that LSA or ESA reckons to be semantically related to the query. Since there is no Gold standard dataset available to evaluate the effectiveness of Bangla information retrieval systems, we have created a dataset containing query document relevance pairs in two distinct domains. We analyze the system’s performance on queries having different difficulty levels. Our technique could be adapted to facilitate effective semantic search in other low-resource, highly inflected languages.

Volume

3315

Subjects

Options

Anwesha: A Tool for Semantic Search in Bangla