Options
Text studies towards multi-lingual content mining for web communication
Date Issued
01-12-2010
Author(s)
Prakash, Kolla Bhanu
Dorai Rangaswamy, M. A.
Raman, Arun Raja
Abstract
Communication through web is becoming increasingly popular thanks to wireless and cellular networks. As this awareness spreads far and wide in different countries, significant complexities arise in terms of language and communication means for extracting information on the web. This is particularly true in India where more than fifteen officially recognized language texts and more variations in local dialect exist. An example is in Tamilnadu where Tamizh, native language with its own variations like Chennai, Madurai and Coimbatore dialects is combined effectively and easily with other languages Telugu, Kannada and Malayalam from nearby states and English and Hindi from global and national perspectives. So a web document here could be in any one of the languages or a mixture of words from different languages to avoid translation like 'computer' of English doesn't have translation in Tamizh. There are several aspects to this variational usage with language protagonists and communication engineers. But the complexity in the web document due to these variations does create difficulties in using conventional data mining approaches. The present study focuses attention on this, beginning from text variations to word and document. Typical characters which have similar usage like 'a' in English with those in Tamizh and Telugu are taken and their pixelmaps are mapped for similarity and contrasts. This is later extended to more complex characters like unknown sign in Telugu which is one character as compared to its English equivalent 'kO' making representations difficult. When one starts looking at words, complexity increases as 'temple' in English translated as 'unknown sign' in Telugu or 'mandiram' written in English. Similarities in pixel-maps are looked at and characteristics in terms of matrices are projected so that mining content when such words or letters are extracted in web document can be put in a probabilistic format with predictions based on correlations. Typical histograms highlighting these aspects are presented and later an experiment with a document page dealing with magnetism is used as model-l for predicting content. ©2010 IEEE.