Options
Attribute based content mining for regional web documents
Date Issued
01-01-2013
Author(s)
Prakash, Kolla Bhanu
Rangaswamy, M. A.Dorai
Raman, Arun Raja
Abstract
The rapid expansion of the Internet has made the WWW a popular place for disseminating and collecting information. Extracting useful information from Web pages thus becomes an important task. Generally, apart from the main content blocks, web pages usually have such blocks as navigation bars, copyright and privacy notices, relevant hyperlinks, and advertisements, which are called noisy blocks. Although such information items are functionally useful for human viewers and necessary for the Web site owners, they often hamper Web page clustering, classification, information retrieval and information extraction. Today, people use the Web for a large variety of activities including travel planning, comparison shopping, entertainment, and research. However, the tools available for collecting, organizing, and sharing Web content have not kept pace with the rapid growth in information. But the major complexity arises when web documents or information is in regional languages. Extracting the content of the document and later communication through oral or text means is quite involved as both syntax and symantics are needed for this. Depending on the form and structure of the web document this task becomes difficult and this is the area the current paper addresses through a novel approach based on the pixel maps and using this how content could be extracted and knowledge is created in the minds of illiterate user. The paper first presents how letters and words which form the basis of text-based communication can be used for content. The objective of this task is to achieve a conceptbased term analysis on the sentence and document levels rather than a single-term analysis in the document set only. This paper outlines the use of attributes for content extraction, using basic pixel attributes and pattern matching, statistical model and pattern matching and Artificial Neural Network training.
Volume
2013