A Layout Based Detachment Approach for Extracting Content from Webpages

Deepa Chandran; Anna Saro Vijendran

doi:10.3844/ajassp.2015.411.420

Research Article Open Access

A Layout Based Detachment Approach for Extracting Content from Webpages

Deepa Chandran¹ and Anna Saro Vijendran²

¹ Department of Information Technology, SNR Sons College, Coimbatore, India
² MCA, SNR Sons College, Coimbatore, India

Abstract

Enormous amount of useful information presented in Internet is usually formatted for the web users. But it is a really complex task to extract the relevant data from various web sources. Recently, various approaches for the extraction of data from the webpages were proposed. This study provides a simple but effective approach, named Layout Based Detachment Approach (LBDA). The proposed approach extracts the main content from the webpage by removing the irrelevant information like header-footer contents, navigation bars, advertisements and other noisy images. The proposed methodology uses the following techniques: Tag tree parsing to get the analysis structure, block acquiring page segmentation method to remove unwanted tags and data extraction to retrieve the necessary contents. The proposed approach eliminates noise and perform effective extraction of the main content blocks from the webpage and display of the essential content to the users. The performance of the proposed approach is evaluated using the performance metrics such as accuracy, precision, recall, execution time and memory usage. The implementation results obviously show that our proposed LBDA approach exhibits better performance than the existing heuristic approach.

American Journal of Applied Sciences

Volume 12 No. 6, 2015, 411-420

DOI: https://doi.org/10.3844/ajassp.2015.411.420

Submitted On: 17 August 2013 Published On: 25 July 2015

How to Cite: Chandran, D. & Vijendran, A. S. (2015). A Layout Based Detachment Approach for Extracting Content from Webpages. American Journal of Applied Sciences, 12(6), 411-420. https://doi.org/10.3844/ajassp.2015.411.420

Copyright: © 2015 Deepa Chandran and Anna Saro Vijendran. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

5,317 Views
3,342 Downloads
0 Citations

Download

Keywords

Webpage Content Extraction
Web Mining
DOM Tree Analysis
Web Structure Mining