A Layout Based Detachment Approach for Extracting Content from Webpages
Deepa Chandran and Anna Saro Vijendran
DOI : 10.3844/ajassp.2015.411.420
American Journal of Applied Sciences
Volume 12, Issue 6
Enormous amount of useful information presented in Internet is usually formatted for the web users. But it is a really complex task to extract the relevant data from various web sources. Recently, various approaches for the extraction of data from the webpages were proposed. This study provides a simple but effective approach, named Layout Based Detachment Approach (LBDA). The proposed approach extracts the main content from the webpage by removing the irrelevant information like header-footer contents, navigation bars, advertisements and other noisy images. The proposed methodology uses the following techniques: Tag tree parsing to get the analysis structure, block acquiring page segmentation method to remove unwanted tags and data extraction to retrieve the necessary contents. The proposed approach eliminates noise and perform effective extraction of the main content blocks from the webpage and display of the essential content to the users. The performance of the proposed approach is evaluated using the performance metrics such as accuracy, precision, recall, execution time and memory usage. The implementation results obviously show that our proposed LBDA approach exhibits better performance than the existing heuristic approach.
© 2015 Deepa Chandran and Anna Saro Vijendran. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.