WebSelF: A Web Scraping Framework

Research output: Contribution to journal/Conference contribution in journal/Contribution to newspaperConference articleResearchpeer-review

  • Jakob Grauenkjær Thomsen
  • ,
  • Erik Ernst
  • Claus Brabrand, Programming, Logic and Semantics, Denmark
  • Michael I. Schwartzbach
We present WebSelF, a framework for web scraping which models the process of web scraping and decomposes it into four conceptually independent, reusable, and composable constituents. We have validated our framework through a full parameterized implementation that is flexible enough to capture previous work on web scraping. We conducted an experiment that evaluated several qualitatively different web scraping constituents (including previous work and combinations hereof) on about 11,000 HTML pages on daily versions of 17 web sites over a period of more than one year. Our framework solves three concrete problems with current web scraping and our experimental results indicate that composition of previous and our new techniques achieve a higher degree of accuracy, precision and specificity than existing techniques alone.
Original languageEnglish
Book seriesLecture Notes in Computer Science
Volume7387
Pages (from-to)347-361
Number of pages14
ISSN0302-9743
DOIs
Publication statusPublished - 2012
EventInternational Conference on Web Engineering - Berlin, Germany
Duration: 23 Jul 201227 Jul 2012
Conference number: 12

Conference

ConferenceInternational Conference on Web Engineering
Number12
CountryGermany
CityBerlin
Period23/07/201227/07/2012

Bibliographical note

Title of the vol.: Web Engineering. 12th International Conference, ICWE 2012
Berlin, Germany, July 23-27, 2012. Proceedings / ed. by Marco Brambilla, Takehiro Tokuda, Robert Tolksdorf
ISBN: 978-3-642-31752-1. e-ISBN 978-3-642-31753-8

See relations at Aarhus University Citationformats

ID: 48837291