WebSelF: A Web Scraping Framework

Publikation: Bidrag til tidsskrift/Konferencebidrag i tidsskrift /Bidrag til avisKonferenceartikelForskningpeer review

  • Jakob Grauenkjær Thomsen
  • ,
  • Erik Ernst, Danmark
  • Claus Brabrand, Programming, Logic and Semantics, Danmark
  • Michael I. Schwartzbach
We present WebSelF, a framework for the selection of in-
formation from a series of similar web pages. For instance,
it could extract the top news story every day from a news
web site. WebSelF offers a high level of flexibility due to the
use of plugins that may themselves be composed of other
plugins, and its properties are well-analyzed due to a formal
model of the core of the framework, along with proofs of a
progress and a preservation theorem. We have implemented
WebSelF in Java, and validated it through a substantial ex-
periment extracting information from about 11,000 HTML
pages on daily versions of 17 web sites over a period of more
than one year. In the experiment we evaluate 60 different se-
lection plugins and 24 qualitatively different validation plu-
gins. The experiment shows that WebSelF achieves a higher
degree of accuracy, precision and specificity than existing
techniques.
OriginalsprogEngelsk
BogserieLecture Notes in Computer Science
Vol/bind7387
Sider (fra-til)347-361
Antal sider14
ISSN0302-9743
DOI
StatusUdgivet - 2012
BegivenhedInternational Conference on Web Engineering - Berlin, Tyskland
Varighed: 23 jul. 201227 jul. 2012
Konferencens nummer: 12

Konference

KonferenceInternational Conference on Web Engineering
Nummer12
LandTyskland
ByBerlin
Periode23/07/201227/07/2012

Se relationer på Aarhus Universitet Citationsformater

ID: 48837291