Advantages and Disadvantages in the Use of the Internet as a Corpus.: The Case of the Online Dictionaries of Spanish “University of Valladolid”

Activity: Talk or presentation typesLecture and oral contribution


There is no doubt that big electronic text corpora can be of great value to lexicographers when performing a series of tasks in connection with the compilation of dictionaries. This has been documented by various scholars engaged in practical lexicography, among them Bergenholtz (1996), Atkins & Rundell (2008), and Hanks (2012). Various high-quality dictionaries have been compiled based upon this empirical basis (see e.g. Sinclair 1997). Today, corpora composed by texts containing hundreds of millions of words are available to the compilers of dictionaries, and it seems that they never stop growing. In this respect, big data is already a reality, but it should not be forgotten that no corpus, however big, can stand up to the enormous collection of texts and words which can be accessed through the Internet. The development of methods allowing for the use of this almost unlimited empirical basis constitutes undoubtedly a challenge more and more relevant to lexicography, for which reason it is the topic of this paper. The paper will start with a brief discussion of the complex relationships existing between different classes of empirical basis which can be used in a lexicographical project, i.e. introspection, multispection, corpora, the Internet, existing dictionaries, textbooks or other information sources, or a combination of these. After that, the paper will discuss some of the most important advantages and disadvantages of using the Internet as a corpus, in comparison with the “traditional” corpora in which the texts have been selected according to specific criteria relevant to the tasks that have to be accomplished. Among the disadvantages, the paper will discuss the problems related to the dubious origin and quality of many texts; and among the advantages, it will discuss the time factor, the big amount of texts available, and the continuous actualization with the most recent words and expressions. In spite of the undeniable disadvantages, the paper will conclude that it is perfectly possible, and even beneficial, to use the Internet as the main empirical source, without resorting to the “traditional” corpora”, when the objective is the production of dictionaries of still higher quality. As an example, the paper will then show how the Internet has been used as the main empirical source in order to select lemmata and meaning units (senses) in a Spanish online project, i.e. the Online Dictionaries of Spanish “University of Valladolid” (Diccionarios Valladolid-Uva), a project which is currently in an advanced phase of compilation. The various methods applied to accomplish the two tasks mentioned will be discussed and some examples (translated in to English) of dictionary articles will be presented, especially focusing on the selected lemmata and senses which are already stored in the database sustaining the seven dictionaries included in the project. Furthermore, a comparison will be made between these lexicographical data and those appearing in other general online dictionaries of Spanish (among them, the one edited by the Royal Spanish Academy), and some examples will be provided which shows how the chosen methodology frequently permits the selection of a bigger number of lexicographical data relevant to the foreseen user group. Finally, the paper will provide some general conclusions as well as some recommendations and suggestions for future lexicographical projects. Literature Atkins, B.T.S. and M. Rundell (2008): The Oxford Guide to Practical Lexicography. Oxford, New York: Oxford University Press. Bergenholtz, H. (1996): Korpusbaseret leksikografi». LexicoNordica, 3, 1-15. Fuertes-Olivera, P.A. and H. Bergenholtz (2015): Los Diccionarios en Línea de Español “Universidad de Valladolid”. Estudios de Lexicografía, 4, 71-98. Gudmann, H.R. (2015): Lagunas de significado en los diccionarios españoles. Estudios de Lexicografía, 4, 161-184. Hanks, P. (2012): The Corpus Revolution in Lexicography. International Journal of Lexicography, 25 (4), 398-436 Sinclair, J. M. (1997): Introduction. In: Collins Cobuild English Language Dictionary. London: HarperCollins Publishers, xv-xxi.
Period5 Jul 2016
Event title20th Afrilex International Conference on Lexicography
Event typeConference
Conference number20
LocationTzaneen, South AfricaShow on map