Using the web as corpus

Translators, terminologists and even Subject matter experts (SMEs) are called upon to define, differentiate and translate terminology on a daily basis. There are many questions that will eventually lead to the A => B translation equivalency (or at least to a prevailing one). All of us can, and usually do, turn to traditional printed dictionaries or dictionaries on CD.

Monolingual theusauri and dictionaries are invaluable but they can’t keep up in most cases with the technological advances and the term inceptions that we face in our line of work. The best resource, one that, in most cases, makes our lives easier is the World Wide Web. It is true that there is so much information on the Web that we have come to talk about information overload, but how would we use that information to enable us to do our job faster, better and more efficiently?

The web can be used as a corpus for our work. What is a corpus? A corpus is “a large and structured set of texts (now usually electronically stored and processed)… A corpus may contain texts in a single language (monolingual corpus) or text data in multiple languages (multilingual corpus). Multilingual corpora that have been specially formatted for side-by-side comparison are called aligned parallel corpora”[1]. Since in this article I’m exploring the web as corpus I would then refine it as “a collection of electronic texts (in a standardized format with certain conventions relating to the content), which is selected in a principled way in order to furnish a store of linguistic data for lexicographical work.”[2]

There is no unified attitude towards the web used as a corpus. We can see four senses in the use of the web as corpus, two of which are useful for our work purposes:

Web as Surrogate: This represents the use of the web as a surrogate because there is no corresponding subject – matter corpus or there is no tool to analyze the corpus at hand, and,
Web as a one stop shop: Using traditional search methods aggregate and download locally the search results thus creating an offline corpus matching specific terminological needs.

The remaining two senses relate to the use of the web as a corpus proper, i.e. when we want to take a snapshot in time of the web and try to analyze the web per se (as for example the BNC [3] does for given periods of the English language) and the web as mega corpus i.e. the creation of a mini web that would have both traditional corpus features as well as web features[4].

In order to aggregate this data we use search engines and/or specialized desktop software that would do the same searches offline. The problems that we all face in the use of either online or offline search engines/software can be summed up in the following:

Not all desktop software supports languages other than the traditional English, French, German, Italian and Spanish.
Not all pages on the internet contain search engine friendly metatags.
Since the web is an ever-changing source, it’s unreplicable and uncontrollable. What we find on one search might not appear again on the same search at a later time.
Instead of training the software to do more semantically oriented searches[5], the software trains the users, us, to formulate our searches in such a way so as to find what we are looking for. As Kilgariff put it “Working with commercial search engines makes us develop workarounds. We become experts in the syntax and constraints of Google, Yahoo!, Altavista[6], and so on. We become ‘googleologists’.”[7]

I can add one last problem that’s more generic and that’s the fact that in general the query language we have to use is unsophisticated and does not permit for a more advanced or complicated querying. We don’t have the ability to search only for stems; we can’t use the NEAR operator; the wildcard operator (*) has come down to denote “find X word in the same sentence or paragraph with Y word”.

Traditional Corpus

Results can be replicated

There is control over the content

Supports linguistically complex queries

The Web

Results are always up to date

Content varies wildly

Doesn’t support complex queries

Comparison of a traditional corpus and the web as corpus

But before you decide to rule out this effective tool let me remind you that it is exactly that up-to-dateness that we are looking for in our line of work. Besides, the web outweighs the mentioned issues because

It’s convenient,
It’s easy to use,
It can accommodate web specific material,
It presents an easy way to collect results,
It’s of such a size that it will always win over any traditional corpus.
It can accommodate searches in your language
It can help you restrict the types of pages you will get as results, their origin (country, type, source etc), and,
It offers access to readymade bilingual sources such as Eurolex, Eurodic and other European sites that have all (or almost all) of its content in more than one languages.
It works with any given language (provided there is some sort of presence of this language on the Web)

So how can we use this tool to make our work easier and also to enable our associates and vendors in their work? There are a couple of things we can do. First of all if our language already has an open access corpus, use it! This might not contain terminology that is particular to those dwelling in ERP but it will contain, in most cases, articles from newspapers and white papers that are open to the public. Combine its use with a traditional corpus.

The Web is like the Land of Oz; we are sure the information is out there… all it takes is the right strategy to mine it. So next time you, or your friends or colleagues are requesting terminology, don’t be afraid to open up to the possibilities of finding an answer outside your normal toolbox.

[1] Wikipedia entry “Text Corpus”

[2] B.T.S. Atkins, OUP 2008 “ Theoretical Lexicography and its relation to dictionary-making“, in Practical Lexicography – a Reader, Thierry Fontenelle (ed)., p. 38

BNC = British National Corpus. It can be found at http://www.natcorp.ox.ac.uk/

[4] For extensive explanation on the four senses of a web corpus, see Baroni, Marco and Bernardini, Silvia (eds.) 2006. Wacky! Working papers on the Web as Corpus. Bologna: GEDIT, p. 10-15

[5] Only Wolfram Alpha does semantically oriented queries but these are limited to certain subjects such as geography, math, statistics etc and they are limited in the sense that the user can search only in English.

[6] Altavista no longer exists as a search engine. RIP Altavista!

[7] Kilgariff 2007, “Googleology is bad science” in Computational Linguistics, 33 (1)”: 147-151.{jcomments on}