Skip to Main Content

English-Corpora.org: An introduction

A gentle introduction to http://english-corpora.org, a clearinghouse of monitor corpora from Brigham Young University

What kinds of analyses does English-Corpora offer?

English-Corpora.org  provides a platform to run a variety of queries on their established corpora. Any logged in user can search for specific words or phrases in each corpus, observe frequency over regular periods of time (months/years/decades), measure collocation, and perform other quantitative tasks for tracing language in use over time. It is also possible to download the corpus for more local use, for more advanced users.

Understanding the corpus landing page

Upon choosing a corpus, you'll arrive on a landing page that looks like this. There are two main boxes in the body of the site; on the left hand side there will be a search apparatus, and some display options; on the right hand site there will be a description of the corpus. On the top bar, there are some additional search options. 

sample corpus landing page showing the various options 

Each of these features are labelled with a corresponding number and described below.

1. Search box - Type in the character string you want to search for in the corpus. This search apparatus supports alphanumeric characters, special characters (such as * or ?), and spaces count as characters. Since we are looking at linguistic data, we can also add part of speech markers to identify specific grammatical features (nouns, verbs, prepositions, etc).

The default is to list outcomes for the strings, You can also clear the search by using the 'reset' button. 

2. Other search options - The search capacity allows users to perform other kinds of searches; these are listed as links above the search box. Selecting one of these options will change the search apparatus' output to reflect the kind of search you are performing. (These operate somewhat like 'advanced' search features.

  • CHART will produce a heatmap of your search term divided up by period of time.
  • COLLOCATES lets you identify terms that are more likely to appear around the initial search term than by simple chance.
  • COMPARE lets you compare two words in the corpus.
  • KWIC (keyword in context) searching shows you how your search term is used in context in aggregate with short snippets of text.

3. Description of corpus / Help box - This section provides information on the corpus selected, what it can be used for, and other information that might be relevant for the user (including the option to create a virtual sub-corpus from material within the corpus). The links provided here give examples of the kinds of searches you might want to perform and how to conduct them. It is possible to download the whole corpus through this informational box, but the web interface is more than adequate for running most searches. Clicking on different search features (e.g. CHART) provides a description of what this approach does.

4. Header of different activities - Once you perform a search, your results will appear under these tabs using the web interface. This allows users to move across different features of the corpus interface without having to run new searches each time.

Understanding the various search options with the sample search of "Wuhan":

With the List option, the corpus will give you all the results for your search (helpful if you are looking for multiple words or forms with a * in the search query) and let you click through to the Keyword in Context View (KWIC).

With Keyword in Context View (KWIC), your search term will be highlighted in a light green and provide some larger context. Click on an example to get even more context. This is helpful for observing patterns of use in aggregate.

With the Chart option, it will produce a bar graph of how often "covid" is mentioned in news articles per year, both by raw and normalized (per million words) frequency. "Wuhan" is not used very often in most news discourse until 2020. (You can also opt to see frequency by country and get to a KWIC search for all articles from Hong Kong, for example. Hong Kong is more likely to report on news from Wuhan prior to the Covid-19 pandemic than other countries).

What's going on with the search query language?

Building out different kinds of searches (including thinking about prefixes, suffixes, and plurals) allows the user to search increasingly complex phenomena. Considering grammatical parts of speech to differentiate words that look the same and have different meanings is helpful; the search apparatus has a drop down menu next to the general box if you click on [POS] (part of speech). As an example, this POS feature allows users to specify "will" as the noun and not the verb with the search (will [POS > noun.all), removing all irrelevant verbs. Being intentional about developing searches means excluding irrelevant examples, saving the user from unnecessary cleanup in the analysis stage. High-frequency words will require narrowing your search somewhat (e.g. by year) to conduct analyses.