Skip to Main Content

English-Corpora.org: An introduction

A gentle introduction to http://english-corpora.org, a clearinghouse of monitor corpora from Brigham Young University

Different kinds of searching

There are a lot of search options available. How do I know which one is right for me?

The overall goal of English-Corpora.org's search interface is to help researchers observe language in use, and presents several different ways to get at that sense of language in use. The List outcomes to Keyword In Context pipeline is the most intuitive, but it might not be the most helpful way to approach the corpus. This page will discuss how to best harness the different search options with some examples.

List

If your search is designed to produce a range of outputs and you want to search a variety of them, you might want to start with the LIST feature. This is helpful if you want to find:

  • Singular and plural examples of the same word (woman, women; football, footballs)
  • Words with the same prefix or suffix (all words ending in ING, all words beginning with intra-)
  • All forms of a specific verb (have and had)
  • Alternative spellings for the same word (color vs colour) - this one is especially useful if you are looking at multiple English dialects!) 
  • a specific string match - is it even in the corpus? 

All these outcomes list frequencies for each hit - you might be surprised by some words that are used more frequently than others! Clicking on overall frequency will take you straight to the Keyword in Context view. Importantly, all of these search queries will work for the additional searches described below -- the presentation of the outcome is what will change.

Chart 

Using the data from the LIST feature to create a heatmap shows when in a specific period of time (year, usually, but could be by month depending on your corpus) your word or words were especially well used. The darker the box containing the number is, the more it is used. This visualization is helpful for observing periods of time to focus your keyword-in-context searching.

Collocates

What words are more likely than by chance to appear within a specific window of your target search term? This feature identifies those for you and leads you to a keyword in context view.

Compare

If you want to look at two wordforms simultaneously (compare their use against each other), you'd want to use this feature; it provides the option to explore collocates for each search term too.

Keyword in Context (KWIC)

This option is best if you know exactly what you want to look for and want to go straight to a set of results. Otherwise, you could use the LIST function to confirm the word form you are searching for is available in the corpus and click once more to go straight to KWIC results.  

 

Search Syntax

I know which search outcome I want to use but I want to make my search query really good!

Because this is a grammatically-driven interface, the programmers want to help you surface particular parts of speech (verbs, adverbs, etc) and other linguistic features. Here is a short list of kinds of searches English-Corpora can support. 

Single word:   mysterious, skew
Phrase:    make up, on the other hand
Any word:   more * than, * bit
Wildcard:    *icity, *break*, b?t?er
Lemma (forms):   DECIDE, CURVE_n
Part of speech:    rough NOUN, VERB money
Alternants:   fast|slow, fast|slow rate
NOT:    pretty -NOUN (compare pretty NOUN)
Synonyms:   =beautiful, =strong ARGUMENT