Skip to Main Content

English-Corpora.org: An introduction

A gentle introduction to http://english-corpora.org, a clearinghouse of monitor corpora from Brigham Young University

What is English-Corpora.org?

English-Corpora.org are a collection of highly curated corpora from Mark Davies at Brigham Young University. These corpora (or collections of text) are designed for searching text from a range of resources to observe language, variation, and change between specified dates on specific items. Many of these are considered to be large monitor corpora, used to observe practices that are both emergent and falling out of favor in contemporary use. English-Corpora.org offers 19 discrete corpora, representing a range of different kinds of language in use (generalized news discourse online, more specific news, Wikipedia, American Soap Operas, historical English) as well as two national corpora (which observe a specific form of English - in this case, historical Canadian and British English). Most of the corpora included cover at least one dialect if not multiple dialects of English (such as British, American, Canadian English).  Two corpora, News on the Web (NOW) and the Coronavirus corpus, continue to be updated daily to reflect ongoing linguistic practices.

English-corpora.org is a free-to-use resource, though you must create your own a login for access. Signing up is easy and straightforward - you will need an email address and to provide some information about yourself. This is primarily so those maintaining the collection of files can show impact and benefit for funding purposes.

The chart below outlines the details of 17 core corpora provided by English-Corpora.org.
Corpus # words Dialect Time period Genre(s)

iWeb: The Intelligent Web-based Corpus

14 billion

6 countries

2017

Web

News on the Web (NOW)

12.3 billion+

20 countries

2010-yesterday

Web: News

Global Web-Based English (GloWbE)

1.9 billion

20 countries

2012-13

Web (incl blogs)

Wikipedia Corpus

1.9 billion

(Various)

2014

Wikipedia

Corpus of Contemporary American English (COCA)

1.0 billion

American

1990-2019

Balanced

Coronavirus Corpus

956 million+

20 countries

Jan 2020-yesterday

Web: News

Corpus of Historical American English (COHA)

475 million

American

1820-2019

Balanced

The TV Corpus

325 million

6 countries

1950-2018

TV shows

The Movie Corpus

200 million

6 countries

1930-2018

Movies

Corpus of American Soap Operas

100 million

American

2001-2012

TV shows

Hansard Corpus

1.6 billion

British

1803-2005

Parliament

Early English Books Online

755 million

British

1470s-1690s

(Various)

Corpus of US Supreme Court Opinions

130 million

American

1790s-present

Legal opinions

TIME Magazine Corpus

100 million

American

1923-2006

Magazine

British National Corpus (BNC) *

100 million

British

1980s-1993

Balanced

Strathy Corpus (Canada)

50 million

Canadian

1970s-2000s

Balanced

CORE Corpus

50 million

6 countries

2014

Web

 

These corpora all represent different, discrete collections of text, although some of them are complimentary. For example, Corpus of Historical English (COHA) covers a longer range of time than the Corpus of Contemporary English (COCA), but COCA is more focused on recent usage than long-term change over time. News on the Web (NOW) is solely focused on news discourse in 20 countries, but Global Web-Based English widens its scope beyond news. Some corpora (such as the British National Corpus) are labelled 'balanced', meaning they contain equal parts of each genre included.

English-Corpora also has two corpora derived from the Google Books NGram project

The Google Books NGram project was a large scale digitization effort of specific books from libraries around the world. These materials underwent optical character recognition and are most well known from the Google Books Ngram graph visualization. The BYU version of these texts allow for more robust linguistic searching than the Google interface provides for the American and British English corpora.

Derived From Google Books n-grams (compare)
Corpus # words Dialect Time period Genre(s)
American English 155 billion American 1500s-2000s (Various)
British English 34 billion British 1500s-2000 (Various)