English-Corpora.org are a collection of highly curated corpora from Mark Davies at Brigham Young University. These corpora (or collections of text) are designed for searching text from a range of resources to observe language, variation, and change between specified dates on specific items. Many of these are considered to be large monitor corpora, used to observe practices that are both emergent and falling out of favor in contemporary use. English-Corpora.org offers 19 discrete corpora, representing a range of different kinds of language in use (generalized news discourse online, more specific news, Wikipedia, American Soap Operas, historical English) as well as two national corpora (which observe a specific form of English - in this case, historical Canadian and British English). Most of the corpora included cover at least one dialect if not multiple dialects of English (such as British, American, Canadian English). Two corpora, News on the Web (NOW) and the Coronavirus corpus, continue to be updated daily to reflect ongoing linguistic practices.
English-corpora.org is a free-to-use resource, though you must create your own a login for access. Signing up is easy and straightforward - you will need an email address and to provide some information about yourself. This is primarily so those maintaining the collection of files can show impact and benefit for funding purposes.
Corpus | # words | Dialect | Time period | Genre(s) |
---|---|---|---|---|
14 billion |
6 countries |
2017 |
Web |
|
12.3 billion+ |
20 countries |
2010-yesterday |
Web: News |
|
1.9 billion |
20 countries |
2012-13 |
Web (incl blogs) |
|
1.9 billion |
(Various) |
2014 |
Wikipedia |
|
1.0 billion |
American |
1990-2019 |
Balanced |
|
956 million+ |
20 countries |
Jan 2020-yesterday |
Web: News |
|
475 million |
American |
1820-2019 |
Balanced |
|
325 million |
6 countries |
1950-2018 |
TV shows |
|
200 million |
6 countries |
1930-2018 |
Movies |
|
100 million |
American |
2001-2012 |
TV shows |
|
1.6 billion |
British |
1803-2005 |
Parliament |
|
755 million |
British |
1470s-1690s |
(Various) |
|
130 million |
American |
1790s-present |
Legal opinions |
|
100 million |
American |
1923-2006 |
Magazine |
|
100 million |
British |
1980s-1993 |
Balanced |
|
50 million |
Canadian |
1970s-2000s |
Balanced |
|
50 million |
6 countries |
2014 |
Web |
These corpora all represent different, discrete collections of text, although some of them are complimentary. For example, Corpus of Historical English (COHA) covers a longer range of time than the Corpus of Contemporary English (COCA), but COCA is more focused on recent usage than long-term change over time. News on the Web (NOW) is solely focused on news discourse in 20 countries, but Global Web-Based English widens its scope beyond news. Some corpora (such as the British National Corpus) are labelled 'balanced', meaning they contain equal parts of each genre included.
The Google Books NGram project was a large scale digitization effort of specific books from libraries around the world. These materials underwent optical character recognition and are most well known from the Google Books Ngram graph visualization. The BYU version of these texts allow for more robust linguistic searching than the Google interface provides for the American and British English corpora.
Corpus | # words | Dialect | Time period | Genre(s) |
---|---|---|---|---|
American English | 155 billion | American | 1500s-2000s | (Various) |
British English | 34 billion | British | 1500s-2000 | (Various) |