Library Guides: English-Corpora.org: An introduction : Home

English-Corpora.org are a collection of highly curated corpora from Mark Davies at Brigham Young University. These corpora (or collections of text) are designed for searching text from a range of resources to observe language, variation, and change between specified dates on specific items. Many of these are considered to be large monitor corpora, used to observe practices that are both emergent and falling out of favor in contemporary use. English-Corpora.org offers 19 discrete corpora, representing a range of different kinds of language in use (generalized news discourse online, more specific news, Wikipedia, American Soap Operas, historical English) as well as two national corpora (which observe a specific form of English - in this case, historical Canadian and British English). Most of the corpora included cover at least one dialect if not multiple dialects of English (such as British, American, Canadian English). Two corpora, News on the Web (NOW) and the Coronavirus corpus, continue to be updated daily to reflect ongoing linguistic practices.

English-corpora.org is a free-to-use resource, though you must create your own a login for access. Signing up is easy and straightforward - you will need an email address and to provide some information about yourself. This is primarily so those maintaining the collection of files can show impact and benefit for funding purposes.

**The chart below outlines the details of 17 core corpora provided by English-Corpora.org.**
Corpus	# words	Dialect	Time period	Genre(s)
iWeb: The Intelligent Web-based Corpus	14 billion	6 countries	2017	Web
News on the Web (NOW)	12.3 billion+	20 countries	2010-yesterday	Web: News
Global Web-Based English (GloWbE)	1.9 billion	20 countries	2012-13	Web (incl blogs)
Wikipedia Corpus	1.9 billion	(Various)	2014	Wikipedia
Corpus of Contemporary American English (COCA)	1.0 billion	American	1990-2019	Balanced
Coronavirus Corpus	956 million+	20 countries	Jan 2020-yesterday	Web: News
Corpus of Historical American English (COHA)	475 million	American	1820-2019	Balanced
The TV Corpus	325 million	6 countries	1950-2018	TV shows
The Movie Corpus	200 million	6 countries	1930-2018	Movies
Corpus of American Soap Operas	100 million	American	2001-2012	TV shows
Hansard Corpus	1.6 billion	British	1803-2005	Parliament
Early English Books Online	755 million	British	1470s-1690s	(Various)
Corpus of US Supreme Court Opinions	130 million	American	1790s-present	Legal opinions
TIME Magazine Corpus	100 million	American	1923-2006	Magazine
British National Corpus (BNC) *	100 million	British	1980s-1993	Balanced
Strathy Corpus (Canada)	50 million	Canadian	1970s-2000s	Balanced
CORE Corpus	50 million	6 countries	2014	Web

These corpora all represent different, discrete collections of text, although some of them are complimentary. For example, Corpus of Historical English (COHA) covers a longer range of time than the Corpus of Contemporary English (COCA), but COCA is more focused on recent usage than long-term change over time. News on the Web (NOW) is solely focused on news discourse in 20 countries, but Global Web-Based English widens its scope beyond news. Some corpora (such as the British National Corpus) are labelled 'balanced', meaning they contain equal parts of each genre included.

English-Corpora also has two corpora derived from the Google Books NGram project

The Google Books NGram project was a large scale digitization effort of specific books from libraries around the world. These materials underwent optical character recognition and are most well known from the Google Books Ngram graph visualization. The BYU version of these texts allow for more robust linguistic searching than the Google interface provides for the American and British English corpora.

Derived From Google Books n-grams (compare)
Corpus	# words	Dialect	Time period	Genre(s)
American English	155 billion	American	1500s-2000s	(Various)
British English	34 billion	British	1500s-2000	(Various)

English-Corpora.org: An introduction

Digital Humanities Librarian

What is English-Corpora.org?

English-Corpora also has two corpora derived from the Google Books NGram project