Skip to Main Content

English-Corpora.org: An introduction

A gentle introduction to http://english-corpora.org, a clearinghouse of monitor corpora from Brigham Young University

Downloading corpora from English-Corpora.org

English-Corpora.org provides free, complete access to their data from a robust web-based platform. However, this doesn't work for everyone's needs and they know that you might want to have more localized access to a their corpora for more advanced text and data mining tasks. English-Corpora.org provides access to their data with two available licenses, Academic and non-Academic; the below chart displays the cost to get access to one or more corpus. You cannot buy all their corpora but you can buy eleven of their biggest ones (see https://www.corpusdata.org/corpora.asp for details).

Penn State researchers seeking to download any corpus would fall under the license category "ACAD" (academic). 

Cost to get access to one or more corpus
License Explanation One corpus Two corpora 3+ corpora
(see example)
ACAD For use by university or college personnel (professors, teachers, students). $375 $595 $200 each additional corpus
NON-ACAD     Any other use*, including commercial. $795 $1,395 $400 each additional corpus

 

You can purchase their data to be used any way you like from English Corpora.org in three different formats: in relational databases, word/lemma/PoS, and words (paragraph format). Purchasing the data means you are also purchase the rights to any and all of these formats. Read more about these formats and their affordances at https://www.corpusdata.org/formats.asp

For legal and copyright reasons, they cannot distribute 100% of the corpus to users, even paid users.. When you purchase the data, you are purchasing 95% of the full text data. The remaining 5% is removed for reasons of copyright. Read more about these limitations: https://www.corpusdata.org/limitations.asp. This is a common practice to conform with US copyright law and will not affect the validity of any results or output.

There are a couple restrictions you must abide by before purchasing the data.

(These are all copied from their Restrictions page, which you must complete to initiate purchase: https://www.corpusdata.org/restrictions.asp)

1. In no case can substantial amounts of the full-text data (typically, a total of 50,000 words or more) be distributed outside the organization listed on the license agreement. For example, you cannot create a large word list or set of n-grams, and then distribute this to others, and you could not copy 70,000 words from different texts and then place this on a website where users from outside your organization would have access to the data.

2. The data cannot be placed on a network (including the Internet), unless access to the data is limited (via restricted login, password, etc) just to those from the organization listed on the license agreement.

3. In addition to the full-text data itself, #2 also applies to derived frequency, collocates, n-grams, concordance and similar data that is based on the corpus.

4. If portions of the derived data is made available to others, it cannot include substantial portions of the the raw frequency of words (e.g. the word occurs 3,403 times in the corpus) or the rank order (e.g. it is the 304th most common words). (Note: it is acceptable to use the frequency data to place words and phrases in "frequency bands", e.g. words 1-1000, 1001-3000, 3001-10,000, etc. However, there should not be more than about 20 frequency bands in your application.)

5. Academic licenses: are only valid for one campus. So if you are part of a research group, for example, with members at universities X, Y, and Z, they all need to purchase the data separately.

6. Academic licenses: you can not use the data to create software or products that will be sold to others.

7. Academic licenses: students in your undergraduate classes cannot have access to substantial portions of the data (e.g. 50,000 words or more). Graduate students can have access to the data for work on theses and dissertations. The data is primarily intended for use in research, not teaching. If you need corpus data for undergraduate classes, please use the standard web interface for the corpora.

8. Academic and Commercial licenses: supervisors will make best efforts to ensure that other employees or students who have access to the data are aware of these restrictions.

9. Commercial license: large companies with employees at several different sites (especially different countries) may need to contact us for a special license.

10. Any publications or products that are based on the data should contain a reference to the source of the data: https://www.corpusdata.org.