Large monitor corpora compiled by Prof. Mark Davies, Linguistics, at Brigham Young University, with free registration required after 4 searches. Focus on Contemporary and Historical Englishes, Spanish, Portuguese; various forms of news discourse. A paid tier offers the ability to download the corpora for your own use.
Created by Prof. Mark Davies, Linguistics, at Brigham Young University, this compares The Corpus of Historical American English (COHA), Google Books (Standard), and the Google Books (BYU / Advanced) corpus in NGrams. A paid tier offers the ability to download the corpora for your own use.
Includes but is not limited to digitized and born digital text, audio, images, moving images, and the metadata that describes them, with particular strength in text and audio
The Natural Language Processing Toolkit (NLTK) is a Python package for machine-annotation and study of syntactic features. It includes a range of free corpora (see section 1.6)
The first producer of free electronic books (ebooks), their catalog includes nearly 30,000 free books and over 100,000 titles. Some, but minimal, clean-up required
Multiple text mining tools to analyze not only scholarly publications, but also other types of biomedical resources, such as Electronic Health Records.
Large-scale collaborative repository of digital content from research libraries including content digitized via the Google Books project and Internet Archive digitization initiatives. See the HathiTrust Research Center page for text analysis options.
Wide-ranging collection of linguistic data, following the discipline's best practices
Specific Library Databases Allowing Text Mining
Most of the libraries' databases do not allow text or data mining due to our license agreements with the vendors. However, some do include permission to do this, and the following providers have kindly offered some text and data mining options for our users. We will continue to work with database vendors to include TDM into future license agreements.
Unless someone is specified as a point of contact, please contact your subject specialist to initiate the process. Some of these links are to specific text mining platforms, rather than full-text access. Any unauthorized web scraping of these databases can result in the vendor cutting off access to the entire campus!
Below are the databases we have negotiated TDM for:
offers mining on all databases Penn State subscribes to. Contact your subject specialist to get started.
Adam Matthew publishes unique primary source collections from archives around the world. The collections cover a broad range of topics in the humanities and social sciences. The stated themes are Area Studies, Cultural Studies, Empire and Globalism, Ethnic Studies, Gender and Sexuality, History, Literature, Politics, Theatre, War and Conflict. Search across all of Penn State's collections via the AMexplorer search box, or browse the list of links.
The Text Creation Partnership creates standardized, accurate XML/SGML encoded electronic text editions of early print books.
Phase I content (25,000 titles) is freely available/searchable as of January 1, 2015. Penn State was a partner in the Phase II of transcription; we have access to all transcriptions from their text repository; Phase II will enter the public domain in 2020. Contact Heather Froehlich, Literary Informatics Librarian, for help.
A selection of digitized and encoded texts chosen from images in the Early English Books Online Project (works printed in the British Isles or in English from 1473 to 1700). Works chosen must be associated with an author whose name appears in the New Cambridge Bibliography of English Literature, or be named by title in the Bibliography.
Searches across 23 of our Gale primary source databases covering 1500-2012, including a Term Frequency search option and Term Clusters viewer (available from the articles results list).
Gale Artemis: Primary Sources is Gale's platform featuring a seamless research environment for multiple collections. Starting with Eighteenth Century Collections Online (ECCO), Nineteenth Century Collections Online (NCCO) and Making of Modern Law (MOML) Gale will be incorporating the majority of our primary source collections, including Archives Unbound and the Historical Newspapers Collections, into Artemis Primary Sources, enabling researchers, teachers and students to cross-search these collections and discover and analyze content in entirely new ways.
Includes access to a Data For Research portal called Constellate providing a localized, self-service system for text mining. Download metadata, word frequencies, citations, key terms, and N-grams of up to 1,000 documents, and run Jupyter Notebooks for analysis on the web.
JSTOR is a not-for-profit organization that provides a trusted archive of important scholarly journals and a selection of scholarly books. Content in JSTOR spans many disciplines, primarily in the humanities and social sciences. While indexing for JSTOR articles is covered in LionSearch, the full text of the articles is not searched in LionSearch. Search JSTOR itself to ensure detailed coverage of full texts.
Provided it is for non-commercial purposes, ScienceDirect offers mining on all databases Penn State has access to. You do this via Elsevier's Science Direct APIs, which you must register to use.
This system provides access to the electronic versions of the Elsevier journals and books that we subscribe to. Current issues and back files are included. Currently, it includes more than 1,200 journals. The full text collection contains over 1.5 million articles and book chapters from 1995 to present across all fields of science.
Oxford University Press grants research access to the Corpus for academic projects that can demonstrate a strong practical need for this data. See their Developer page for details. Researchers are not required to request permission for non-commercial text-mining of OUP content. However, OUP offers consultation service with a technical project manager to assist in planning your TDM project, including avoidance of any technical safeguards triggers OUP has in place to protect the stability and security of our websites. To request a consultant for your TDM project, please e-mail Data.Mining@oup.com