Library Guides: Text mining: Web-based resources: Text Mining Resources

Open Access Resources

AntCorGen
Quickly download and process whole or sections of PLOS articles by discipline for use in other software programs. (See PLOS below)
ArXiv (Cornell University)
Open access to 1,153,908 e-prints in Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance and Statistics. Bulk Access
BioMed Central (Springer Science+Business Media)
Over 250,000 full-text, peer-reviewed articles are available for text and data mining.
Chronicling America (Library of Congress)
Digitized newspapers from all 50 States; includes API for JSON, linked data, and bulk download options
Corpus.byu.edu (Brigham Young University)
Large monitor corpora compiled by Prof. Mark Davies, Linguistics, at Brigham Young University, with free registration required after 4 searches. Focus on Contemporary and Historical Englishes, Spanish, Portuguese; various forms of news discourse. A paid tier offers the ability to download the corpora for your own use.
Corpus Resource Database (University of Helsinki)
Collections of (primarily) English-language corpora, covering both historical and modern varieties of language
Digital Public Library of America Data (DPLA)
Data is available for bulk download in JSON files.
Documenting the Now (University of Maryland/UC Riverside/Washington University in St. Louis)
A collection of materials based on publically available social media (Twitter, etc) for chronicling historically significant events.
Google Books BYU View (Brigham Young University)
Created by Prof. Mark Davies, Linguistics, at Brigham Young University, this compares The Corpus of Historical American English (COHA), Google Books (Standard), and the Google Books (BYU / Advanced) corpus in NGrams. A paid tier offers the ability to download the corpora for your own use.
GovDocs (U.S. Government Publishing Office)
Bulk data documents from the U.S. government in an XML format.
Internet Archive & Open Library (Internet Archive)
Offers over 10,000,000 fully accessible books and texts. Instructions for downloading in bulk.
MSU Libraries Humanities Data (Michigan State University)
Includes but is not limited to digitized and born digital text, audio, images, moving images, and the metadata that describes them, with particular strength in text and audio
Natural Language Processing Toolkit corpora (University of Pennsylvania)
The Natural Language Processing Toolkit (NLTK) is a Python package for machine-annotation and study of syntactic features. It includes a range of free corpora (see section 1.6)
PLOS (Public Library of Science)
Provides access to its peer-reviewed articles.
Project Gutenberg
The first producer of free electronic books (ebooks), their catalog includes nearly 30,000 free books and over 100,000 titles. Some, but minimal, clean-up required
PubMed Central Databases and Text Mining Tools (NCBI/U.S. National Institutes of Health's National Library of Medicine)
Multiple text mining tools to analyze not only scholarly publications, but also other types of biomedical resources, such as Electronic Health Records.
University of Oxford Text Archive (University of Oxford)
A repository of digital literary and linguistic resources for research and teaching in higher education

Login Required, but not licensed by the Libraries

The following links resources are available to anyone in the Penn State Community using their @psu.edu access account.

CQPWeb (Lancaster University)
Includes modern European, historical and Modern East Asian language corpora and a focus on historical British English corpora
more...less...
(incl. Early English Books Online Text Creation Partnership full-text search!)
HathiTrust Digital Library (Hathi Trust)
Large-scale collaborative repository of digital content from research libraries including content digitized via the Google Books project and Internet Archive digitization initiatives. See the HathiTrust Research Center page for text analysis options.
Linguistic Data Consortium (UPenn)
Wide-ranging collection of linguistic data, following the discipline's best practices

Specific Library Databases Allowing Text Mining

Most of the libraries' databases do not allow text or data mining due to our license agreements with the vendors. However, some do include permission to do this, and the following providers have kindly offered some text and data mining options for our users. We will continue to work with database vendors to include TDM into future license agreements.

Unless someone is specified as a point of contact, please contact your subject specialist to initiate the process. Some of these links are to specific text mining platforms, rather than full-text access. Any unauthorized web scraping of these databases can result in the vendor cutting off access to the entire campus!

Below are the databases we have negotiated TDM for:

Adam Matthew Digital Penn State Portal This link opens in a new window
offers mining on all databases Penn State subscribes to. Contact your subject specialist to get started.
Adam Matthew publishes unique primary source collections from archives around the world. The collections cover a broad range of topics in the humanities and social sciences. The stated themes are Area Studies, Cultural Studies, Empire and Globalism, Ethnic Studies, Gender and Sexuality, History, Literature, Politics, Theatre, War and Conflict. Search across all of Penn State's collections via the AMexplorer search box, or browse the list of links.
Early English Books Online - Text Creation Partnership This link opens in a new window
The Text Creation Partnership creates standardized, accurate XML/SGML encoded electronic text editions of early print books.
Phase I content (25,000 titles) is freely available/searchable as of January 1, 2015. Penn State was a partner in the Phase II of transcription; we have access to all transcriptions from their text repository; Phase II will enter the public domain in 2020. Contact Heather Froehlich, Literary Informatics Librarian, for help.
A selection of digitized and encoded texts chosen from images in the Early English Books Online Project (works printed in the British Isles or in English from 1473 to 1700). Works chosen must be associated with an author whose name appears in the New Cambridge Bibliography of English Literature, or be named by title in the Bibliography.
Gale Primary Sources This link opens in a new window
Searches across 23 of our Gale primary source databases covering 1500-2012, including a Term Frequency search option and Term Clusters viewer (available from the articles results list).

Gale Artemis: Primary Sources is Gale's platform featuring a seamless research environment for multiple collections. Starting with Eighteenth Century Collections Online (ECCO), Nineteenth Century Collections Online (NCCO) and Making of Modern Law (MOML) Gale will be incorporating the majority of our primary source collections, including Archives Unbound and the Historical Newspapers Collections, into Artemis Primary Sources, enabling researchers, teachers and students to cross-search these collections and discover and analyze content in entirely new ways.
JSTOR This link opens in a new window
Includes access to a Data For Research portal called Constellate providing a localized, self-service system for text mining. Download metadata, word frequencies, citations, key terms, and N-grams of up to 1,000 documents, and run Jupyter Notebooks for analysis on the web.
JSTOR is a not-for-profit organization that provides a trusted archive of important scholarly journals and a selection of scholarly books. Content in JSTOR spans many disciplines, primarily in the humanities and social sciences. While indexing for JSTOR articles is covered in LionSearch, the full text of the articles is not searched in LionSearch. Search JSTOR itself to ensure detailed coverage of full texts.
ScienceDirect (Elsevier full text journal articles and electronic books) This link opens in a new window
Provided it is for non-commercial purposes, ScienceDirect offers mining on all databases Penn State has access to. You do this via Elsevier's Science Direct APIs, which you must register to use.
This system provides access to the electronic versions of the Elsevier journals and books that we subscribe to. Current issues and back files are included. Currently, it includes more than 1,200 journals. The full text collection contains over 1.5 million articles and book chapters from 1995 to present across all fields of science.

Oxford Dictionaries
Oxford University Press grants research access to the Corpus for academic projects that can demonstrate a strong practical need for this data. See their Developer page for details. Researchers are not required to request permission for non-commercial text-mining of OUP content. However, OUP offers consultation service with a technical project manager to assist in planning your TDM project, including avoidance of any technical safeguards triggers OUP has in place to protect the stability and security of our websites. To request a consultant for your TDM project, please e-mail Data.Mining@oup.com
LexisNexis Web Services API
The LexisNexis API provides easy, batch-downloading access to LexisNexis materials Penn State subscribes to, in JSON format.