Skip to Main Content

Text Analysis: Introduction

Where do I find corpora?

Corpora can be found in lots of places. Before running off to assemble your own, it is always worth seeing if something similar already exists. Even if it is not exactly what you want, it is worth seeing what's already out there and thinking about how you could adapt the existing corpus to support your research questions.

Always begin with corpus repositories. Corpus repositories follow a series of agreed-upon standards, including that the contributor offers metadata and documentation about their corpus (which in turn makes the use of this corpus more clear to someone coming to it for the first time). The major corpus repositories are​:

Web-based corpus query systems are also rich resources.

In addition, web sites like Project Gutenberg and other web sites - such as various newspapers' full-text articles which PSU offers access to -  are also ripe for building up corpora.

For more information on finding corpora to use, please see the Text Mining Resources guide.

What do I do with my corpora once I've got them?

Before you can think about analyzing your corpora, you may have to do some preprocessing.

Much of this often takes the form of removing recurring boilerplate text (such as bylines, social media links out, or project gutenberg disclaimers). 

Another common step is to provide annotation for features like parts of speech and information about different speakers using XML. There are debates on what level of annotation is appropriate for the kind of work you are planning to do - and it is never impossible to go back and decide that more annotation is necessary. Any included annotation should be provided to specifically help answer the research question you plan on using the corpus to tackle. Hardie (2014, open access link) offers a survey of what could be the most skeletal but still informative form of corpus annotation. 

Starting small and manageable is usually wiser than going big and missing the forest for the trees.

To analyze a corpus, you need to choose a piece of software.

There are a variety of software packages to do this kind of work; there is no one-size-fits-all software for performing this kind of analysis. At its core, text analysis is a way of counting individual lexical items to identify patterns. This can be done on a command line using Unix, Python, or R; many beginners find it easier to use a piece of software that looks more like other software packages they are familiar with. Those without computer programming experience are encouraged to start with software built and designed by other people.

It is important to emphasize that no one software is better than another, though some are better at certain things than others. Much here comes down to personal taste, much like Firefox vs Chrome or Android vs iPhone.

A variety of options for concordance software are available:​

(Note that some of these these may require a lisencing fee.) 


Additional Information

Natural language processing (sometimes also called NLP), includes tasks like grammatical parsing, tokenization, stemming. These are all hugely helpful for automating some annotation processes. NLP packages offer a suite of command-line concordance functions, and these are available for both Python and R users.

Python packages that complete tasks like part-of-speech tagging and named entity recognition. 

packages include aspects of natural language processing tasks described above, though they bend more towards machine learning applications for quantitative text analysis. (see the Text Analysis Methods page for methods).