Skip to Main Content

Text Analysis: Introduction

A note on methods

In its most basic form, text analysis is just about counting words. However, counting is easy. Choosing what to count and why is much harder.

When we talk about text analysis, we are often discussing a collection of methods which count words and observe their relationships towards each other. There are two major ways to approach this. The first model is by specifying individual lexical items (words, phrases) under investigation and observing patterns. The second way is more hands-off, allowing the computer to sort out relationships for you and present its findings.There are pros and cons to both methods, and the most interesting studies tend to involve moving between multiple modes of inquiry. This is an iterative process, not one that will automagically solve your problems for you. Be prepared to take a trial-and-error approach to your methods and allow yourself to move beyond the most immediately obvious findings. Computers are good at keeping track of things that we as readers are bad at noticing - if you didn't need the computer to find it, it probably isn't a very meaningful discovery.

Using machine learning and other unsupervised methods looks to create a top-level view of language, rather than looking for specific uses across a corpus. This approach is great for getting a sense of what large corpora are doing, and offering new directions for close reading. A more supervised approach, by defining terms and observing their specific patterns, offers more opportunities for close reading, but makes the big picture harder to grasp. In both cases, it is worth remembering that using text analysis methods works best when the level of detail you are looking for exceeds the amount of data you can realistically sort through as a human, linear reader, and then trying to explain why it does the thing it does. 


Common methods

This box offers several ways to perform these counts and what their strengths are. Different software will have different implementations of these methods, so choosing your platform may have an effect on the kinds of analyses you can run.

Keyword-in-Context (KWIC) Analysis: provides a list of a specific word or phrase in context (up to 7 words in each direction is common). Best for pattern identification and close reading.

Lexical co-occurance or collocation: Observes clusters of terms which are likely to appear together in a given population, based on statistical relationships. This is good for getting a sense of 'aboutness' for a specific term or population or detecting specific word associations. There are lots of different methods for doing this (see Pecina 2009 on collocation metrics for details). Topic modeling is based on this principle.

Word Vector modeling: Like lexical co-occurance, this looks for terms which are likely to appear together in a given population, and projects terms into multi-dimensional  space to model semantic relationships between words at scale. This is especially good for discovering how texts' use of words can relate to each other.

N-grams: Observes clusters of terms which very definitely appear together in a given population. This is very good for identifying common phrases in a particular genre, etc and stylistic features which are unique to a specific author. 

Keyness: Sets up a comparison between Set A and Set B: does this word appear MORE or LESS frequently in Set A when compared with Set B, using a statistical measure called Log-likelihood (

Most Frequent Word Analysis: finds the most frequent terms of a given population, and highlights the small function words which make up the bulk of language. Good for identifying unique stylistic fingerprints; keeping track of presence and absence.