Week 5: Textual Analysis

This week we took a look at some neat tools to analysis text. One of these being the Google Ngram Viewer. This cool tool from Google allows you to compare different trends by seeing a graph of how often a word or phrase (n-gram) appeared in a corpus of books. The tool allows you to select your time frame, the language of books you’re searching through and even allows you to view multiple n-grams on the graph at one time. The example below shows a graph looking at Frankenstein, Sherlock Holmes and Albert Einstein between 1800-2000.

N-gram viewer

 

A downside of the Ngram viewer is that you can’t actually see which books these phrases are coming from. For example, Sherlock Holmes spiked in the 1930’s, if you were doing research on this topic you might find it useful to be able to view a list of books that Sherlock Holmes was mentioned in. A site that does allow you to see list of the articles that had the given phrase in them is Mining the Dispatch. Mining the Dispatch is a project that looks at the articles in the Richmond Daily Dispatch to explore the social and political life of Richmond, Virginia during the Civil War. Once you’ve chosen a topic (i.e. “Fugitive Slave Ads”), you can see a list of articles that best exemplified the topic. The author of Mining the Dispatch, Robert K. Nelson, has also given his own interpretation of each trend.

I think that both the Ngram viewer and Mining the Dispatch are useful and each have their own pros and cons. For the Ngram viewer, your results come from the millions of books that are in Google Books, which is pro. A pro of Mining the Dispatch, is that it doesn’t just give you a graph, but also an interpretation of the graph and ways to explore more. I don’t think I can consider the small scale of resources that Mining the Dispatch uses to get its results, since the whole purpose of the site is to focus on the one newspaper. However, next steps for this project would be to expand and also search through other resources from the Civil War era. Just like next steps for the Ngram viewer would be giving you the function of being able to view the sources of their analysis.

 

This week we also looked at an article from Science, “Quantitative Analysis of Culture Using Millions of Digitized Books”. The authors of this article constructed a corpus of texts that they claim amount to about 4% of the books printed. They then were able to use this to analyze cultural changes throughout the years. I think that the concept of this idea is really cool and they have a good point. For example, I agree that if you look at all the books written last year, you would get an idea about what was important and what was going on.

My problem with this project is that they are only looking at 4% of the books, and I’m not sure what kind of representation that can give us. I know though that it’s practically impossible to have every single book digitized, but one day we’ll be there. However, even when that day comes, we’ll still be facing the problem of lack of context. These graphs are cool, but they really do just show us what words were used and don’t give us a ton of insight into the context in which they were used. The authors claim that by studying these graphs we have a new type of science: culturomics. In which they will be able to reconstruct the past, and I agree we will be able to get a good sense of how trends changes. But will we know why cultural trends changed? Will we know the significance of why things changed? No, I don’t think we will. And we won’t be able to properly understand the past without know the what and the why.

Leave a comment