Data Mining

Data mining is one of those buzzword concepts that doesn’t seem to immediately apply to the humanities, let alone History.  Typically, references to data mining pertain to corporations like Facebook or Google using personal information for commercial purposes or occasionally to political groups wielding it for semi-nefarious purposes.  But just as atomic energy can be put towards both productive and destructive purposes, the humanities have found uses for this incredibly powerful tool.

A marquee project of digital humanities engaging in data mining is that which catalogued the tattoos of convicted persons in 19th Century Britain.  A truly vast endeavor part of the Digital Panopticon, Zoe Alker and her team used criminal justice records to create a database of tattoos.  They employed pattern-recognition and learning algorithms to “chunk” and categorize the data into a machine-readable format to which they could apply human interpretation.  Alker has stated that the portion of the project requiring greatest interpretation was categorizing the tattoos into themes such as love, religion, national identity, sex, pleasure, and names/initials.  Simple (but most certainly not unvaluable) analyses can be applied to the data in this state: the prevalence of tattoos in the total population, the absolute numbers and relative proportions of each theme, whether men or women were more likely to get a tattoo pertaining to “love,” etc.  Next, the team began to take into account colocation of tattoos; an example to which Alker returns in both her writing and conference presentation (full video below) is a glut of designs on men commemorating Buffalo Bill’s exposition in London being immediately collocated with romantic symbols or the names of sweethearts.

Good, qualitative analysis can now be applied.  The team looked at who got tattoos (everybody), where they were placed on the body (“public” areas), the most prevalent themes (love, loved ones, and simple pleasures), and how all of the above changed over a century (they increased in popularity).  From these, we learn that tattoos seem to have been relatively well-accepted, meant to be seen, and commemorative of “positive” themes.  More importantly, however, we are able to gain a window into the lives of ordinary folk—a class sadly underrepresented in archives, literature, and traditional scholarship.  In this record of a necessarily transitory pictorial record of culture, the researchers found the ordinary concerns of working folk.  They “appeared to wear tattoos for much the same purpose as today — commemorating their loved ones and family and the pleasures of working-class everyday life.”[1]

A project which involved less organizing the data and more playing with it was that of a Harvard-led team working with an early version of Google’s N-Gram viewer.  In searching a database of five million digitized books, the team found simple points of interest like the linguistic flip from “throve” to “thrived” as the past tense of “thrive,” the relative fame of certain people, and how much people write about various years.  As above, however, the interesting happens when one starts asking questions of the data.  For instance, sudden otherwise-unexplained drops or surges in popularity of an author can indicate government suppression or promotion and it seems that we stop writing about past years sooner nowadays than we used to.  These kinds of insights serve as starting points or tools for additional research.

Data mining is quite apt to find and describe networks.  The project “Six Degrees of Francis Bacon” is one that does just that.  The initial visual when viewing this project is an overwhelming mess of connected lines hopping betwixt various notables, but after reading the tutorial, the power of the tool is revealed.  One can query it for shared acquaintances to how people related to each other, filter connections by profession, and view timelines of correspondence.  Analyses looking at how groups interact (or not) with other in- or out-groups, the extent to which various figures may have known one another, and a variety of other such research projects are enabled by tools such as this.

Other projects require supercomputers.  In order to adequately analyze the 800,000 documents in the HathiTrust and JSTOR databases to create a picture of the experience of black women from the Eighteenth to Twentieth Centuries, researchers pulled a selection of 20,000 and created a model of them using the Blacklight supercomputer.  As with other projects, this yielded more data to be sorted through.  Researchers found connections between the Black Women’s Club and the New Negro movement along with inroads to several other questions.  A project that required collaboration between scholars of different disciplines, a particularly gratifying (at least, from the humanist perspective) realization by a team member was:

“Humanities and social science researchers have to be worried about not just what the numbers mean at a surface level. They have a whole theory behind how you go about interpreting things as they relate to the larger society”

-Mark Van More, quoted in Ken Chiacchia and Aaron Dubrow, “Rescued History,” News, National Science Foundation, accessed November 30, 2020, https://www.nsf.gov/discoveries/disc_summ.jsp?cntn_id=137797&org=NSF.

Organized data is rarely the end of the answer.  It is, however, frequently the avenue to new answers.  We are fortunate to be in an era where the tools at our disposal are matching pace with our desire to ask new questions of the past.  Where our traditional sources for producing history have failed us, Big Data seems to offer ways to fill some of those gaps.  Data mining is one of the more useful additions to historical methodology; in that spirit, hi-ho, hi-ho, off to work we go!


[1] Zoe Alker and Robert Shoemaker, “Convict Tattoos,” Digital Panopticon, accessed November 29, 2020, https://www.digitalpanopticon.org/Convict_tattoos.

3 Replies to “Data Mining”

  1. Lol, “semi-nefarious” is my new favorite term. Excellent and in-depth survey of this week’s projects, well written and thorough. I’m very skeptical of Google’s N-Gram tool, not least because the presenter casually mentioned discarding a few million scans that were low-quality. So yes, a potential indicator for future research, but not much of a systematic effort. (certainly it pales next to the Digital Panopticon or some other databases) I think this is the issue alluded to in the Minsky article: historians want to be multi-disciplinarian, but most tech tools (and funding) remain future-oriented and easily misapplied when it comes to history (looking at you, semi-nefarious State Dept.)

  2. “For instance, sudden otherwise-unexplained drops or surges in popularity of an author can indicate government suppression or promotion and it seems that we stop writing about past years sooner nowadays than we used to. ”

    I had questions about this conclusion. Having spent the better part of my adult life writing for public consumption (both as a journalist and a marketer), I’m not at all sure “suppression” is the only explanation for a significant differential between the expected trajectory and the actual one. What about, for example, changes in terminology used (check out global warming vs. climate change or GLBT vs. LGBT), public frenzies that come in waves (literally who gives a f— about Ben Affleck except when he dates someone hot), or just what’s relevant when (e.g.”world war”). The speakers are obviously right in the context they point out, but that doesn’t necessarily mean the same cause can be assumed for every other case.

  3. As always, Neil, you absolutely killed this blog post! Great attention to detail. You have a great narrative to your posts, very well-structured and I love this background for your website. What theme did you use? Just curious. The Snow White video at the end by the way is awesome. I like the light-hearted touch to your post. With these terrible times, and it being Finals and all…every little light-hearted post and thing I see is very uplifiting. I really like your intro, it’s very engaging:
    “Data mining is one of those buzzword concepts that doesn’t seem to immediately apply to the humanities, let alone History. ”

    Great job on the post!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.