Header Ads

Python, NLTK, and the Digital Humanities: Finding Patterns in Gothic Literature

If someone says “gothic” to you, do you think of the lush rolling countryside or a sunny day?

Chances are you don’t. Most people – myself included – associate that word with the dark, mysterious, and even frightening. Maybe you picture ornate stone architecture of a castle with gargoyles. Or perhaps foreboding skies rolling over said castle. Or very morose, pale people wearing black capes and veils. Or vampires with all of the above.

About a year ago, Caroline Winter, a PhD student at the University of Victoria, emailed me with a question. She had assembled a corpus of 134 works of European Gothic literature that had been written or translated into English, ranging from the 18th century to the early 20th. Caroline had a hunch that gothic literature was more vibrant than most people thought, but lacked the quantitative background to analyze her large data set. Could I write a short script to count and analyze color words within her corpus? This post details my first experience with the digital humanities — applying simple computational tools programmers use every day to the data relevant to traditional humanities disciplines.

Originally a quick Python project for a Sunday afternoon, my journey attempting to answer Caroline’s question eventually turned into a talk at PyCon. Through some pretty straightforward counting and matching techniques, we were able to find several interesting patterns that challenged my gloomy picture of “gothic”. To follow along using Phantom of the Opera as an example text, take a look at the companion Jupyter Notebook on Github.

Beyond black & white

The first step in the project was to define which color words we were looking for. The challenge here was that both the vocabulary used to describe color and the actual coloring of objects themselves were different in the gothic era than in present day.

Rather than guess about historical color words, we turned to the Oxford English Dictionary’s Historical Thesaurus (hereafter the Historical Thesaurus). It lists color words used in English and primarily in Europe, the year of each one’s first recorded use, and its color family.

After adding some html color names based on color grouping to our csv file of the original data set I read a csv file with the Historical Thesaurus data into a short function and eliminated everything that came into usage after 1914, since it’s not clear from the data when words fell out of usage.
def id_color_words(): """ Gets color words from the csv file and puts them into a dict where key = word and value = (hex value, color family); removes pre-1914 color words. """ color_word_dict = {} modern_color_words = [] color_data = csv.DictReader(open('./color_names.csv'), delimiter=",", quotechar='"') for row in color_data: name = row['Colour Name'].lower() year = int(row['First Usage']) if ' ' not in name: if year < 1914: family = row['Colour Family'].lower() hex_value = row['hex'].lower() color_word_dict[name] = (hex_value, family) else: modern_color_words.append((year, name)) return color_word_dict, modern_color_words

This gave us a dictionary of 980 pre-WWI color words ranging from the familiar, like blue (first usage in 1300), crimson (1416), or jet (1607), to the uncommon, like corbeau (1810, dark green), damask (1598, pink) or ochroid (1897, pale yellow). There were also some instances where the way words were categorized reflected a historical state of familiar things. For example, ‘glass’ is categorized as a greyish green, not pale blue or clear as we may think of it today.

Now we knew what we were looking for, but generating an accurate analysis was about more than simply counting these color words.

‘rose’ != ‘rose’ != ‘rose’

English is a tricky language, with many words that sound the same meaning different things and many words that look the same meaning different things depending on their context. ‘Rose’ is a great example: it can be a noun, adjective, or verb, as demonstrated in the gif below.

So which words should we count? Should every word on the list be included?

To make this decision, we needed to write more code to parse our corpus and look at the results.

I used the function below to get the text ready for analysis. It does three things. First, it reads in the .txt file for the work we’re analyzing. Then, the function removes the “gristle” of stop words, punctuation, and upper case letters. Finally, it leverages the pos_tag function from the Natural Language Toolkit (NLTK) to tag each remaining word as a part of speech (noun, verb, adjective, etc.).

def process_text(filename): """ This function generates a list of tokens with punctuation stopwords, and spaces removed for the whole text. It also applies NLTK's part of speech tagging function to determine if words are nouns, adjectives, verbs, etc. """ text_tokens = [] # create a list of the words and punctuation we want to remove before analyzing stop_punc = set(stopwords.words('english') + [*string.punctuation] + ["''", '``']) with open(filename) as text: for row in text: # puts everything in lowercase, postion tags for token in pos_tag(word_tokenize(row.lower())): #removes tokens if contains punctuation or stopwords if token and token[0] not in stop_punc: text_tokens.append(token) return text_tokens

This function outputs the whole text that looks like this – as you can see the NLTK pos_tag doesn’t look like it gets the part of speech correct every time, but it’s pretty close.

[('dying', 'JJ'),
('injunction', 'NN'),
('forbidden', 'VBN'),
('uncle', 'NN'),
('allow', 'VB'),
('embark', 'VB'),
('seafaring', 'JJ'),
('life', 'NN'),
('visions', 'NNS'),
('faded', 'VBD'),
('perused', 'VBD'),
('first', 'JJ'),
('time', 'NN'),
('poets', 'NNS'),
('whose', 'WP$'),
('effusions', 'NNS'),
('entranced', 'VBD'),
('soul', 'NN'),
('lifted', 'VBD'),
('heaven', 'VB')]

Next, we needed to isolate the color words from the text and do some analysis of the context to make sure there weren’t any glaring issues in the data we were generating. Here Caroline’s literature background was extremely helpful in identifying what looked inaccurate, and I went off to pull out the context of the suspicious words so she could make a final call.

  • Isabella, a yellowish color that was also the name of a couple of characters in our corpus;
  • Imperial, a purple color that in the texts actually meant the political structure, not the color; and
  • Angry, sometimes used to describe a red-pink flushed color, but was used more often as an emotion word than a color word.

At this stage, I also experimented with stemming and lemmatizing the color words in our master list and in the texts themselves to see if that changed how many color words we were finding, rather than looking for exact matches. What this means, for example, is transforming the word “whitish” from the Historical Thesaurus to its root, or stem (“white”), and doing the same to the words in the text we were analyzing. However, because the Historical Thesaurus is so comprehensive and already included many forms of each word, the results didn’t change much and we decided to leave this step out.

Looking at the preliminary data, we also found that we got some combinations of color words, like “rose” followed by “red” or “milky” followed by “white”. While the Historical Thesaurus covers common combinations of these when they’re joined with a “-” (e.g. “rose-red”) we decided to isolate those examples in the output of the find_color_words to help us determine if we wanted to exclude those samples from the final analysis.

Analysis & Visualization – the (really) fun part

With adjustments made to the color word list, we can run the tagged text through the find_color_words function below and see both the concurrent color words and the full list. To do this, the code below leverages Python’s itertools with a couple of helper functions: pairwise and is_color_word.
def pairwise(iterable): """ Returns a zip object, consisting of tuple pairs of words where the second word of tuple #1 will be the first word of tuple #2, e.g. [('room', 'NN'), ('perfume', 'NN'), ('lady', 'NN'), ('black', 'JJ')] from our `processed` variable becomes: [(('room', 'NN'), ('perfume', 'NN')) (('perfume', 'NN'), ('lady', 'NN')) (('lady', 'NN'), ('black', 'JJ')) (('black', 'JJ'), ('contents', 'NNS')] """ a, b = tee(iterable) next(b, None) return zip(a, b) def is_color_word(word, color_dict): """ Compares at each word, tag tuple in `processed` to both the color dict and the allowed tags for adjectives and nouns """ color, tag = word tags = {'JJ', 'NN'} # JJ = adjectives, NN = nouns return tag in tags and color in color_dict def find_color_words(t, color_dict): """ Leverages the previous two functions to identify the color words in the text and look for concurrent color words (e.g. 'white marble'), then returns each in a separate list. """ color_words = [] concurrent_color_words = [] for o in t: if is_color_word(o, color_dict): color_words.append(o) for p, n in pairwise(t): if is_color_word(p, color_dict) and is_color_word(n, color_dict): concurrent_color_words.append((p, n)) return color_words, concurrent_color_words


Here’s what we get from this function.

First, a list of all of the identified color words in the text and their tag, like this:

[('yellow', 'JJ'), ('black', 'JJ'), ('mourning', 'NN'), ('rose-red', 'JJ'), ('lily-white', 'JJ'), ('black', 'JJ'), ('black', 'JJ'), ('black', 'JJ'), ('white', 'JJ'), ('yellow', 'NN'), ('plum', 'NN'), ('glass', 'NN'), ('red', 'JJ'), ('coral', 'JJ'), ('pink', 'NN'), ('iron', 'NN'), ('glass', 'NN'), ('pink', 'JJ'), ('candid', 'JJ'), ('blue', 'JJ')]

Second, we get a list of tuples containing the color words that were adjectives or nouns closely followed by another adjective or noun in the original text. From The Phantom of the Opera, we get examples like:

(('glass', 'NN'), ('champagne', 'NN')) (('pink', 'NN'), ('white', 'JJ')) (('gold', 'NN'), ('purple', 'NN')) (('water', 'NN'), ('bluey', 'NN'))

In most cases we didn’t think one of these took anything away from or obscured the other; in fact their close association often painted a clearer picture of color texture. So we left both words in.

From this you can get some summary stats, like what percentage of all uncommon words in the text were color words (Phantom is 0.9%), and what proportion are nouns versus adjectives (Phantom is 52-47).

But the really fun part is using those HTML color groups to plot the use of color in the text.

The Jupyter Notebook contains a couple of examples with matplotlib that are really straightforward to implement, like this bar chart showing the colors used in The Phantom of the Opera. Kite created a Github repository here where you can access the code from this and other posts on their blog.

There are many of interesting options for visualizing this data. The original talk included a website, built with the Django framework, ChartJS, and lots of CSS – online here – where we visualized each
book as a series of color blocks in their order of appearance.

Even with the limitations of HTML color applied to a broad palette,you’ll see that a lot of the books are not as dark and gloomy as their “gothic” label might lead you to believe. This makes sense: the supernatural is a strong theme in Gothic literature, but so is contrasting it with the beauty of the natural world that was considered both a haven and a dwindling reality during the dawn of the industrial revolution.

Note: This Article Originally Appeared On Kite

1 comment:


    We have been having recent complains from individuals about how they lost money 💵 to SPAMMERS who call themselves HACKERS or BANK LOAN OFFERS. They are all over the internet sharing false testimonies. Please do not fall for their lies for this is just a way to LURE you to them.

    They say lies in the likes of such-:
    ▪️Bitcoin Auctioning ▪️Western Union Hack
    ▪️Blank Credit Card ▪️Clearing Criminal Records
    ▪️Loan Offers. ▪️Bank Account Loading
    ▪️Changing University Grades & so on.
    These are all lies and you shouldn’t fall for them.

    🏵GLOBAL PLUGGERS🏵 is here to help you Recover all your Money 💵 that you have been Ripped of.
    We are a group of Computer💻 Experts who are memebers of the “HACKERONE” Forum. We have dedicated ourselves to help Victims of these SCAM(s) recover all the Money that has been taken falsely from them.

    If you have been a victim of thes Thieves, then you need to contact us as soon as possible so you can get your money back.
    Email-: globalpluggers@gmail.com
    No. +1 (808) 600 0773 ( Number also available on WhatsApp)

    Please know that we do not charge you for Fund Recovery Service, Our Funds Recovery Service is to help and so it’s Free.

    We also provide Legit Hacking Services such as-:
    🔸Phone Hacking/Cloning
    🔸Email Hacking & Password Recovery
    🔸Social Media Hacking & Passowrd Recovery
    🔸Deleted Files Recovery 🔸Mobil Tracking
    🔸Virus detection & Elimination.

    Email globalpluggers@gmail.com
    No. +1 (808) 600 0773 (number also available on WhatsApp)


Powered by Blogger.