Python, NLTK, and the Digital Humanities: Finding Patterns in Gothic Literature



If someone says “gothic” to you, do you think of the lush rolling countryside or a sunny day?


Chances are you don’t. Most people – myself included – associate that word with the dark, mysterious, and even frightening. Maybe you picture ornate stone architecture of a castle with gargoyles. Or perhaps foreboding skies rolling over said castle. Or very morose, pale people wearing black capes and veils. Or vampires with all of the above.

About a year ago, Caroline Winter, a PhD student at the University of Victoria, emailed me with a question. She had assembled a corpus of 134 works of European Gothic literature that had been written or translated into English, ranging from the 18th century to the early 20th. Caroline had a hunch that gothic literature was more vibrant than most people thought, but lacked the quantitative background to analyze her large data set. Could I write a short script to count and analyze color words within her corpus? This post details my first experience with the digital humanities — applying simple computational tools programmers use every day to the data relevant to traditional humanities disciplines.

Originally a quick Python project for a Sunday afternoon, my journey attempting to answer Caroline’s question eventually turned into a talk at PyCon. Through some pretty straightforward counting and matching techniques, we were able to find several interesting patterns that challenged my gloomy picture of “gothic”. To follow along using Phantom of the Opera as an example text, take a look at the companion Jupyter Notebook on Github.


Beyond black & white


The first step in the project was to define which color words we were looking for. The challenge here was that both the vocabulary used to describe color and the actual coloring of objects themselves were different in the gothic era than in present day.


Rather than guess about historical color words, we turned to the Oxford English Dictionary’s Historical Thesaurus (hereafter the Historical Thesaurus). It lists color words used in English and primarily in Europe, the year of each one’s first recorded use, and its color family.


After adding some html color names based on color grouping to our csv file of the original data set I read a csv file with the Historical Thesaurus data into a short function and eliminated everything that came into usage after 1914, since it’s not clear from the data when words fell out of usage.
def id_color_words(): """ Gets color words from the csv file and puts them into a dict where key = word and value = (hex value, color family); removes pre-1914 color words. """ color_word_dict = {} modern_color_words = [] color_data = csv.DictReader(open('./color_names.csv'), delimiter=",", quotechar='"') for row in color_data: name = row['Colour Name'].lower() year = int(row['First Usage']) if ' ' not in name: if year < 1914: family = row['Colour Family'].lower() hex_value = row['hex'].lower() color_word_dict[name] = (hex_value, family) else: modern_color_words.append((year, name)) return color_word_dict, modern_color_words


This gave us a dictionary of 980 pre-WWI color words ranging from the familiar, like blue (first usage in 1300), crimson (1416), or jet (1607), to the uncommon, like corbeau (1810, dark green), damask (1598, pink) or ochroid (1897, pale yellow). There were also some instances where the way words were categorized reflected a historical state of familiar things. For example, ‘glass’ is categorized as a greyish green, not pale blue or clear as we may think of it today.


Now we knew what we were looking for, but generating an accurate analysis was about more than simply counting these color words.


‘rose’ != ‘rose’ != ‘rose’

English is a tricky language, with many words that sound the same meaning different things and many words that look the same meaning different things depending on their context. ‘Rose’ is a great example: it can be a noun, adjective, or verb, as demonstrated in the gif below.






So which words should we count? Should every word on the list be included?


To make this decision, we needed to write more code to parse our corpus and look at the results.


I used the function below to get the text ready for analysis. It does three things. First, it reads in the .txt file for the work we’re analyzing. Then, the function removes the “gristle” of stop words, punctuation, and upper case letters. Finally, it leverages the pos_tag function from the Natural Language Toolkit (NLTK) to tag each remaining word as a part of speech (noun, verb, adjective, etc.).

def process_text(filename): """ This function generates a list of tokens with punctuation stopwords, and spaces removed for the whole text. It also applies NLTK's part of speech tagging function to determine if words are nouns, adjectives, verbs, etc. """ text_tokens = [] # create a list of the words and punctuation we want to remove before analyzing stop_punc = set(stopwords.words('english') + [*string.punctuation] + ["''", '``']) with open(filename) as text: for row in text: # puts everything in lowercase, postion tags for token in pos_tag(word_tokenize(row.lower())): #removes tokens if contains punctuation or stopwords if token and token[0] not in stop_punc: text_tokens.append(token) return text_tokens

This function outputs the whole text that looks like this – as you can see the NLTK pos_tag doesn’t look like it gets the part of speech correct every time, but it’s pretty close.

[('dying', 'JJ'),
('injunction', 'NN'),
('forbidden', 'VBN'),
('uncle', 'NN'),
('allow', 'VB'),
('embark', 'VB'),
('seafaring', 'JJ'),
('life', 'NN'),
('visions', 'NNS'),
('faded', 'VBD'),
('perused', 'VBD'),
('first', 'JJ'),
('time', 'NN'),
('poets', 'NNS'),
('whose', 'WP$'),
('effusions', 'NNS'),
('entranced', 'VBD'),
('soul', 'NN'),
('lifted', 'VBD'),
('heaven', 'VB')]


Next, we needed to isolate the color words from the text and do some analysis of the context to make sure there weren’t any glaring issues in the data we were generating. Here Caroline’s literature background was extremely helpful in identifying what looked inaccurate, and I went off to pull out the context of the suspicious words so she could make a final call.

  • Isabella, a yellowish color that was also the name of a couple of characters in our corpus;
  • Imperial, a purple color that in the texts actually meant the political structure, not the color; and
  • Angry, sometimes used to describe a red-pink flushed color, but was used more often as an emotion word than a color word.

At this stage, I also experimented with stemming and lemmatizing the color words in our master list and in the texts themselves to see if that changed how many color words we were finding, rather than looking for exact matches. What this means, for example, is transforming the word “whitish” from the Historical Thesaurus to its root, or stem (“white”), and doing the same to the words in the text we were analyzing. However, because the Historical Thesaurus is so comprehensive and already included many forms of each word, the results didn’t change much and we decided to leave this step out.


Looking at the preliminary data, we also found that we got some combinations of color words, like “rose” followed by “red” or “milky” followed by “white”. While the Historical Thesaurus covers common combinations of these when they’re joined with a “-” (e.g. “rose-red”) we decided to isolate those examples in the output of the find_color_words to help us determine if we wanted to exclude those samples from the final analysis.

Analysis & Visualization – the (really) fun part



With adjustments made to the color word list, we can run the tagged text through the find_color_words function below and see both the concurrent color words and the full list. To do this, the code below leverages Python’s itertools with a couple of helper functions: pairwise and is_color_word.
def pairwise(iterable): """ Returns a zip object, consisting of tuple pairs of words where the second word of tuple #1 will be the first word of tuple #2, e.g. [('room', 'NN'), ('perfume', 'NN'), ('lady', 'NN'), ('black', 'JJ')] from our `processed` variable becomes: [(('room', 'NN'), ('perfume', 'NN')) (('perfume', 'NN'), ('lady', 'NN')) (('lady', 'NN'), ('black', 'JJ')) (('black', 'JJ'), ('contents', 'NNS')] """ a, b = tee(iterable) next(b, None) return zip(a, b) def is_color_word(word, color_dict): """ Compares at each word, tag tuple in `processed` to both the color dict and the allowed tags for adjectives and nouns """ color, tag = word tags = {'JJ', 'NN'} # JJ = adjectives, NN = nouns return tag in tags and color in color_dict def find_color_words(t, color_dict): """ Leverages the previous two functions to identify the color words in the text and look for concurrent color words (e.g. 'white marble'), then returns each in a separate list. """ color_words = [] concurrent_color_words = [] for o in t: if is_color_word(o, color_dict): color_words.append(o) for p, n in pairwise(t): if is_color_word(p, color_dict) and is_color_word(n, color_dict): concurrent_color_words.append((p, n)) return color_words, concurrent_color_words

Template: 

Here’s what we get from this function.


First, a list of all of the identified color words in the text and their tag, like this:

[('yellow', 'JJ'), ('black', 'JJ'), ('mourning', 'NN'), ('rose-red', 'JJ'), ('lily-white', 'JJ'), ('black', 'JJ'), ('black', 'JJ'), ('black', 'JJ'), ('white', 'JJ'), ('yellow', 'NN'), ('plum', 'NN'), ('glass', 'NN'), ('red', 'JJ'), ('coral', 'JJ'), ('pink', 'NN'), ('iron', 'NN'), ('glass', 'NN'), ('pink', 'JJ'), ('candid', 'JJ'), ('blue', 'JJ')]

Second, we get a list of tuples containing the color words that were adjectives or nouns closely followed by another adjective or noun in the original text. From The Phantom of the Opera, we get examples like:

(('glass', 'NN'), ('champagne', 'NN')) (('pink', 'NN'), ('white', 'JJ')) (('gold', 'NN'), ('purple', 'NN')) (('water', 'NN'), ('bluey', 'NN'))


In most cases we didn’t think one of these took anything away from or obscured the other; in fact their close association often painted a clearer picture of color texture. So we left both words in.


From this you can get some summary stats, like what percentage of all uncommon words in the text were color words (Phantom is 0.9%), and what proportion are nouns versus adjectives (Phantom is 52-47).


But the really fun part is using those HTML color groups to plot the use of color in the text.


The Jupyter Notebook contains a couple of examples with matplotlib that are really straightforward to implement, like this bar chart showing the colors used in The Phantom of the Opera. Kite created a Github repository here where you can access the code from this and other posts on their blog.





There are many of interesting options for visualizing this data. The original talk included a website, built with the Django framework, ChartJS, and lots of CSS – online here – where we visualized each
book as a series of color blocks in their order of appearance.





Even with the limitations of HTML color applied to a broad palette,you’ll see that a lot of the books are not as dark and gloomy as their “gothic” label might lead you to believe. This makes sense: the supernatural is a strong theme in Gothic literature, but so is contrasting it with the beauty of the natural world that was considered both a haven and a dwindling reality during the dawn of the industrial revolution.


Note: This Article Originally Appeared On Kite

Comments

  1. ARE YOU A VICTIM OF FALSE HACKERS & BANK LOAN SCAM⁉️

    We have been having recent complains from individuals about how they lost money 💵 to SPAMMERS who call themselves HACKERS or BANK LOAN OFFERS. They are all over the internet sharing false testimonies. Please do not fall for their lies for this is just a way to LURE you to them.

    They say lies in the likes of such-:
    ▪️Bitcoin Auctioning ▪️Western Union Hack
    ▪️Blank Credit Card ▪️Clearing Criminal Records
    ▪️Loan Offers. ▪️Bank Account Loading
    ▪️Changing University Grades & so on.
    These are all lies and you shouldn’t fall for them.

    🏵GLOBAL PLUGGERS🏵 is here to help you Recover all your Money 💵 that you have been Ripped of.
    WHO ARE GLOBAL PLUGGERS⁉️
    We are a group of Computer💻 Experts who are memebers of the “HACKERONE” Forum. We have dedicated ourselves to help Victims of these SCAM(s) recover all the Money that has been taken falsely from them.

    If you have been a victim of thes Thieves, then you need to contact us as soon as possible so you can get your money back.
    Email-: globalpluggers@gmail.com
    No. +1 (808) 600 0773 ( Number also available on WhatsApp)

    Note:
    Please know that we do not charge you for Fund Recovery Service, Our Funds Recovery Service is to help and so it’s Free.

    We also provide Legit Hacking Services such as-:
    🔸Phone Hacking/Cloning
    🔸Email Hacking & Password Recovery
    🔸Social Media Hacking & Passowrd Recovery
    🔸Deleted Files Recovery 🔸Mobil Tracking
    🔸Virus detection & Elimination.

    Contact-:
    Email globalpluggers@gmail.com
    No. +1 (808) 600 0773 (number also available on WhatsApp)








    ReplyDelete
    Replies
    1. ⚡️☑️MEET THE REAL HACKERS☑️⚡️

      It Tears me Up Whenever we receive complaints from Clients About Their Experience With the Hackers They Met Before They Heard about us.
      These Days There Are alot of Hackers Online, You Just Have to Be Careful about who you meet for help, because many people now don't know who to ask for help anymore but there's really an actual solution to that which I am giving you for free, Don't go for the incompetent ones which I know you understand what I'm saying like hackers using gmail, yahoomail and other cheaper email accounts that could be easily hacked ⚠️🚷, come to think of it, why the fuck would a REAL HACKER want to use a Mailing Service that brings out his vulnerabilities? ❌❌ ❌ so can you see they are really not who they say they're, they are just here to Rip people Off, You Can Always Identify Them With Their False Write Ups and False Testimonies Trying To Lure you Into their Arms.❌❌❌ and my advice really goes out to you looking for a Real Hacker that's a heads up so that you wouldn’t fall deep into their trap no more.🚷⚠️⚠️⚠️

      ☑️ COMPOSITE CYBER SECURITY SPECIALISTS is here to Provide you with The Best Hackers, So you can get saved from The Arms of the Fake Hackers❌❌

      ☑️We have Legit Hackers and Private investigators at your service. 💻 Every member of our team is well experienced in their various niches with Great Skills, Technical Hacking Strategies And Positive Online Reviews And Recommendations💻🛠

      ☑️We have Digital Forensic Specialists, Certified Ethical Hackers, Computer Engineers, Cyber Security Experts, Private investigators and more on our team. Some Of These Specialist Includes ⭐️ DAWID CZAGAN⭐️ JACK CABLE ⭐️ SEAN MELIA ⭐️ ARNE SWINNEN ⭐️And More.
      Some Of The Services we render includes:
      * Website hacking 💻
      * Facebook and social media hacking 📲
      * Database hacking, & Blog Cleaning🛠
      * Phone and Gadget Hacking 📲
      • CREDIT CARD MISHAPS 💳 💥
      * Clearing Of Criminal Records ❌
      * RECOVERY OF LOST FUNDS ON BINARY OPTIONS & CAPITAL INVESTMENTS💰
      * Location Tracking 📲
      and many More

      ☑️ Our Goal is to make your digital life secure, safe and hassle-free. All you Need To do is To Write us a Mail Then We’ll Assigned any of These Hackers To You Instantly.


      ☑️ CONTACT:
      ••• Email:
      composite.cybersecurity@protonmail.com

      🔘2020 © composite cybersecurity specialists
      🔘Want faster service? Contact us!
      🔘All Rights Reserved ®️

      Delete
    2. RECOVERY OF STOLEN BITCOIN
      RECOVERY OF LOST MONEY TO SCAMMER
       LEGIT HACKER RECOMMENDATION
       Have  you ever being a victim of scam?  or have you lost your money to fake hackers online? I implore you to contact this trust worthy hacker and   recovery expert QUADHACKED@GMAIL.COM , I was a victim of fake people posing as  binary options and bitcoin investors,  I lost a sum of $4,000 and 2BTC from my bitcoin wallet to this fakes. it took a while before i realized they were scams and this really hurt me. Then an inlaw of mine heard about it and recommended to me a specialist with the address -  QUADHACKED@GMAIL.COM  . He helped me recover my lost bitcoins  in less than 72hrs  and the fakes where caught and made to pay for what they did to me .
      if you have lost any amount to online scams and you're seeking to recover them, in fake hackers,  online dating scams,btc wallet hack, fake binary investors  .Reach out to Quadhacked  to help you ,and you will be so glad you did so, best believe .

      Delete
  2. Excellent and professional investigative services. I hired Mr FRED for a very private and difficult matter of hacking my husband's phone and he far exceeded my expectations. He helped me get some info such as whatsapp, facebook, text messages, call logs and even phone conversations that I needed for proof of his secretive affair. The first time we spoke, we had a very long phone consultation in which he gave me all my options that he could think of to resolve my case, and he even recommended I try other options before hiring him, which shows that he is honest. I decided to hire him and I am glad I did. He is a fantastic investigator and a great person; to all loyal partners out there if you have a dishonest partner don't hesitate to send him a mail Contact: CYBERAPPHACK@GMAIL.COM.

    ReplyDelete
  3. Greetings....
    Check out these blank ATM cards today. 
    My name is Robert Williams from California. A successful business owner and father. I got one of these already programmed Credit cards that allows me withdraw a maximum of $5,000 daily for 30 days. I am so happy about these cards because I received mine last week and have already used it to get $20,000. Mr frank Richard   of Creditcards.atm@gmail.com is giving out these cards to support people in any kind of financial problem. I must be sincere to you, when i first saw the advert, I believed it to be illegal and a hoax but when I contacted Mr Frank Richard , he confirmed to me that although it is illegal, nobody gets caught while using these cards because they have been programmed to disable every communication once inserted into any Automated Teller Machine(ATM). If interested contact him as soon as possible 
    Email:Creditcards.atm@gmail.com 

    Whatsapp:+1(305) 330-3282

    ReplyDelete
  4. Greetings....
    Check out these blank ATM cards today. 
    My name is Robert Williams from California. A successful business owner and father. I got one of these already programmed Credit cards that allows me withdraw a maximum of $5,000 daily for 30 days. I am so happy about these cards because I received mine last week and have already used it to get $20,000. Mr frank Richard   of Creditcards.atm@gmail.com is giving out these cards to support people in any kind of financial problem. I must be sincere to you, when i first saw the advert, I believed it to be illegal and a hoax but when I contacted Mr Frank Richard , he confirmed to me that although it is illegal, nobody gets caught while using these cards because they have been programmed to disable every communication once inserted into any Automated Teller Machine(ATM). If interested contact him as soon as possible 
    Email:Creditcards.atm@gmail.com 

    Whatsapp:+1(305) 330-3282

    ReplyDelete
  5. Greetings....
    Check out these credit cards today. 
    My name is Robert Williams from California. A successful business owner and father. I got one of these already programmed Credit cards that allows me withdraw a maximum of $5,000 daily for 30 days. I am so happy about these cards because I received mine last week and have already used it to get $20,000. Mr frank Richard   of Creditcards.atm@gmail.com is giving out these cards to support people in any kind of financial problem. I must be sincere to you, when i first saw the advert, I believed it to be illegal and a hoax but when I contacted Mr Frank Richard , he confirmed to me that although it is illegal, nobody gets caught while using these cards because they have been programmed to disable every communication once inserted into any Automated Teller Machine(ATM). If interested contact him as soon as possible 
    Email:Creditcards.atm@gmail.com 

    Whatsapp:+1(305) 330-3282

    ReplyDelete
  6. Greetings....
    Check out these credit cards today. 
    My name is Robert Williams from California. A successful business owner and father. I got one of these already programmed Credit cards that allows me withdraw a maximum of $5,000 daily for 30 days. I am so happy about these cards because I received mine last week and have already used it to get $20,000. Mr frank Richard   of Creditcards.atm@gmail.com is giving out these cards to support people in any kind of financial problem. I must be sincere to you, when i first saw the advert, I believed it to be illegal and a hoax but when I contacted Mr Frank Richard , he confirmed to me that although it is illegal, nobody gets caught while using these cards because they have been programmed to disable every communication once inserted into any Automated Teller Machine(ATM). If interested contact him as soon as possible 
    Email:Creditcards.atm@gmail.com 

    Whatsapp:+1(305) 330-3282

    ReplyDelete
  7. You guys have surpassed my expectations! James is seriously amazing and is doing everything to help my Fiancé and me, in1weeks my credit score went up 700 points and I can only imagine what is to come. Thank you for the excellent customer service and doing exactly what you all have set out to do! NO GIMMICKS OR BS with you guys.They carry out any kind of hacks You can reachout to them via Hackintechnology@gmail.com +16692252253

    ReplyDelete
  8. Do you need Finance?
    Are you looking for Finance?
    Are you looking for a money to enlarge your business?
    We help individuals and companies to obtain loan for business
    expanding and to setup a new business ranging any amount. Get a loan at affordable interest rate of 3%, Do you need this cash/loan for business and to clear your bills? Then send us an email now for more information contact us now via Email financialserviceoffer876@gmail.com Whats App +918929509036

    ReplyDelete
  9. My ex ruined my credit due to his incessant extravagant spending spree, I found myself in a big mess. I talked to a credit repair company and I was told that it would take me non less than a year to fix my credit. I was devastated, that's a very long time which I can't cope with. I looked online and came across Credit Doctor's contact, hit him up and to my greatest surprise, my credit was repaired in 4 working days from 486 -810. I was so amazed and it didn't cost me too much really. I implore you to contact him on for all credit issues and hacking issues. No doubt that he's the best out there and your problems will be solved!
    HACKINTECHNOLOGY@GMAIL.COM
    +16692252253

    ReplyDelete
  10. I was searching for loan to sort out my bills& debts, then i saw comments about Blank ATM Credit Card that can be hacked to withdraw money from any ATM machines around you . I doubted thus but decided to give it a try by contacting { cyberhackingcompany@gmail.com} they responded with their guidelines on how the card works. I was assured that the card can withdraw $5,000 instant per day & was credited with$50,000,000.00 so i requested for one & paid the delivery fee to obtain the card, after 24 hours later, i was shock to see the UPS agent in my resident with a parcel{card} i signed and went back inside and confirmed the card work's after the agent left. This is no doubts because i have the card & has made used of the card. This hackers are USA based hackers set out to help people with financial freedom!! Contact these email if you wants to get rich with this Via: cyberhackingcompany@gmail.com ..    

    ReplyDelete
  11. GET RICH WITH BLANK ATM CARD ... Whatsapp: +18033921735

    I want to testify about Dark Web blank atm cards which can withdraw money from any atm machines around the world. I was very poor before and have no job. I saw so many testimony about how Dark Web hackers send them the atm blank card and use it to collect money in any atm machine and become rich. ( darkwebblankatmcard@gmail.com ) I email them also and they sent me the blank atm card. I have use it to get 90,000 dollars. withdraw the maximum of 5,000 USD daily. Dark Web is giving out the card just to help the poor. Hack and take money directly from any atm machine vault with the use of atm programmed card which runs in automatic mode.

    Email: darkwebblankatmcard@gmail.com
    Text & Call or WhatsApp: +18033921735

    ReplyDelete

  12. Hello Everybody, I live in Singapore and i am a happy woman today? and i told my self that any lender that rescue my family from our poor situation, i will refer any person that is looking for loan to him, he gave me happiness to me and my family, i was in need of a loan of $20, 000.00 to start my life all over as i am a single mother with 3 kids I met this honest and GOD fearing man loan lender that help me with a loan of $20, 000.00 Dollar, he is a GOD fearing man, if you are in need of loan and you will pay back the loan please contact him tell him that is Mrs Sharon, that refer you to him. contact via email:(challotloan@gmail.com) Thank you.

    ReplyDelete
  13. Have you heard about programmed ATM card? email: (williamshackers@hotmail.com) or WhatsApp +27730051607 for enquiring on how to get the ATM programmed card.
    We have special cash loaded programmed ATM card of $5000, $10000, $15000, $20000 and any amount your choice you need for you to buy your dream car, house and to start up your own business. Our ATM card can be used to withdraw cash at any ATM or swipe, stores and POS. Our card has daily withdrawal limit depending card balance you order. Contact us via Email if you need a card email: (williamshackers@hotmail.com) or WhatsApp +27730051607.

    ReplyDelete
  14. HOW I GOT MY LOAN FROM THIS GREAT COMPANY

    Hello my dear people, I am Linda McDonald, currently living in Austin Texas, USA. I am a widow at the moment with three kids and i was stuck in a financial situation in April 2018 and i needed to refinance and pay my bills. I tried seeking loans from various loan firms both private and corporate but never with success, and most banks declined my credit ,do not full prey to those hoodlums at there that call them self-money lender they are all scam , all they want is your money and you well not hear from them again they have done it to me twice before I met Mr. David Wilson the most interesting part of it is that my loan was transfer to me within 74hours so I will advise you to contact Mr. David if you are interested in getting loan and you are sure you can pay him back on time you can contact him via email……… (davidwilsonloancompany4@gmail.com) No credit check, no cosigner with just 2% interest rate and better repayment plans and schedule if you must contact any firm with reference to securing a loan without collateral then contact Mr. David Wilson today for your loan

    They offer all kind of categories of loan they

    Short term loan (5_10years)
    Long term loan (20_40)
    Media term loan (10_20)
    They offer loan like
    Home loan............., Business loan........ Debt loan.......
    Student loan.........., Business startup loan
    Business loan......., Company loan.............. etc
    Email..........( davidwilsonloancompany4@gmail.com)
    When it comes to financial crisis and loan then David Wilson loan financial is the place to go please just tell him I Mrs. Linda McDonald direct you Good Luck.......................

    ReplyDelete
  15. Are you interested in the service of a hacker to get into a phone, facebook account, snapchat, Instagram, yahoo, Whatsapp, get verified on any social network account, increase your followers by any amount, bank wire and bank transfer. Contact him on= hackintechnology.com hackintechnology@gmail.com +12132951376

    ReplyDelete

Post a comment

If you have any doubts please comment !