Scraping Wikileaks for Hillary Clinton Emails related to Bangladesh

Mon Aug 08 2016

5 min read

Cumilla, Bangladesh

I wanted to get my feet wet in the scraping world. So, I decided to scrape wikileaks. I am from Bangladesh so I thought let's see what wikileaks has in store for Bangladesh. Hillary Clinton is the next probable president in the US. So, I decided to get the Hillary Clinton emails that are linked to Bangladesh. These emails were made public by wikileaks. With the scraped data, I wish to do some document modeling some other day inshaAllah. So, lets get to work!

Importing necessary libraries

We will import some libraries here to make our life easier along the way.

import requests, os, bs4, webbrowser, re, json
import pandas as pd

After looking at some emails in the wikileaks website, I decided to save the data in the following format.

data = {
    'title' : [] ,
    'from' : [] ,
    'to': [] ,
    'date' : [] ,
    'subject': [],
    'body' : []

}

Scraping wikileaks

There are 8 pages of search results for Hillary Clinton emails related to Bangladesh. So, we open the emails from each page, scrape them, store them in our dictionary then move on to the next page. The following code snippet has comments for all parts, so it's quite self-explanatory. I will leave it at that.

page = 1
base_url = "https://search.wikileaks.org/"
url = 'https://search.wikileaks.org/?query=bangladesh&exact_phrase=&any_of=&exclude_words=&document_date_start=&document_date_end=&released_date_start=&released_date_end=&publication_type%5B%5D=42&new_search=False&order_by=newest_released_date#results'
# There are 8 result pages, so we use this loop to go through each page and scrape.
while (page < 9):
    #set the url and go to it
    print "going to wikileaks for searching, page:", page
    res = requests.get(url)
    res.raise_for_status()


    # get the search result page
    soup = bs4.BeautifulSoup(res.text)
    linkElems = soup.select('.info a')

    #open each page and scrape data
    for i in range(len(linkElems)):

        # get a search result's content
        result_url = linkElems[i].get('href')
        result_html = requests.get(result_url)
        result_html.raise_for_status()
        result_soup = bs4.BeautifulSoup(result_html.text)

        #extract the data
        #title of the document
        title = result_soup.select('.tab-content h2')[0].get_text()
        #body of the mail
        content = result_soup.select('.email-content')[0].get_text()
        content = content.encode('utf-8')
        #Strip unnecessary white spaces
        content = re.sub(r'\s+ ', ' ', content)

        #get the from, to and subject data from header
        header = result_soup.select('#header')[0].get_text()
        # break down the header to from, to, date and subject fields to fit our dictionary format
        #get the sender
        sender = header.splitlines()[1]
        sender = sender.split(':')
        sender = sender[1].encode('utf-8')

        #get the receiver
        receiver = header.splitlines()[2]
        receiver = receiver.split(':')
        receiver = receiver[1].encode('utf-8')

        # get the date time
        dt = header.splitlines()[3]
        dt = dt.split(':')
        dt = dt[1].encode('utf-8') + dt[2].encode('utf-8')

        # get the subject
        subject = header.splitlines()[4]
        subject = subject.split(':')
        subject = subject[1].encode('utf-8')

        #add all the data to our dictionary
        data['title'].append(title)
        data['from'].append(sender)
        data['to'].append(receiver)
        data['date'].append(dt)
        data['subject'].append(subject)
        data['body'].append(content)


    #go to next page
    page += 1
    if page < 9:
        #get the next page's link
        next_page = soup.select('.next a')
        next_page_url = next_page[0].get('href')
        #set url
        url = base_url + next_page_url

#get the data into a pandas dataframe
email_leaks = pd.DataFrame(data)

going to wikileaks for searching, page: 1
going to wikileaks for searching, page: 2
going to wikileaks for searching, page: 3
going to wikileaks for searching, page: 4
going to wikileaks for searching, page: 5
going to wikileaks for searching, page: 6
going to wikileaks for searching, page: 7
going to wikileaks for searching, page: 8

All done! Let's take a look at our collected data.

email_leaks.shape

(364, 6)

email_leaks

... ... ...

	body	date	from	subject	title	to
0	\nUNCLASSIFIED U.S. Department of State Case N...	2010-09-01 0200	Akbar Zaidi	PAKISTAN'S ROLLER-COASTER ECONOMY	PAKISTAN'S ROLLER-COASTER ECONOMY: TAX EVASION...
1	\nUNCLASSIFIED U.S. Department of State Case N...	2010-10-24 0445	Hillary Clinton	TRIP READING - - BANGLADESH IS FOLLOWING THE ...	TRIP READING - - BANGLADESH IS FOLLOWING THE L...	Lauren Jiloty
2	\nUNCLASSIFIED U.S. Department of State Case N...	2001-01-01 0300	Daily Sun	EVOLVING DIPLOMATIC ECO-SYSTEM AND BANGLADESH...	EVOLVING DIPLOMATIC ECO-SYSTEM AND BANGLADESH ...
3	\nUNCLASSIFIED U.S. Department of State Case N...	2010-12-05 2139	Hillary Clinton			Lauren Jiloty
4	\nUNCLASSIFIED U.S. Department of State Case N...	2010-12-06 0007	Hillary Clinton	MORE	MORE	Lauren Jiloty
5	\nUNCLASSIFIED U.S. Department of State Case N...	2010-12-06 0010	Hillary Clinton	MORE	MORE	Melanne Verveer
6	\nUNCLASSIFIED U.S. Department of State Case N...	2010-12-05 2140	Hillary Clinton			Michael Fuchs
7	\nUNCLASSIFIED U.S. Department of State Case N...	2010-12-08 2253	Hillary Clinton	LATEST	LATEST	Melanne Verveer
362	\nUNCLASSIFIED U.S. Department of State Case N...	2012-06-07 0653		-	-
363	\nUNCLASSIFIED U.S. Department of State Case N...	2010-04-27 0902		-	-

364 rows × 6 columns

Saving the data in different file formats

We can save this dataframe in various format, thanks to pandas. This particular dataset has some encoding related problems with csv or excel format, so we save them in json and text file format. Here, I have shown two other formats just for demonstration, you can try these out too.

#Saving the data in different formats
#Latex
email_leaks.to_latex('leaks.tex')
#HTML
email_leaks.to_html('lix.html')
#JSON
email_leaks.to_json('mails.json')
#Text File
json.dump(data, open("mails.txt",'w'))

So, Scraping was pretty easy, thanks to the wonderful libraries available. Let's call it a day and grab some coffee. Adios!

Importing necessary libraries

Scraping wikileaks

Saving the data in different file formats

Scraping Wikileaks for Hillary Clinton Emails related to Bangladesh

# Importing necessary libraries

# Scraping wikileaks

# Saving the data in different file formats

Importing necessary libraries

Scraping wikileaks

Saving the data in different file formats