Scraping Wikileaks for Hillary Clinton Emails related to Bangladesh
I wanted to get my feet wet in the scraping world. So, I decided to scrape wikileaks. I am from Bangladesh so I thought let's see what wikileaks has in store for Bangladesh. Hillary Clinton is the next probable president in the US. So, I decided to get the Hillary Clinton emails that are linked to Bangladesh. These emails were made public by wikileaks. With the scraped data, I wish to do some document modeling some other day inshaAllah. So, lets get to work!
Importing necessary libraries
We will import some libraries here to make our life easier along the way.
import requests, os, bs4, webbrowser, re, json
import pandas as pd
After looking at some emails in the wikileaks website, I decided to save the data in the following format.
data = {
'title' : [] ,
'from' : [] ,
'to': [] ,
'date' : [] ,
'subject': [],
'body' : []
}
Scraping wikileaks
There are 8 pages of search results for Hillary Clinton emails related to Bangladesh. So, we open the emails from each page, scrape them, store them in our dictionary then move on to the next page. The following code snippet has comments for all parts, so it's quite self-explanatory. I will leave it at that.
page = 1
base_url = "https://search.wikileaks.org/"
url = 'https://search.wikileaks.org/?query=bangladesh&exact_phrase=&any_of=&exclude_words=&document_date_start=&document_date_end=&released_date_start=&released_date_end=&publication_type%5B%5D=42&new_search=False&order_by=newest_released_date#results'
# There are 8 result pages, so we use this loop to go through each page and scrape.
while (page < 9):
#set the url and go to it
print "going to wikileaks for searching, page:", page
res = requests.get(url)
res.raise_for_status()
# get the search result page
soup = bs4.BeautifulSoup(res.text)
linkElems = soup.select('.info a')
#open each page and scrape data
for i in range(len(linkElems)):
# get a search result's content
result_url = linkElems[i].get('href')
result_html = requests.get(result_url)
result_html.raise_for_status()
result_soup = bs4.BeautifulSoup(result_html.text)
#extract the data
#title of the document
title = result_soup.select('.tab-content h2')[0].get_text()
#body of the mail
content = result_soup.select('.email-content')[0].get_text()
content = content.encode('utf-8')
#Strip unnecessary white spaces
content = re.sub(r'\s+ ', ' ', content)
#get the from, to and subject data from header
header = result_soup.select('#header')[0].get_text()
# break down the header to from, to, date and subject fields to fit our dictionary format
#get the sender
sender = header.splitlines()[1]
sender = sender.split(':')
sender = sender[1].encode('utf-8')
#get the receiver
receiver = header.splitlines()[2]
receiver = receiver.split(':')
receiver = receiver[1].encode('utf-8')
# get the date time
dt = header.splitlines()[3]
dt = dt.split(':')
dt = dt[1].encode('utf-8') + dt[2].encode('utf-8')
# get the subject
subject = header.splitlines()[4]
subject = subject.split(':')
subject = subject[1].encode('utf-8')
#add all the data to our dictionary
data['title'].append(title)
data['from'].append(sender)
data['to'].append(receiver)
data['date'].append(dt)
data['subject'].append(subject)
data['body'].append(content)
#go to next page
page += 1
if page < 9:
#get the next page's link
next_page = soup.select('.next a')
next_page_url = next_page[0].get('href')
#set url
url = base_url + next_page_url
#get the data into a pandas dataframe
email_leaks = pd.DataFrame(data)
going to wikileaks for searching, page: 1
going to wikileaks for searching, page: 2
going to wikileaks for searching, page: 3
going to wikileaks for searching, page: 4
going to wikileaks for searching, page: 5
going to wikileaks for searching, page: 6
going to wikileaks for searching, page: 7
going to wikileaks for searching, page: 8
All done! Let's take a look at our collected data.
email_leaks.shape
(364, 6)
email_leaks
| body | date | from | subject | title | to | |
|---|---|---|---|---|---|---|
| 0 | \nUNCLASSIFIED U.S. Department of State Case N... | 2010-09-01 0200 | Akbar Zaidi | PAKISTAN'S ROLLER-COASTER ECONOMY | PAKISTAN'S ROLLER-COASTER ECONOMY: TAX EVASION... | |
| 1 | \nUNCLASSIFIED U.S. Department of State Case N... | 2010-10-24 0445 | Hillary Clinton | TRIP READING - - BANGLADESH IS FOLLOWING THE ... | TRIP READING - - BANGLADESH IS FOLLOWING THE L... | Lauren Jiloty |
| 2 | \nUNCLASSIFIED U.S. Department of State Case N... | 2001-01-01 0300 | Daily Sun | EVOLVING DIPLOMATIC ECO-SYSTEM AND BANGLADESH... | EVOLVING DIPLOMATIC ECO-SYSTEM AND BANGLADESH ... | |
| 3 | \nUNCLASSIFIED U.S. Department of State Case N... | 2010-12-05 2139 | Hillary Clinton | Lauren Jiloty | ||
| 4 | \nUNCLASSIFIED U.S. Department of State Case N... | 2010-12-06 0007 | Hillary Clinton | MORE | MORE | Lauren Jiloty |
| 5 | \nUNCLASSIFIED U.S. Department of State Case N... | 2010-12-06 0010 | Hillary Clinton | MORE | MORE | Melanne Verveer |
| 6 | \nUNCLASSIFIED U.S. Department of State Case N... | 2010-12-05 2140 | Hillary Clinton | Michael Fuchs | ||
| 7 | \nUNCLASSIFIED U.S. Department of State Case N... | 2010-12-08 2253 | Hillary Clinton | LATEST | LATEST | Melanne Verveer |
| 362 | \nUNCLASSIFIED U.S. Department of State Case N... | 2012-06-07 0653 | - | - | ||
| 363 | \nUNCLASSIFIED U.S. Department of State Case N... | 2010-04-27 0902 | - | - |
364 rows × 6 columns
Saving the data in different file formats
We can save this dataframe in various format, thanks to pandas. This particular dataset has some encoding related problems with csv or excel format, so we save them in json and text file format. Here, I have shown two other formats just for demonstration, you can try these out too.
#Saving the data in different formats
#Latex
email_leaks.to_latex('leaks.tex')
#HTML
email_leaks.to_html('lix.html')
#JSON
email_leaks.to_json('mails.json')
#Text File
json.dump(data, open("mails.txt",'w'))
So, Scraping was pretty easy, thanks to the wonderful libraries available. Let's call it a day and grab some coffee. Adios!
