Scraping Sentiment from Reddit

Fork

We can use Python to automatically analyze the sentiment of Reddit posts (i.e. Sentiment analysis, a.k.a. "opinion mining" a.k.a. "emotion artificial intelligence"). This may have practical applications for cryptocurrency traders. Machine learning could be used to look for correlations between price movements and sentiment. Today, we are going to "scrape" the top posts from social news discussion website Reddit's r/ethtrader Ethereum investment community, because the links are already scored for relevance by humans using Reddit's voting system. Here's how we're going to do it: (highlights below)

import requests
from bs4 import BeautifulSoup
# Imports the Google Cloud client library
from google.cloud import language

# Instantiates a client
language_client = language.Client()

url = 'https://www.reddit.com/r/ethtrader/top/?sort=top&t=all'

r = requests.get(url, headers = {'User-agent': 'youllneverguess'})
#Use fresh username since Reddit rejects the default Python one
r
print(r.status_code)
#a status code of 200 means that everything is okay
soup = BeautifulSoup(r.content, 'html.parser')
siteTable = soup.find("div", { "id" : "siteTable"})
hits = siteTable.find_all("div", { "class" : "thing" } )

i = 0

for hit in hits:
 i = i + 1
 print("-------------------------------------------")
 username = hit.find("a", { "class" : "author"})
 datetime = hit.find("time")['datetime']
 score = hit.find("div", { "class" : "score unvoted"}).text
 for link in hit.find_all('a', href=True):
  if "https://www.reddit.com/r/" in link['href']:
   link = link['href']
   followed = requests.get(link, headers = {'User-agent': 'ethscraper 1.0'})
   followed
   linksoup = BeautifulSoup(followed.content, 'html.parser')
   content = linksoup.find("div", { "class" : "content"})
   paragraphs = content.find_all("p")
   text = ""
   for paragraph in paragraphs:
    text = text + " " + paragraph.text
   document = language_client.document_from_text(text)
   sentiment = document.analyze_sentiment().sentiment
 print(i, "| Username:", username.string, "| Date & Time:", datetime, "| Votes:", score, "|", 'Sentiment: {}, Magnitude: {}'.format(sentiment.score, sentiment.magnitude))

And now for the play-by-play:

import requests
from bs4 import BeautifulSoup
# Imports the Google Cloud client library
from google.cloud import language

We start off by importing the requests library (to do the actual scraping), Beautiful Soup 4, and the Google Cloud client library, which contains the cutting edge Sentiment Analyzer.

# Instantiates a client
language_client = language.Client()

This is necessary for the Google Cloud client library Sentiment Analyzer.

url = 'https://www.reddit.com/r/ethtrader/top/?sort=top&t=all'
r = requests.get(url, headers = {'User-agent': 'youllneverguess'})
#Use fresh username since Reddit rejects the default Python one
r
print(r.status_code)
#a status code of 200 means that everything is okay
soup = BeautifulSoup(r.content, 'html.parser')
siteTable = soup.find("div", { "id" : "siteTable"})
hits = siteTable.find_all("div", { "class" : "thing" } )

This "scrapes" the target website and them parses it using Beautiful soup. Any children divs of the siteTable div are posts regarding our topic, i.e. "hits". You have to go through the source code of the target website to find these bits of code to scrape based on. We have to change the name of the user agent since Reddit will reject the default one (they must get hit with requests using the default user-agent name all the time).

for hit in hits:
 i = i + 1
 print("-------------------------------------------")
 username = hit.find("a", { "class" : "author"})
 datetime = hit.find("time")['datetime']
 score = hit.find("div", { "class" : "score unvoted"}).text

Above, we scrape the name of the user who made the submission, the date and time of submission (for later comparison to the subsequent change in price), and number of Reddit votes (so we have a human-determined score that's hard to fake).

Wait, there's more:

 for link in hit.find_all('a', href=True):
  if "https://www.reddit.com/r/" in link['href']:
   link = link['href']
   followed = requests.get(link, headers = {'User-agent': 'ethscraper 1.0'})
   followed
   linksoup = BeautifulSoup(followed.content, 'html.parser')
   content = linksoup.find("div", { "class" : "content"})
   paragraphs = content.find_all("p")
   text = ""
   for paragraph in paragraphs:
    text = text + " " + paragraph.text
   document = language_client.document_from_text(text)
   sentiment = document.analyze_sentiment().sentiment
 print(i, "| Username:", username.string, "| Date & Time:", datetime, "| Votes:", score, "|", 'Sentiment: {}, Magnitude: {}'.format(sentiment.score, sentiment.magnitude))

Here we follow the frontpage link to the discussion page and use artificial intelligence (AI) to score the attitude towards the topic at hand on a scale of from -1 to 1. The AI also scores the magnitude (i.e. overall strength of emotion) on an infinite scale.

Voilà voici:

This data is going to be used for machine learning to determine correlations between sentiment and market price, if any.

Fork

Comments