Pandas, BeautifulSoup, Requests, Plotly, Dash, nltk, Sklearn, Keras
Regression, PCA, Time Series, Trees, RFs, SVMs, Boosting, Neural Networks
Project Description:
Utilizing Reddit’s Pushshift API, I scrape reddit posts and comments from two similar subreddits, /r/pcgaming and /r/boardgames, with the goal of utilizing NLP techniques to classify posts into the correct subreddit. Here, I explore the text content and functional features of subreddit activity.
Data Collection:
Reddit provides a well-documented API for collecting data. I write 2 simple functions which extract posts and comments when given a subreddit name, a desired post count, and UTC datetime to scrape backwards from. This can be viewed as a py executable with my parameters here.
Before analyzing the text content of these posts, I explore what features describing the structure of the post content can be used in a model in addition to the text. I explore post and comment length, activity, time of posting, and domain of hyperlinked content in the notebook here.
Insight 1: Both subreddits have comparable text length. The boardgames subreddit has more extremely long content and pcgaming has more minimal text content.
Insight 2: PCGaming is more active in late hours of night / early hours of morning while Boardgames is more active in late afternoon. Thus time of day could be a useful predictive feature.
Insight 3: Type of hyperlink content and domain predict subreddit. Image content in posts exclusively indicates /r/boardgames membership while many linked domain beyond youtube strongly indicate pcgaming membership.
Utilizing the nltk library, I tokenize and lemmatize the text content of titles, posts, and comments.
Insight 4: We find words such as dice, card, and board which reference concrete objects is associated with /r/boardgames.