Smulemates: A Karaoke Singing Style Recommendation System

I’ve grown up singing karaoke for as long as I remember – at home, at family parties, and at get-togethers with friends.  It is not only one of my favorite hobbies, but also a big part of my own culture.  So naturally, I decided to focus my final project on this great pastime.


The aspect of karaoke that I love the most is that it brings people together.  In fact, there is a whole social app around this concept, called Smule.  As of 2018, this popular karaoke app had over 50 million monthly active users.  It unites people from all over the world to either sing a solo, duet, or group performance.  Users may also view other users’ performances, and see what’s popular and trending.  However, what I noticed is that the app doesn’t recommend specific users based on the user’s particular singing style.  For myself, I know that I am unable to belt out Mariah Carey and Whitney Houston songs from the 90’s, so I prefer to stay within my own vocal range.  So, with millions of other users on the app, how can we make it easier for users to choose who to sing with?  With this project, I propose that Smule add a feature called Smulemates, a recommendation system that suggests other users with similar singing styles.


As I was looking for data for my project, I came across the Stanford Digital Archive of Mobile Performances (DAMP) Dataset, consisting of several datasets of actual Smule user performances.  The dataset required access approval, and since I had a very limited amount of time to work on the project, I took it upon myself to try to expedite the process by trying to contact the actual owner of the dataset (in addition to emailing the specified alias).  My initial attempts consisted of calling a phone number that was listed on the parent site in which the dataset was being hosted, along with sending a message through the Center for Computer Research in Music and Acoustics (CCRMA) Facebook page. Although the administrator was willing to help me figure out the contact, I was redirected back to the alias I had originally e-mailed. Since I was back at square one, I decided to take a bolder step and reach out to an individual on LinkedIn who had written a research paper using the DAMP dataset.  She was a former intern at Smule, and had worked on creating one of the datasets.  A few hours later, I received a very kind response that provided an e-mail address to the owner of the dataset.  The next morning, I e-mailed the owner directly and received access within the hour.

The dataset that I chose was called the Smule “Balanced” Vocal Performances, which consisted of 24,874 performances across 14 songs and 5,429 singers.  The data was also balanced between male and female singers.

Feature Extraction

Once I collected all my data, I loaded the audio into Librosa to extract mel-frequency cepstral coefficient (MFCC) features.  MFCC simulates the human auditory system and is often used when examining sound similarities.  Initially, I was running into some errors, and noticed that there were about 250 files that did not contain any singing or were cut short.  I worked around this issue by adding a “try-except” to my code.  If the file gave an error, the file name was added to a “bad_files” list, passed, and continued with my for-loop.  The file names that were not giving me errors were added to a “good_files” list.  I set my number of MFCC features to 12, and once the code was run for all the files, I had a matrix of 12 features by 216 frames for each performance.  I then flatted this matrix to a 1 x 2592, so that I only had one row for each performance.

After creating 3 separate dataframes (mfcc_all, good_files, and bad_files), I had to merge the mfcc_all and good_files dataframes in order to add a “perf_key” column to the end of my row of MFCC features.  Lastly, I removed null values from this dataframe and re-indexed.

Dimensionality Reduction

Since the MFCC dataframe had 2,592 features, I had to perform dimensionality reduction to reduce the number of features.  I first normalized my data using Standard Scaler, so that all the data was between 0 and 1.  I then used principal component analysis to reduce my dimensions.  After plotting the number of components to variance, I chose to reduce it to 1000 components.

Content-Based Recommendation System

Lastly, I created my content-based recommendation system by comparing each performance against all the other 25K performances.  I chose the distance metric of cosine similarity to recommend the most similar singing style.  The recommendation system then outputs the top 3 performances with the highest cosine similarity.  Here is an example of a top recommendation based on a particular user’s performance:

User (Singing “Lost Boy” by Ruth B)
Top Recommendation (Singing “Lost Boy” by Ruth B)

The sound of the recommended user’s voice sounds very close to the user, as they both have a raspy and mellow style of singing.  Some may even think it is the same person!  Now, if the two users decide to have a duet, their voices go pretty well together:

Duet between User and Top Recommendation


With singer recommendations built into the app, a user may:

  • Explore singers with similar singing styles.
  • Choose to collaborate with the singer.
  • View other performances by the singer and get ideas of songs they may be interested in performing as well.

Going forward, I would like to explore how different singing styles work well together based on popularity.  Since not everyone wants to only perform with singers like themselves, I would like to create an additional recommendation system for other singing styles that would work well with the user.

Connect with me on LinkedIn!

Mexican Restaurant Yelp Reviews: Sentiment Analysis and Topic Modeling

Yelp provides a wealth of information that can be used by both consumers and business owners to understand the business, as well as customer experiences.  One important aspect of a Yelp page is the customer reviews.  As a business becomes more popular, the more reviews it receives.  However, as the number of reviews grows, it becomes difficult to read through it all and to get a full picture of the customer experience.  By performing natural language processing on Yelp reviews, and providing sentiment analysis and topic modeling, this project has two goals:

  1. To make it easier for customers to make restaurant decisions.
  2. To allow restaurants to action off of customer reviews.


The data for this project was retrieved from the Yelp Dataset Challenge, using the business.json and review.json files.  Since it came in JSON format, I first converted the files into CSVs and combined the two files into one dataframe.  With the original dataset consisting of 6 million reviews, I filtered the data down to only including Las Vegas Mexican restaurants.  This resulted in a dataset that included 1,019 restaurants and 129K+ reviews from 2013-2018.

Sentiment Analysis

Performing sentiment analysis required using both unsupervised and supervised learning.  First, I pre-processed the text data by removing punctuation, stop words, numbers, and converted everything to lowercase.  I then converted the text documents into a matrix of token counts using CountVectorizer.

Once my data was ready for supervised learning, I trained the data only on the 5 and 1 star ratings, which I corresponded to be positive and negative reviews.  Since the data consisted of 76% 5 stars and 24% 1 stars, I had to balance it using random oversampling.  After validating across several different machine learning models, Multinomial Naïve Bayes gave me the best results, with F1 scores of 0.92 for 1 star ratings and 0.97 for 5 star ratings.  I then applied this final model to my entire dataset to predict positive and negative sentiment.

Topic Modeling

My next step was to perform topic modeling based on sentiment, therefore splitting the positive and negative text data that was predicted by the model.  I pre-processed, tokenized, and vectorized the same way I did with the sentiment analysis, and then performed dimensionality reduction using LSA, LDA, and NMF.  NMF (non-negative matrix factorization) gave the best results.  The 6 positive topics included:  ordering, meat, service, sides, atmosphere, and high praise.  The 5 negative topics included: service, quality, sides, wait, and meat.  A few of the same topics appeared for both the positive and negative sets.

Positive Topics:

TopicTop 10 Words
Orderinglike, one, ordered, chicken, time, salsa, would, also, menu, restaurant
Meattacos, taco, asada, carne, pastor, meat, al, get, el, best
Servicefood, service, restaurant, best, amazing, always, delicious, fast, friendly, staff
Sidesgood, really, pretty, salsa, service, chips, nice, got, also, little
Atmospheregreat, service, back, amazing, friendly, awesome, definitely, atmosphere, staff, drinks
High Praiseplace, love, go, try, like, get, always, amazing, burrito, best

Negative Topics:

TopicTop 10 Words
Serviceus, came, table, asked, said, service, would, server, one, minutes
Qualityfood, service, place, good, great, restaurant, bad, better, eat, quality
Sidesgood, like, place, really, chicken, salsa, ordered, burrito, chips, one
Waitorder, time, get, back, go, one, location, said, got, never
Meattacos, taco, asada, meat, carne, ordered, fish, line, bell, one


Based on the results, I would recommend that:

  • Customers/restaurant owners review sentiment over time and during specific periods.
  • Restaurant owners create action plans based off of customer comments by topic.

In addition, since it is helpful to be able to see the verbatim comments for each of the sentiment and topic groupings, I created a Tableau dashboard that would easily allow a user to do this.  Click on the video below to play a demo of the dashboard.

Future Work

Future improvements of this work would include:

  • Additional cleaning of data (e.g. adding more stop words).
  • Summarizing results with representative sentences, since reading verbatims still requires extensive reading.
  • Creating a recommendation system for restaurants with similar positive comments.

Predicting Patient Appointment No-Shows

Patient no-shows and last-minute cancellations come at a very high cost.  On average, health care practices experience about a 19% average no-show rate.  With the average cost of an unused appointment slot costing about $200, no-shows total to about $150 billion per year for the US healthcare system.  More importantly, missed appointments have a high negative impact on the hospital staff and patients.  The time wasted on an unused appointment could have been spent treating another patient and improving quality of care.  The goal of this project was to create a model that predicted whether or not a patient would show up to an appointment, and to provide a recommended course of action to reduce the amount of missed appointments.


The data I used for the project was retrieved from a Kaggle dataset.  This dataset consists of 111K appointments for 62K distinct patients across 81 different neighborhoods.  The time period spanned from April to June 2016. 

While the target of the model was to classify a patient as a no-show/show, my final dataset consisted of 13 features, as the original dataset required a bit of feature engineering.  This included:

  • Mapping the No-Show field to 0/1
  • Calculating the number of days between the appointment schedule date and actual appointment date
  • Calculating the cumulative number of appointments each patient previously had prior to the appointment
  • Calculating the cumulative number of missed appointments each patient previously had prior to the appointment
  • Changing the handicap field to whether or not the patient has a handicap
  • Removing outliers

In addition, the features were normalized using StandardScaler.  Since the data consisted of an 80/20 split between no-shows and shows, I balanced the data using random oversampling.

Model Selection

As I worked on choosing the best model for predicting patient no-shows, I ran the data through 7 different statistical models and compared the results using the balanced accuracy metric.  The XGBoost model gave the highest balanced accuracy of 0.95, while Support Vector Machines came very close at 0.94. 

Since balanced accuracy is calculated by averaging the recall and specificity, this metric was based on my model’s ability to predict 99% of no-shows and 90% of patients that showed up to their appointments.  I preferred to maximize my recall in order to capture almost all of the patient no-shows.

Looking at feature importance, the most importance feature across all models was the number of appointments missed in the past.  This shows that past behavior is a big indicator of whether or not someone will show up to their appointment.  Other important features included the number of previous appointments, the number of days between scheduling an appointment and the actual appointment, and the patient’s age.


Based on my results, I would suggest the following recommendations:

  • Fee Policy – Since the number of missed appointments was the most important feature of the model, I would recommend instilling a fee policy.  This will help with reducing the number of repeat no-shows.
  • Automated reminder system with two-way communication – In order to minimize the amount of manual labor required from the staff, I’d suggest creating an automated reminder system that allows the patient to confirm, cancel, re-schedule, or ask any questions.
  • Extra reminders for those predicted to not show up – Using the model, I would recommend sending extra reminders to predicted “no-show” patients in order to give them additional opportunities to confirm their appointment attendance.
  • Waitlist – Have a waitlist as back-up to fill appointment slots that open up at the last minute.

Future Work

In order to improve my model, I would look at additional data including neighborhood demographics, accessibility to public transportation, and a longer time period.  I would also look into modeling the most effective form of reminder communication.

Predicting Disney World Ride Wait Times

I’m a huge Disney fan.  I try to visit a Disney park once a year, or at least every other year, with the goal of riding all the rides at least once.  However, with the popularity of attending these parks comes long ride lines that may last between a few minutes to a few hours.  With my experience over the years, I’ve been able to come up with some of my own techniques to try to maximize my time.  However, I’ve always wondered, is there a science to this?

Data Acquisition

As I scoured the internet to look for data that may help me solve my question, I decided to use two websites:  Weather Underground and Touring Plans.

Since I was interested on how weather affects wait times, I web scraped hourly data from Weather Underground, using both Selenium and Beautiful Soup.  I wanted to focus the analysis on the busy summer vacation months, so I chose to scrape data from May 1, 2018 to August 31, 2018.  The pickling technique came in handy here, as sometimes there were unexpected errors that would occur as I scraped through each time period.  Instead of having to run the entire period repeatedly, I ran the web scraper for half a month a time, pickled each period, and combined them all at the end.  The resulting data had to be cleaned up a bit, requiring me to strip unwanted characters, renaming the columns, and converting the data types.

Touring Plans provides wait time data for 14 rides across the 4 parks in Walt Disney World, along with related metadata pertaining to each particular day (these all came in separate files).  For each ride, they provide the posted wait time every few minutes that the park is open.  In addition, a Touring Plans employee actually has a job to wait in rides and record the actual wait times.  I decided to combine these together into one Wait Time column.  I also focused on only one ride for this project – Splash Mountain in Magic Kingdom.  Since my weather data only included May-August, I limited this data to the same dates.

Once all the data was retrieved, I merged them all into one dataframe.  I merged the weather data into the Touring Plans Wait Time data on a new column called “Date and Hour,” which combined both the date and the hour of the day.  I also merged the daily metadata based on the date column.  Through exploratory data analysis, I also removed any outliers from the data, leaving with me with ~18K observations.

Multivariate Linear Regression

The multivariate linear regression model formula is:

𝑦̂ =𝛽0+𝛽1𝑥1+𝛽2𝑥2+𝛽3𝑥3+…+𝛽n𝑥n

Here, 𝑦̂, is the predicted Wait Time, the x’s are the features or columns in the data, 𝛽0 is the intercept, and 𝛽1, 𝛽2, etc. are the coefficients of each feature.

Using Ordinary Least Squares Regression, and the train/test split method (80% train, 20% test), I first focused on only using the weather features in my model: Temperature, Dew Point, Humidity, Wind Speed, Wind Gust, Pressure, and Precipitation.  This resulted in an R2 of 0.23 on my train set, and an R2 of 0.20 on my test set, meaning the weather features alone explain 20% of the variance in wait times at Splash Mountain.  My mean squared error (MSE) was 709.8 and my root mean square error (RMSE) was 26.6 minutes.

I then proceeded to add in my time features of Month, Day of the Week, and Hour of the Day.  Since these were categorical features, I had to add them in as dummy variables.  This resulted in an R2 of 0.60, three times the R2 of my first model.  My MSE was halved to 354.0, and my RMSE was reduced to 18.8 minutes.

In my final model, I added the Ticket Season feature, which indicates if the One-Day ticket season is Peak, Regular, or Value.  I also removed the Temperature and Pressure features from the model, since they both had a p-value of less than 0.05.  This further increased my R2 to 0.64, with my final MSE being 319.7 and my RMSE at 17.9 minutes.  In addition, my residuals were normally distributed supporting my decision to use linear regression.


After examining the results, my regression model suggests the following guidelines for this time period for Splash Mountain:

FeatureLower Wait TimesHigher Wait Times
Month (during this period)MayJune
Day of the WeekSundaysWednesdays
HourBefore 10AM, After 8PM12PM-5PM
Ticket SeasonValuePeak
WeatherMore humid, rainWindy conditions

Future Work

Future work to improve the model may include incorporating:

  • Disney events – several events occur throughout the day, such as parades, fireworks, and other shows.  Since many park visitors are watching these shows during these times, this reduces the number of people that are riding rides.
  • New rides/attractions – Many rides and new lands have been opening at the Disney parks.
  • Other rides and locations – Since this model was only based on Splash Mountain, I’d like to do the same for the other rides, as well as Disney World vs Disneyland, since each location brings different types of visitors.
  • Full year/multiple years – I’d extend the time period to a full year to get the full annual picture.