Build an Artist Recommender App in Only Few Steps
I just recently graduated from the 9-week intensive Le Wagon Data Science boot camp and wanted to pimp my portfolio with a reasonable app that is simple but still shows my understanding of data models and how to deal with them.
TL;DR
Creating an Artist Recommender app using the KNeighborsRegressor. Docs and app (in production) can be found here:
The Data
I decided to use a public Kaggle data set containing about 29k artists to create an artist recommender app which takes in an artists name and returns three alternative artists to be most similar to your input. So follow along to see how far I got there.
My Jupyter Notebook on the project contains a preliminary feature analysis, some minor data cleaning and a very simple Scikit-learn KNeighborsRegressor. Lastly I pushed my model to production and deployed a Streamlit one-pager on Heroku.
Feature analysis
The data set stems from a song data base from Spotify of over 1M songs and is grouped by unique artists. It provides the following features:
'artists', 'genres', 'acousticness', 'danceability', 'duration_ms', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo', 'valence', 'popularity', 'key', 'mode', 'count'
The features are mostly average values of the aggregated song data base. ‘genres’ is a list, key and mode are the most popular values of the aggregation.
Most features are evenly distributed. Some have outliers or 0-values:
# Plotting distributions
df.hist(bins=50, figsize=(25,15))
plt.show()
liveness
is skewed to the right => most songs don't show much livenessloudness
is skewed to the left => songs are usually more silentkey
'G' is used the most under all the given artists => This may derive from its unique harmony and easy playability (on a guitar 🎸 e.g.)instrumentalness
seems to be having a lot of non-instrumental artists => could be audio books?
Feature preprocessing
Looking at the genres
feature I see some potential to implement it in the data set. However, they are displayed in lists (arrays) which have to be unpacked and put into separate columns. I decided to take the 50 most popular genres and ‘One Hot Encode’ them over the data set. The NULL values were then imputed by the most common genre ‘Rock’.
Although we have a lot of missing popularity
entries, I would like to keep the feature in the data set. The recommender should have the ability to recommend artists which are similarly popular. Missing values will be imputed with the median value.
# replacing with the median (distribution is quite normal)
median_pop = df_enc[df_enc['popularity'] > 0].popularity.median()
df_enc['popularity'] = df_enc.popularity.replace(0.0, median_pop)
About 65% of the instrumentalness
feature close to 0 (<0.1). The feature will be dropped. All imputation attempts would lead to a biased model.
The Model
As mentioned, the regression model is held rather simple. Since there is no target to be predicted I only aimed to find the nearest neighbors in the set.
# Define X and y
X = df_enc.drop(columns=['artists', 'genres', 'target']) # Remove non numerical features
y = df_enc['target'] # filled with '0' valuesknn_model = KNeighborsRegressor().fit(X, y) # Instanciate and train model
# save the model for production
dump(knn_model, '../model.joblib')
The Input
I created a small search engine with regex and Python string operations to make it easier for the user to pass an artist to the recommender app. The search engine removes interpunction, trims whitespaces and replaces spacial chars with their nearest relatives of the user input term and all the values in the artist
column of the data set. It then tries to look for direct matches in the data set. If no matches are found, it iterates through each word in the search term and tries to match it with each word of each artist string.
If over 50% of the searched words match with an artist, the search engine will collect assemble a list of alternatives and return it to the user. For example, the user searches for ‘beatles’ but the data set uses the string ‘the beatles’. 50% of the search will still match and ‘the beatles’ will be returned as an alternative. This method is flawed in a way that the search mustn’t consist only of articles or pronouns or other frequently used terms. To tackle this I would have to remove all stop words from the set (and the search term).
To see how I exactly did that check out the section in the documenting notebook HERE.
The Output
If the search eventually matches with the users input, I simply call the .kneigbors()
method on the fitted model and look for the 4 (!) closest neighbors to the input.
As a result I will have 3 alternative artists and their distance to the input artist.
def recommend_artist(artist, model=knn_model, df=df_enc, neighbors=3):
"""
will find the nearest neighbors of the desired artist.
pass the ```artist```s name, the fitted ```model```, the n of ```neighbors``` and the pd.DataFrame (```df```) suiting the model.
returns a list of recommended artists similar to the imput artist.
"""
inpt = finder(artist, data=df)
if isinstance(inpt, pd.DataFrame):
X = inpt.drop(columns=['artists', 'genres', 'target'])
# list neighbors
dist, nearest = knn_model.kneighbors(X, n_neighbors=neighbors+1) # Return the distances and index of the 2 closest points
indexes = nearest[0]
dist = dist[0][1:]
dist_fin = 1 - dist / (dist.max() + dist.max() / .2) # the highest distance in the set => beautifying the range
return np.array(df.artists[indexes[1:]]), dist_fin # the nearest is always the artist itself
else:
print(inpt)
The App
To make this wonderful app available for indecisive listeners I decided to deploy it on Heroku. I used Streamlit as a framework which is very intuitive and perfect for smaller projects like this.
You can find it here and play around yourself: