I love listening to music. I also love making playlists. Typically, when I make a playlist, I try to make the playlist as cohesive as possible. In other words, I want all the songs to sound fairly similar, or have a similar feel or vibe. On the other hand, I strive to have an eclectic taste in music. In other words, I want to listen to a wide breadth of music, and to continue broading my musical horizons.
So... cool... I want to have cohesive playlists, and eclectic listening overall (overall listening could probably be approximated by all my liked songs). But those are qualitative descriptions, and might be considered subjective.
Or are they?
What if there was a way to quantify how cohesive or eclectic a playlist is?
Enter: The Playlist Cohesometer
I listen to music on Spotify. Every year in December, Spotify gives users a report of their listening from that year (they call it \<year> Wrapped (eg, 2020 Wrapped)). One of the metrics they provide is the number of genres listened to that year, and a list of top genres. I wouldn't be surprised if most Spotify users couldn't name more than 30 genres, so naturally, seeing that they listened to hundreds of genres, or that their top genre is something they've never heard of (take "escape room" for example), might be confusing or intriguing to many people. I've been among those intrigued and confused for a few years running. This year I looked into what data was available through the Spotify API (which includes artists' genres), and was fascinated by what data was available.
This will be our entry point for The Playlist Cohesometer.
My goal with this project is to find a way to score a Spotify playlist on a scale from eclectic to cohesive.
Along the way, however, I aim to guide the reader through the data science pipeline, and to leave the reader feeling with a better understanding of something, whether that be the data science pipeline at large, a specific data science tool, musical genres, or the importance of multithreading. I'll be linking to various data sources, libraries, and miscellaneous knowledge sources throughout, so if you are interested in something I urge you to click some links and get a deeper understanding of a topic.
Without further ado, let's get started!
As mentioned, we'll start with Spotify's Web API to get data about some playlists. The first thing we'll need to do here is get an access token so Spotify will let us get the data we want.
Spotify's API is not the most straightforward I've seen, and their authorization process is a big part of that. There are four ways to get authorized (they call them Authorization Flows). Since we only need to examine publicly available data, we're going to use the Client Credentials Authorization Flow. For more information about this Authorization Flow and the other three, take a look here.
We're going to need a way to make http requests (requests), and a way to convert strings to base 64 (base64).
import base64
import requests
To get the all important authorization credentials, you'll have to register an app. Instructions can be found here. Then you can get a client ID and a client secret. I've copied my client ID below, and I put the client secret in a text file called spot_api_secret.txt
.
client_id = 'b9a089193ceb44b1b4c22c1b09e9dd6d'
with open('spot_api_secret.txt', 'r') as secretfile:
secret = secretfile.read().strip()
Here's what I consider the tricky part, just because it isn't explained very clearly in the documentation. The client ID must be joined with the client secret, with a colon as the delimiter, then it must be encoded as a base 64 string. Python's base 64 conversion likes to do its converting in bytes, so we'll have to do some extra converting.
credentials = client_id + ':' + secret
creds_bytes = credentials.encode('ascii')
base64_bytes = base64.b64encode(creds_bytes)
base64_creds = base64_bytes.decode('ascii')
Now we can make our request. We specify what Authorization Flow we're using in the body of the request, and we put the base 64 converted credentials in a header (with "Basic " preceding it). If all goes well, our access token should be in the response.
body = {'grant_type': 'client_credentials'}
headers = {'Authorization': 'Basic ' + base64_creds}
r = requests.post('https://accounts.spotify.com/api/token', data=body, headers=headers)
if not r.ok:
print(r.json())
oauth_tok = r.json()['access_token']
With our access token, we can now explore all the publicly available data from the Spotify API.
We're interested in looking at the genres in a playlist. Internally, Spotify's hierarchy is as follows: playlists have songs, which have artists, which have genres. So in order to get a playlist's genres, we'll first have to get their tracks, then we'll have to get those tracks' artists, then we'll have to get those artists' genres.
Spotify playlists (as well as tracks and artists and albums) are identified by Spotify ID's (information here). This can be found by right clicking a playlist in the Spotify desktop client (specifics here). I'll grab a number of my playlists, as well as a few playlists made by The Sounds of Spotify. In the following dictionary, these are the ones preceded by TSo (for "The Sound of..."). They are playlists made entirely of one genre of music, for the most part. The Sound of Everything is an exception to this, as it has one song from each of the 5437 genres Spotify acknowledges. This could probably be considered the most eclectic playlist possible.
In order to get information about these playlists, we'll make requests to the playlist tracks endpoint, so we'll transform this dictionary of names and IDs to a dictionary of names and API endpoints, using a dictionary comprehension (more info on dict comprehensions here and here).
playlist_ids = {'the original source': '1phYgoNC1JfaPCYjeGG3qs',
'test': '23HGYoppjJmcdc7wu8cnCU',
'hodge podge': '2B2rzS6EltNSUGB3ydhEvX',
'finds': '4hqF0FM2RNSJbIWt9os18B',
'Trance': '1g2LDmft1WbH5Tdj5bQOHU',
'trancee': '7xX0QsUDpqTbV8Sv5OkkuX',
'jazze': '3ltMQDT0SUEazFuKLsQ8ST',
'lofi.': '6ZZKPNj04Dkdcaa56NW8VG',
'alt 1 :)': '4AfSnWDVWMeA6NeNjJjYsS',
'technical': '0oTpkoyG96RUUkycqAb7O4',
'deep synth': '6ffhOVyOSyMaZzgyNqE8gf',
'synth (and words)': '3UbXvzFXN4tXRklderuxMb',
'TSo Chamber Psych': '6rirvdbul7rDunT5SP5F4m',
'My Liked Songs': '6TyaE4jhfivkgIymnH7URL',
'TSo Everything': '69fEt9DN5r4JQATi52sRtq',
'TSo Dream Pop': '2A5zN7OTP4n64gEtsFEO2Z'
}
url_fmt = 'https://api.spotify.com/v1/playlists/{}/tracks?market=US&fields=next%2Citems(is_local%2Ctrack.artists)'
playlists = {name:url_fmt.format(pl_id) for (name, pl_id) in playlist_ids.items()}
Coding for data science is almost never something that will be written and run successfully the first time. For that reason, modularizing sections of code is a great way to avoid frustration. Jupyter Notebooks are one great way to do this. Defining functions is another great way. I like to do both. The first function we'll define will take a properly formatted url to the playlist tracks Spotify API endpoint, and return a list of the track data Spotify returns. This endpoint only returns 100 tracks at a time, so for playlists longer than 100 songs, multiple requests will have to be made, and their results concatenated together.
import time
def get_pl_tracks(url):
headers = {'Accept': 'application/json',
'Authorization': 'Bearer ' + oauth_tok,
'Content-Type': 'application/json'}
r = requests.get(url, headers=headers)
if not r.ok:
print(r)
print('Retry After: ', r.headers['retry-after'])
code = r.json()['error']['status']
if code == 429:
time.sleep(int(r.headers['retry-after']))
r = requests.get(url, headers=headers)
r = r.json()
tracks = r['items']
if r['next']:
tracks = tracks + get_pl_tracks(r['next'])
return tracks
As a quick test, I'll run this function with one playlist and compare the length of the result with the number of songs in the playlist (1458).
r = get_pl_tracks(playlists['alt 1 :)'])
len(r)
1458
At the very least, it has the right number of tracks! I'd also recommend looking through the results you get if you try something similar. The json for each track can be navigated like a dictionary in python. It is important to see what kinds of things are available to you when you're collecting data. I did a solid amount of poking around, but removed it because it was rather aimless and had a lot of long output.
Next, we'll define a function to get the artists from the songs in a playlist, and we'll keep track of how many times an artist occurs across all tracks in a playlist (using the Counter collection). Here, we're presented with our first major decision: how do we deal with songs with multiple artists? This is closely related to another decision we'll have to make soon: how do we deal with artists with multiple genres? We could just add all the genres for all the artists on a song, but consider a case where a song has two artists that are identified as Dutch R&B: would that song be twice as "Dutch R&B" as a song with only one such artist? No. Additionally, if a song is a collaboration between two artists from vastly different genres, is that song both of those genres? Yes. But it probably sounds like a mix between the two, not fully either one (this is based purely off my listening experience). For this reason, when we encounter a song with multiple artists, we will normalize the count by the number of artists. In this code, I will use the modifier "adjusted" (or "adj") instead of "normalized", because we will do some more blatant normalization later.
from collections import Counter
def get_pl_artists(url):
tracks = get_pl_tracks(url)
artists_adj = Counter()
for track in tracks:
if not track['is_local']:
for artist in track['track']['artists']:
artists_adj[artist['id']] += 1.0/len(track['track']['artists'])
return artists_adj, len(tracks)
We'll test it out again:
plartists, num_tracks = get_pl_artists(playlists['alt 1 :)'])
print(list(plartists.items())[27])
('5BvJzeQpmsdsFp4HGUYUEx', 14.5)
I've printed out the Spotify ID and count of an artist in this playlist with a count that is not a whole number, as an exmaple of the normalizing we did as we were counting.
Finally, we'll define a function to get all the genres in the playlist, using the function we made to get a playlist's artists. When we count genres, we will weight the count by how many times a particular artist with that genre appears in the playlist. We will not adjust by the number of genres an artist has (having 2 genres versus 1 does not make an artist less of each, based on my experience). We will, however, normalize the final counts by the number of tracks in the playlist. Ideally, our analysis should be irrespective of the size of the playlist, as a small playlist can be eclectic and a large playlist can be cohesive, and vice versa. In the code, I will identify this as "track normalized", or "tnormd". In the end, the value this function returns for a given genre should be the percent of songs in the playlist that may be identified as that genre. We'll also return our data in descending order of frequency, using Counter's most_common
method.
def get_pl_genres(pl_url):
artist_cts, num_tracks = get_pl_artists(pl_url)
artist_ids = list(artist_cts.keys())
request_list = []
artists_info = []
genre_cts = Counter()
while artist_ids:
request_list.append(','.join(artist_ids[:50]))
artist_ids = artist_ids[50:]
for ids in request_list:
artists_url = 'https://api.spotify.com/v1/artists?ids={}'.format(ids)
headers = {'Accept': 'application/json',
'Authorization': 'Bearer ' + oauth_tok,
'Content-Type': 'application/json'}
r = requests.get(artists_url, headers=headers)
if not r.ok:
print(r)
code = r.status_code
while code == 429:
print('Retry After: ', r.headers['retry-after'])
time.sleep(int(r.headers['retry-after']))
r = requests.get(artists_url, headers=headers)
code = r.status_code
r = r.json()
artists_info = artists_info + r['artists']
for artist in artists_info:
genres = artist['genres']
for genre in genres:
genre_cts[genre] += artist_cts[artist['id']]
tnormd_genres = Counter({k:(v/num_tracks) for k, v in genre_cts.most_common()})
return tnormd_genres, num_tracks
Once again, we'll test out our function. I've made a playlist with only instrumental rock, so this should return a Counter (which is a subclass of dictionaries) with one item, with the value being 100%
genres, num_tracks = get_pl_genres(playlists['test'])
print(list(genres.items())[0])
('instrumental rock', 1.0)
Alright, we have some data! Let's see what we can glean from it!
After getting some data, typically you'll want to familiarize yourself with it (even if you think you know about it already). Some great ways to do that are to find ways to visualize it, and to find statistics to summarize it.
The meat of the data we have is just one metric, but we have it across a range of categories. This sounds like a job for a bar chart. We'll be using matplotlib to make some bar charts for our data. Specifically, we'll use the Axes.bar method. We'll write another function to do our plotting. On the x-axis, we'll put our genres, and on the y-axis, we'll put the relative genre frequency we found earlier. Since our data is already sorted by descending frequency, we can expect to see some sort of decreasing curve if we follow the tops of the bars. I believe this shape could be interesting and illuminating, and the heights (relative frequencies) could be interesting as well.
import matplotlib.pyplot as plt
def plot_genre_distributions(pl_names=playlists.keys()):
all_genre_distributions = {}
pls_to_plot = {k:v for k, v in playlists.items() if k in pl_names}
nrows = int((len(pls_to_plot)+1)/2)
ncols = 2 if len(pls_to_plot) > 1 else 1
fig, axs = plt.subplots(nrows=nrows, ncols=ncols, figsize=(ncols*7, 4*nrows))
for i, (name, url) in enumerate(pls_to_plot.items()):
genres, num_tracks = get_pl_genres(url)
all_genre_distributions[name] = genres
if len(pls_to_plot) == 1: ax = axs
elif len(pls_to_plot) == 2: ax = axs[i]
else: ax = axs[int(i/2)][i%2]
labels = list(genres.keys())
counts = list(genres.values())
ax.bar(range(len(genres)), counts, width=1, tick_label=labels)
title = 'Genre Distribution in ' + name
ax.set_title(title)
ax.set_ylabel('Genre Relative Frequency')
plt.setp(ax.get_xticklabels(), rotation=-55, horizontalalignment='left')
if len(genres) > 40: ax.get_xaxis().set_visible(False)
fig.tight_layout()
plt.show()
return all_genre_distributions
As I mentioned earlier, modularization can save a lot of frustration when coding for data science. This is an excellent example of that. When this function is run on all playlists in the playlists dict (this is the default), it takes about 40 seconds to run. When you are just changing how charts look, though, sometimes you want to make a quick change and see how it looks. I've left, commented out, an example of a call I used when adjusting how the graphs looked. Looking at only 3 playlists took the time down significantly.
# plot_genre_distributions(['lofi.', 'alt 1 :)', 'technical'])
genre_distributions = plot_genre_distributions()
These look pretty good! It seems for the most part that they follow a sort of exponential decay shape, which makes sense to me intuitively. We also see relative frequencies for top genres that are all over the board. The genre-specific playlists (The Sound of Chamber Psych, The Sound of Dream Pop) have top genre relative frequencies between 90% and 100% which makes sense. It's interesting that even the most eclectic playlist possible (The Sound of Everything) shares the same general shape as the rest of the playlists' curves, just scaled down significantly (top genre relative frequency seems to be about 0.85%). Unfortunately, most of these playlists have too many genres to be legibly labeled, but of the few that can be labeled, there are some interesting things to note. Looking at "technical," the top 3 genres are instrumental math rock, instrumental rock, and math rock, all with relative genre frequencies between 70% and 40%. Even someone unfamiliar with these genres could tell by the names that they're very similar.
These graphs are fun to look at, but let's stop trying to guess the values for things like the top genre's relative frequency. Next, we'll gather some statistics for these playlists.
In order to keep track of our statistics, we'll use pandas dataframes (pandas docs, dataframes). We'll keep track of a number of things we think may be cool to look at. I'll define them below.
top_genre
): the genre with the highest relative frequency.num_genres
): the number of distinct genres that occur in the playlist. Generally, we might expect this to be higher for more eclectic playlists and lower for more cohesive playlists, but it may be more strongly correlated to the size of the playlist.num_tracks
): the number of songs in the playlist. This will be interesting for us to look at, but ideally we'll want our analysis to be independent of this column.tracks:genres
): the number of tracks in the playlist divided by the number of genres. Generally, we might expect this to be lower for more eclectic playlists and higher for more cohesive playlists.genres:tracks
): the inverse of the previous metric. Generally, we might expect this to be higher for more eclectic playlists and lower for more cohesive playlists.avg_relfreq
): the arithmetic mean of the relative frequency values across genres. Generally, we might expect this to be lower for more eclectic playlists and higher for more cohesive playlists.top_gen_relfreq
): the relative frequency of the top genre. Generally, we might expect this to be lower for more eclectic playlists and higher for more cohesive playlists. top:tot_gen
): the top genre relative frequency normalized by the number of genres. Generally, we might expect this to be lower for more eclectic playlists and higher for more cohesive playlists. Once again we'll define a function to perform this task, to allow us to test our function quickly as we create it.
import pandas as pd
def get_playlist_stats(pl_names=playlists.keys()):
pls_to_summarize = {k:v for k, v in playlists.items() if k in pl_names}
results = pd.DataFrame(columns=['name', 'top_genre',
'num_genres', 'num_tracks', 'tracks:genres',
'genres:tracks', 'avg_relfreq',
'top_gen_relfreq', 'top:tot_gen'])
for name, url in pls_to_summarize.items():
genres, num_tracks = get_pl_genres(url)
num_genres = len(genres)
top_genre = genres.most_common(1)[0]
result = {'name': name,
'num_genres': float(num_genres), # casting as floats to be able to examine correlation later
'num_tracks': float(num_tracks), #
'top_genre': top_genre[0],
'tracks:genres': num_tracks/num_genres,
'genres:tracks': num_genres/num_tracks,
'top:tot_gen': top_genre[1]/num_genres,
'top_gen_relfreq': top_genre[1],
'avg_relfreq': sum(genres.values())/num_genres,
}
results = results.append(result, ignore_index = True)
return results
pl_stats = get_playlist_stats()
pl_stats
<Response [429]> Retry After: 1
name | top_genre | num_genres | num_tracks | tracks:genres | genres:tracks | avg_relfreq | top_gen_relfreq | top:tot_gen | |
---|---|---|---|---|---|---|---|---|---|
0 | the original source | edm | 58.0 | 42.0 | 0.724138 | 1.380952 | 0.090722 | 0.682540 | 0.011768 |
1 | test | instrumental rock | 1.0 | 7.0 | 7.000000 | 0.142857 | 1.000000 | 1.000000 | 1.000000 |
2 | hodge podge | pop | 265.0 | 224.0 | 0.845283 | 1.183036 | 0.012930 | 0.166295 | 0.000628 |
3 | finds | modern rock | 356.0 | 174.0 | 0.488764 | 2.045977 | 0.009346 | 0.080460 | 0.000226 |
4 | Trance | electronica | 59.0 | 67.0 | 1.135593 | 0.880597 | 0.064255 | 0.656716 | 0.011131 |
5 | trancee | electronica | 54.0 | 38.0 | 0.703704 | 1.421053 | 0.072531 | 0.785088 | 0.014539 |
6 | jazze | jazz | 89.0 | 72.0 | 0.808989 | 1.236111 | 0.057220 | 0.413194 | 0.004643 |
7 | lofi. | lo-fi beats | 66.0 | 541.0 | 8.196970 | 0.121996 | 0.027540 | 0.688848 | 0.010437 |
8 | alt 1 :) | modern rock | 497.0 | 1458.0 | 2.933602 | 0.340878 | 0.008769 | 0.326360 | 0.000657 |
9 | technical | instrumental math rock | 40.0 | 50.0 | 1.250000 | 0.800000 | 0.094583 | 0.656667 | 0.016417 |
10 | deep synth | spacewave | 38.0 | 21.0 | 0.552632 | 1.809524 | 0.082707 | 0.238095 | 0.006266 |
11 | synth (and words) | chillwave | 140.0 | 140.0 | 1.000000 | 1.000000 | 0.026480 | 0.164286 | 0.001173 |
12 | TSo Chamber Psych | chamber psych | 209.0 | 388.0 | 1.856459 | 0.538660 | 0.020480 | 0.919029 | 0.004397 |
13 | My Liked Songs | modern rock | 1058.0 | 5415.0 | 5.118147 | 0.195383 | 0.003901 | 0.149554 | 0.000141 |
14 | TSo Everything | classical | 5431.0 | 5437.0 | 1.001105 | 0.998896 | 0.000583 | 0.011669 | 0.000002 |
15 | TSo Dream Pop | dream pop | 218.0 | 434.0 | 1.990826 | 0.502304 | 0.033873 | 0.975806 | 0.004476 |
This is interesting to parse through. Let's look at how some of these things met or did not meet our expectations.
genres:tracks
is a very interesting column. We expected this to be higher for more eclectic playlists. The Sound of Everything, which we noted earlier is probably the most eclectic playlist possible, has a value of 0.998896. The other two "The Sound of..." playlists (which might be considered some of the most cohesive) have values close to 0.5. Finds, trancee and deep synth, on the other hand, have values that are significantly above 1. This indicates that there's probably more to this statistic than simply "higher = more eclectic" and "lower = more cohesive". It could be a much more complex relationship between multiple factors, or it could just be irrelevant to what we're considering. Results like this are great in my opinion. Sure, its fun to have your expectations met, but there's a lot to learn about our own assumptions when we find that our expectations were wrong. In data science, it's important to address these situations, rather than trying to cover them up. Nobody could be expected to guess everything correctly, otherwise there would be no reason to do data science.
As tracks:genres
is the mathematical inverse of genres:tracks
, it too could be considered to be more complex (or alternativelty, less important) than we andicipated.
avg_relfreq
is interesting too. With the exception of the highly contrived test playlist, this tends to be very low for most playlists. It is very very low for The Sound of Everything (0.000583), but most are one to two orders of magnitude greater. The part I find interesting is in comparing The Sound of [genre] and playlists like trancee, jazze, technical, and deep synth. While these are playlists I'd like to call very cohesive, I probably wouldn't say they're more cohesive than playlists designed to showcase single genres. What this statistic seems to be more correlated with, in general, is the number of tracks. The more tracks there are, the lower the average relative frequency is. This might suggest we shouldn't focus too hard on it, or that we should at least be strongly aware of its correlation with number of tracks.
top:tot_gen
falls into the same category as avg_relfreq
here: it seems to be most strongly correlated with num_tracks.
top_gen_relfreq
seems to align the most with our expectations. It may be interesting to focus mostly on the ends of the graphs where relative frequency is higher.
Let's confirm (or disprove) our speculations about correlations with pandas and seaborn, a library that works well for making heatmaps (docs, tutorial). We'll be looking for correlations between num_tracks
and any other column, and it will also be interesting to note any other particularly strong correlations.
import seaborn as sbn
correlations = pl_stats.corr()
sbn.heatmap(correlations, annot=True)
plt.show()
num_tracks
is most strongly correlated with num_genres
(0.79), and top_gen_relfreq
(-0.48), interestingly. Once again, my expectations have been broken by the latter of those strong correlations. We also see weaker (but still existent) correlations between num_tracks
and genres:tracks
and avg_relfreq
. What this tells us is that perhaps these columns will still be meaningful to consider -- perhaps they are influenced by number of tracks, but also something else. That something else is what is particularly interesting to me.
I'm not surprised to see a very strong negative correlation between tracks:genres
and genres:tracks
, seeing as they are mathematically inversely related. Perhaps most interesting, outside of the num_tracks
correlations, is the almost direct correlation between top:tot_gen
and avg_relfreq
. After some consideration, I believe this is because the sum of all the genre relative frequencies is dominated by the top genre. Once this connection is made, it seems obvious that [the sum of the relative frequencies divided by the number of genres] is almost directly correlated with [the top genre relative frequency divided by the number of genres]. This further leads me to believe it may be most rewarding to consider only the left side of our plots, where the relative frequencies are higher.
So far we've gotten some pretty interesting metrics, and we're on our way to being able to determine how cohesive or eclectic playlists are. However, there's still a big hole in our plan so far.
Consider two playlists. They each have high concentrations of three separate genres.
The first:
The second:
Clearly we would describe the first playlist as more cohesive and the second as more eclectic. Unfortunately, with only genre counts and distributions, and no information about the genres themselves, there's only so far this analysis will get. Obviously, there's a lot more to what makes two songs sound similar or different than what genre the artists are. Two songs from different genres may sound vastly similar (see: instrumental rock, instrumental math rock, math rock). Unfortunately, the exact characteristics that may indicate whether songs sound similar would be incredibly hard to parse through. There is some interesting data available from the Spotify API for track audio features (danceability
, energy
, key
, loudness
, mode
, speechiness
, acousticness
, instrumentalness
, liveness
, valence
, tempo
, duration_ms
, and time_signature
are available for all tracks), but surely someone has already done that...
everynoise.com to the rescue! Here's a discription of Every Noise at Once from the website:
Every Noise at Once is an ongoing attempt at an algorithmically-generated, readability-adjusted scatter-plot of the musical genre-space, based on data tracked and analyzed for 5,437 genre-shaped distinctions by Spotify as of 2021-05-15. The calibration is fuzzy, but in general down is more organic, up is more mechanical and electric; left is denser and more atmospheric, right is spikier and bouncier.
(Retrieved 5/15/2021)
Every Noise at Once is made by Glenn McDonald, who is a "data alchemist" at Spotify. His work at Spotify includes creating people's Daily Mixes, and populating artists' Related Artists tabs (algorithmically, of course) [source]. If anyone could be called an expert at determining genres' relationships with each other, it's him.
The whole site is worth exploring if you're interested in music, but some parts are particularly useful for our purposes. Specifically, the genre list has a beautiful gem hidden in the html when sorted by "similarity to <genre>". Here's are the first few rows of the list when sorted by similarity to trap:
Note the title of the first table data cell in each row (the lines where class="note"
). Acoustic distance between genres is exactly what we need to be able to tell if a playlist with a lot of instrumental rock, instrumental math rock, and math rock is different from a playlist with a lot of reggaeton, shoegaze, and hard techno! The former should have a low acoustic distance between the genres and the latter should have a much greater distance between the genres. The exact definition of what acoustic distance means is not entirely clear, but I am confident that there is a meaningful way Spotify/Glenn McDonald/everynoise.com calculates it. For reference, the distance between a genre and its closest genre is usually below 1, and the distance between a genre and its farthest genre is usually about 20 (the highest I've seen is around 34). Another important property is that these distances are bi-directional (the distance between russian synthpop and new rave is the same as the distance between new rave and russian synthpop). This will allow us to trim down how many times we have to look up a distance.
There is another metric listed with acoustic distance as well, called overlap. I will not use this because it quickly goes to zero for most genres (leaving thousands of genres tied with overlap of 0).
Before we get into scraping this data, it will be important to figure out how we're going to store it.
We could think of this as a weighted, undirected graph where every vertex is a genre, and the edge weights are the acoustic distances between two genres. This could be stored in an adjacency matrix as a numpy ndarray, or in a sql database... but this is not a trivial case when it comes to storage (in the worst case, a playlist with all the genres cough cough The Sound of Everything cough cough), we have distances between all 5400-ish genres to store). An adjacency matrix would have 5400 * 5400 = over 29 million entries, which might just fit in memory, but it would be cutting it close. Since this is not a trivial case, we'll just use a dedicated graph library. I've chosen NetworkX (docs).
Now that we've decided how we'll take care of the data we get, we can look at how to get the data. We'll scrape the data using BeautifulSoup (docs).
from bs4 import BeautifulSoup
r = requests.get('https://everynoise.com/everynoise1d.cgi?scope=all')
soup = BeautifulSoup(r.content, 'html.parser')
Our first goal will be to get a sense of how we can access the pertinent information. For this, we'll just take a look at the table we get from our soup.
table_soup = soup.find('table').prettify()
print(table_soup[:675]) # just showing the first row to try not to clutter the output too much
<table border="0" cellpadding="0" cellspacing="0"> <tr class="" valign="top"> <td align="right" class="note" style="font-size: 20px; line-height: 24px"> 1 </td> <td style="font-size: 20px; line-height: 24px"> <a class="note" href="https://embed.spotify.com/?uri=spotify:playlist:6gS3HhOiI17QNojjPuPzqc" onclick="linksync('https://embed.spotify.com/?uri=spotify:playlist:6gS3HhOiI17QNojjPuPzqc');" target="spotify" title="See this playlist"> ☊ </a> </td> <td class="note" style="font-size: 20px; line-height: 24px"> <a href="?root=pop&scope=all" style="color: #AB890D" title="Re-sort the list starting from here."> pop </a> </td> </tr>
We can see that we'll need to go into each \<tr> and then into the third \<td> to get the genre names. I followed a similar process for the lists sorted by genre to find how to get the acoustic distance. I won't output the code or output for it since it's basically the same as above, and the results are already basically shown in the screenshot above. The final code I used to extract the acoustic distances will come later.
We'll look at all the rows of the table, and snag the genre names from each row, and append them to a list.
rows = soup.find("table").find_all("tr")
all_genres = []
for row in rows:
cells = row.find_all("td")
genrename = cells[2].get_text()
all_genres.append(genrename)
print(len(all_genres))
all_genres[:10]
5437
['pop', 'dance pop', 'post-teen pop', 'rap', 'pop rap', 'rock', 'latin', 'hip hop', 'modern rock', 'trap latino']
This list is quite long (well, 5437 elements long, to be exact), So I outputted only the first 10, and also printed the length, just to make sure we got everything. Now, we can begin getting the acoustic distances.
I learned an incredibly important lesson here, and I'd like to share it with you. We're going to need to make 5437 html requests. This is no trivial task. It will take a long time. Things will go wrong.
Remember when I said at the beginning one of the things I want a reader to be able to take away from this project is the importance of threading? Well this is where this comes in. I tried collecting this data with a single process and a single thread at first, and wasted about 17 hours of my life. Over that time span, about 25% of the genres had data collected for them, and with every graph insertion it was taking longer. Things were not looking good.
When I returned to my computer after 17 hours to find that only a fraction of the genres had been processed, I realized I needed a better solution.
Threading was that better solution. Python has a library for threading built in (here's a useful guide to threading). Threading allows us to perform tasks concurrently, to make better use of time the processor spends idle. Basically, we are going to define a function we want all threads to work on, then set them all loose chipping away at the remaing genres to process.
To find and extract the acoustic distance from the "title" attributes, we'll use regular expressions (docs, tutorial). I recommend a tool like regex101 to test regular expressions if you're not toally comfortable with them.
The tricky part of threading is something called a data race (learn more about it and how to avoid it here). When a thread needs to write to shared data, it must be the only one accessing it. To ensure this, we use a Lock (also called a Semaphore). In our case, we're going to need a lock for the list of genres to process, and a lock for the graph where we're storing our distance relationships.
I also implemented a timer to predict when the job would be complete, which was done by the main thread. Information on the function I used for this can be found here.
Finally, I implemented a feature where the graph would be backed up to a file every 15 minutes, so in case something unexpected happened, I would still have the progress so far. I also save to this file at the end, so we only have to run this code block once (in the future we can just read the graph from the file). I did this with NetworkX's write_weighted_edgelist function. It is important to note that by default, this function delimits nodes using whitespace, which is an issue if your nodes have whitespace. I used **
as a delimiter because no genre has that in its name (verifiable with ctrl-f on everynoise).
I will not be showing the output of this function because running it takes over 4 hours (even with 16 threads!).
import networkx as nx
import re
import threading
import time
#make graph where we'll store genre distances
genredists = nx.Graph()
# start the timer
start_time = time.time()
# make a copy of the genre list so we can remove from it
genreq = all_genres.copy()
def thread_task(list_lock, remaining, graph_lock, gr):
dist_getter = re.compile(r'.*acoustic distance: (\d+\.\d+)')
more_to_process = True
# we will continue getting data until there is nothing left in genreq
while more_to_process:
# get the lock before editing the remaining genre list
list_lock.acquire()
if remaining:
rootgenre = remaining.pop(0) # critical section
list_lock.release()
# make the request
url = 'https://everynoise.com/everynoise1d.cgi?root={}&scope=all'.format(rootgenre)
r = requests.get(url)
# troubleshooting faulty requests
while not r.ok:
print(r)
r = requests.get(url) # maybe another try will help
soup = BeautifulSoup(r.content, 'html.parser')
#get all rows
rows = soup.find("table").find_all("tr")
# in each row, find the name of the target genre, and the acoustic distance
for row in rows:
cells = row.find_all("td")
targetgenre = cells[2].find('a').get_text()
distelem = cells[0]
if not distelem.has_attr('title'):
print(str(distelem) + ' ' + rootgenre + ' ' + targetgenre)
else:
title = distelem['title']
dist = float(dist_getter.search(title).group(1))
# get the lock for the graph before inserting an edge
graph_lock.acquire()
gr.add_edge(rootgenre, targetgenre, weight=dist) # critical section
graph_lock.release()
else:
list_lock.release()
more_to_process = False
# create the locks
list_lock = threading.Lock()
graph_lock = threading.Lock()
# we'll keep track of threads here
threadlist = []
for i in range(16):
curthread = threading.Thread(target=thread_task, args=(list_lock, genreq, graph_lock, genredists))
curthread.start()
threadlist.append(curthread)
# the main thread will keep track of progress and display estimated time remaining, and also back up the graph
more_to_process = True
backup_counter = 0
while more_to_process:
list_lock.acquire()
progress = len(all_genres) - len(genreq) #critical section
list_lock.release()
print(str(100*progress/len(all_genres))+'%')
# backup every 15 min
if backup_counter == 15:
print('backing up...')
graph_lock.acquire()
nx.write_weighted_edgelist(genredists, 'genredistsfromthreads', delimiter='**') #critical section
graph_lock.release()
print('backed up.')
backup_counter = 0
else:
backup_counter += 1
# progress report, then wait
if progress < len(all_genres):
elapsed_time = time.time() - start_time
predicted_end_elapsed_time = len(all_genres)*elapsed_time/progress
predicted_end_time = predicted_end_elapsed_time + start_time
print('Predicted completion at ' + time.ctime(predicted_end_time))
time.sleep(60)
else:
more_to_process = False
# when done, wait for all the threads
for t in threadlist:
t.join()
# final write to filesystem
nx.write_weighted_edgelist(genredists, 'genredistsfromthreads', delimiter='**')
print('done')
The two things I pointed out when discussing making 5437 http requests were that it would take a long time and that things would go wrong. These are expected. I've covered the first, now I'll cover the second. There's one fatal flaw in the above code that I only discovered a couple hours after setting the threads loose: I'm just putting genre names right into a url! In retrospect, I should have known not to do this, but it's a great learning experience for everyone.
Consider the genre "r&b", and its many variations. '&' is a special character when it comes to urls (and so is '+'). Every time a thread requested a genre with a '&' or a '+' in it, everynoise just returned the main genre list page (with no distances). This was problematic for our purposes, but luckily it didn't break anything too bad (I expected things to go wrong, and I tried my best to prepare for them).
This had a rather easy fix. I just had to escape the problematic genre names, and recompile the distances between problematic genres. I used urllib's quote function to do this (docs). I just added the edges to the graph and saved it to file again (only one thread needed this time).
import urllib
problematic_genres = list(filter(lambda x: ('&' in x) or ('+' in x), all_genres))
dist_getter = re.compile(r'.*acoustic distance: (\d+\.\d+)')
while problematic_genres:
rootgenre = problematic_genres.pop(0)
escapedrootgenre = urllib.parse.quote(rootgenre)
url = 'https://everynoise.com/everynoise1d.cgi?root={}&scope=all'.format(escapedrootgenre)
r = requests.get(url)
while not r.ok:
print(r)
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
rows = soup.find("table").find_all("tr")
for row in rows:
cells = row.find_all("td")
targetgenre = cells[2].find('a').get_text()
if targetgenre in problematic_genres:
distelem = cells[0]
if not distelem.has_attr('title'):
print(str(distelem) + ' ' + rootgenre + ' ' + targetgenre)
else:
title = distelem['title']
dist = float(dist_getter.search(title).group(1))
genredists.add_edge(rootgenre, targetgenre, weight=dist)
nx.write_weighted_edgelist(genredists, 'genredistsfromthreadsfull', delimiter='**')
print('done')
Now that our graph is saved to the filesystem, we can re-extract it into a NetworkX graph with write_weighted_edgelist
's counterpart, read_weighted_edgelist
.
import networkx as nx
genredists = nx.read_weighted_edgelist('genredistsfromthreadsfull', delimiter='**')
genredists.size()
14783175
Okay, so we have our graph of all the genre distances. Now we need to turn this into something useful regarding our playlists. In order to comment on how acoustic distance between genres, we will somehow combine our genre frequency data with our genre distance data. If one genre is particularly far from the others, but that genre occurs only rarely, I would still consider the playlist mostly cohesive. In order to combine these numbers, we will multiply the acoustic distance between two genres and the average relative frequency of the two genres in the playlist. We will use the average relative frequency between two genres because this will weight distances involving more prevalent genres higher, and distances between uncommon genres lower (Imagine a scenario where the top genre has low-ish distance to all other genres, but some less common genres are far from each other. They're still close-ish to a bulk of the playlist, and therefore the playlist would be considered more cohesive.)
def count_adjusted_distancegraph(genre_distributions, all_genre_distances):
all_count_adj_dists = {}
for name, genre_distr in genre_distributions.items():
print('working on', name)
count_adj_dists = nx.subgraph(all_genre_distances, genre_distr.keys()).copy()
for from_gen, to_gen, data in count_adj_dists.edges(data=True):
raw_dist = data['weight']
avg_relfreq = (genre_distr[from_gen] + genre_distr[to_gen])/2
adj_dist = raw_dist * avg_relfreq
count_adj_dists[from_gen][to_gen]['weight'] = adj_dist
count_adj_dists[from_gen][to_gen]['weight2'] = raw_dist
all_count_adj_dists[name] = count_adj_dists
return all_count_adj_dists
Now that we have a way to summarize how acoustic discance factors in to a playlist's genres, we can incorporate this into our playlist-wise data. We'll create a new column for the average adjusted distance between all genres in the playlist. In this column, we can expect cohesive playlists to have smaller adjusted distances and eclectic playlists to have larger acoustic distances.
d = {k:v for (k,v) in genre_distributions.items() if k in ['technical']}
a = count_adjusted_distancegraph(genre_distributions, genredists)
working on the original source working on test working on hodge podge working on finds working on Trance working on trancee working on jazze working on lofi. working on alt 1 :) working on technical working on deep synth working on synth (and words) working on TSo Chamber Psych working on My Liked Songs working on TSo Everything working on TSo Dream Pop
pl_stats['adjusted_distances'] = pl_stats['name'].map(lambda x: a[x].size(weight='weight')/a[x].size())
pl_stats['raw_distances'] = pl_stats['name'].map(lambda x: a[x].size(weight='weight2')/a[x].size())
pl_stats
name | top_genre | num_genres | num_tracks | tracks:genres | genres:tracks | avg_relfreq | top_gen_relfreq | top:tot_gen | adjusted_distances | raw_distances | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | the original source | edm | 58.0 | 42.0 | 0.724138 | 1.380952 | 0.090722 | 0.682540 | 0.011768 | 0.180952 | 2.224092 |
1 | test | instrumental rock | 1.0 | 7.0 | 7.000000 | 0.142857 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 |
2 | hodge podge | pop | 265.0 | 224.0 | 0.845283 | 1.183036 | 0.012930 | 0.166295 | 0.000628 | 0.035859 | 3.057757 |
3 | finds | modern rock | 356.0 | 174.0 | 0.488764 | 2.045977 | 0.009346 | 0.080460 | 0.000226 | 0.031185 | 3.658833 |
4 | Trance | electronica | 59.0 | 67.0 | 1.135593 | 0.880597 | 0.064255 | 0.656716 | 0.011131 | 0.158337 | 2.757615 |
5 | trancee | electronica | 54.0 | 38.0 | 0.703704 | 1.421053 | 0.072531 | 0.785088 | 0.014539 | 0.193318 | 2.859429 |
6 | jazze | jazz | 89.0 | 72.0 | 0.808989 | 1.236111 | 0.057220 | 0.413194 | 0.004643 | 0.166335 | 3.265737 |
7 | lofi. | lo-fi beats | 66.0 | 541.0 | 8.196970 | 0.121996 | 0.027540 | 0.688848 | 0.010437 | 0.065826 | 2.461002 |
8 | alt 1 :) | modern rock | 497.0 | 1458.0 | 2.933602 | 0.340878 | 0.008769 | 0.326360 | 0.000657 | 0.019832 | 2.631990 |
9 | technical | instrumental math rock | 40.0 | 50.0 | 1.250000 | 0.800000 | 0.094583 | 0.656667 | 0.016417 | 0.206686 | 2.472184 |
10 | deep synth | spacewave | 38.0 | 21.0 | 0.552632 | 1.809524 | 0.082707 | 0.238095 | 0.006266 | 0.206647 | 2.452852 |
11 | synth (and words) | chillwave | 140.0 | 140.0 | 1.000000 | 1.000000 | 0.026480 | 0.164286 | 0.001173 | 0.056141 | 2.372341 |
12 | TSo Chamber Psych | chamber psych | 209.0 | 388.0 | 1.856459 | 0.538660 | 0.020480 | 0.919029 | 0.004397 | 0.046829 | 2.747842 |
13 | My Liked Songs | modern rock | 1058.0 | 5415.0 | 5.118147 | 0.195383 | 0.003901 | 0.149554 | 0.000141 | 0.012819 | 3.799406 |
14 | TSo Everything | classical | 5431.0 | 5437.0 | 1.001105 | 0.998896 | 0.000583 | 0.011669 | 0.000002 | 0.002508 | 4.400012 |
15 | TSo Dream Pop | dream pop | 218.0 | 434.0 | 1.990826 | 0.502304 | 0.033873 | 0.975806 | 0.004476 | 0.068606 | 2.284304 |
Once again, our expectations have been contradicted, though this time I believe it is due less so to the underlying data, and more so with our method of adjusting the distances. It seems as though the relative frequency term of our adjustment overpowered the distance, and it also seems as though the playlist size plays a significant part in the adjusted distance. We will look at the correlations heatmap again to see what we can learn.
correlations = pl_stats.corr()
sbn.heatmap(correlations, annot=True)
plt.show()
It seems that the strongest correlations with the adjusted distances are indeed the number of genres, the number of tracks, and also, interestingly, the ratio between them.
In an ideal world with infinite time and resources, I would like to explore other ways to incorporate genre acoustic distance into this analysis, ideally with less dependence on the size of the playlist and more dependence on the distance. Alas, this is not the case, and I must continue forward.
Hypothesis testing generally assumes a known result, or an ability to verify your model somehow. Unfortunately, I have no way of knowing more than just qualitatively and subjectively how cohesive a playlist is. Therefore, we'll resort to Unsupervised machine learning. This does not require data with anything known about it -- just data that may have some underlying quantity that is unknown (this is called a latent variable).
The method we will use is called Principal Component Analysis. We'll use the machine learning library, scikit-learn. In particular, we'll use their implementation of PCA (docs). The goal of PCA is to find a linear combination of features that can reduce the dimensionality of the data while explaining as much variance in the data as possible. What this means is that We are looking for a smaller number of underlying metrics that can describe the data (hopefully) just as well as all the data.
As we said earlier, we do not want our analysis to be dependent on the number of songs or genres in a playlist. For this reason, we will not consider either when we train our model. We will use the rest of the statistics and computations we've made so far, though.
import numpy as np
from sklearn.decomposition import PCA
features = pl_stats[['tracks:genres', 'genres:tracks', 'avg_relfreq', 'top_gen_relfreq',
'top:tot_gen', 'adjusted_distances', 'raw_distances']]
pca = PCA(n_components=2)
pca.fit(features)
print(pca.explained_variance_ratio_)
new_cols = np.transpose(pca.transform(features))
pl_stats_analyzed = pl_stats.copy()
pl_stats_analyzed['principal_component_1'] = new_cols[0]
pl_stats_analyzed['principal_component_2'] = new_cols[1]
pl_stats_analyzed
[0.8587668 0.1091602]
name | top_genre | num_genres | num_tracks | tracks:genres | genres:tracks | avg_relfreq | top_gen_relfreq | top:tot_gen | adjusted_distances | raw_distances | principal_component_1 | principal_component_2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | the original source | edm | 58.0 | 42.0 | 0.724138 | 1.380952 | 0.090722 | 0.682540 | 0.011768 | 0.180952 | 2.224092 | -1.431518 | 0.809935 |
1 | test | instrumental rock | 1.0 | 7.0 | 7.000000 | 0.142857 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 5.362862 | 1.920514 |
2 | hodge podge | pop | 265.0 | 224.0 | 0.845283 | 1.183036 | 0.012930 | 0.166295 | 0.000628 | 0.035859 | 3.057757 | -1.463473 | -0.127194 |
3 | finds | modern rock | 356.0 | 174.0 | 0.488764 | 2.045977 | 0.009346 | 0.080460 | 0.000226 | 0.031185 | 3.658833 | -2.083773 | -0.595777 |
4 | Trance | electronica | 59.0 | 67.0 | 1.135593 | 0.880597 | 0.064255 | 0.656716 | 0.011131 | 0.158337 | 2.757615 | -1.046092 | 0.205160 |
5 | trancee | electronica | 54.0 | 38.0 | 0.703704 | 1.421053 | 0.072531 | 0.785088 | 0.014539 | 0.193318 | 2.859429 | -1.573893 | 0.252113 |
6 | jazze | jazz | 89.0 | 72.0 | 0.808989 | 1.236111 | 0.057220 | 0.413194 | 0.004643 | 0.166335 | 3.265737 | -1.534280 | -0.240433 |
7 | lofi. | lo-fi beats | 66.0 | 541.0 | 8.196970 | 0.121996 | 0.027540 | 0.688848 | 0.010437 | 0.065826 | 2.461002 | 5.930100 | -0.989048 |
8 | alt 1 :) | modern rock | 497.0 | 1458.0 | 2.933602 | 0.340878 | 0.008769 | 0.326360 | 0.000657 | 0.019832 | 2.631990 | 0.785922 | -0.156508 |
9 | technical | instrumental math rock | 40.0 | 50.0 | 1.250000 | 0.800000 | 0.094583 | 0.656667 | 0.016417 | 0.206686 | 2.472184 | -0.866403 | 0.449045 |
10 | deep synth | spacewave | 38.0 | 21.0 | 0.552632 | 1.809524 | 0.082707 | 0.238095 | 0.006266 | 0.206647 | 2.452852 | -1.742142 | 0.547040 |
11 | synth (and words) | chillwave | 140.0 | 140.0 | 1.000000 | 1.000000 | 0.026480 | 0.164286 | 0.001173 | 0.056141 | 2.372341 | -1.151669 | 0.467716 |
12 | TSo Chamber Psych | chamber psych | 209.0 | 388.0 | 1.856459 | 0.538660 | 0.020480 | 0.919029 | 0.004397 | 0.046829 | 2.747842 | -0.276692 | 0.102760 |
13 | My Liked Songs | modern rock | 1058.0 | 5415.0 | 5.118147 | 0.195383 | 0.003901 | 0.149554 | 0.000141 | 0.012819 | 3.799406 | 2.681505 | -1.722328 |
14 | TSo Everything | classical | 5431.0 | 5437.0 | 1.001105 | 0.998896 | 0.000583 | 0.011669 | 0.000002 | 0.002508 | 4.400012 | -1.539881 | -1.439574 |
15 | TSo Dream Pop | dream pop | 218.0 | 434.0 | 1.990826 | 0.502304 | 0.033873 | 0.975806 | 0.004476 | 0.068606 | 2.284304 | -0.050574 | 0.516579 |
As printed before the dataframe, we were able to find a factor that explains 85% of the variance in the data. That's great, as long as it's not simply a manifestation of track or genre count. To discover this, we will examine the correlations in this data again.
correlations = pl_stats_analyzed.corr()
sbn.heatmap(correlations, annot=True)
plt.show()
As it turns out, the factor that can explain 85% of the variance in the data is directly proportional to the track to genre ratio! On top of this, it is only weakly correlated to the total number of tracks and genres.
What have we learned from our exploration?
We discovered that it is possible to mostly differentiate between playlists based on a factor that is unrelated to the size of the playlist. Furthermore, that factor is directly proportional to data that is (relatively) easy to obtain. Simply divide the number of tracks in a playlist by the number of genres. What we did not obtain was a way to score playlists based on their cohesiveness. Looking at this metric, it seems to have arbitrary values when looking at relatively cohesive playlists and comparing them, and similarly for relatively eclectic playlists.
I still believe that with more time and effort, a way to score a playlist's cohesiveness could be determined. If I were to continue with this project, I would: