Analysis of Spotify Playlists

The Playlist Cohesometer

(aka... Playlist Eclectometer)

Christopher Baillie Olin


Background

I love listening to music. I also love making playlists. Typically, when I make a playlist, I try to make the playlist as cohesive as possible. In other words, I want all the songs to sound fairly similar, or have a similar feel or vibe. On the other hand, I strive to have an eclectic taste in music. In other words, I want to listen to a wide breadth of music, and to continue broading my musical horizons.

So... cool... I want to have cohesive playlists, and eclectic listening overall (overall listening could probably be approximated by all my liked songs). But those are qualitative descriptions, and might be considered subjective.

Or are they?

What if there was a way to quantify how cohesive or eclectic a playlist is?

Enter: The Playlist Cohesometer

I listen to music on Spotify. Every year in December, Spotify gives users a report of their listening from that year (they call it \<year> Wrapped (eg, 2020 Wrapped)). One of the metrics they provide is the number of genres listened to that year, and a list of top genres. I wouldn't be surprised if most Spotify users couldn't name more than 30 genres, so naturally, seeing that they listened to hundreds of genres, or that their top genre is something they've never heard of (take "escape room" for example), might be confusing or intriguing to many people. I've been among those intrigued and confused for a few years running. This year I looked into what data was available through the Spotify API (which includes artists' genres), and was fascinated by what data was available.

This will be our entry point for The Playlist Cohesometer.

This Project

My goal with this project is to find a way to score a Spotify playlist on a scale from eclectic to cohesive.

Along the way, however, I aim to guide the reader through the data science pipeline, and to leave the reader feeling with a better understanding of something, whether that be the data science pipeline at large, a specific data science tool, musical genres, or the importance of multithreading. I'll be linking to various data sources, libraries, and miscellaneous knowledge sources throughout, so if you are interested in something I urge you to click some links and get a deeper understanding of a topic.

Without further ado, let's get started!


Gathering Data

As mentioned, we'll start with Spotify's Web API to get data about some playlists. The first thing we'll need to do here is get an access token so Spotify will let us get the data we want.

Spotify API Authorization

Spotify's API is not the most straightforward I've seen, and their authorization process is a big part of that. There are four ways to get authorized (they call them Authorization Flows). Since we only need to examine publicly available data, we're going to use the Client Credentials Authorization Flow. For more information about this Authorization Flow and the other three, take a look here.

We're going to need a way to make http requests (requests), and a way to convert strings to base 64 (base64).

To get the all important authorization credentials, you'll have to register an app. Instructions can be found here. Then you can get a client ID and a client secret. I've copied my client ID below, and I put the client secret in a text file called spot_api_secret.txt.

Here's what I consider the tricky part, just because it isn't explained very clearly in the documentation. The client ID must be joined with the client secret, with a colon as the delimiter, then it must be encoded as a base 64 string. Python's base 64 conversion likes to do its converting in bytes, so we'll have to do some extra converting.

Now we can make our request. We specify what Authorization Flow we're using in the body of the request, and we put the base 64 converted credentials in a header (with "Basic " preceding it). If all goes well, our access token should be in the response.

With our access token, we can now explore all the publicly available data from the Spotify API.

Getting Playlist Data

We're interested in looking at the genres in a playlist. Internally, Spotify's hierarchy is as follows: playlists have songs, which have artists, which have genres. So in order to get a playlist's genres, we'll first have to get their tracks, then we'll have to get those tracks' artists, then we'll have to get those artists' genres.

Spotify playlists (as well as tracks and artists and albums) are identified by Spotify ID's (information here). This can be found by right clicking a playlist in the Spotify desktop client (specifics here). I'll grab a number of my playlists, as well as a few playlists made by The Sounds of Spotify. In the following dictionary, these are the ones preceded by TSo (for "The Sound of..."). They are playlists made entirely of one genre of music, for the most part. The Sound of Everything is an exception to this, as it has one song from each of the 5437 genres Spotify acknowledges. This could probably be considered the most eclectic playlist possible.

In order to get information about these playlists, we'll make requests to the playlist tracks endpoint, so we'll transform this dictionary of names and IDs to a dictionary of names and API endpoints, using a dictionary comprehension (more info on dict comprehensions here and here).

Coding for data science is almost never something that will be written and run successfully the first time. For that reason, modularizing sections of code is a great way to avoid frustration. Jupyter Notebooks are one great way to do this. Defining functions is another great way. I like to do both. The first function we'll define will take a properly formatted url to the playlist tracks Spotify API endpoint, and return a list of the track data Spotify returns. This endpoint only returns 100 tracks at a time, so for playlists longer than 100 songs, multiple requests will have to be made, and their results concatenated together.

As a quick test, I'll run this function with one playlist and compare the length of the result with the number of songs in the playlist (1458).

At the very least, it has the right number of tracks! I'd also recommend looking through the results you get if you try something similar. The json for each track can be navigated like a dictionary in python. It is important to see what kinds of things are available to you when you're collecting data. I did a solid amount of poking around, but removed it because it was rather aimless and had a lot of long output.

Next, we'll define a function to get the artists from the songs in a playlist, and we'll keep track of how many times an artist occurs across all tracks in a playlist (using the Counter collection). Here, we're presented with our first major decision: how do we deal with songs with multiple artists? This is closely related to another decision we'll have to make soon: how do we deal with artists with multiple genres? We could just add all the genres for all the artists on a song, but consider a case where a song has two artists that are identified as Dutch R&B: would that song be twice as "Dutch R&B" as a song with only one such artist? No. Additionally, if a song is a collaboration between two artists from vastly different genres, is that song both of those genres? Yes. But it probably sounds like a mix between the two, not fully either one (this is based purely off my listening experience). For this reason, when we encounter a song with multiple artists, we will normalize the count by the number of artists. In this code, I will use the modifier "adjusted" (or "adj") instead of "normalized", because we will do some more blatant normalization later.

We'll test it out again:

I've printed out the Spotify ID and count of an artist in this playlist with a count that is not a whole number, as an exmaple of the normalizing we did as we were counting.

Finally, we'll define a function to get all the genres in the playlist, using the function we made to get a playlist's artists. When we count genres, we will weight the count by how many times a particular artist with that genre appears in the playlist. We will not adjust by the number of genres an artist has (having 2 genres versus 1 does not make an artist less of each, based on my experience). We will, however, normalize the final counts by the number of tracks in the playlist. Ideally, our analysis should be irrespective of the size of the playlist, as a small playlist can be eclectic and a large playlist can be cohesive, and vice versa. In the code, I will identify this as "track normalized", or "tnormd". In the end, the value this function returns for a given genre should be the percent of songs in the playlist that may be identified as that genre. We'll also return our data in descending order of frequency, using Counter's most_common method.

Once again, we'll test out our function. I've made a playlist with only instrumental rock, so this should return a Counter (which is a subclass of dictionaries) with one item, with the value being 100%

Alright, we have some data! Let's see what we can glean from it!

Data Exploration

After getting some data, typically you'll want to familiarize yourself with it (even if you think you know about it already). Some great ways to do that are to find ways to visualize it, and to find statistics to summarize it.

Visualization

The meat of the data we have is just one metric, but we have it across a range of categories. This sounds like a job for a bar chart. We'll be using matplotlib to make some bar charts for our data. Specifically, we'll use the Axes.bar method. We'll write another function to do our plotting. On the x-axis, we'll put our genres, and on the y-axis, we'll put the relative genre frequency we found earlier. Since our data is already sorted by descending frequency, we can expect to see some sort of decreasing curve if we follow the tops of the bars. I believe this shape could be interesting and illuminating, and the heights (relative frequencies) could be interesting as well.

As I mentioned earlier, modularization can save a lot of frustration when coding for data science. This is an excellent example of that. When this function is run on all playlists in the playlists dict (this is the default), it takes about 40 seconds to run. When you are just changing how charts look, though, sometimes you want to make a quick change and see how it looks. I've left, commented out, an example of a call I used when adjusting how the graphs looked. Looking at only 3 playlists took the time down significantly.

These look pretty good! It seems for the most part that they follow a sort of exponential decay shape, which makes sense to me intuitively. We also see relative frequencies for top genres that are all over the board. The genre-specific playlists (The Sound of Chamber Psych, The Sound of Dream Pop) have top genre relative frequencies between 90% and 100% which makes sense. It's interesting that even the most eclectic playlist possible (The Sound of Everything) shares the same general shape as the rest of the playlists' curves, just scaled down significantly (top genre relative frequency seems to be about 0.85%). Unfortunately, most of these playlists have too many genres to be legibly labeled, but of the few that can be labeled, there are some interesting things to note. Looking at "technical," the top 3 genres are instrumental math rock, instrumental rock, and math rock, all with relative genre frequencies between 70% and 40%. Even someone unfamiliar with these genres could tell by the names that they're very similar.

These graphs are fun to look at, but let's stop trying to guess the values for things like the top genre's relative frequency. Next, we'll gather some statistics for these playlists.

Playlist Statistics

In order to keep track of our statistics, we'll use pandas dataframes (pandas docs, dataframes). We'll keep track of a number of things we think may be cool to look at. I'll define them below.

Once again we'll define a function to perform this task, to allow us to test our function quickly as we create it.

This is interesting to parse through. Let's look at how some of these things met or did not meet our expectations.

genres:tracks is a very interesting column. We expected this to be higher for more eclectic playlists. The Sound of Everything, which we noted earlier is probably the most eclectic playlist possible, has a value of 0.998896. The other two "The Sound of..." playlists (which might be considered some of the most cohesive) have values close to 0.5. Finds, trancee and deep synth, on the other hand, have values that are significantly above 1. This indicates that there's probably more to this statistic than simply "higher = more eclectic" and "lower = more cohesive". It could be a much more complex relationship between multiple factors, or it could just be irrelevant to what we're considering. Results like this are great in my opinion. Sure, its fun to have your expectations met, but there's a lot to learn about our own assumptions when we find that our expectations were wrong. In data science, it's important to address these situations, rather than trying to cover them up. Nobody could be expected to guess everything correctly, otherwise there would be no reason to do data science.

As tracks:genres is the mathematical inverse of genres:tracks, it too could be considered to be more complex (or alternativelty, less important) than we andicipated.

avg_relfreq is interesting too. With the exception of the highly contrived test playlist, this tends to be very low for most playlists. It is very very low for The Sound of Everything (0.000583), but most are one to two orders of magnitude greater. The part I find interesting is in comparing The Sound of [genre] and playlists like trancee, jazze, technical, and deep synth. While these are playlists I'd like to call very cohesive, I probably wouldn't say they're more cohesive than playlists designed to showcase single genres. What this statistic seems to be more correlated with, in general, is the number of tracks. The more tracks there are, the lower the average relative frequency is. This might suggest we shouldn't focus too hard on it, or that we should at least be strongly aware of its correlation with number of tracks.

top:tot_gen falls into the same category as avg_relfreq here: it seems to be most strongly correlated with num_tracks.

top_gen_relfreq seems to align the most with our expectations. It may be interesting to focus mostly on the ends of the graphs where relative frequency is higher.

Let's confirm (or disprove) our speculations about correlations with pandas and seaborn, a library that works well for making heatmaps (docs, tutorial). We'll be looking for correlations between num_tracks and any other column, and it will also be interesting to note any other particularly strong correlations.

num_tracks is most strongly correlated with num_genres (0.79), and top_gen_relfreq (-0.48), interestingly. Once again, my expectations have been broken by the latter of those strong correlations. We also see weaker (but still existent) correlations between num_tracks and genres:tracks and avg_relfreq. What this tells us is that perhaps these columns will still be meaningful to consider -- perhaps they are influenced by number of tracks, but also something else. That something else is what is particularly interesting to me.

I'm not surprised to see a very strong negative correlation between tracks:genres and genres:tracks, seeing as they are mathematically inversely related. Perhaps most interesting, outside of the num_tracks correlations, is the almost direct correlation between top:tot_gen and avg_relfreq. After some consideration, I believe this is because the sum of all the genre relative frequencies is dominated by the top genre. Once this connection is made, it seems obvious that [the sum of the relative frequencies divided by the number of genres] is almost directly correlated with [the top genre relative frequency divided by the number of genres]. This further leads me to believe it may be most rewarding to consider only the left side of our plots, where the relative frequencies are higher.

More Data Collection

The case for collecting more data

So far we've gotten some pretty interesting metrics, and we're on our way to being able to determine how cohesive or eclectic playlists are. However, there's still a big hole in our plan so far.

Consider two playlists. They each have high concentrations of three separate genres.

The first:

The second:

Clearly we would describe the first playlist as more cohesive and the second as more eclectic. Unfortunately, with only genre counts and distributions, and no information about the genres themselves, there's only so far this analysis will get. Obviously, there's a lot more to what makes two songs sound similar or different than what genre the artists are. Two songs from different genres may sound vastly similar (see: instrumental rock, instrumental math rock, math rock). Unfortunately, the exact characteristics that may indicate whether songs sound similar would be incredibly hard to parse through. There is some interesting data available from the Spotify API for track audio features (danceability, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, tempo, duration_ms, and time_signature are available for all tracks), but surely someone has already done that...

everynoise.com to the rescue! Here's a discription of Every Noise at Once from the website:

Every Noise at Once is an ongoing attempt at an algorithmically-generated, readability-adjusted scatter-plot of the musical genre-space, based on data tracked and analyzed for 5,437 genre-shaped distinctions by Spotify as of 2021-05-15. The calibration is fuzzy, but in general down is more organic, up is more mechanical and electric; left is denser and more atmospheric, right is spikier and bouncier.

(Retrieved 5/15/2021)

Every Noise at Once is made by Glenn McDonald, who is a "data alchemist" at Spotify. His work at Spotify includes creating people's Daily Mixes, and populating artists' Related Artists tabs (algorithmically, of course) [source]. If anyone could be called an expert at determining genres' relationships with each other, it's him.

The whole site is worth exploring if you're interested in music, but some parts are particularly useful for our purposes. Specifically, the genre list has a beautiful gem hidden in the html when sorted by "similarity to <genre>". Here's are the first few rows of the list when sorted by similarity to trap:

image-2.png

Note the title of the first table data cell in each row (the lines where class="note"). Acoustic distance between genres is exactly what we need to be able to tell if a playlist with a lot of instrumental rock, instrumental math rock, and math rock is different from a playlist with a lot of reggaeton, shoegaze, and hard techno! The former should have a low acoustic distance between the genres and the latter should have a much greater distance between the genres. The exact definition of what acoustic distance means is not entirely clear, but I am confident that there is a meaningful way Spotify/Glenn McDonald/everynoise.com calculates it. For reference, the distance between a genre and its closest genre is usually below 1, and the distance between a genre and its farthest genre is usually about 20 (the highest I've seen is around 34). Another important property is that these distances are bi-directional (the distance between russian synthpop and new rave is the same as the distance between new rave and russian synthpop). This will allow us to trim down how many times we have to look up a distance.

There is another metric listed with acoustic distance as well, called overlap. I will not use this because it quickly goes to zero for most genres (leaving thousands of genres tied with overlap of 0).

Scraping Acoustic Distance Between Genres

Before we get into scraping this data, it will be important to figure out how we're going to store it.

We could think of this as a weighted, undirected graph where every vertex is a genre, and the edge weights are the acoustic distances between two genres. This could be stored in an adjacency matrix as a numpy ndarray, or in a sql database... but this is not a trivial case when it comes to storage (in the worst case, a playlist with all the genres cough cough The Sound of Everything cough cough), we have distances between all 5400-ish genres to store). An adjacency matrix would have 5400 * 5400 = over 29 million entries, which might just fit in memory, but it would be cutting it close. Since this is not a trivial case, we'll just use a dedicated graph library. I've chosen NetworkX (docs).

Now that we've decided how we'll take care of the data we get, we can look at how to get the data. We'll scrape the data using BeautifulSoup (docs).

Our first goal will be to get a sense of how we can access the pertinent information. For this, we'll just take a look at the table we get from our soup.

We can see that we'll need to go into each \<tr> and then into the third \<td> to get the genre names. I followed a similar process for the lists sorted by genre to find how to get the acoustic distance. I won't output the code or output for it since it's basically the same as above, and the results are already basically shown in the screenshot above. The final code I used to extract the acoustic distances will come later.

We'll look at all the rows of the table, and snag the genre names from each row, and append them to a list.

This list is quite long (well, 5437 elements long, to be exact), So I outputted only the first 10, and also printed the length, just to make sure we got everything. Now, we can begin getting the acoustic distances.

I learned an incredibly important lesson here, and I'd like to share it with you. We're going to need to make 5437 html requests. This is no trivial task. It will take a long time. Things will go wrong.

Remember when I said at the beginning one of the things I want a reader to be able to take away from this project is the importance of threading? Well this is where this comes in. I tried collecting this data with a single process and a single thread at first, and wasted about 17 hours of my life. Over that time span, about 25% of the genres had data collected for them, and with every graph insertion it was taking longer. Things were not looking good.

When I returned to my computer after 17 hours to find that only a fraction of the genres had been processed, I realized I needed a better solution.

Threading was that better solution. Python has a library for threading built in (here's a useful guide to threading). Threading allows us to perform tasks concurrently, to make better use of time the processor spends idle. Basically, we are going to define a function we want all threads to work on, then set them all loose chipping away at the remaing genres to process.

To find and extract the acoustic distance from the "title" attributes, we'll use regular expressions (docs, tutorial). I recommend a tool like regex101 to test regular expressions if you're not toally comfortable with them.

The tricky part of threading is something called a data race (learn more about it and how to avoid it here). When a thread needs to write to shared data, it must be the only one accessing it. To ensure this, we use a Lock (also called a Semaphore). In our case, we're going to need a lock for the list of genres to process, and a lock for the graph where we're storing our distance relationships.

I also implemented a timer to predict when the job would be complete, which was done by the main thread. Information on the function I used for this can be found here.

Finally, I implemented a feature where the graph would be backed up to a file every 15 minutes, so in case something unexpected happened, I would still have the progress so far. I also save to this file at the end, so we only have to run this code block once (in the future we can just read the graph from the file). I did this with NetworkX's write_weighted_edgelist function. It is important to note that by default, this function delimits nodes using whitespace, which is an issue if your nodes have whitespace. I used ** as a delimiter because no genre has that in its name (verifiable with ctrl-f on everynoise).

I will not be showing the output of this function because running it takes over 4 hours (even with 16 threads!).

The two things I pointed out when discussing making 5437 http requests were that it would take a long time and that things would go wrong. These are expected. I've covered the first, now I'll cover the second. There's one fatal flaw in the above code that I only discovered a couple hours after setting the threads loose: I'm just putting genre names right into a url! In retrospect, I should have known not to do this, but it's a great learning experience for everyone.

Consider the genre "r&b", and its many variations. '&' is a special character when it comes to urls (and so is '+'). Every time a thread requested a genre with a '&' or a '+' in it, everynoise just returned the main genre list page (with no distances). This was problematic for our purposes, but luckily it didn't break anything too bad (I expected things to go wrong, and I tried my best to prepare for them).

This had a rather easy fix. I just had to escape the problematic genre names, and recompile the distances between problematic genres. I used urllib's quote function to do this (docs). I just added the edges to the graph and saved it to file again (only one thread needed this time).

Now that our graph is saved to the filesystem, we can re-extract it into a NetworkX graph with write_weighted_edgelist's counterpart, read_weighted_edgelist.

Okay, so we have our graph of all the genre distances. Now we need to turn this into something useful regarding our playlists. In order to comment on how acoustic distance between genres, we will somehow combine our genre frequency data with our genre distance data. If one genre is particularly far from the others, but that genre occurs only rarely, I would still consider the playlist mostly cohesive. In order to combine these numbers, we will multiply the acoustic distance between two genres and the average relative frequency of the two genres in the playlist. We will use the average relative frequency between two genres because this will weight distances involving more prevalent genres higher, and distances between uncommon genres lower (Imagine a scenario where the top genre has low-ish distance to all other genres, but some less common genres are far from each other. They're still close-ish to a bulk of the playlist, and therefore the playlist would be considered more cohesive.)

Now that we have a way to summarize how acoustic discance factors in to a playlist's genres, we can incorporate this into our playlist-wise data. We'll create a new column for the average adjusted distance between all genres in the playlist. In this column, we can expect cohesive playlists to have smaller adjusted distances and eclectic playlists to have larger acoustic distances.

Once again, our expectations have been contradicted, though this time I believe it is due less so to the underlying data, and more so with our method of adjusting the distances. It seems as though the relative frequency term of our adjustment overpowered the distance, and it also seems as though the playlist size plays a significant part in the adjusted distance. We will look at the correlations heatmap again to see what we can learn.

It seems that the strongest correlations with the adjusted distances are indeed the number of genres, the number of tracks, and also, interestingly, the ratio between them.

In an ideal world with infinite time and resources, I would like to explore other ways to incorporate genre acoustic distance into this analysis, ideally with less dependence on the size of the playlist and more dependence on the distance. Alas, this is not the case, and I must continue forward.

Hypothesis Testing and Machine Learning

Hypothesis testing generally assumes a known result, or an ability to verify your model somehow. Unfortunately, I have no way of knowing more than just qualitatively and subjectively how cohesive a playlist is. Therefore, we'll resort to Unsupervised machine learning. This does not require data with anything known about it -- just data that may have some underlying quantity that is unknown (this is called a latent variable).

The method we will use is called Principal Component Analysis. We'll use the machine learning library, scikit-learn. In particular, we'll use their implementation of PCA (docs). The goal of PCA is to find a linear combination of features that can reduce the dimensionality of the data while explaining as much variance in the data as possible. What this means is that We are looking for a smaller number of underlying metrics that can describe the data (hopefully) just as well as all the data.

As we said earlier, we do not want our analysis to be dependent on the number of songs or genres in a playlist. For this reason, we will not consider either when we train our model. We will use the rest of the statistics and computations we've made so far, though.

As printed before the dataframe, we were able to find a factor that explains 85% of the variance in the data. That's great, as long as it's not simply a manifestation of track or genre count. To discover this, we will examine the correlations in this data again.

As it turns out, the factor that can explain 85% of the variance in the data is directly proportional to the track to genre ratio! On top of this, it is only weakly correlated to the total number of tracks and genres.

Insights

What have we learned from our exploration?

We discovered that it is possible to mostly differentiate between playlists based on a factor that is unrelated to the size of the playlist. Furthermore, that factor is directly proportional to data that is (relatively) easy to obtain. Simply divide the number of tracks in a playlist by the number of genres. What we did not obtain was a way to score playlists based on their cohesiveness. Looking at this metric, it seems to have arbitrary values when looking at relatively cohesive playlists and comparing them, and similarly for relatively eclectic playlists.

I still believe that with more time and effort, a way to score a playlist's cohesiveness could be determined. If I were to continue with this project, I would: