‘The Concert Project’ : So, What Kind of Music Do You Listen To?

In this post I thought I’d show you a bit more under the hood when it comes to building the data set. I dive into the theory and methodology behind acquiring my genre specifications and do a little cluster analysis to see how the sub-genres fit into the larger more ubiquitous genres the general listener is likely more accustomed to. This one may run a bit longer than average.

Genre is an ephemeral subject. Two people might draw different lines between certain genres where others might not. Someone might place a band into box_A while another might argue that the band belongs in box_B. And though Billboard, the company I have been using for much of my artist data, places artists into broad categories for their charts and awards, there is no specific genre tag that they place on any artist specifically. The whole idea behind genres is actually one big fuzzy classification system – relying on what the general consensus is saying about any given artist or genre at the moment. What might have been ‘Rock & Roll’ in the ’70s is now ‘Classic Rock’ or maybe even ‘Blues Rock’. But I think that the argument that genre can be described as “the class of music most people can agree an artist falls into at the moment” is a fairly good one. So that is my starting point.

The website last.fm is a community of music lovers who share what they are listening to and often add their own new items – songs, artists, genres – to the constantly evolving data set. For each artist page there are up to 6 crowd-sourced ‘tags’ that can be seen as genre and sub-genre specifications. This is a great starting place for lots of reasons. First, since the classification is crowd-sourced it fits the definition of genre stated earlier. Second, because there are only up to 6 genre tags assigned to an artist, it is easy to think that the 6 that are listed are the most appropriate for the artist. There shouldn’t be much difference between the 1st and the 6th tag, whereas if there were 50 tags there might be a large divide between the descriptive power of the 1st and the 37th. And lastly, because there are only up to 6 tags it makes the standardization of the data more intuitive.

I used a list of all of the artists I extracted from the event data set and validated through Billboard as the list of artists I would scrape last.fm for. For each artist, their last.fm page was scraped for the ‘tag’ html element which contains the genre tags. They were each exported into their own one-row .csv files and then combined into a master .csv file.



Loading

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

view raw

Blog_5_1.ipynb

hosted with ❤ by GitHub

The table seen is just the ‘head’ or the first 6 rows of the data. These data were combined alphabetically, so you get to see some obscure number-identified artists near the top. After importing the genre table the first thing I do is strip off the artist names and check to see how many unique genres and how many total tags I have. There are 2,701 different tags in the data with 23,772 total tags. On average, a tag would appear about 9 times in the data, however that is most definitely now how the distribution of genres works out.



Loading

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

view raw

Blog_5_2.ipynb

hosted with ❤ by GitHub

The list of the top 5 genres should be no real surprise. The top genre (rock) has 1,140 artists under its tag – much more than the average of 9. Many of the tags might be one-of-a-kind or so specific that only a few artists would fall under their class. Often artists are tagged with their own name or maybe the name of the lead singer. My first instinct is to trim down the number of genres in the analysis. I have around 26,000 events so to use all 2,700 genres would not really be feasible without overfitting a predictive model.

The next section takes a nice, round number just to see how much of the genre data exists in the top genres. The top 50 genres make up 57% of the total genre tags. I don’t have any statistical reasoning besides the ease of use, but 50 genres /57% allocation seems like a reasonable number from which to build a model. I guess ideally I could take all genres within the first standard deviation of the distribution, but in order to keep moving forward I’ll use 50. Each artist would then have, on average, 4 tags each and I won’t have to deal with the other 2,650 genres. Win/win.

The next step is building a better version of the genre table. Right now the table is really just a series of vectors all pasted together. Each column is just ‘gX’ indicating genre 1-6, but what I need is 50 columns, one for each genre and a ‘1’ or a ‘0’ indicating whether or not that artist falls in to the genre. The code below accomplishes this.



Loading

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

view raw

Blog_5_3.ipynb

hosted with ❤ by GitHub

The first thing I do is make a blank version of my table, with each artist listed and all genres set to ‘0’. Next, going through each of the 6 tags for each artist for each genre, I assign a ‘1’ if there is a match and a ‘0’ otherwise. This builds out the genre table using what are called ‘dummy variables’.

Once the field of dummies is built I can check out the correlations between the genres. My first instinct is to see if there are clusters of genres that belong together. I can use these if I decide to do a broader-scoped genre analysis.



Loading

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

view raw

Blog_5_4.ipynb

hosted with ❤ by GitHub

From the first correlation matrix I can quickly see that there are some genres that correlate very highly with each other. After applying Wald clustering, which is a hierarchical clustering analysis, genre groups begin to reveal themselves in the data. In hierarchical clustering you identify classes moving up from the individual units as sub-classes and merge them layer by layer into super-classes. This type of clustering is perfect for genre analysis as we normally think of genres in terms of sub-genres of larger more prevalent super-genres. By raising or lowering the number of classes I want I can divide or agglomerate the different genres. I decided to use 13 as my number of classes because this is number of at which each sub-genre found a home within a larger group that made sense. More classes yields sub-groups within already defined and cogent genres, while fewer classes alienates genres within too-broadly defined categories.

The next blog will likely involve delving into the source of our popularity variables and possible interplay between popularity, genre, and location.

Cheers!