The year was 2014. I was dedicating half of my day to the analysis of consumer surveys for Southwick Associates an the other half to the ponderous task of writing my thesis. Most of the analysis was done and I was dragging my heels putting everything I wanted to say into so many words. I had mostly recovered from the disheartening revelation that my data source changed on me halfway through the collection process. This meant that many of my explanatory variables would be severely limited, thus limiting the scope of the project. I was initially crushed, but recovered and adapted the project in ways that compensated for the loss. “However,” I thought, “I’ll probably never come back to this.” At the time the Billboard website, from which I was gathering some of my artist descriptors, had gone through a huge format and structure change, and the data regarding how many ‘Top 200 songs’ and ‘Hot 100’ albums any given artist had were completely inaccessible. Alternative sources proved futile. Luckily I had collected enough to produce some results. I pressed on and didn’t intend to look back.
Fast forward four years and instead of figuring out how to ask simple questions of a small pile of data, I am now swimming in a comparable lake of data that must be cleaned, filtered, joined, re-configured, and analysed. The same data that proved so painstakingly hard to gather is now mine with a few keystrokes.
The reason for the difference is twofold. Since leaving my thesis behind I have acquired new skills that have opened up my technical upper-limit. As well, the platforms that are available to me now compared to the ones I was working with in grad school are worlds apart. Plus Billboard finally finished their website! All of these contribute to dawn of THE CONCERT PROJECT.
The goal of the project is simple – explore and play with data set that interests me in a way that grows me as an analyst.
Before we dig into the details I think I will outline the data sources and my reasoning behind them:
- Billboard Boxscores: These are the bragging boards of the music industry. This is the meat and potatoes of the data set I’m building. Included are the names of all of the acts performing, the dates of the shows, where they took place, how many shows are included, total revenue, number of tickets sold, prices of the tickets, and the promoter behind the concert.
- Billboard Artist Data: This is the data that eluded me while I was completing my thesis. I want to measure an artist’s popularity and how that might affect different things. I’m thinking I’ll primarily use the ‘Top 200’ list because it reaches much deeper into the swath of artists because it measures songs instead of albums (like the ‘Hot 100’ list) while also having more positions, allowing more artists to enjoy a spot on it.
- Last.fm Artist Data: These guys are great. They use user-generated data to measure things and assign categories to artists. They have their own metrics, such as who is spiking in popularity, which genres are interesting for which reason, and even who is listening where. I plan on using their genre assignments in my analysis. Billboard does not assign genres on their artist page, though they do give out awards like “Best Country Album of 2017“. Plus, genre is such a subjective thing and there are many layers and sub-genres. Last.fm solves that problem because the genres are assigned by popular vote from their users, leading to a robust and nuanced categorization.
- MusicBrainz: Stated on their homepage, “MusicBrainz is an open music encyclopedia that collects music metadata and makes it available to the public.” There is so much here to tap into, however, my main use of the MusicBrainz data is to identify the years of album releases for my artists. Billboard does not hold this data, and though it is potentially accessible from many different cources, MusicBrainz is the best option.
There are other data sources that are just ideas without a structured plan of attack, such as Metacritic to identify artists with upcoming albums and google search data for everything that can do. There are also the boring data, such as the BLS and other government sources.
All these things together form the foundation for The Concert Project and will be used to build my models as well as this blog. Please be sure to check in for more content in the coming future. I will likely put up a bunch of work quickly and then taper down to a more reasonable schedule as time progresses.