‘The Conert Project’ : Event Setup (1) 

Like any project, the most important indicator of how your end results will turn out is how you get started. By cleaning the data you really get to know your data, which is super important. In this first technical post I’m going to take you through my initial ‘meet and greet’ with my events data.

Previous to this point I have collected and collated hundreds of Boxscores. Now it is time to bring them into my R-kernel Jupyter Notebook and get to work. I have excerpts from the code I have used here for example, though I’ll be talking through each of the steps in a narrative format as well.

Using the ‘read.csv’ function, I bring in my data and immediately strip off the first variable, which is the ‘Rank’ variable – important to concert promoters, but not important to me. I name my columns to names that will prove useful and convert my ‘Revenue’ variable to numeric.

I then use the ‘distinct’ function to make sure none of my data has been duplicated in the collection process. A quick ‘nrow’ and I can easily see how many events I’m dealing with; 38,786.

After the initial prep I can take a peek at the data in table form. Note that, due to how the github reader interprets the data in the file, I have had to omit the ‘Prices’ and ‘Promoter’ columns from this display.



Loading

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

view raw

Blog_1_1.ipynb

hosted with ❤ by GitHub

Based on the data I am viewing, it looks like these are the highest-grossing concerts from the July-August 2013 time period. These high-revenue concerts all have way more than one act. This makes sense, as these are likely the super-concerts or festivals. The highest-grossing of them all banked nearly $9.5million in revenue over two days!

Besides the excitement of seeing my data for the first time, I can also start to visualize the hurdles coming ahead.

  • All of those artists – how am I going to deal with so many artists? Especially since in many cases there is likely one main act and other supporting acts.
  • Those state abbreviations. What even are those?!
  • The dates are also a mess. (Spoiler! There are several different formats hidden within.)
  • Numeric data hidden within string variables.

The first thing I do is split up the artists. I think that I only want to examine the headliner for any given concert, even if is it a large grossing festival. If that analysis doesn’t end up working out I think I would rather treat festivals differently than try and separate them into their different constituent acts.

After the headliners are picked out and assigned to the ‘Artist’ variable I have to clean them up. I get rid of dollars signs, apostrophes, dashes, and any other type of punctuation. This will come in handy when I have to standardize artist names later.



Loading

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

view raw

Blog_1_2.ipynb

hosted with ❤ by GitHub

My first five artists have been isolated and we can see them on the output above. None of them had any strange punctuation (like Ke$ha or Panic! at the Disco), but we can see they have been pulled off of the larger list they once belonged to.

As well, a bit of code was added to the end to create another variable, ‘bb_Artist’. This is the variable I will use to guide my search of the Billboard website. There are two very sticky problems, however: ‘&‘, ‘and‘. I have come to loathe these symbols. Here is why:

  1. Sometimes artists are ‘thisandthat‘, but can be listed as ‘this&that
  2. Sometimes artists are two separate artists listed as ‘this-guyand/&the-other-guy
  3. Sometimes artists are ‘thisand/&that‘, but billboard decides to have them in their database as ‘thisthat

Because of this I have – behind the scenes – gone through and validated which version of the possible iterations of the artist names work best with the Billboard website. I may introduce that tangent as a separate blog post at a later date. Much like any good cooking show, I have done my prep ahead of time and the turkey is roasted and loaded below as my ‘Joined Data Tables – clean’ file.



Loading

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

view raw

Blog_1_3.ipynb

hosted with ❤ by GitHub

The first thing I do after loading the data is to look at the structure of the file. I still have my 38,786 observations and 11 variables. The second column tells me which type of data I am looking at. Unsurprising, the only non-Factor data is my revenue data that I already touched up before. The number of levels indicate the number of different values within each of the variables. The most interesting to me are the ‘Artist’ and ‘bb_Artist’ levels. Out of the 38,786 observations, I only have 7,806 headliners, and of those headliners I was able to create a Billboard optimized version for 7,578. The other ~250 were probably composed of artists doubling up with the ‘and’ between them or artists who were listed multiple ways and are now consolidated.

Now, not all of those ‘Artists’ are actually artists I will be able to find and match to any data. They may be small, obscure, mis-entered, not really an artist, or any number of other things. However, for the time being, now that my artist data is Billboard friendly, I can work on re-configuring the other pieces of my data.



Loading

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

view raw

Blog_1_4.ipynb

hosted with ❤ by GitHub

I know there was a lot in there, but I think most of what I was doing should make intuitive sense even if it doesn’t make technical sense. The numbers output after each code block are how I check the values that I am getting – that they make sense. At any point I can scroll back up to my first five entries in the table to make sure the code is doing what I want it to.

The biggest takeaway from this whole process was that only 68% of the events collected were from the United States. This is important because I want to limit the scope this project to just US concerts (for now). After the non-US shows were removed I am left with 26,479 events. This is still a fantastically large number compared to what I was working on four years ago.

But I’m still not done. I haven’t touched on the hardest part – the dates data. I’ll pick it up from there next time.