‘The Concert Project’ : Going Further 

I left the blog last time with the promise that I would join the venue data with the event data. This post will accomplish that, but it is surprisingly uneventful. Due to this, the second half of this blog will be a sort of checkpoint and reflection about the project so far and where I see it going.

But first – I’ll join the venue data to the event data with a left join:



Loading

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

view raw

Blog_4_1.ipynb

hosted with ❤ by GitHub

You can ignore the warnings. The data sets had the variables as differing (though similar) data types – character and factor. A character variable is your standard string variable while a factor variable implies more of an order to things – an inherent value to each similar string. In this case R automatically matched them together as characters.

I now have 28 variables attached to 26,490 rows of data. Unfortunately there were 261 rows that I could not match with my venue data. These are likely data entry errors rather than errors in my process (I recall seeing a ‘Las Vgeas’ at once point, for example). A 1% data loss is probably fine considering the amount of work that would be involved to find and fix those rows. This project is really about moving forward with a solid first run-through to build my basic functionality in R. If this were a project for a client or employer I would likely take the time to investigate those 261 rows.

Once joined I can now run the same map Iran previously, but now I can weight the venues by the number of shows that took place at that venue.



Loading

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

view raw

Blog_4_2.ipynb

hosted with ❤ by GitHub

The big reveal here is that Las Vegas has events with a lot of shows! Here is how you can interpret the map. The larger the circles are, the more shows were tied to specific events (rows in the data). The darker the circles are, the more events took place at that location. I can see the big cities are still very much a draw for lots of shows and events and I can also start to see some patterns I’d expect. I think I see Red Rocks, the famous venue in Colorado, lighting up. I also see some larger transparent circles in fairly rural areas, meaning there was likely one event that happened there that happened to have a lot of shows. County fairs perhaps?

Like most every exercise in this exploratory analysis, the map just elicits more questions than answers – which leads me to the second part of this blog post: what should I be asking of this data set?

This project, for me, is a way for me to hone some new data analysis skills. Up to this point I’ve focused much more on the data processing (spoiler, more to come), but this is standard. I’ve heard estimates that between 50% to 95% of this type of work is data processing. That would depend on the exact job or project, but from my experience I’m inclined to agree that around 80% of any project is going to be data collection and processing. I’ve tried to work in exploratory analysis as I’ve ground through the processing, but I haven’t lingered too much on any piece of analysis because – for now – each piece is more of a proof of concept rather than a true analysis. Just a quick glimpse into the data before moving forward.

But after the next blog post I will be in a much better position to begin true analysis of this data, so I think it is time to start thinking about what I’d like to ask of the data and also what skills I’d like to test with it. My initial thoughts are:

  1. Are certain genres favored in different regions of the US? Or rather, are there different ‘musical’ regions in the US?
  2. Can the variables at hand be used to estimate the revenue from a show – given that you know the location and artist performing?
  3. Can I build a hypothetical ‘tour route’ that would maximize revenue for the artist given certain constraints?

These are questions that could be asked of the data. But which analysis techniques could the data be used to exhibit?

  • Linear regression or other ML predictive analytics (#2)
  • K-means or other clustering (#1)
  • Interactive visualizations (shiny/Tableau Public)

These seem like good building blocks, but I feel that they are a bit uninspired so I have a question for anyone reading this…

What questions would you ask of the data or what techniques would you test against the set?

Leave me a comment with your suggestion and I’ll see if I can work it into the analysis!

 

 

Leave a comment