The Cutting Room Floor

(Or interesting bits not included in my Flatiron data science project)

6 min readJun 5, 2020

Over the past few weeks, I have been working on a project (for my Flatiron data science course ) where the objective was to advise budding movie executives from a well-known tech company keen on establishing a movie studio. Provided with data from a variety of sources (e.g IMDB, TMDB, Rotten Tomatoes), the goal of the project was to ascertain:

what type of movies were doing well at the box office
what type of movies the new studio should be creating

The intention behind this post is not to regurgitate either the technical or non-technical aspects of the project. The goals of the project have already been met by the technical analysis contained in a Jupyter notebook that is in the GitHub repository here:

hsinhinlim/Flatiron-Module-1-Final-Project

Microsoft sees all the big companies creating original video content, and they want to get in on the fun. They have…

github.com

While the non-technical aspects of the project ought already be covered by my video here:

Instead, I intend to use this blog post to share three things I learnt while exploring the data prior to the conducting the analysis, some of which failed to feature in the project itself (but could prove useful later on).

Waffle Charts

Now, this did make it into the final version of the project. As part of my analysis, films were categorised by their production budget into indie movies, low budget films, medium budget films and blockbusters. At one point in the analysis, it was useful to see the composition of films categorised by their production budget. Enter, the waffle chart (so named, I assume, because it resembles the eponymous waffle from Belgium, only with more colours and fewer toppings):

This was a effective pair of subplots comparing the relative composition of two different subsets which I had carved out of the overall dataset. At a glance, the viewer can easily work out the predominance of blockbuster movies in a subset of top grossing films (defined in my analysis as the absolute amount of money that a film made over and above its production budget). It is also possible to quickly see that indie movies dominate the subset of movies which have achieved a high return on investment (defined in my analysis as ROI, which is the return of a film expressed as a percentage of its production budget). The inference may then be drawn that there is usually a trade off when deciding whether to make indie films or blockbusters — the former will provide a high percentage return, the later will provide a high absolute return.

Without deviating too much from the purpose of this blog, a high absolute return is not all it seems — a film can make a lot of money, but it can still be a poor investment if:

that same amount of money could have made a higher percentage return in a different investment
investment in the movie was funded with borrowed money which attracts interest and the absolute return from a film cannot cover both the repayment of the principal and interest

A module called PyWaffle was used to create the waffle chart, one can find the documentation here:

PyWaffle

PyWaffle is an open source, MIT-licensed Python package for plotting waffle charts. A Figure constructor class Waffle…

pywaffle.readthedocs.io

The code to create the waffle chart is as follows:

A waffle chart may be a visualisation to consider should the need arise again to demonstrate the particular composition of a certain dataset; it is an effective and attractive alternative to a pie chart, which carries the potential to mislead because the slices of a pie chart is not easily or quickly comprehensible to its beholder.

TMDB API and YouTube Videos in a Jupyter Notebook

While conducting exploratory data analysis I wanted to find out how else one could obtain information about movies if the data had not been provided with the project. It turns out that The Movie Database (“TMBD”) has a pretty handy API for doing just that:

API Docs

Hosted API documentation for every OAS (Swagger) and RAML spec out there. Powered by Stoplight.io. Document, mock…

developers.themoviedb.org

One could query the API for specific movies, each assigned a unique ID number assigned by TMDB. While exploring the JSON response obtained from making a request to the API, I noticed an interesting datapoint — a key that could lead to the YouTube video for the trailer associated with a specific movie. It then transpired that it was actually possible to play the trailer within the Jupyter Notebook itself! At this point, I must confess I spent quite a lot of time watching trailers of my favourite movies this way. See below for the code implementing this (without, of course, my API key — you will need to obtain your own from TMDB to make the code below work):

As you can see from the code snippet above, there is a YouTubeVideo Python module which makes this happen. As an aside, I would recommend running the code to show the trailer from the film with TMDB ID 55931 as per the snippet above — this will give you an idea of why I spent so much time playing around with this feature….

A requirement of the project was to upload a video of a non-technical presentation on YouTube. I toyed with actually including that video itself into the Juypter Notebook but as things turned out, my notebook felt very busy after I had completed my analysis and I was reluctant to add any further bells and whistles. This feature may, however, come in handy in the future.

Scrapping data from The Numbers

Something I also found in the process of conducting analysis on what other movie data may be out there was this website called “The Numbers”:

The Numbers - Where Data and the Movie Business Meet

June 4, 2020 The Invisible Man came out on DVD / Blu-ray / 4K last week, but that wasnâ€™t enough to overtake Sonic the…

www.the-numbers.com

It contained a lot of data about the movie industry and, in particular, very useful financial data. In fact, it seemed entirely focused on the financial aspects of the movie industry. I initially thought it might be helpful to scrape some data from it to use in my analysis, but again, I opted for brevity rather than include too much at the expense of providing focused recommendations as part of my project.

One interesting question that could have come within the scope of the project (which I could have answered with more time), was the question of how much revenue accrued to movies of different ratings. That would have provided a unique angle to the answering the question as to what type of movie to produce. I really enjoyed the process of learning to scrape data and so I thought this would be a fun exercise regardless. This was the result in a pandas dataframe:

Data scraped from “The Numbers” describing the revenue share of different ratings of films in their dataset

And the code snippet below is how I scraped the data above:

Conclusion

Sometimes the outtakes from a movie which did not make the final cut can make for interesting viewing, I hope you found some of the features I have documented above interesting and perhaps useful going forward.