Data Science-Movie Dataset Analysis

6 min readJan 6, 2021

In my previous blog post, I shared my interest in data science and why it is important. In this blog, I would like to share how to analyze data, project the outcome of our analysis, and predict from the dataset.

Reason for analysis:

First of all, we need to think about what I will find with the dataset and how to do it. Suppose one of the top leading companies wants to make a movie, and they want to know what type of movie to make to be a successful one. Our focus is to find what type of movie to make to be successful, and the success of the movie can lay on many criteria like people rating, gross value, budget, profit, etc.

Source for a dataset:

We could browse for freely available data on the internet for movie analysis or fetch using API. Many websites provide movie reviews and ratings like IMDb, rotten tomatoes, Film.com, Flixster, etc. Here, I will analyze the IMDb dataset.

Explore the dataset:

Here I am using python and Jupyter notebook. We need to set our Jupiter notebook and import the necessary packages. Then load the CSV files using pandas.

For this scenario, I have taken three datasets.

Title-contains all the information about the movie title and genre
Budget-contains all information about the movie budget gross data
Rating-contains ratings of all the movies

We have to explore and understand the data before working on it. On looking at the three datasets, the title dataset contains null values. In Budget datasets, the release_date column is in the object, and we need to convert it into DateTime format, and the domestic and worldwide gross have to be converted from object to float values.

Merging Datasets:

After cleaning the datasets, we merge them for analysis. Look for identical column names to merge the dataset. If you look at the ‘title and ‘rating’ dataset, both have the ‘tconst’ column as common. Hence I joined the ‘title’ and ‘rating’ datasets using the inner joint and joined them to the ‘budget’ dataset using the left joint. I joined the dataset on ‘budget’ as they share the common movie title column.

Analysis:

Once we are done with the cleaning and merging, we can start our analysis. I made my analysis by answering three questions from this dataset.

What type of movie did people watch or rate in the past year?
Does movie-length have any impact on the audience?
When is the best time to release the movie to be successful?

What type of movie did people watch or rate in the past year?

Let’s have a look at the genre column. The genre types are clubbed together, but how could we count each one of them?

There are many ways to do this, but one way is to create a separate column for each genre and assign the value ‘1’ if found in that particular row. Else, assign ‘0’.

We could add the genre column to get the count of each genre and store the index value.

Since we want to know the top genre in the past years, we group the data by ‘year.’

Finally, we are adding the total number of genres each year from 2010 to 2017. Here I am using data till 2017 as I couldn’t find more data in recent years.

Now let’s plot our findings on the genre data. Here I have plotted for the Top 5 genres.

The visualization clearly shows us that Documentary movies are the most rated or liked by the audience in the past years.

Does movie length have any impact on audience?

For this analysis, we need movie runtime data. So I have grouped the movie runtime data by average rating.

Now let’s plot a bar graph for the top 30 ratings. Generally, we say top 10, but here I have taken the top 30 to check the ratings' trends.

The graph below shows that most of the top-rated movies have an average movie run time of around 90 minutes.

The same way we could find for the least 30 ratings.

Again the bar graph below shows us that the movie with the least rating also falls under the same average movie runtime of around 90 minutes.

When is the best time to release the movie in order to be successful?

Finally, let’s find the best time to release the movie, as this analysis might help us create a successful movie if released at the right time.

For that, I have plotted a bar graph using the month when the movies have been released in the past and their corresponding average domestic gross value.

Our graph clearly shows us that May and June have made the highest average domestic gross value.

Success! Finally, we created our sample movie analysis!!!! Yesssss!!!!

Result:

The genre analysis clearly shows us that the Documentary movie is currently doing the best, and hence it’s more likely to make a Documentary movie.
Most movies fall under the average runtime of 90 minutes, but people have rated equally for any movie-length, which helps us find the movie runtime has no impact on the viewer’s point.
Finally, movies released in May and June have made the highest average gross. It could of any reason like good weather, holiday season, etc.

Conclusion:

In this blog, we have completed the movie analysis and recommendations for making a successful movie. We created this using python. We saw the basic steps of analyzing our data based on the requirements and how to project and interpret our analysis.

I hope you enjoyed and learned something interesting. As a next step, I am working on further analysis of the movie budget and its impact on people rating to check if it's just a number or a true opinion.