Get New Netflix Movies by a Recommendation System

This recommendation system was iniciated and learned during my Data Science training, beside the data exploration below, I’m also going to build a new application to show this recommentation to the end user in the near future.

Photo by Thibault Penin on Unsplash

Dataset

The Netflix dataset is from Kaggle competition “Netflix Prize data”, so they can improve their reccommendation algorithm, despite being an old competition we still can learn a lot from it. You can get the dataset as well from here.

For this exploration, we are going to use just the .txt files (combined_data..) and the movie_titles.csv file.

Data Exploration

The full exploration is on a Jupyter Notebook that I already uploaded to my GitHub profile.
With that being said, we will start by looping each .txt file and combining them together into a full one as the code below.

But be patient because the files has many rows and the processing will depend on the power of your machine.

Let’s see a summary for our data:

Data summary 
--------------------------------------------------
Movie total count: 17770
Users total count: 480189
Rating total count: 100480507

Let’s split the data into training and testing before proceeding with an exploratory analysis, as some analyzes only make sense for training data. We will use the 80/20 ratio for training/testing.

For that, we will create a dataset on disk with the training data. That way we don’t need to run the entire charging process again every time we run this process.

For train data the file is called train_data.csv, and for test data the file is called test_data.csv.

Using train data, let’s see at the distribution of ratings:

Image by Author

Now let’s check wheter the day of the week have an influence on the user’s evaluation. For that we added a column with the day of the week to find out.

In order to create this column it takes a few minutes, so be patient again.

Elapsed time: 0:07:08.215719

So now that we have the column called “weekday”, we can generate our plot:

Image by Author

It’s really curious why Tuesday has more ratings, I mean, why this day? So because of that, let’s calculate the average of ratings per day of the week.

Average Ratings
------------------------------
weekday
Friday 3.585274
Monday 3.577250
Saturday 3.591791
Sunday 3.594144
Thursday 3.582463
Tuesday 3.574438
Wednesday 3.583751
Name: rating, dtype: float64

According from the result, the average for all weekdays is almost the same, so we can say that the day of the week does not seem to have an influence on the users’ evaluation.

Now let’s look at user ratings over time:

Image by Author

We can see that users became to rating movies through out the years as the were learning how to do that maybe, and as it’s was becoming more famous that feature among users, or sure it can be due to users increase in the platform.

Now let’s check out what are the users who rated the most:

user
305344 17112
2439493 15896
387418 15402
1639792 9767
1461435 9447
Name: rating, dtype: int64

Let’s create a plot to see that better:

Image by Author

We can note already that the vast majority of users have less than 1000 ratings.

Now how many ratings are in the last 5% of all ratings?

0.00        1
0.05 7
0.10 15
0.15 21
0.20 27
0.25 34
0.30 41
0.35 50
0.40 60
0.45 73
0.50 89
0.55 109
0.60 133
0.65 163
0.70 199
0.75 245
0.80 307
0.85 392
0.90 520
0.95 749
1.00 17112
Name: rating, dtype: int64

Let’s create a plot to see that better:

Image by Author

What we can notice from the plot above is there are some movies (which are very popular) that are rated by a large number of users. But most movies (like 90%) have a few hundred ratings.

Sparse Matrix

Let’s just image that we are navegating into Netflix, Amazon, etc website and suddenly stars to appears new products recommendations, something that you may like. They have the data that tells which product a specific user has bought, what that user did in the website, where the user cliked, etc.
So they are comparing information between users, for that they are using similarity calculation between them.

In order to get the same example, now we are going to create a sparse matrix as the image below.

Photo by cmdline on cmdlinetips

The reason why we are using Sparse Matrix instead of Dense Matrix, is that we can use the users in the y axis and the movies in the x axis, so if the user has not rating a movie we are going to use the number 0 in that place (but we could use any other number). With that, we can have all values in all cells and we can show the relationship between users and movies.

We are going to create the matrix for both train and test on disk, so we don’t need to create the matrix everytime we run the notebook.

Above we do exactly what I explained before, we use the rating values from the users, the users itself and the movies in order to build the sparse matrix.

Sparse Matrix created. Shape is: (user, movie):  (2649430, 17771)

After both created, we will calculate the global average of all movie ratings, average user rating and average movie rating.

First we will check the average rating of all movies as below.

{'global': 3.582890686321557}

In order to get others average ratings, we created a function to do that calculation, and then we build a dict to store those values.

And then we can print the rating average for the user id number 149.

User Rating Average: 4.25

Below we also calculate the rating average to all movies.

Movie Rating Average 32: 3.9922680412371134

Cold Start Problem

I also learned that in Machine Learning we can have a problem called Cold Start, which means that since we have divided our dataset into training and test (in this case 80/20), there might be some users that are not going to be part of the train data.

If we subtract the total users from the users in training data we get the answer below.

Users total count: 480189
Total training users count: 405041
Total users not count in training: 75148 (15.65%)

75148 users are not part of the training data, that is, we cannot learn the evaluation pattern of these users! This is the problem with cold start.

This also happens for movies.

Movies total count: 17770
Total training movies count: 17424
Total movies not count in training: 346 (1.95%)

346 movies do not appear in training data. We will have to deal with this when we work especially on the Machine Learning model.

In case we have a real problem in this, we need to change the split data.

Movie Similarity Matrix

We are going to create our movie similarity matrix that will be able to find the most similar movies according to users’ rating standard.

# Shape
movie_sparse_matrix.shape
(17771, 17771)

Recommendation

Now let’s load all the movies from the file movie_titles.csv that is also available in Kaggle.

This dataset contais:

  • Movie id
  • Movie release Year
  • Movie title
movie_titles.head()
Image by Author

Now that we have all the movies already loaded, let’s see which movie are similar to the movie id number 13673.

Movie: Toy Story
Total User Ratings = 4785.
We find 17342 movies that are similar to this one and we'll print the most similar ones.

Finding all the similarities according from one movie.

We also can see from a plot:

Image by Author

Here the top 10 movies most similar to the movie with the id number 13673 chosen previously.

Image by Author

Conclusion

From now on we are able to find the most recommended movies according to the one we chose from the Netflix list, for me at least it was very pleaseant to see the result.

Gif by Giphy on Giphy

From the recommendation system we studied before, now we are able to develop a new application in order to show the result more friendly to the user and also pick a movie from the list, using Flask and Djangoo for example.

I hope you enjoyed this article and the step by step. I encourage you to play around with this dataset on your own as I did, since the dataset is already available to us. The script of this article can be found on my GitHub reporitory: guimatheus92/Get-New-Netflix-Movies-by-a-Recommendation-System: This recommendation system was iniciated and learned during my Data Science training, beside the data exploration below, I’m also going to build a new application to show this recommentation to the end user in the near future. (github.com)

You can also find me in my LinkedIn profile or in my website: Guilherme Matheus | Back-end Developer, Front-end Developer & Data Scientist (guimatheus92.github.io)

Mechanical Engineer Business Intelligence developer, passionate about technology, I have knowledge and experience to create a BI architecture and much more 📚.