Anime recommendation (Big Data Certification#6)

Executive Summary
• Problem Statement
• Business Values
• Project Requirements

Problem Statement
• How is rating this anime if we give it to user?
• How popular each anime based-on follower?
• How many anime group based-on their genres?
• Which anime will be recommended to user based-on their
preference?

Business Values
• Able to choose anime to match current viewer
• Able to push advertisement to potential viewer
• Able to upsell similar products for each anime
• Able to accurately predict anime rating and popularity for license
acquisition

Requirements
• Anime data and user rating
• Recommendation algorithm using ALS
• Clustering algorithm using K-NN
• Model evaluation using RMSE

Data
• From https://www.kaggle.com/CooperUnion/anime-
recommendations-database
• Contains information on user preference data from 73,516 users on
12,294 anime
• Each user is able to add anime to their completed list and give it a
rating and this data set is a compilation of those ratings

Data
• 2 files, anime.csv and rating.csv
• Data volume
• 12,294 rows for anime.csv
• 7,813,737 rows for rating.csv

Schema
• anime.csv
• anime_id: myanimelist.net's unique id identifying an anime
• name: full name of anime
• genre: comma separated list of genres for this anime
• type: movie, TV, OVA, etc.
• episodes: how many episodes in this show. (1 if movie)
• rating: average rating out of 10 for this anime
• members: number of community members that are in this anime's "group"

Schema
• rating.csv
• user_id: non identifiable randomly generated user id
• anime_id: the anime that this user has rated
• rating: rating out of 10 this user has assigned (-1 if the user watched it but
didn't assign a rating)

Feature
• Use original dataset to build recommendation model
• Extract unique genre from genres column in anime.csv to build
clustering model

Feature
• Recommendation
• anime_id
• rating, also used as target
• user_id

Feature
• Clustering
• anime_id
• Pivoted genres (Action, Adventure, Comedy, Drama, …)
• type
• episodes
• rating
• members
• Use predicted cluster as target

Running Prototyping Experiment
• Get data
• Data pre-processing
• Feature engineering
• Train the model
• Model evaluation

Get Data
• Dataset was downloaded from
https://www.kaggle.com/CooperUnion/anime-recommendations-
database
• Data is in comma separated value file format
• See data information in ”Data” section

Data Pre-Processing
• Data retrieved are well-formed
• Some NULL value in rating was found
• Unknown episodes represented as “Unknown”
• Rows with NULL and/or Unknown values was filtered out
• Total filtered rows is ~500

Feature Engineering
• Use original data schema
• Processed only data in rating.csv
• Use anime_id, user_id and rating as features
• rating also used as target

Train the Model
• Processed only data in rating.csv
• Ratio of train-to-test data is 80:20
• Use ALS algorithm to build rating predictive model

Model Evaluation
• Data in anime.csv is used for map anime_id with human-readible
name
• Predicted ratings were of type “floating point”
• Using RMSE as an evaluation method
• Some row of test data cannot be predicted, we get “NaN” as a result
• NaN (Not-a-Number) was filtered out

Contents
• Clustering model with K-Means
• Real-time data processing
• Visualization

Environment
• CRAN R 3.4.2
• Anime data file (anime.csv)
• Genres distance file (distance.csv)

Build a Clustering Model
• Try build with 5 to 10 clusters
• Use distance.csv file to determine the distance
• Visualizing clusters

Discussion
• Distance value can be determine as indicated in “How to produce a
pretty plot of the results of k-means cluster analysis?” discussion
(https://stats.stackexchange.com/questions/31083/how-to-produce-
a-pretty-plot-of-the-results-of-k-means-cluster-analysis)
• Distance value in anime clustering should be a normalized value
• Can be percent of running scene for each genre
• Example: action scene running for 12 minutes out of 24 minutes, so
distance for action is 50%

Environment
• Web API
• Kafka
• Spark Streaming

Environment
Client
Client
Client
Request
Response
Producer Consumer

Anime recommendation (Big Data Certification#6)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Anime recommendation (Big Data Certification#6)

Ähnlich wie Anime recommendation (Big Data Certification#6) (20)

Mehr von IMC Institute

Mehr von IMC Institute (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Anime recommendation (Big Data Certification#6)