3. Problem Statement
• How is rating this anime if we give it to user?
• How popular each anime based-on follower?
• How many anime group based-on their genres?
• Which anime will be recommended to user based-on their
preference?
4. Business Values
• Able to choose anime to match current viewer
• Able to push advertisement to potential viewer
• Able to upsell similar products for each anime
• Able to accurately predict anime rating and popularity for license
acquisition
5. Requirements
• Anime data and user rating
• Recommendation algorithm using ALS
• Clustering algorithm using K-NN
• Model evaluation using RMSE
7. Data
• 2 files, anime.csv and rating.csv
• Data volume
• 12,294 rows for anime.csv
• 7,813,737 rows for rating.csv
8. Schema
• anime.csv
• anime_id: myanimelist.net's unique id identifying an anime
• name: full name of anime
• genre: comma separated list of genres for this anime
• type: movie, TV, OVA, etc.
• episodes: how many episodes in this show. (1 if movie)
• rating: average rating out of 10 for this anime
• members: number of community members that are in this anime's "group"
10. Schema
• rating.csv
• user_id: non identifiable randomly generated user id
• anime_id: the anime that this user has rated
• rating: rating out of 10 this user has assigned (-1 if the user watched it but
didn't assign a rating)
16. Get Data
• Dataset was downloaded from
https://www.kaggle.com/CooperUnion/anime-recommendations-
database
• Data is in comma separated value file format
• See data information in ”Data” section
17. Data Pre-Processing
• Data retrieved are well-formed
• Some NULL value in rating was found
• Unknown episodes represented as “Unknown”
• Rows with NULL and/or Unknown values was filtered out
• Total filtered rows is ~500
18. Feature Engineering
• Use original data schema
• Processed only data in rating.csv
• Use anime_id, user_id and rating as features
• rating also used as target
19. Train the Model
• Processed only data in rating.csv
• Ratio of train-to-test data is 80:20
• Use ALS algorithm to build rating predictive model
20. Model Evaluation
• Data in anime.csv is used for map anime_id with human-readible
name
• Predicted ratings were of type “floating point”
• Using RMSE as an evaluation method
• Some row of test data cannot be predicted, we get “NaN” as a result
• NaN (Not-a-Number) was filtered out
24. Environment
• CRAN R 3.4.2
• Anime data file (anime.csv)
• Genres distance file (distance.csv)
25. Build a Clustering Model
• Try build with 5 to 10 clusters
• Use distance.csv file to determine the distance
• Visualizing clusters
26. Discussion
• Distance value can be determine as indicated in “How to produce a
pretty plot of the results of k-means cluster analysis?” discussion
(https://stats.stackexchange.com/questions/31083/how-to-produce-
a-pretty-plot-of-the-results-of-k-means-cluster-analysis)
• Distance value in anime clustering should be a normalized value
• Can be percent of running scene for each genre
• Example: action scene running for 12 minutes out of 24 minutes, so
distance for action is 50%