2. About me: Hans
● A Solutions Architect for Cloud Computing
○ AWS Certified 7/11, GCP Certified Professional Cloud Architect
○ Well-Architected Ambassador improves clients’ workload on AWS
○ High Traffic Volume Website (CDN Solutions) and Big Data Solutions
● A Contributor for the Open Source Project
○ Moto: a library easily mock tests based on AWS infrastructure
● An ML Amateur
○ AWS DeepRacer League ranking 11th/89 in Taipei Summit, 2019
○ Trend Micro Malware Detection Competition ranking 50th/424, 2018
○ The Analytics Edge Kaggle competition ranking 20th/2923, 2015
6. Introduction
● We all know that log is important but…
● Some people don’t care it. The truth is not
important; however, the responsibility is true
● Let’s talk about the challenge
9. Organization
● Political Correctness is the most important
○ I don’t like it, but this is true
● Example
○ Developer Team Versus Operation Team
○ The boss hates the agent on the servers
○ Lazy
10. Budget
● Financial Policy
○ Purchasing or Finance Department
● Example
○ Fiscal year ending tomorrow and you
request to start a project now
○ Finance team doesn’t understand what are
you doing so reject your proposal
11. Technology Stack
● Technology Stack = Technical Debt
We always care about the business logic
but forget the log
● Never use the logging system
● Cannot integrate legacy system to logging
system
12. Operation & Development
● Operation Complexity
○ The more components you maintain, the
more operation task you need to do
● Development Complexity
○ If logging system is lack of functions,
you need to develop it
20. Where to Store Logs
● Hot
○ Streaming System
● Warm
○ Logging Server, CloudWatch Log
● Cold:
○ S3, HDFS...
21. How to Store
● Partition
● Compression
○ Gzip, Snappy…
● Storage Format
○ Txt, Parquet, Avro...
● Aggregation, Rotation and Archive Time
○ Daily, Hourly, Monthly...
22. How to Analyze Logs
● Human Eyes
○ Notepad++ is enough
● Program
○ GoAccess, Python (Pandas), R (Dplyr)...
● Database
○ MySQL, Elasticsearch...
46. Practice
● My Tool Box
○ Human Eyes (Notepad++ is awesome!)
○ Athena (I love SQL!)
○ Python (Pandas), R (Dplyr)
○ Elasitcsearch + Logstash + Kibana + Beats
○ EMR or Glue
53. WAF Case (Real-Time)
● 1 record ≒ 1.5 KB (uncompressed)
● WAF delivers log to Kinesis Firehose
● 100 million requests = 100 million records
● Do the math:
1.5 KB * 10000000 ≒ 143.05 GB
55. WAF Case (Real-Time)
● The blog solution is good!
● However… something is not right
● Write heavy!
○ Assume Worse Case: C10K
● HeadShot!
○ Elasticsearch Service
Recover soon~
https://www.slideshare.net/AmazonWebServices/elasticsearch-5-in-amazon-elasticsearch-service
56. CloudFront Case (Batch)
● 1 record ≒ 0.5 KB (uncompressed)
● CloudFront delivers log to S3 in Gzip format
● 100 million requests = 100 million records
● Compression Percentage ≒ 70%
● Do the math:
0.5 KB * 10000000 * 70% ≒ 14 GB
● This is daily assumption...
58. CloudFront Case (Batch)
● Before 2018.12, you need to hands on the
best practice
○ Partition (year=2019/month=07/day=24)
○ Compression (Snappy)
○ Columnar Format (Parquet)
● Now you have a CloudFormation solution!
59. CloudFront Case (Batch)
Analyze your Amazon CloudFront Access Logs at Scale with Amazon Athena
https://aws.amazon.com/tw/blogs/big-data/analyze-your-amazon-cloudfront-access-logs-at-scale/
60. CloudFront Case (Batch)
Analyze your Amazon CloudFront Access Logs at Scale with Amazon Athena
https://aws.amazon.com/tw/blogs/big-data/analyze-your-amazon-cloudfront-access-logs-at-scale/
61. CloudFront Case (Batch)
Analyze your Amazon CloudFront Access Logs at Scale with Amazon Athena
https://aws.amazon.com/tw/blogs/big-data/analyze-your-amazon-cloudfront-access-logs-at-scale/
62. Athena Benchmark CSV vs Parquet
https://aws.amazon.com/tw/blogs/big-data/analyzing-data-in-s3-using-amazon-athena/
63. CloudFront Case (Batch)
● Benfits
○ Use Partition, Compression and Parquet is
fast and cheap
○ Data scaned from 14 GB to 40 MB