SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Downloaden Sie, um offline zu lesen
Using Apache Spark and MySQL
for Data Analysis
Alexander Rubin, Sveta Smirnova
Percona
February, 4, 2017
www.percona.com
Agenda
• Why Spark?
• Spark Examples
– Wikistats analysis with Spark
www.percona.com
Data /
SQL / Protocol
SQL/
App
What is Spark anyway?
Nodes
Parallel Compute only
Local
FS
?
www.percona.com
• In memory processing with caching
• Massively Parallel
• Direct access to data sources (i.e.MySQL)
>>> df = sqlContext.load(source="jdbc",
url="jdbc:mysql://localhost?user=root",
dbtable="ontime.ontime_sm”)
• Can store data in Hadoop HDFS / S3 /
local Filesystem
• Native Python and R integration
Why Spark?
www.percona.com
Spark vs MySQL
www.percona.com
Spark vs. MySQL for BigData
Indexes
Partitioning
“Sharding”
Full table scan
Partitioning
Map/Reduce
www.percona.com
Spark (vs. MySQL)
• No indexes
• All processing is full scan
• BUT: distributed and parallel
• No transactions
• High latency (usually)
MySQL:
1 query = 1 CPU core
www.percona.com
Indexes (BTree) for Big Data
challenge
• Creating an index for Petabytes of data?
• Updating an index for Petabytes of data?
• Reading a terabyte index?
• Random read of Petabyte?
Full scan in parallel is better for big data
www.percona.com
ETL / Pipeline
1. Extract data from
external source
2. Transform before
loading
3. Load data into
MySQL
1. Extract data from
external source
2. Load data or rsync to
all spark nodes
3. Transform
data/Analyze
data/Visualize data;
Parallelism
www.percona.com
Schema on Read
Schema on Write
• Load data infile will
verify the input (validate)
• … indirect data
conversion
• ... or fail if number of
cols is wrong
Schema on Read
• No “load data” per se,
nothing to validate here
• … Create external table or
read csv
• ... will validate on “read”/
select
www.percona.com
Example:
Loading wikistat into MySQL
1. Extract data
from external
source and
uncompress!
2. Load data into
MySQL and
Transform
Wikipedia page counts –
download, >10TB
load data local infile '$file'
into table wikistats.wikistats_full
CHARACTER SET latin1
FIELDS TERMINATED BY ' '
(project_name, title, num_requests,
content_size)
set request_date =
STR_TO_DATE('$datestr',
'%Y%m%d %H%i%S'),
title_md5=unhex(md5(title));
http://dumps.wikimedia.org/other/pagecounts-raw/
www.percona.com
Load timing per hour of wikistat
• InnoDB: 52.34 sec
• MyISAM: 11.08 sec (+ indexes)
• 1 hour of wikistats =1 minute
• 1 year will load in 6 days
– (8765.81 hours in 1 year)
• 6 year = > 1 month to load
Not even counting
the insert time
degradation…
www.percona.com
Loading wikistat as is into
Spark
• Just copy files to storage (AWS S3 / local /
etc)…
– And create SQL structure
• Or read csv, aggregate/filter in Spark and
– load the aggregated data into MySQL
www.percona.com
Loading wikistat as is into
Spark
• How fast to search?
– Depends upon the number of nodes
• 1000 nodes spark cluster
– 4.5 TB, 104 Billion records
– Exec time: 45 sec
– Scanning 4.5TB of data
• http://spark-summit.org/wp-content/uploads/2014/07/Building-
1000-node-Spark-Cluster-on-EMR.pdf
www.percona.com
Pipelines: MySQL vs Spark
www.percona.com
Spark and WikiStats: load pipeline
Row(project=p[0],
url=urllib.unquote(p[1]).lower(),
num_requests=int(p[2]),
content_size=int(p[3])))
www.percona.com
Save results to MySQL
group_res = sqlContext.sql(
"SELECT '"+ mydate + "' as mydate,
url,
count(*) as cnt,
sum(num_requests) as tot_visits
FROM wikistats
GROUP BY url")
# Save to MySQL
mysql_url="jdbc:mysql://localhost?user=wikistats&password=
wikistats”
group_res.write.jdbc(url=mysql_url,
table="wikistats.wikistats_by_day_spark",
mode="append")
www.percona.com
Multi-Threaded Inserts
www.percona.com
PySpark: CPU
Cpu0 : 94.4%us, 0.0%sy, 0.0%ni, 5.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 5.7%us, 0.0%sy, 0.0%ni, 92.4%id, 0.0%wa, 0.0%hi, 1.9%si, 0.0%st
Cpu2 : 95.0%us, 0.0%sy, 0.0%ni, 5.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu3 : 94.9%us, 0.0%sy, 0.0%ni, 5.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu4 : 0.6%us, 0.0%sy, 0.0%ni, 99.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu5 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu6 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu7 : 95.0%us, 0.0%sy, 0.0%ni, 5.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu8 : 94.4%us, 0.0%sy, 0.0%ni, 5.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
...
Cpu17 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu18 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu19 : 94.9%us, 0.0%sy, 0.0%ni, 5.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu20 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu21 : 94.9%us, 0.0%sy, 0.0%ni, 5.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu22 : 94.9%us, 0.0%sy, 0.0%ni, 5.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu23 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 49454372k total, 40479496k used, 8974876k free, 357360k buffers
www.percona.com
Monitoring your jobs
www.percona.com
www.percona.com
mysql> SELECT lower(url) as lurl, sum(tot_visits) as max_visits , count(*) FROM
wikistats_by_day_spark where lower(url) not like '%special%' and lower(url) not like
'%page%' and lower(url) not like '%test%' and lower(url) not like '%wiki%' group by
lower(url) order by max_visits desc limit 10;
+--------------------------------------------------------+------------+----------+
| lurl | max_visits | count(*) |
+--------------------------------------------------------+------------+----------+
| heath_ledger | 4247338 | 131 |
| cloverfield | 3846404 | 131 |
| barack_obama | 2238406 | 153 |
| 1925_in_baseball#negro_league_baseball_final_standings | 1791341 | 11 |
| the_dark_knight_(film) | 1417186 | 64 |
| martin_luther_king,_jr. | 1394934 | 136 |
| deaths_in_2008 | 1372510 | 67 |
| united_states | 1357253 | 167 |
| scientology | 1349654 | 108 |
| portal:current_events | 1261538 | 125 |
+--------------------------------------------------------+------------+----------+
10 rows in set (1 hour 22 min 10.02 sec)
Search the WikiStats in MySQL
10 most frequently queried wiki pages in January 2008
www.percona.com
Search the WikiStats in SparkSQL
spark-sql> CREATE TEMPORARY TABLE wikistats_parquet
USING org.apache.spark.sql.parquet
OPTIONS (
path "/ssd/wikistats_parquet_bydate"
);
Time taken: 3.466 seconds
spark-sql> SELECT lower(url) as lurl, sum(tot_visits) as max_visits , count(*) FROM
wikistats_parquet where lower(url) not like '%special%' and lower(url) not like '%page%'
and lower(url) not like '%test%' and lower(url) not like '%wiki%' group by lower(url)
order by max_visits desc limit 10;
heath_ledger 4247335 42
cloverfield 3846400 42
barack_obama 2238402 53
1925_in_baseball#negro_league_baseball_final_standings 1791341 11
the_dark_knight_(film) 1417183 36
martin_luther_king,_jr. 1394934 46
deaths_in_2008 1372510 38
united_states 1357251 55
scientology 1349650 44
portal:current_events 1261305 44
Time taken: 1239.014 seconds, Fetched 10 row(s)
10 most frequently queried wiki pages in January 2008
20 min
www.percona.com
Apache Drill
Treat any datasource
as a table (even it is
not)
Querying MongoDB
with SQL
www.percona.com
Magic?
!=
www.percona.com
Recap…
1. Search full dataset
• May be pre-filtered
• Not aggregated
2. No parallelism
3. Based on index?
4. InnoDB<> Columnar
5. Partitioning?
1. Dataset is already
– Filtered (only site=“en”)
– Aggregated (group by url)
2. Parallelism (+)
3. Not Based on index
4. Columnar (+)
5. Partitioning (+)
www.percona.com
Thank you!
https://www.linkedin.com/in/alexanderrubin
Alexander Rubin

Weitere ähnliche Inhalte

Was ist angesagt?

Chef for DevOps - an Introduction
Chef for DevOps - an IntroductionChef for DevOps - an Introduction
Chef for DevOps - an IntroductionSanjeev Sharma
 
Migrating Data and Databases to Azure
Migrating Data and Databases to AzureMigrating Data and Databases to Azure
Migrating Data and Databases to AzureKaren Lopez
 
MySQL GTID 시작하기
MySQL GTID 시작하기MySQL GTID 시작하기
MySQL GTID 시작하기I Goo Lee
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
 
Mastering PostgreSQL Administration
Mastering PostgreSQL AdministrationMastering PostgreSQL Administration
Mastering PostgreSQL AdministrationEDB
 
Amazon Aurora 성능 향상 및 마이그레이션 모범 사례 - AWS Summit Seoul 2017
Amazon Aurora 성능 향상 및 마이그레이션 모범 사례 - AWS Summit Seoul 2017Amazon Aurora 성능 향상 및 마이그레이션 모범 사례 - AWS Summit Seoul 2017
Amazon Aurora 성능 향상 및 마이그레이션 모범 사례 - AWS Summit Seoul 2017Amazon Web Services Korea
 
autonomous-database-100.pdf
autonomous-database-100.pdfautonomous-database-100.pdf
autonomous-database-100.pdfTrLuNguyn
 
Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Riccardo Zamana
 
Terraform을 기반한 AWS 기반 대규모 마이크로서비스 인프라 운영 노하우 - 이용욱, 삼성전자 :: AWS Summit Seoul ...
Terraform을 기반한 AWS 기반 대규모 마이크로서비스 인프라 운영 노하우 - 이용욱, 삼성전자 :: AWS Summit Seoul ...Terraform을 기반한 AWS 기반 대규모 마이크로서비스 인프라 운영 노하우 - 이용욱, 삼성전자 :: AWS Summit Seoul ...
Terraform을 기반한 AWS 기반 대규모 마이크로서비스 인프라 운영 노하우 - 이용욱, 삼성전자 :: AWS Summit Seoul ...Amazon Web Services Korea
 
MySQL Enterprise Edition
MySQL Enterprise EditionMySQL Enterprise Edition
MySQL Enterprise EditionMySQL Brasil
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!Guido Schmutz
 
Practical Elasticsearch - real world use cases
Practical Elasticsearch - real world use casesPractical Elasticsearch - real world use cases
Practical Elasticsearch - real world use casesItamar
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Databricks
 
202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS ...
202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS ...202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS ...
202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS ...Amazon Web Services Japan
 
Sonatype nexus 로 docker registry 관리하기
Sonatype nexus 로 docker registry 관리하기Sonatype nexus 로 docker registry 관리하기
Sonatype nexus 로 docker registry 관리하기KwangSeob Jeong
 
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...Zalando Technology
 
From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...
From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...
From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...confluent
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.
 

Was ist angesagt? (20)

Chef for DevOps - an Introduction
Chef for DevOps - an IntroductionChef for DevOps - an Introduction
Chef for DevOps - an Introduction
 
Migrating Data and Databases to Azure
Migrating Data and Databases to AzureMigrating Data and Databases to Azure
Migrating Data and Databases to Azure
 
MySQL GTID 시작하기
MySQL GTID 시작하기MySQL GTID 시작하기
MySQL GTID 시작하기
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Mastering PostgreSQL Administration
Mastering PostgreSQL AdministrationMastering PostgreSQL Administration
Mastering PostgreSQL Administration
 
Amazon Aurora 성능 향상 및 마이그레이션 모범 사례 - AWS Summit Seoul 2017
Amazon Aurora 성능 향상 및 마이그레이션 모범 사례 - AWS Summit Seoul 2017Amazon Aurora 성능 향상 및 마이그레이션 모범 사례 - AWS Summit Seoul 2017
Amazon Aurora 성능 향상 및 마이그레이션 모범 사례 - AWS Summit Seoul 2017
 
autonomous-database-100.pdf
autonomous-database-100.pdfautonomous-database-100.pdf
autonomous-database-100.pdf
 
Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020
 
Terraform을 기반한 AWS 기반 대규모 마이크로서비스 인프라 운영 노하우 - 이용욱, 삼성전자 :: AWS Summit Seoul ...
Terraform을 기반한 AWS 기반 대규모 마이크로서비스 인프라 운영 노하우 - 이용욱, 삼성전자 :: AWS Summit Seoul ...Terraform을 기반한 AWS 기반 대규모 마이크로서비스 인프라 운영 노하우 - 이용욱, 삼성전자 :: AWS Summit Seoul ...
Terraform을 기반한 AWS 기반 대규모 마이크로서비스 인프라 운영 노하우 - 이용욱, 삼성전자 :: AWS Summit Seoul ...
 
MySQL Enterprise Edition
MySQL Enterprise EditionMySQL Enterprise Edition
MySQL Enterprise Edition
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!
 
Practical Elasticsearch - real world use cases
Practical Elasticsearch - real world use casesPractical Elasticsearch - real world use cases
Practical Elasticsearch - real world use cases
 
[AWS Builders] Effective AWS Glue
[AWS Builders] Effective AWS Glue[AWS Builders] Effective AWS Glue
[AWS Builders] Effective AWS Glue
 
Jenkins-CI
Jenkins-CIJenkins-CI
Jenkins-CI
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
 
202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS ...
202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS ...202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS ...
202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS ...
 
Sonatype nexus 로 docker registry 관리하기
Sonatype nexus 로 docker registry 관리하기Sonatype nexus 로 docker registry 관리하기
Sonatype nexus 로 docker registry 관리하기
 
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
 
From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...
From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...
From Postgres to Event-Driven: using docker-compose to build CDC pipelines in...
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 

Andere mochten auch

Эффективная отладка репликации MySQL
Эффективная отладка репликации MySQLЭффективная отладка репликации MySQL
Эффективная отладка репликации MySQLSveta Smirnova
 
MySQL High-Availability and Scale-Out architectures
MySQL High-Availability and Scale-Out architecturesMySQL High-Availability and Scale-Out architectures
MySQL High-Availability and Scale-Out architecturesFromDual GmbH
 
Galera cluster for high availability
Galera cluster for high availability Galera cluster for high availability
Galera cluster for high availability Mydbops
 
What you wanted to know about MySQL, but could not find using inernal instrum...
What you wanted to know about MySQL, but could not find using inernal instrum...What you wanted to know about MySQL, but could not find using inernal instrum...
What you wanted to know about MySQL, but could not find using inernal instrum...Sveta Smirnova
 
Hbase源码初探
Hbase源码初探Hbase源码初探
Hbase源码初探zhaolinjnu
 
MySQL High Availability Deep Dive
MySQL High Availability Deep DiveMySQL High Availability Deep Dive
MySQL High Availability Deep Divehastexo
 
2010丹臣的思考
2010丹臣的思考2010丹臣的思考
2010丹臣的思考zhaolinjnu
 
MySQL InnoDB 源码实现分析(一)
MySQL InnoDB 源码实现分析(一)MySQL InnoDB 源码实现分析(一)
MySQL InnoDB 源码实现分析(一)frogd
 
MySQL High Availability Solutions
MySQL High Availability SolutionsMySQL High Availability Solutions
MySQL High Availability SolutionsLenz Grimmer
 
Why MySQL High Availability Matters
Why MySQL High Availability MattersWhy MySQL High Availability Matters
Why MySQL High Availability MattersMatt Lord
 
The nightmare of locking, blocking and isolation levels!
The nightmare of locking, blocking and isolation levels!The nightmare of locking, blocking and isolation levels!
The nightmare of locking, blocking and isolation levels!Boris Hristov
 
Advanced Percona XtraDB Cluster in a nutshell... la suite
Advanced Percona XtraDB Cluster in a nutshell... la suiteAdvanced Percona XtraDB Cluster in a nutshell... la suite
Advanced Percona XtraDB Cluster in a nutshell... la suiteKenny Gryp
 
Lessons Learned: Troubleshooting Replication
Lessons Learned: Troubleshooting ReplicationLessons Learned: Troubleshooting Replication
Lessons Learned: Troubleshooting ReplicationSveta Smirnova
 
Advanced mysql replication techniques
Advanced mysql replication techniquesAdvanced mysql replication techniques
Advanced mysql replication techniquesGiuseppe Maxia
 

Andere mochten auch (20)

Эффективная отладка репликации MySQL
Эффективная отладка репликации MySQLЭффективная отладка репликации MySQL
Эффективная отладка репликации MySQL
 
MySQL High-Availability and Scale-Out architectures
MySQL High-Availability and Scale-Out architecturesMySQL High-Availability and Scale-Out architectures
MySQL High-Availability and Scale-Out architectures
 
Galera cluster for high availability
Galera cluster for high availability Galera cluster for high availability
Galera cluster for high availability
 
What you wanted to know about MySQL, but could not find using inernal instrum...
What you wanted to know about MySQL, but could not find using inernal instrum...What you wanted to know about MySQL, but could not find using inernal instrum...
What you wanted to know about MySQL, but could not find using inernal instrum...
 
SQL Outer Joins for Fun and Profit
SQL Outer Joins for Fun and ProfitSQL Outer Joins for Fun and Profit
SQL Outer Joins for Fun and Profit
 
Hbase源码初探
Hbase源码初探Hbase源码初探
Hbase源码初探
 
MySQL High Availability Deep Dive
MySQL High Availability Deep DiveMySQL High Availability Deep Dive
MySQL High Availability Deep Dive
 
2010丹臣的思考
2010丹臣的思考2010丹臣的思考
2010丹臣的思考
 
Requirements the Last Bottleneck
Requirements the Last BottleneckRequirements the Last Bottleneck
Requirements the Last Bottleneck
 
MySQL InnoDB 源码实现分析(一)
MySQL InnoDB 源码实现分析(一)MySQL InnoDB 源码实现分析(一)
MySQL InnoDB 源码实现分析(一)
 
Extensible Data Modeling
Extensible Data ModelingExtensible Data Modeling
Extensible Data Modeling
 
MySQL High Availability Solutions
MySQL High Availability SolutionsMySQL High Availability Solutions
MySQL High Availability Solutions
 
Why MySQL High Availability Matters
Why MySQL High Availability MattersWhy MySQL High Availability Matters
Why MySQL High Availability Matters
 
Mysql For Developers
Mysql For DevelopersMysql For Developers
Mysql For Developers
 
Redis介绍
Redis介绍Redis介绍
Redis介绍
 
The nightmare of locking, blocking and isolation levels!
The nightmare of locking, blocking and isolation levels!The nightmare of locking, blocking and isolation levels!
The nightmare of locking, blocking and isolation levels!
 
Advanced Percona XtraDB Cluster in a nutshell... la suite
Advanced Percona XtraDB Cluster in a nutshell... la suiteAdvanced Percona XtraDB Cluster in a nutshell... la suite
Advanced Percona XtraDB Cluster in a nutshell... la suite
 
Lessons Learned: Troubleshooting Replication
Lessons Learned: Troubleshooting ReplicationLessons Learned: Troubleshooting Replication
Lessons Learned: Troubleshooting Replication
 
Explain
ExplainExplain
Explain
 
Advanced mysql replication techniques
Advanced mysql replication techniquesAdvanced mysql replication techniques
Advanced mysql replication techniques
 

Ähnlich wie Using Apache Spark and MySQL for Data Analysis

Fact-Based Monitoring
Fact-Based MonitoringFact-Based Monitoring
Fact-Based MonitoringDatadog
 
Fact based monitoring
Fact based monitoringFact based monitoring
Fact based monitoringDatadog
 
Cassandra
CassandraCassandra
Cassandraexsuns
 
Real time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchReal time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchAbhishek Andhavarapu
 
ClickHouse 2018. How to stop waiting for your queries to complete and start ...
ClickHouse 2018.  How to stop waiting for your queries to complete and start ...ClickHouse 2018.  How to stop waiting for your queries to complete and start ...
ClickHouse 2018. How to stop waiting for your queries to complete and start ...Altinity Ltd
 
ClickHouse Introduction by Alexander Zaitsev, Altinity CTO
ClickHouse Introduction by Alexander Zaitsev, Altinity CTOClickHouse Introduction by Alexander Zaitsev, Altinity CTO
ClickHouse Introduction by Alexander Zaitsev, Altinity CTOAltinity Ltd
 
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...HostedbyConfluent
 
Developing on SQL Azure
Developing on SQL AzureDeveloping on SQL Azure
Developing on SQL AzureIke Ellis
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACKristofferson A
 
Managing Your Content with Elasticsearch
Managing Your Content with ElasticsearchManaging Your Content with Elasticsearch
Managing Your Content with ElasticsearchSamantha Quiñones
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scalethelabdude
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit
 
Amazon Athena (April 2017)
Amazon Athena (April 2017)Amazon Athena (April 2017)
Amazon Athena (April 2017)Julien SIMON
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLDatabricks
 
backgroundcommunicationandwaitevents-180124221026.pdf
backgroundcommunicationandwaitevents-180124221026.pdfbackgroundcommunicationandwaitevents-180124221026.pdf
backgroundcommunicationandwaitevents-180124221026.pdfssuser785ce21
 
ElasticSearch AJUG 2013
ElasticSearch AJUG 2013ElasticSearch AJUG 2013
ElasticSearch AJUG 2013Roy Russo
 
MYSQL Query Anti-Patterns That Can Be Moved to Sphinx
MYSQL Query Anti-Patterns That Can Be Moved to SphinxMYSQL Query Anti-Patterns That Can Be Moved to Sphinx
MYSQL Query Anti-Patterns That Can Be Moved to SphinxPythian
 
Owning time series with team apache Strata San Jose 2015
Owning time series with team apache   Strata San Jose 2015Owning time series with team apache   Strata San Jose 2015
Owning time series with team apache Strata San Jose 2015Patrick McFadin
 

Ähnlich wie Using Apache Spark and MySQL for Data Analysis (20)

Fact-Based Monitoring
Fact-Based MonitoringFact-Based Monitoring
Fact-Based Monitoring
 
Fact based monitoring
Fact based monitoringFact based monitoring
Fact based monitoring
 
Apache Spark v3.0.0
Apache Spark v3.0.0Apache Spark v3.0.0
Apache Spark v3.0.0
 
Cassandra
CassandraCassandra
Cassandra
 
Real time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchReal time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and Elasticsearch
 
ClickHouse 2018. How to stop waiting for your queries to complete and start ...
ClickHouse 2018.  How to stop waiting for your queries to complete and start ...ClickHouse 2018.  How to stop waiting for your queries to complete and start ...
ClickHouse 2018. How to stop waiting for your queries to complete and start ...
 
ClickHouse Introduction by Alexander Zaitsev, Altinity CTO
ClickHouse Introduction by Alexander Zaitsev, Altinity CTOClickHouse Introduction by Alexander Zaitsev, Altinity CTO
ClickHouse Introduction by Alexander Zaitsev, Altinity CTO
 
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
Don’t Forget About Your Past—Optimizing Apache Druid Performance With Neil Bu...
 
Developing on SQL Azure
Developing on SQL AzureDeveloping on SQL Azure
Developing on SQL Azure
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
 
Managing Your Content with Elasticsearch
Managing Your Content with ElasticsearchManaging Your Content with Elasticsearch
Managing Your Content with Elasticsearch
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
 
Amazon Athena (April 2017)
Amazon Athena (April 2017)Amazon Athena (April 2017)
Amazon Athena (April 2017)
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
Quick Wins
Quick WinsQuick Wins
Quick Wins
 
backgroundcommunicationandwaitevents-180124221026.pdf
backgroundcommunicationandwaitevents-180124221026.pdfbackgroundcommunicationandwaitevents-180124221026.pdf
backgroundcommunicationandwaitevents-180124221026.pdf
 
ElasticSearch AJUG 2013
ElasticSearch AJUG 2013ElasticSearch AJUG 2013
ElasticSearch AJUG 2013
 
MYSQL Query Anti-Patterns That Can Be Moved to Sphinx
MYSQL Query Anti-Patterns That Can Be Moved to SphinxMYSQL Query Anti-Patterns That Can Be Moved to Sphinx
MYSQL Query Anti-Patterns That Can Be Moved to Sphinx
 
Owning time series with team apache Strata San Jose 2015
Owning time series with team apache   Strata San Jose 2015Owning time series with team apache   Strata San Jose 2015
Owning time series with team apache Strata San Jose 2015
 

Mehr von Sveta Smirnova

MySQL 2024: Зачем переходить на MySQL 8, если в 5.х всё устраивает?
MySQL 2024: Зачем переходить на MySQL 8, если в 5.х всё устраивает?MySQL 2024: Зачем переходить на MySQL 8, если в 5.х всё устраивает?
MySQL 2024: Зачем переходить на MySQL 8, если в 5.х всё устраивает?Sveta Smirnova
 
Database in Kubernetes: Diagnostics and Monitoring
Database in Kubernetes: Diagnostics and MonitoringDatabase in Kubernetes: Diagnostics and Monitoring
Database in Kubernetes: Diagnostics and MonitoringSveta Smirnova
 
MySQL Database Monitoring: Must, Good and Nice to Have
MySQL Database Monitoring: Must, Good and Nice to HaveMySQL Database Monitoring: Must, Good and Nice to Have
MySQL Database Monitoring: Must, Good and Nice to HaveSveta Smirnova
 
MySQL Cookbook: Recipes for Developers
MySQL Cookbook: Recipes for DevelopersMySQL Cookbook: Recipes for Developers
MySQL Cookbook: Recipes for DevelopersSveta Smirnova
 
MySQL Performance for DevOps
MySQL Performance for DevOpsMySQL Performance for DevOps
MySQL Performance for DevOpsSveta Smirnova
 
MySQL Test Framework для поддержки клиентов и верификации багов
MySQL Test Framework для поддержки клиентов и верификации баговMySQL Test Framework для поддержки клиентов и верификации багов
MySQL Test Framework для поддержки клиентов и верификации баговSveta Smirnova
 
MySQL Cookbook: Recipes for Your Business
MySQL Cookbook: Recipes for Your BusinessMySQL Cookbook: Recipes for Your Business
MySQL Cookbook: Recipes for Your BusinessSveta Smirnova
 
Introduction into MySQL Query Tuning for Dev[Op]s
Introduction into MySQL Query Tuning for Dev[Op]sIntroduction into MySQL Query Tuning for Dev[Op]s
Introduction into MySQL Query Tuning for Dev[Op]sSveta Smirnova
 
Производительность MySQL для DevOps
 Производительность MySQL для DevOps Производительность MySQL для DevOps
Производительность MySQL для DevOpsSveta Smirnova
 
MySQL Performance for DevOps
MySQL Performance for DevOpsMySQL Performance for DevOps
MySQL Performance for DevOpsSveta Smirnova
 
How to Avoid Pitfalls in Schema Upgrade with Percona XtraDB Cluster
How to Avoid Pitfalls in Schema Upgrade with Percona XtraDB ClusterHow to Avoid Pitfalls in Schema Upgrade with Percona XtraDB Cluster
How to Avoid Pitfalls in Schema Upgrade with Percona XtraDB ClusterSveta Smirnova
 
How to migrate from MySQL to MariaDB without tears
How to migrate from MySQL to MariaDB without tearsHow to migrate from MySQL to MariaDB without tears
How to migrate from MySQL to MariaDB without tearsSveta Smirnova
 
Modern solutions for modern database load: improvements in the latest MariaDB...
Modern solutions for modern database load: improvements in the latest MariaDB...Modern solutions for modern database load: improvements in the latest MariaDB...
Modern solutions for modern database load: improvements in the latest MariaDB...Sveta Smirnova
 
How Safe is Asynchronous Master-Master Setup?
How Safe is Asynchronous Master-Master Setup?How Safe is Asynchronous Master-Master Setup?
How Safe is Asynchronous Master-Master Setup?Sveta Smirnova
 
Современному хайлоду - современные решения: MySQL 8.0 и улучшения Percona
Современному хайлоду - современные решения: MySQL 8.0 и улучшения PerconaСовременному хайлоду - современные решения: MySQL 8.0 и улучшения Percona
Современному хайлоду - современные решения: MySQL 8.0 и улучшения PerconaSveta Smirnova
 
How to Avoid Pitfalls in Schema Upgrade with Galera
How to Avoid Pitfalls in Schema Upgrade with GaleraHow to Avoid Pitfalls in Schema Upgrade with Galera
How to Avoid Pitfalls in Schema Upgrade with GaleraSveta Smirnova
 
How Safe is Asynchronous Master-Master Setup?
 How Safe is Asynchronous Master-Master Setup? How Safe is Asynchronous Master-Master Setup?
How Safe is Asynchronous Master-Master Setup?Sveta Smirnova
 
Introduction to MySQL Query Tuning for Dev[Op]s
Introduction to MySQL Query Tuning for Dev[Op]sIntroduction to MySQL Query Tuning for Dev[Op]s
Introduction to MySQL Query Tuning for Dev[Op]sSveta Smirnova
 
Billion Goods in Few Categories: How Histograms Save a Life?
Billion Goods in Few Categories: How Histograms Save a Life?Billion Goods in Few Categories: How Histograms Save a Life?
Billion Goods in Few Categories: How Histograms Save a Life?Sveta Smirnova
 
A Billion Goods in a Few Categories: When Optimizer Histograms Help and When ...
A Billion Goods in a Few Categories: When Optimizer Histograms Help and When ...A Billion Goods in a Few Categories: When Optimizer Histograms Help and When ...
A Billion Goods in a Few Categories: When Optimizer Histograms Help and When ...Sveta Smirnova
 

Mehr von Sveta Smirnova (20)

MySQL 2024: Зачем переходить на MySQL 8, если в 5.х всё устраивает?
MySQL 2024: Зачем переходить на MySQL 8, если в 5.х всё устраивает?MySQL 2024: Зачем переходить на MySQL 8, если в 5.х всё устраивает?
MySQL 2024: Зачем переходить на MySQL 8, если в 5.х всё устраивает?
 
Database in Kubernetes: Diagnostics and Monitoring
Database in Kubernetes: Diagnostics and MonitoringDatabase in Kubernetes: Diagnostics and Monitoring
Database in Kubernetes: Diagnostics and Monitoring
 
MySQL Database Monitoring: Must, Good and Nice to Have
MySQL Database Monitoring: Must, Good and Nice to HaveMySQL Database Monitoring: Must, Good and Nice to Have
MySQL Database Monitoring: Must, Good and Nice to Have
 
MySQL Cookbook: Recipes for Developers
MySQL Cookbook: Recipes for DevelopersMySQL Cookbook: Recipes for Developers
MySQL Cookbook: Recipes for Developers
 
MySQL Performance for DevOps
MySQL Performance for DevOpsMySQL Performance for DevOps
MySQL Performance for DevOps
 
MySQL Test Framework для поддержки клиентов и верификации багов
MySQL Test Framework для поддержки клиентов и верификации баговMySQL Test Framework для поддержки клиентов и верификации багов
MySQL Test Framework для поддержки клиентов и верификации багов
 
MySQL Cookbook: Recipes for Your Business
MySQL Cookbook: Recipes for Your BusinessMySQL Cookbook: Recipes for Your Business
MySQL Cookbook: Recipes for Your Business
 
Introduction into MySQL Query Tuning for Dev[Op]s
Introduction into MySQL Query Tuning for Dev[Op]sIntroduction into MySQL Query Tuning for Dev[Op]s
Introduction into MySQL Query Tuning for Dev[Op]s
 
Производительность MySQL для DevOps
 Производительность MySQL для DevOps Производительность MySQL для DevOps
Производительность MySQL для DevOps
 
MySQL Performance for DevOps
MySQL Performance for DevOpsMySQL Performance for DevOps
MySQL Performance for DevOps
 
How to Avoid Pitfalls in Schema Upgrade with Percona XtraDB Cluster
How to Avoid Pitfalls in Schema Upgrade with Percona XtraDB ClusterHow to Avoid Pitfalls in Schema Upgrade with Percona XtraDB Cluster
How to Avoid Pitfalls in Schema Upgrade with Percona XtraDB Cluster
 
How to migrate from MySQL to MariaDB without tears
How to migrate from MySQL to MariaDB without tearsHow to migrate from MySQL to MariaDB without tears
How to migrate from MySQL to MariaDB without tears
 
Modern solutions for modern database load: improvements in the latest MariaDB...
Modern solutions for modern database load: improvements in the latest MariaDB...Modern solutions for modern database load: improvements in the latest MariaDB...
Modern solutions for modern database load: improvements in the latest MariaDB...
 
How Safe is Asynchronous Master-Master Setup?
How Safe is Asynchronous Master-Master Setup?How Safe is Asynchronous Master-Master Setup?
How Safe is Asynchronous Master-Master Setup?
 
Современному хайлоду - современные решения: MySQL 8.0 и улучшения Percona
Современному хайлоду - современные решения: MySQL 8.0 и улучшения PerconaСовременному хайлоду - современные решения: MySQL 8.0 и улучшения Percona
Современному хайлоду - современные решения: MySQL 8.0 и улучшения Percona
 
How to Avoid Pitfalls in Schema Upgrade with Galera
How to Avoid Pitfalls in Schema Upgrade with GaleraHow to Avoid Pitfalls in Schema Upgrade with Galera
How to Avoid Pitfalls in Schema Upgrade with Galera
 
How Safe is Asynchronous Master-Master Setup?
 How Safe is Asynchronous Master-Master Setup? How Safe is Asynchronous Master-Master Setup?
How Safe is Asynchronous Master-Master Setup?
 
Introduction to MySQL Query Tuning for Dev[Op]s
Introduction to MySQL Query Tuning for Dev[Op]sIntroduction to MySQL Query Tuning for Dev[Op]s
Introduction to MySQL Query Tuning for Dev[Op]s
 
Billion Goods in Few Categories: How Histograms Save a Life?
Billion Goods in Few Categories: How Histograms Save a Life?Billion Goods in Few Categories: How Histograms Save a Life?
Billion Goods in Few Categories: How Histograms Save a Life?
 
A Billion Goods in a Few Categories: When Optimizer Histograms Help and When ...
A Billion Goods in a Few Categories: When Optimizer Histograms Help and When ...A Billion Goods in a Few Categories: When Optimizer Histograms Help and When ...
A Billion Goods in a Few Categories: When Optimizer Histograms Help and When ...
 

Kürzlich hochgeladen

Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionOnePlan Solutions
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrandmasabamasaba
 
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburgmasabamasaba
 
%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durbanmasabamasaba
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...masabamasaba
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdfPearlKirahMaeRagusta1
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastPapp Krisztián
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfproinshot.com
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyviewmasabamasaba
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is insideshinachiaurasa2
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 

Kürzlich hochgeladen (20)

Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
 
%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 

Using Apache Spark and MySQL for Data Analysis

  • 1. Using Apache Spark and MySQL for Data Analysis Alexander Rubin, Sveta Smirnova Percona February, 4, 2017
  • 2. www.percona.com Agenda • Why Spark? • Spark Examples – Wikistats analysis with Spark
  • 3. www.percona.com Data / SQL / Protocol SQL/ App What is Spark anyway? Nodes Parallel Compute only Local FS ?
  • 4. www.percona.com • In memory processing with caching • Massively Parallel • Direct access to data sources (i.e.MySQL) >>> df = sqlContext.load(source="jdbc", url="jdbc:mysql://localhost?user=root", dbtable="ontime.ontime_sm”) • Can store data in Hadoop HDFS / S3 / local Filesystem • Native Python and R integration Why Spark?
  • 6. www.percona.com Spark vs. MySQL for BigData Indexes Partitioning “Sharding” Full table scan Partitioning Map/Reduce
  • 7. www.percona.com Spark (vs. MySQL) • No indexes • All processing is full scan • BUT: distributed and parallel • No transactions • High latency (usually) MySQL: 1 query = 1 CPU core
  • 8. www.percona.com Indexes (BTree) for Big Data challenge • Creating an index for Petabytes of data? • Updating an index for Petabytes of data? • Reading a terabyte index? • Random read of Petabyte? Full scan in parallel is better for big data
  • 9. www.percona.com ETL / Pipeline 1. Extract data from external source 2. Transform before loading 3. Load data into MySQL 1. Extract data from external source 2. Load data or rsync to all spark nodes 3. Transform data/Analyze data/Visualize data; Parallelism
  • 10. www.percona.com Schema on Read Schema on Write • Load data infile will verify the input (validate) • … indirect data conversion • ... or fail if number of cols is wrong Schema on Read • No “load data” per se, nothing to validate here • … Create external table or read csv • ... will validate on “read”/ select
  • 11. www.percona.com Example: Loading wikistat into MySQL 1. Extract data from external source and uncompress! 2. Load data into MySQL and Transform Wikipedia page counts – download, >10TB load data local infile '$file' into table wikistats.wikistats_full CHARACTER SET latin1 FIELDS TERMINATED BY ' ' (project_name, title, num_requests, content_size) set request_date = STR_TO_DATE('$datestr', '%Y%m%d %H%i%S'), title_md5=unhex(md5(title)); http://dumps.wikimedia.org/other/pagecounts-raw/
  • 12. www.percona.com Load timing per hour of wikistat • InnoDB: 52.34 sec • MyISAM: 11.08 sec (+ indexes) • 1 hour of wikistats =1 minute • 1 year will load in 6 days – (8765.81 hours in 1 year) • 6 year = > 1 month to load Not even counting the insert time degradation…
  • 13. www.percona.com Loading wikistat as is into Spark • Just copy files to storage (AWS S3 / local / etc)… – And create SQL structure • Or read csv, aggregate/filter in Spark and – load the aggregated data into MySQL
  • 14. www.percona.com Loading wikistat as is into Spark • How fast to search? – Depends upon the number of nodes • 1000 nodes spark cluster – 4.5 TB, 104 Billion records – Exec time: 45 sec – Scanning 4.5TB of data • http://spark-summit.org/wp-content/uploads/2014/07/Building- 1000-node-Spark-Cluster-on-EMR.pdf
  • 16. www.percona.com Spark and WikiStats: load pipeline Row(project=p[0], url=urllib.unquote(p[1]).lower(), num_requests=int(p[2]), content_size=int(p[3])))
  • 17. www.percona.com Save results to MySQL group_res = sqlContext.sql( "SELECT '"+ mydate + "' as mydate, url, count(*) as cnt, sum(num_requests) as tot_visits FROM wikistats GROUP BY url") # Save to MySQL mysql_url="jdbc:mysql://localhost?user=wikistats&password= wikistats” group_res.write.jdbc(url=mysql_url, table="wikistats.wikistats_by_day_spark", mode="append")
  • 19. www.percona.com PySpark: CPU Cpu0 : 94.4%us, 0.0%sy, 0.0%ni, 5.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 5.7%us, 0.0%sy, 0.0%ni, 92.4%id, 0.0%wa, 0.0%hi, 1.9%si, 0.0%st Cpu2 : 95.0%us, 0.0%sy, 0.0%ni, 5.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 94.9%us, 0.0%sy, 0.0%ni, 5.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 0.6%us, 0.0%sy, 0.0%ni, 99.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 : 95.0%us, 0.0%sy, 0.0%ni, 5.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu8 : 94.4%us, 0.0%sy, 0.0%ni, 5.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st ... Cpu17 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu18 : 94.3%us, 0.0%sy, 0.0%ni, 5.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu19 : 94.9%us, 0.0%sy, 0.0%ni, 5.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu20 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu21 : 94.9%us, 0.0%sy, 0.0%ni, 5.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu22 : 94.9%us, 0.0%sy, 0.0%ni, 5.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu23 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 49454372k total, 40479496k used, 8974876k free, 357360k buffers
  • 22. www.percona.com mysql> SELECT lower(url) as lurl, sum(tot_visits) as max_visits , count(*) FROM wikistats_by_day_spark where lower(url) not like '%special%' and lower(url) not like '%page%' and lower(url) not like '%test%' and lower(url) not like '%wiki%' group by lower(url) order by max_visits desc limit 10; +--------------------------------------------------------+------------+----------+ | lurl | max_visits | count(*) | +--------------------------------------------------------+------------+----------+ | heath_ledger | 4247338 | 131 | | cloverfield | 3846404 | 131 | | barack_obama | 2238406 | 153 | | 1925_in_baseball#negro_league_baseball_final_standings | 1791341 | 11 | | the_dark_knight_(film) | 1417186 | 64 | | martin_luther_king,_jr. | 1394934 | 136 | | deaths_in_2008 | 1372510 | 67 | | united_states | 1357253 | 167 | | scientology | 1349654 | 108 | | portal:current_events | 1261538 | 125 | +--------------------------------------------------------+------------+----------+ 10 rows in set (1 hour 22 min 10.02 sec) Search the WikiStats in MySQL 10 most frequently queried wiki pages in January 2008
  • 23. www.percona.com Search the WikiStats in SparkSQL spark-sql> CREATE TEMPORARY TABLE wikistats_parquet USING org.apache.spark.sql.parquet OPTIONS ( path "/ssd/wikistats_parquet_bydate" ); Time taken: 3.466 seconds spark-sql> SELECT lower(url) as lurl, sum(tot_visits) as max_visits , count(*) FROM wikistats_parquet where lower(url) not like '%special%' and lower(url) not like '%page%' and lower(url) not like '%test%' and lower(url) not like '%wiki%' group by lower(url) order by max_visits desc limit 10; heath_ledger 4247335 42 cloverfield 3846400 42 barack_obama 2238402 53 1925_in_baseball#negro_league_baseball_final_standings 1791341 11 the_dark_knight_(film) 1417183 36 martin_luther_king,_jr. 1394934 46 deaths_in_2008 1372510 38 united_states 1357251 55 scientology 1349650 44 portal:current_events 1261305 44 Time taken: 1239.014 seconds, Fetched 10 row(s) 10 most frequently queried wiki pages in January 2008 20 min
  • 24. www.percona.com Apache Drill Treat any datasource as a table (even it is not) Querying MongoDB with SQL
  • 26. www.percona.com Recap… 1. Search full dataset • May be pre-filtered • Not aggregated 2. No parallelism 3. Based on index? 4. InnoDB<> Columnar 5. Partitioning? 1. Dataset is already – Filtered (only site=“en”) – Aggregated (group by url) 2. Parallelism (+) 3. Not Based on index 4. Columnar (+) 5. Partitioning (+)