SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Downloaden Sie, um offline zu lesen
Š Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Real-time analytical query processing and
predictive model building on
high dimensional document datasets with
timestamps
Debasish Das
Data & Artificial Intelligence, Verizon
Contributors
Algorithm: Santanu Das,Zhengming Xing
Platform: PonramaJegan
Frontend: AltaffShaik,Jon Leonhardt
Š Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Data Overview
• Location data
• Each srcip defined as unique row key
• Provides approximate location of each srcip
• Timeseries containing latitude, longitude, error bound, duration, timezone for
each srcip
• Clickstream data
• Contains clickstream data of each row key
• Contains startTime, duration, httphost, httpuri, upload/download bytes,
httpmethod
• Compatible with IPFIX/Netflow formats
Š Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Marketing Analytics
Lookalike modeling Competitive analysis
• Aggregate Anonymous analysis for insights
• Spark Summit Europe 2016
• Spark Summit East 2017
Demand Prediction
?
?
Location Clustering
Š Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Data Model
• Schema:srcip, timestmap, tld, zip, tldvisits, zipvisits
• Dense dimension,dense measure
– Data: 10.1.13.120,d1H2,company1.com,94555,2, 4
• Sparse dimension, dense measure
– Data: 10.1.13.120,d1, {company1.com,company2.com},{94555,
94301}, 10, 15
• Sparse dimension, sparse measure
– Data: 10.1.13.120,d1, {company1.com,company2.com},{94555,
94301}, {company1.com:4,company2.com:6},{94555:8,94301:7}
• Timestamp optional
• Competing technologies:PowerDrill, Druid, LinkedIn Pinot, Essbase
Š Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Lucene Document Mapping
• Example
Schema: srcip, timestamp, tld, zip, tldvisits, zipvisits
Data: 10.1.13.120, d1, {company1.com, company2.com}, 94555, 10, 15
Data: 10.1.13.120, d4, {company1.com, company3.com}, 94301, 12, 8
• DataFrame Row to Lucene Document mapping
schema Row Document OLAP
srcip StringType Stored Measure
timestamp TimestampType Stored Dimension
tld ArrayType[StringType] Indexed + Stored Dimension
zip StringType Indexed + Stored Dimension
tld/zipvisits IntegerType Stored Measure
Š Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Lucene Storage
• Row storage: Spark Summit Europe 2016
– 2 indirect disk seeks for retrieval
Reference:
http://www.slideshare.net/lucenerevoluti
on/willnauer-simon-doc-values-column-
stride-fields-in-lucene
Š Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Lucene Column Store
• Column storage: Spark Summit
East 2017
– References: LUCENE-3108,
LUCENE-2935, LUCENE-2168,
LUCENE-1231
– Cache friendly column retrieval: 1
direct disk seek
– Integer column: Min-Max
encoding
– Numeric column: Uncompressed
– Binary column: Referenced
– Complex Type: Binary + Kryo
Integer Binary
Š Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
DeviceAnalyzer
• Goals
– srcip/visits as dense measure
– Real-Time queries
• Aggregate
• Group
• Timeseries
– Real-Time Timeseries forecast
Š Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Trapezium
DAIS Open Source framework to build batch, streaming and API services
https://github.com/Verizon/trapezium
Š Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Trapezium LuceneDAO
• SparkSQL optimized for full scan
– Column indexing not supported
• Fulfills Real-Time requirements for OLAP queries
• Lucene for indexing + storage per executor
• Spark operators for distributed aggregation
– treeAggregate
– mapPartition + treeReduce
• Features
• Build Distributed Lucene Shards from Dataframe
• Access saved shards through LuceneDAO for Analytics + ML pipelines
• Save shards to HDFS for QueryProcessor like SolrCloud
Š Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
LuceneDAO Indexing
/?ref=1108&?url=http://www.macys.c
om&id=5
www.walmart.com%2Fc%2Fep%2Frange-
hood-filters&sellermemid=459
http%3A%2F%2Fm.macys.com%2Fshop%2F
product%2Fjockey-elance-cotton
/?ref=1108&?url=http://www.macys.c
om&id=5
m.amazon.com%2Fshop%2Fproduct%2Fjo
ckey-elance-cotton
https://www.walmart.com/ip/Women-
Pant-Suit-Roundtree
walmart://ip/?veh=dsn&wmlspartner
m.macys.com%2Fshop%2Fsearch%3Fkeyw
ord%3DDress
ip1, macys.com, 2
ip1, walmart.com, 1
ip1, macys.com: 1
ip2, walmart.com: 1
ip1, amazon.com: 1
ip1, macys.com : 2
ip2, walmart.com: 1
Macys, 0
Walmart, 1
Amazon, 2
srcip tld visits
ip1 [0,1,2] 7
ip2 [1] 2
tld doc
0 [ip1]
1 [ip1, ip2]
2 [ip1]
reverse-index
column-store
measure: [srcip,visits]
dimension: [tld]
Š Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
LuceneDAO API
import trapezium.dal.lucene._
import org.apache.spark.sql.types._
object DeviceIndexer extend BatchTransaction {
process(dfs: Map[String, DataFrame], batchTime: Time): {
df = dfs(“DeviceStore”)
olapDf= rollup(df)
}
persist(df: DataFrame, batchTime: Time): {
val dimensions = Set(“tld”, “zip”)
val types = Map(“tld” -> LuceneType(true, StringType),
“srcip” -> LuceneType(false, StringType),
“visits” -> LuceneType(false,IntegerType))
val dao = new LuceneDAO(“path”, dimension,types)
dao.index(df,new Time(batchTime))
}
Index Creation
import trapezium.dal.lucene._
import org.apache.spark.sql.types._
Load:
val dimensions = Set(“tld”, “zip”)
val types = Map(“tld” -> LuceneType(true,StringType),
“srcip” -> LuceneType(false,StringType),
“visits” -> LuceneType(false,IntegerType))
val dao = new LuceneDAO(“path”, dimension,types)
dao.load(sc)
Queries:
dao.aggregate(query:String, measure: String, aggregator:String)
dao.group(query:String,dimension:String, measure: String,
aggregator:String)
dao.timeseries(query:String, minTime: Long, maxTime: Long,
rollup:Long, measure: String, aggregator:
String)
dao.search(query:String, columns: Seq[String]): DataFrame
Query Processing
Š Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
LuceneDAO Internals
• Retrieve documents with/without relevance
• ColumnAccessor over dimension + measures
• Disk / In-Memory ColumnAccessor
• C-store style while loops over dimension
• Spark ML style aggregators
• treeAggregate for distributed aggregation
Š Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Aggregation Architecture
Š Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Index Generation
• Dataset details:
57M devices, 4.2B docs
• Parquet: 79 GB
• Lucene Reverse Index: 16 GB
• Lucene DocValues: 59.6 GB
• Global Dictionary Size: 5.5 MB
• Executors: 20 Cores: 8
• RAM Driver: 16g Executor: 16g
• Runtime
– Parquet:
• 1831.87 s
– Dictionary:
• 213.7 s
– Index + Stored:
• 360 s
Š Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Aggregate Queries
• HashSet aggregation
• SparkSQL
df.select(”srcip”,"tld")
.where(array_contains(df("tld"),
”company1.com"))
.agg(countDistinct(”srcip") as "visits")
.collect()
• LuceneDAO
dao.aggregate("tld:company1.com”,
"srcip", "count")
3.82 6.65 14.25 20.64
0
100
200
300
400
500
600
700
1 5 10 20
Runtime(s)
qps
spark-sql1.6
spark-sql2.0
lucene-dao
Š Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Group Queries
• HLL aggregation
• SparkSQL
df.select(”srcip”,"tld”, “zip”)
.where(array_contains(df("tld"),
"company1.com"))
.select(“zip”, “srcip”).groupBy(“zip”)
.agg(approxCountDistinct(”srcip") as
"visits")
.collect()
• LuceneDAO
dao.aggregate("tld:company1.com", "srcip",
"count")
6.52 11.92 12.72 20.29
0
100
200
300
400
500
600
700
800
1 5 10 20
Runtime(s)
qps
spark-sql1.6
spark-sql2.0
lucene-dao
Š Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Device Heat-Map
company1.com
Š Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Timeseries Queries
• HLL aggregation
• SparkSQL
df.select(“time”,”srcip”,"tld”)
.where(array_contains(df("tld"),
”company1.com"))
.select(“time”, “srcip”).groupBy(“time”)
.agg(approxCountDistinct(”srcip") as "visits")
.collect()
• LuceneDAO
dao.aggregate("tld:company1.com", "srcip",
"count")
1.99 4.59 7.31 13.34
0
100
200
300
400
500
600
700
1 2 3 4
spark-sql1.6
spark-sql2.0
lucene-dao
Š Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
TimeSeries Forecast
• Given a query:
select
timestamp, (srcip) as deviceCount
where
tld=‘company1.com’AND state=‘CA’
• Predict deviceCount for next
timestamp
• Forecast deviceCount for next N
timestamps
TimeSeriesKNNRegression.predict
Input:
timeseries: Array[Double]
topk: Int
featureDim: Int
normalize: Boolean
multiStep: Int
metric: KernelType=Euclidean
Output:
predicted values: Array[Double]
Trapezium ML
Š Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Forecast Service
httpServer = {
provider = "akka"
hostname = "localhost"
port = 19999
contextPath = "/"
endPoints = [{
path = “analyzer-api"
className =
”TimeseriesEndPoint"
}]
}
Powered by Trapezium API
class TimeseriesEndPoint(sc: SparkContext)
extends SparkServiceEndPoint(sc) {
override def route : timeseriesRoute
val types = Map(“tld” -> LuceneType(true, StringType),
“srcip” -> LuceneType(false, StringType),
“visits” -> LuceneType(false, IntegerType))
val dao = new LuceneDAO(“path”, dimension, types)
dao.load(sc)
def timeseriesRoute : {
post { request => {
ts = dao.timeseries(request, minTime, maxTime, rollup,
“srcip”, “count_approx”)
predicted = TimeseriesKNNRegression.predict(ts, topk=5,
featureDim=3, normalize=false, multiStep=5,
metric=Euclidean)
generateResponse(ts, predicted)
}
}
Š Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Device-Count Forecast
5 step prediction
company1.com
Š Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Thank You.
Q&A
Join us and make machines intelligent
Data & Artificial Intelligence Systems
499 Hamilton Ave, Palo Alto
California

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and Future
DataWorks Summit
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
Michael Stack
 

Was ist angesagt? (20)

Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
 
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
 
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3GoHigh Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
 
HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and Future
 
Video Analysis in Hadoop
Video Analysis in HadoopVideo Analysis in Hadoop
Video Analysis in Hadoop
 
From limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiencyFrom limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiency
 
Empower Data-Driven Organizations
Empower Data-Driven OrganizationsEmpower Data-Driven Organizations
Empower Data-Driven Organizations
 
Build Big Data Enterprise solutions faster on Azure HDInsight
Build Big Data Enterprise solutions faster on Azure HDInsightBuild Big Data Enterprise solutions faster on Azure HDInsight
Build Big Data Enterprise solutions faster on Azure HDInsight
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...
 
Tame that Beast
Tame that BeastTame that Beast
Tame that Beast
 
Dremio introduction
Dremio introductionDremio introduction
Dremio introduction
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
 
RedisConf17 - Redis Enterprise on IBM Power Systems
RedisConf17 - Redis Enterprise on IBM Power SystemsRedisConf17 - Redis Enterprise on IBM Power Systems
RedisConf17 - Redis Enterprise on IBM Power Systems
 
Leveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioningLeveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioning
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 

Andere mochten auch

Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Spark Summit
 

Andere mochten auch (20)

Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
 
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
 
Cornami Accelerates Performance on SPARK: Spark Summit East talk by Paul Master
Cornami Accelerates Performance on SPARK: Spark Summit East talk by Paul MasterCornami Accelerates Performance on SPARK: Spark Summit East talk by Paul Master
Cornami Accelerates Performance on SPARK: Spark Summit East talk by Paul Master
 
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
 
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
 
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
 
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
 
Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...
Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...
Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...
 
Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...
Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...
Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...
 
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
 
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
 
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
 
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
 
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
 
HDFS Federation
HDFS FederationHDFS Federation
HDFS Federation
 
Introduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGIntroduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUG
 
Data Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache FlumeData Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache Flume
 

Ähnlich wie Realtime Analytical Query Processing and Predictive Model Building on High Dimensional Document Datasets with Timestamps: Spark Summit East talk by Debasish Das

Ähnlich wie Realtime Analytical Query Processing and Predictive Model Building on High Dimensional Document Datasets with Timestamps: Spark Summit East talk by Debasish Das (20)

Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
DAT332_How Verizon is Adopting Amazon Aurora PostgreSQL for Enterprise Workloads
DAT332_How Verizon is Adopting Amazon Aurora PostgreSQL for Enterprise WorkloadsDAT332_How Verizon is Adopting Amazon Aurora PostgreSQL for Enterprise Workloads
DAT332_How Verizon is Adopting Amazon Aurora PostgreSQL for Enterprise Workloads
 
Migrating your traditional Data Warehouse to a Modern Data Lake
Migrating your traditional Data Warehouse to a Modern Data LakeMigrating your traditional Data Warehouse to a Modern Data Lake
Migrating your traditional Data Warehouse to a Modern Data Lake
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Verizon Centralizes Data into a Data Lake in Real Time for Analytics
Verizon Centralizes Data into a Data Lake in Real Time for AnalyticsVerizon Centralizes Data into a Data Lake in Real Time for Analytics
Verizon Centralizes Data into a Data Lake in Real Time for Analytics
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
From Mainframe to Microservices: Vanguard’s Move to the Cloud - ENT331 - re:I...
From Mainframe to Microservices: Vanguard’s Move to the Cloud - ENT331 - re:I...From Mainframe to Microservices: Vanguard’s Move to the Cloud - ENT331 - re:I...
From Mainframe to Microservices: Vanguard’s Move to the Cloud - ENT331 - re:I...
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
 
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
 
STG316_Optimizing Storage for Big Data Workloads
STG316_Optimizing Storage for Big Data WorkloadsSTG316_Optimizing Storage for Big Data Workloads
STG316_Optimizing Storage for Big Data Workloads
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWS
 
Streaming Solutions for Real time problems
Streaming Solutions for Real time problemsStreaming Solutions for Real time problems
Streaming Solutions for Real time problems
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...
EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...
EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...
 
Stream processing on mobile networks
Stream processing on mobile networksStream processing on mobile networks
Stream processing on mobile networks
 

Mehr von Spark Summit

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

Mehr von Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

KĂźrzlich hochgeladen

Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
shambhavirathore45
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
JohnnyPlasten
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 

KĂźrzlich hochgeladen (20)

Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 

Realtime Analytical Query Processing and Predictive Model Building on High Dimensional Document Datasets with Timestamps: Spark Summit East talk by Debasish Das

  • 1. Š Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. Real-time analytical query processing and predictive model building on high dimensional document datasets with timestamps Debasish Das Data & Artificial Intelligence, Verizon Contributors Algorithm: Santanu Das,Zhengming Xing Platform: PonramaJegan Frontend: AltaffShaik,Jon Leonhardt
  • 2. Š Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. Data Overview • Location data • Each srcip defined as unique row key • Provides approximate location of each srcip • Timeseries containing latitude, longitude, error bound, duration, timezone for each srcip • Clickstream data • Contains clickstream data of each row key • Contains startTime, duration, httphost, httpuri, upload/download bytes, httpmethod • Compatible with IPFIX/Netflow formats
  • 3. Š Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. Marketing Analytics Lookalike modeling Competitive analysis • Aggregate Anonymous analysis for insights • Spark Summit Europe 2016 • Spark Summit East 2017 Demand Prediction ? ? Location Clustering
  • 4. Š Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. Data Model • Schema:srcip, timestmap, tld, zip, tldvisits, zipvisits • Dense dimension,dense measure – Data: 10.1.13.120,d1H2,company1.com,94555,2, 4 • Sparse dimension, dense measure – Data: 10.1.13.120,d1, {company1.com,company2.com},{94555, 94301}, 10, 15 • Sparse dimension, sparse measure – Data: 10.1.13.120,d1, {company1.com,company2.com},{94555, 94301}, {company1.com:4,company2.com:6},{94555:8,94301:7} • Timestamp optional • Competing technologies:PowerDrill, Druid, LinkedIn Pinot, Essbase
  • 5. Š Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. Lucene Document Mapping • Example Schema: srcip, timestamp, tld, zip, tldvisits, zipvisits Data: 10.1.13.120, d1, {company1.com, company2.com}, 94555, 10, 15 Data: 10.1.13.120, d4, {company1.com, company3.com}, 94301, 12, 8 • DataFrame Row to Lucene Document mapping schema Row Document OLAP srcip StringType Stored Measure timestamp TimestampType Stored Dimension tld ArrayType[StringType] Indexed + Stored Dimension zip StringType Indexed + Stored Dimension tld/zipvisits IntegerType Stored Measure
  • 6. Š Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. Lucene Storage • Row storage: Spark Summit Europe 2016 – 2 indirect disk seeks for retrieval Reference: http://www.slideshare.net/lucenerevoluti on/willnauer-simon-doc-values-column- stride-fields-in-lucene
  • 7. Š Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. Lucene Column Store • Column storage: Spark Summit East 2017 – References: LUCENE-3108, LUCENE-2935, LUCENE-2168, LUCENE-1231 – Cache friendly column retrieval: 1 direct disk seek – Integer column: Min-Max encoding – Numeric column: Uncompressed – Binary column: Referenced – Complex Type: Binary + Kryo Integer Binary
  • 8. Š Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. DeviceAnalyzer • Goals – srcip/visits as dense measure – Real-Time queries • Aggregate • Group • Timeseries – Real-Time Timeseries forecast
  • 9. Š Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. Trapezium DAIS Open Source framework to build batch, streaming and API services https://github.com/Verizon/trapezium
  • 10. Š Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. Trapezium LuceneDAO • SparkSQL optimized for full scan – Column indexing not supported • Fulfills Real-Time requirements for OLAP queries • Lucene for indexing + storage per executor • Spark operators for distributed aggregation – treeAggregate – mapPartition + treeReduce • Features • Build Distributed Lucene Shards from Dataframe • Access saved shards through LuceneDAO for Analytics + ML pipelines • Save shards to HDFS for QueryProcessor like SolrCloud
  • 11. Š Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. LuceneDAO Indexing /?ref=1108&?url=http://www.macys.c om&id=5 www.walmart.com%2Fc%2Fep%2Frange- hood-filters&sellermemid=459 http%3A%2F%2Fm.macys.com%2Fshop%2F product%2Fjockey-elance-cotton /?ref=1108&?url=http://www.macys.c om&id=5 m.amazon.com%2Fshop%2Fproduct%2Fjo ckey-elance-cotton https://www.walmart.com/ip/Women- Pant-Suit-Roundtree walmart://ip/?veh=dsn&wmlspartner m.macys.com%2Fshop%2Fsearch%3Fkeyw ord%3DDress ip1, macys.com, 2 ip1, walmart.com, 1 ip1, macys.com: 1 ip2, walmart.com: 1 ip1, amazon.com: 1 ip1, macys.com : 2 ip2, walmart.com: 1 Macys, 0 Walmart, 1 Amazon, 2 srcip tld visits ip1 [0,1,2] 7 ip2 [1] 2 tld doc 0 [ip1] 1 [ip1, ip2] 2 [ip1] reverse-index column-store measure: [srcip,visits] dimension: [tld]
  • 12. Š Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. LuceneDAO API import trapezium.dal.lucene._ import org.apache.spark.sql.types._ object DeviceIndexer extend BatchTransaction { process(dfs: Map[String, DataFrame], batchTime: Time): { df = dfs(“DeviceStore”) olapDf= rollup(df) } persist(df: DataFrame, batchTime: Time): { val dimensions = Set(“tld”, “zip”) val types = Map(“tld” -> LuceneType(true, StringType), “srcip” -> LuceneType(false, StringType), “visits” -> LuceneType(false,IntegerType)) val dao = new LuceneDAO(“path”, dimension,types) dao.index(df,new Time(batchTime)) } Index Creation import trapezium.dal.lucene._ import org.apache.spark.sql.types._ Load: val dimensions = Set(“tld”, “zip”) val types = Map(“tld” -> LuceneType(true,StringType), “srcip” -> LuceneType(false,StringType), “visits” -> LuceneType(false,IntegerType)) val dao = new LuceneDAO(“path”, dimension,types) dao.load(sc) Queries: dao.aggregate(query:String, measure: String, aggregator:String) dao.group(query:String,dimension:String, measure: String, aggregator:String) dao.timeseries(query:String, minTime: Long, maxTime: Long, rollup:Long, measure: String, aggregator: String) dao.search(query:String, columns: Seq[String]): DataFrame Query Processing
  • 13. Š Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. LuceneDAO Internals • Retrieve documents with/without relevance • ColumnAccessor over dimension + measures • Disk / In-Memory ColumnAccessor • C-store style while loops over dimension • Spark ML style aggregators • treeAggregate for distributed aggregation
  • 14. Š Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. Aggregation Architecture
  • 15. Š Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. Index Generation • Dataset details: 57M devices, 4.2B docs • Parquet: 79 GB • Lucene Reverse Index: 16 GB • Lucene DocValues: 59.6 GB • Global Dictionary Size: 5.5 MB • Executors: 20 Cores: 8 • RAM Driver: 16g Executor: 16g • Runtime – Parquet: • 1831.87 s – Dictionary: • 213.7 s – Index + Stored: • 360 s
  • 16. Š Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. Aggregate Queries • HashSet aggregation • SparkSQL df.select(”srcip”,"tld") .where(array_contains(df("tld"), ”company1.com")) .agg(countDistinct(”srcip") as "visits") .collect() • LuceneDAO dao.aggregate("tld:company1.com”, "srcip", "count") 3.82 6.65 14.25 20.64 0 100 200 300 400 500 600 700 1 5 10 20 Runtime(s) qps spark-sql1.6 spark-sql2.0 lucene-dao
  • 17. Š Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. Group Queries • HLL aggregation • SparkSQL df.select(”srcip”,"tld”, “zip”) .where(array_contains(df("tld"), "company1.com")) .select(“zip”, “srcip”).groupBy(“zip”) .agg(approxCountDistinct(”srcip") as "visits") .collect() • LuceneDAO dao.aggregate("tld:company1.com", "srcip", "count") 6.52 11.92 12.72 20.29 0 100 200 300 400 500 600 700 800 1 5 10 20 Runtime(s) qps spark-sql1.6 spark-sql2.0 lucene-dao
  • 18. Š Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. Device Heat-Map company1.com
  • 19. Š Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. Timeseries Queries • HLL aggregation • SparkSQL df.select(“time”,”srcip”,"tld”) .where(array_contains(df("tld"), ”company1.com")) .select(“time”, “srcip”).groupBy(“time”) .agg(approxCountDistinct(”srcip") as "visits") .collect() • LuceneDAO dao.aggregate("tld:company1.com", "srcip", "count") 1.99 4.59 7.31 13.34 0 100 200 300 400 500 600 700 1 2 3 4 spark-sql1.6 spark-sql2.0 lucene-dao
  • 20. Š Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. TimeSeries Forecast • Given a query: select timestamp, (srcip) as deviceCount where tld=‘company1.com’AND state=‘CA’ • Predict deviceCount for next timestamp • Forecast deviceCount for next N timestamps TimeSeriesKNNRegression.predict Input: timeseries: Array[Double] topk: Int featureDim: Int normalize: Boolean multiStep: Int metric: KernelType=Euclidean Output: predicted values: Array[Double] Trapezium ML
  • 21. Š Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. Forecast Service httpServer = { provider = "akka" hostname = "localhost" port = 19999 contextPath = "/" endPoints = [{ path = “analyzer-api" className = ”TimeseriesEndPoint" }] } Powered by Trapezium API class TimeseriesEndPoint(sc: SparkContext) extends SparkServiceEndPoint(sc) { override def route : timeseriesRoute val types = Map(“tld” -> LuceneType(true, StringType), “srcip” -> LuceneType(false, StringType), “visits” -> LuceneType(false, IntegerType)) val dao = new LuceneDAO(“path”, dimension, types) dao.load(sc) def timeseriesRoute : { post { request => { ts = dao.timeseries(request, minTime, maxTime, rollup, “srcip”, “count_approx”) predicted = TimeseriesKNNRegression.predict(ts, topk=5, featureDim=3, normalize=false, multiStep=5, metric=Euclidean) generateResponse(ts, predicted) } }
  • 22. Š Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. Device-Count Forecast 5 step prediction company1.com
  • 23. Š Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. Thank You. Q&A Join us and make machines intelligent Data & Artificial Intelligence Systems 499 Hamilton Ave, Palo Alto California