SlideShare a Scribd company logo
1 of 23
Download to read offline
© Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Real-time analytical query processing and
predictive model building on
high dimensional document datasets with
timestamps
Debasish Das
Distinguished Engineer
Contributors
Algorithm: Santanu Das,Zhengming Xing
Platform: PonramaJegan
Frontend: AltaffShaik,Jon Leonhardt
© Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Data Overview
• Location data
• Each srcip defined as unique row key
• Provides approximate location of each srcip
• Timeseries containing latitude, longitude, error bound, duration, timezone for
each srcip
• Clickstream data
• Each srcip defined as unique row key
• Timeseries containing startTime, duration, httphost, httpuri, upload/download
bytes, httpmethod
• Compatible with IPFIX/Netflow formats
© Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Marketing Analytics
Lookalike Modeling Discriminant Analysis
• Aggregate Anonymous analysis for insights
• Spark Summit Europe 2016
• Spark Summit East 2017
Demand Prediction
?
?
Location Clustering
© Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Data Model
• Schema:srcip, timestamp, tld, zip, tldvisits, zipvisits
• Dense dimension,dense measure
– Data: 10.1.13.120,d1H2,company1.com,94555,2, 4
• Sparse dimension, dense measure
– Data: 10.1.13.120,d1, {company1.com,company2.com},{94555,
94301}, 10, 15
• Sparse dimension, sparse measure
– Data: 10.1.13.120,d1, {company1.com,company2.com},{94555,
94301}, {company1.com:4,company2.com:6},{94555:8,94301:7}
• Timestamp optional
• Competing technologies:PowerDrill, Druid, LinkedIn Pinot, Essbase
© Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Lucene Document Mapping
• Example
Schema: srcip, timestamp, tld, zip, tldvisits, zipvisits
Data: 10.1.13.120, d1, {company1.com, company2.com}, 94555, 10, 15
Data: 10.1.13.120, d4, {company1.com, company3.com}, 94301, 12, 8
• DataFrame Row to Lucene Document mapping
schema Row Document OLAP
srcip StringType Stored Measure
timestamp TimestampType Stored Dimension
tld ArrayType[StringType] Indexed + Stored Dimension
zip StringType Indexed + Stored Dimension
tld/zipvisits IntegerType Stored Measure
© Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Lucene Storage
• Row storage: Spark Summit Europe 2016
– 2 indirect disk seeks for retrieval
Reference:
http://www.slideshare.net/lucenerevoluti
on/willnauer-simon-doc-values-column-
stride-fields-in-lucene
© Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Lucene Column Store
• Column storage: Spark Summit
East 2017
– References: LUCENE-3108,
LUCENE-2935, LUCENE-2168,
LUCENE-1231
– Cache friendly column retrieval: 1
direct disk seek
– Integer column: Min-Max
encoding
– Numeric column: Uncompressed
– Binary column: Referenced
– Complex Type: Binary + Kryo
Integer Binary
© Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
DeviceAnalyzer
• Goals
– srcip/visits as dense measure
– Real-Time queries
• Aggregate
• Group
• Time-series
– Real-Time Time-series
forecast
© Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Trapezium
DAIS Open Source framework to build batch, streaming and API services
https://github.com/Verizon/trapezium
© Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Trapezium LuceneDAO
• SparkSQL optimized for full scan
– Column indexing not supported
• Fulfills Real-Time requirements for OLAP queries
• Lucene for indexing + storage per executor
• Spark operators for distributed aggregation
– treeAggregate
– mapPartition + treeReduce
• Features
• Build Distributed Lucene Shards from Dataframe
• Access saved shards through LuceneDAO for Analytics + ML pipelines
• Save shards to HDFS for QueryProcessor like SolrCloud
© Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
LuceneDAO Indexing
/?ref=1108&?url=http://www.macys.c
om&id=5
www.walmart.com%2Fc%2Fep%2Frange-
hood-filters&sellermemid=459
http%3A%2F%2Fm.macys.com%2Fshop%2F
product%2Fjockey-elance-cotton
/?ref=1108&?url=http://www.macys.c
om&id=5
m.amazon.com%2Fshop%2Fproduct%2Fjo
ckey-elance-cotton
https://www.walmart.com/ip/Women-
Pant-Suit-Roundtree
walmart://ip/?veh=dsn&wmlspartner
m.macys.com%2Fshop%2Fsearch%3Fkeyw
ord%3DDress
ip1, macys.com, 2
ip1, walmart.com, 1
ip1, macys.com: 1
ip2, walmart.com: 1
ip1, amazon.com: 1
ip1, macys.com : 2
ip2, walmart.com: 1
macys.com, 0
walmart.com, 1
Amazon.com, 2
visits
7
2
tld doc
macys.com [ip1]
walmart.com [ip1, ip2]
amazon.com [ip1]
reverse-index
column-store
measure: [srcip,visits]
dimension: [tld]
srcip
ip1
ip2
tld
[0,1,2]
[1]
© Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
LuceneDAO API
import trapezium.dal.lucene._
import org.apache.spark.sql.types._
object DeviceIndexer extend BatchTransaction {
process(dfs: Map[String, DataFrame], batchTime: Time): {
df = dfs(“DeviceStore”)
olapDf= rollup(df)
}
persist(df: DataFrame, batchTime: Time): {
val dimensions = Set(“tld”, “zip”)
val types = Map(“tld” -> LuceneType(true, StringType),
“srcip” -> LuceneType(false, StringType),
“visits” -> LuceneType(false,IntegerType))
val dao = new LuceneDAO(“path”, dimension,types)
dao.index(df,new Time(batchTime))
}
Index Creation
import trapezium.dal.lucene._
import org.apache.spark.sql.types._
Load:
val dimensions = Set(“tld”, “zip”)
val types = Map(“tld” -> LuceneType(true,StringType),
“srcip” -> LuceneType(false,StringType),
“visits” -> LuceneType(false,IntegerType))
val dao = new LuceneDAO(“path”, dimension,types)
dao.load(sc)
Queries:
dao.aggregate(query:String, measure: String, aggregator:String)
dao.group(query:String,dimension:String, measure: String,
aggregator:String)
dao.timeseries(query:String, minTime: Long, maxTime: Long,
rollup:Long, measure: String, aggregator:
String)
dao.search(query:String, columns: Seq[String]): DataFrame
Query Processing
© Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
LuceneDAO Internals
• Retrieve documents with/without relevance
• ColumnAccessor over dimension + measures
• Disk / In-Memory ColumnAccessor
• C-store style while loops over dimension
• Spark ML style aggregators
• treeAggregate for distributed aggregation
© Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Aggregation Architecture
© Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Index Generation
• Dataset details:
57M devices, 4.2B docs
• Parquet: 79 GB
• Lucene Reverse Index: 16 GB
• Lucene DocValues: 59.6 GB
• Global Dictionary Size: 5.5 MB
• Executors: 20 Cores: 8
• RAM Driver: 16g Executor: 16g
• Runtime
– Parquet:
• 1831.87 s
– Dictionary:
• 213.7 s
– Index + Stored:
• 360 s
© Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Aggregate Queries
• HashSet aggregation
• SparkSQL
df.select(”srcip”,"tld")
.where(array_contains(df("tld"),
”company1.com"))
.agg(countDistinct(”srcip") as "visits")
.collect()
• LuceneDAO
dao.aggregate("tld:company1.com”,
"srcip", "count")
61.63
158.4
285.53
538.11
3.82 6.65 14.25 20.64
0
100
200
300
400
500
600
700
1 5 10 20
Runtime(s)
qps
spark-
sql1.6
spark-
sql2.0
© Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Group Queries
• HLL aggregation
• SparkSQL
df.select(”srcip”,"tld”, “zip”)
.where(array_contains(df("tld"),
"company1.com"))
.select(“zip”, “srcip”).groupBy(“zip”)
.agg(approxCountDistinct(”srcip") as
"visits")
.collect()
• LuceneDAO
dao.group("tld:company1.com", "srcip",
"count_approx")
58.07
174.44
298.67
669.69
6.52 11.92 12.72 20.29
0
100
200
300
400
500
600
700
800
1 5 10 20
Runtime(s)
qps
spark-sql1.6
spark-sql2.0
lucene-dao
© Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Device Heat-Map
company1.com
© Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Time-series Queries
• HLL aggregation
• SparkSQL
df.select(“time”,”srcip”,"tld”)
.where(array_contains(df("tld"),
”company1.com"))
.select(“time”, “srcip”).groupBy(“time”)
.agg(approxCountDistinct(”srcip") as "visits")
.collect()
• LuceneDAO
dao.timeseries("tld:company1.com", "srcip",
"count_approx")
Complex query supported: tld:company1.com
AND zip:94* ….
54.88
169.02
279.44
528.88
1.99 4.59 7.31 13.34
0
100
200
300
400
500
600
700
1 5 10 20
Runtime(s)
qps
spark-sql1.6
spark-sql2.0
lucene-dao
© Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Time-Series Forecast
• Given a query:
select timestamp, srcip
.where(tld=‘company1.com’ AND
state=‘CA’)
.groupBy(“time”)
.agg(approxCountDistinct(“srcip”) as “visits”)
• Predict deviceCount for next timestamp
• Forecast deviceCount for next N
timestamps
TimeSeriesKNNRegression.predict
Input:
timeseries: Array[Double]
topk: Int
featureDim: Int
normalize: Boolean
multiStep: Int
metric: KernelType=Euclidean
Output:
predicted values: Array[Double]
Trapezium ML
© Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Forecast Service
httpServer = {
provider = "akka"
hostname = "localhost"
port = 19999
contextPath = "/"
endPoints = [{
path = “analyzer-api"
className =
”TimeseriesEndPoint"
}]
}
Powered by Trapezium API
class TimeseriesEndPoint(sc: SparkContext)
extends SparkServiceEndPoint(sc) {
override def route : timeseriesRoute
val types = Map(“tld” -> LuceneType(true, StringType),
“srcip” -> LuceneType(false, StringType),
“visits” -> LuceneType(false, IntegerType))
val dao = new LuceneDAO(“path”, dimension, types)
dao.load(sc)
def timeseriesRoute : {
post { request => {
ts = dao.timeseries(request, minTime, maxTime, rollup,
“srcip”, “count_approx”)
predicted = TimeseriesKNNRegression.predict(ts, topk=5,
featureDim=3, normalize=false, multiStep=5,
metric=Euclidean)
generateResponse(ts, predicted)
}
}
© Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Device-Count Forecast
5 step prediction
company1.com
© Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Thank You.
Q&A
Join us and make machines intelligent
Data & Artificial Intelligence Systems
499 Hamilton Ave, Palo Alto
California

More Related Content

What's hot

03-NOV-1510-Ognjen-Antonic-Telemach-stream-1
03-NOV-1510-Ognjen-Antonic-Telemach-stream-103-NOV-1510-Ognjen-Antonic-Telemach-stream-1
03-NOV-1510-Ognjen-Antonic-Telemach-stream-1
Ognjen Antonic
 

What's hot (20)

Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
 
Introduction to Dremio
Introduction to DremioIntroduction to Dremio
Introduction to Dremio
 
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
Using Spark and Riak for IoT Apps—Patterns and Anti-Patterns: Spark Summit Ea...
 
Spark + Flashblade: Spark Summit East talk by Brian Gold
Spark + Flashblade: Spark Summit East talk by Brian GoldSpark + Flashblade: Spark Summit East talk by Brian Gold
Spark + Flashblade: Spark Summit East talk by Brian Gold
 
Presto: SQL-on-anything
Presto: SQL-on-anythingPresto: SQL-on-anything
Presto: SQL-on-anything
 
Build Big Data Enterprise solutions faster on Azure HDInsight
Build Big Data Enterprise solutions faster on Azure HDInsightBuild Big Data Enterprise solutions faster on Azure HDInsight
Build Big Data Enterprise solutions faster on Azure HDInsight
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 
Scaling Through Simplicity—How a 300 million User Chat App Reduced Data Engin...
Scaling Through Simplicity—How a 300 million User Chat App Reduced Data Engin...Scaling Through Simplicity—How a 300 million User Chat App Reduced Data Engin...
Scaling Through Simplicity—How a 300 million User Chat App Reduced Data Engin...
 
Operationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At ScaleOperationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At Scale
 
Addressing Enterprise Customer Pain Points with a Data Driven Architecture
Addressing Enterprise Customer Pain Points with a Data Driven ArchitectureAddressing Enterprise Customer Pain Points with a Data Driven Architecture
Addressing Enterprise Customer Pain Points with a Data Driven Architecture
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
Simplifying And Accelerating Data Access for Python With Dremio and Apache ArrowSimplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
 
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
 
Spark in the Enterprise - 2 Years Later by Alan Saldich
Spark in the Enterprise - 2 Years Later by Alan SaldichSpark in the Enterprise - 2 Years Later by Alan Saldich
Spark in the Enterprise - 2 Years Later by Alan Saldich
 
Built-In Security for the Cloud
Built-In Security for the CloudBuilt-In Security for the Cloud
Built-In Security for the Cloud
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeong
 
03-NOV-1510-Ognjen-Antonic-Telemach-stream-1
03-NOV-1510-Ognjen-Antonic-Telemach-stream-103-NOV-1510-Ognjen-Antonic-Telemach-stream-1
03-NOV-1510-Ognjen-Antonic-Telemach-stream-1
 
Admiral Group
Admiral GroupAdmiral Group
Admiral Group
 
Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017
 

Viewers also liked

Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
Spark Summit
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Spark Summit
 
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark Summit
 
Analysis Andromeda Galaxy Data Using Spark: Spark Summit East Talk by Jose Na...
Analysis Andromeda Galaxy Data Using Spark: Spark Summit East Talk by Jose Na...Analysis Andromeda Galaxy Data Using Spark: Spark Summit East Talk by Jose Na...
Analysis Andromeda Galaxy Data Using Spark: Spark Summit East Talk by Jose Na...
Spark Summit
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 

Viewers also liked (20)

Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
 
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming: Spar...
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
 
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
 
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
 
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
 
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
 
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
 
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
 
Analysis Andromeda Galaxy Data Using Spark: Spark Summit East Talk by Jose Na...
Analysis Andromeda Galaxy Data Using Spark: Spark Summit East Talk by Jose Na...Analysis Andromeda Galaxy Data Using Spark: Spark Summit East Talk by Jose Na...
Analysis Andromeda Galaxy Data Using Spark: Spark Summit East Talk by Jose Na...
 
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
 
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by...
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by...New Directions in pySpark for Time Series Analysis: Spark Summit East talk by...
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by...
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon OuelletteTime Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
 
Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil
Custom Applications with Spark's RDD: Spark Summit East talk by Tejas PatilCustom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil
Custom Applications with Spark's RDD: Spark Summit East talk by Tejas Patil
 

Similar to Realtime Analytical Query Processing and Predictive Model Building on High Dimensional Document Datasets with Timestamps: talk by Debasish Das

Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Spark Summit
 

Similar to Realtime Analytical Query Processing and Predictive Model Building on High Dimensional Document Datasets with Timestamps: talk by Debasish Das (20)

Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
 
DAT332_How Verizon is Adopting Amazon Aurora PostgreSQL for Enterprise Workloads
DAT332_How Verizon is Adopting Amazon Aurora PostgreSQL for Enterprise WorkloadsDAT332_How Verizon is Adopting Amazon Aurora PostgreSQL for Enterprise Workloads
DAT332_How Verizon is Adopting Amazon Aurora PostgreSQL for Enterprise Workloads
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Migrating your traditional Data Warehouse to a Modern Data Lake
Migrating your traditional Data Warehouse to a Modern Data LakeMigrating your traditional Data Warehouse to a Modern Data Lake
Migrating your traditional Data Warehouse to a Modern Data Lake
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Verizon Centralizes Data into a Data Lake in Real Time for Analytics
Verizon Centralizes Data into a Data Lake in Real Time for AnalyticsVerizon Centralizes Data into a Data Lake in Real Time for Analytics
Verizon Centralizes Data into a Data Lake in Real Time for Analytics
 
Streaming Solutions for Real time problems
Streaming Solutions for Real time problemsStreaming Solutions for Real time problems
Streaming Solutions for Real time problems
 
STG316_Optimizing Storage for Big Data Workloads
STG316_Optimizing Storage for Big Data WorkloadsSTG316_Optimizing Storage for Big Data Workloads
STG316_Optimizing Storage for Big Data Workloads
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
 
From Mainframe to Microservices: Vanguard’s Move to the Cloud - ENT331 - re:I...
From Mainframe to Microservices: Vanguard’s Move to the Cloud - ENT331 - re:I...From Mainframe to Microservices: Vanguard’s Move to the Cloud - ENT331 - re:I...
From Mainframe to Microservices: Vanguard’s Move to the Cloud - ENT331 - re:I...
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWS
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Stream processing on mobile networks
Stream processing on mobile networksStream processing on mobile networks
Stream processing on mobile networks
 
Case Study: Sprinklr Uses Amazon EBS to Maximize Its NoSQL Deployment - DAT33...
Case Study: Sprinklr Uses Amazon EBS to Maximize Its NoSQL Deployment - DAT33...Case Study: Sprinklr Uses Amazon EBS to Maximize Its NoSQL Deployment - DAT33...
Case Study: Sprinklr Uses Amazon EBS to Maximize Its NoSQL Deployment - DAT33...
 

More from Spark Summit

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
amitlee9823
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 

Recently uploaded (20)

5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 

Realtime Analytical Query Processing and Predictive Model Building on High Dimensional Document Datasets with Timestamps: talk by Debasish Das

  • 1. © Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. Real-time analytical query processing and predictive model building on high dimensional document datasets with timestamps Debasish Das Distinguished Engineer Contributors Algorithm: Santanu Das,Zhengming Xing Platform: PonramaJegan Frontend: AltaffShaik,Jon Leonhardt
  • 2. © Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. Data Overview • Location data • Each srcip defined as unique row key • Provides approximate location of each srcip • Timeseries containing latitude, longitude, error bound, duration, timezone for each srcip • Clickstream data • Each srcip defined as unique row key • Timeseries containing startTime, duration, httphost, httpuri, upload/download bytes, httpmethod • Compatible with IPFIX/Netflow formats
  • 3. © Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. Marketing Analytics Lookalike Modeling Discriminant Analysis • Aggregate Anonymous analysis for insights • Spark Summit Europe 2016 • Spark Summit East 2017 Demand Prediction ? ? Location Clustering
  • 4. © Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. Data Model • Schema:srcip, timestamp, tld, zip, tldvisits, zipvisits • Dense dimension,dense measure – Data: 10.1.13.120,d1H2,company1.com,94555,2, 4 • Sparse dimension, dense measure – Data: 10.1.13.120,d1, {company1.com,company2.com},{94555, 94301}, 10, 15 • Sparse dimension, sparse measure – Data: 10.1.13.120,d1, {company1.com,company2.com},{94555, 94301}, {company1.com:4,company2.com:6},{94555:8,94301:7} • Timestamp optional • Competing technologies:PowerDrill, Druid, LinkedIn Pinot, Essbase
  • 5. © Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. Lucene Document Mapping • Example Schema: srcip, timestamp, tld, zip, tldvisits, zipvisits Data: 10.1.13.120, d1, {company1.com, company2.com}, 94555, 10, 15 Data: 10.1.13.120, d4, {company1.com, company3.com}, 94301, 12, 8 • DataFrame Row to Lucene Document mapping schema Row Document OLAP srcip StringType Stored Measure timestamp TimestampType Stored Dimension tld ArrayType[StringType] Indexed + Stored Dimension zip StringType Indexed + Stored Dimension tld/zipvisits IntegerType Stored Measure
  • 6. © Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. Lucene Storage • Row storage: Spark Summit Europe 2016 – 2 indirect disk seeks for retrieval Reference: http://www.slideshare.net/lucenerevoluti on/willnauer-simon-doc-values-column- stride-fields-in-lucene
  • 7. © Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. Lucene Column Store • Column storage: Spark Summit East 2017 – References: LUCENE-3108, LUCENE-2935, LUCENE-2168, LUCENE-1231 – Cache friendly column retrieval: 1 direct disk seek – Integer column: Min-Max encoding – Numeric column: Uncompressed – Binary column: Referenced – Complex Type: Binary + Kryo Integer Binary
  • 8. © Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. DeviceAnalyzer • Goals – srcip/visits as dense measure – Real-Time queries • Aggregate • Group • Time-series – Real-Time Time-series forecast
  • 9. © Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. Trapezium DAIS Open Source framework to build batch, streaming and API services https://github.com/Verizon/trapezium
  • 10. © Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. Trapezium LuceneDAO • SparkSQL optimized for full scan – Column indexing not supported • Fulfills Real-Time requirements for OLAP queries • Lucene for indexing + storage per executor • Spark operators for distributed aggregation – treeAggregate – mapPartition + treeReduce • Features • Build Distributed Lucene Shards from Dataframe • Access saved shards through LuceneDAO for Analytics + ML pipelines • Save shards to HDFS for QueryProcessor like SolrCloud
  • 11. © Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. LuceneDAO Indexing /?ref=1108&?url=http://www.macys.c om&id=5 www.walmart.com%2Fc%2Fep%2Frange- hood-filters&sellermemid=459 http%3A%2F%2Fm.macys.com%2Fshop%2F product%2Fjockey-elance-cotton /?ref=1108&?url=http://www.macys.c om&id=5 m.amazon.com%2Fshop%2Fproduct%2Fjo ckey-elance-cotton https://www.walmart.com/ip/Women- Pant-Suit-Roundtree walmart://ip/?veh=dsn&wmlspartner m.macys.com%2Fshop%2Fsearch%3Fkeyw ord%3DDress ip1, macys.com, 2 ip1, walmart.com, 1 ip1, macys.com: 1 ip2, walmart.com: 1 ip1, amazon.com: 1 ip1, macys.com : 2 ip2, walmart.com: 1 macys.com, 0 walmart.com, 1 Amazon.com, 2 visits 7 2 tld doc macys.com [ip1] walmart.com [ip1, ip2] amazon.com [ip1] reverse-index column-store measure: [srcip,visits] dimension: [tld] srcip ip1 ip2 tld [0,1,2] [1]
  • 12. © Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. LuceneDAO API import trapezium.dal.lucene._ import org.apache.spark.sql.types._ object DeviceIndexer extend BatchTransaction { process(dfs: Map[String, DataFrame], batchTime: Time): { df = dfs(“DeviceStore”) olapDf= rollup(df) } persist(df: DataFrame, batchTime: Time): { val dimensions = Set(“tld”, “zip”) val types = Map(“tld” -> LuceneType(true, StringType), “srcip” -> LuceneType(false, StringType), “visits” -> LuceneType(false,IntegerType)) val dao = new LuceneDAO(“path”, dimension,types) dao.index(df,new Time(batchTime)) } Index Creation import trapezium.dal.lucene._ import org.apache.spark.sql.types._ Load: val dimensions = Set(“tld”, “zip”) val types = Map(“tld” -> LuceneType(true,StringType), “srcip” -> LuceneType(false,StringType), “visits” -> LuceneType(false,IntegerType)) val dao = new LuceneDAO(“path”, dimension,types) dao.load(sc) Queries: dao.aggregate(query:String, measure: String, aggregator:String) dao.group(query:String,dimension:String, measure: String, aggregator:String) dao.timeseries(query:String, minTime: Long, maxTime: Long, rollup:Long, measure: String, aggregator: String) dao.search(query:String, columns: Seq[String]): DataFrame Query Processing
  • 13. © Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. LuceneDAO Internals • Retrieve documents with/without relevance • ColumnAccessor over dimension + measures • Disk / In-Memory ColumnAccessor • C-store style while loops over dimension • Spark ML style aggregators • treeAggregate for distributed aggregation
  • 14. © Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. Aggregation Architecture
  • 15. © Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. Index Generation • Dataset details: 57M devices, 4.2B docs • Parquet: 79 GB • Lucene Reverse Index: 16 GB • Lucene DocValues: 59.6 GB • Global Dictionary Size: 5.5 MB • Executors: 20 Cores: 8 • RAM Driver: 16g Executor: 16g • Runtime – Parquet: • 1831.87 s – Dictionary: • 213.7 s – Index + Stored: • 360 s
  • 16. © Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. Aggregate Queries • HashSet aggregation • SparkSQL df.select(”srcip”,"tld") .where(array_contains(df("tld"), ”company1.com")) .agg(countDistinct(”srcip") as "visits") .collect() • LuceneDAO dao.aggregate("tld:company1.com”, "srcip", "count") 61.63 158.4 285.53 538.11 3.82 6.65 14.25 20.64 0 100 200 300 400 500 600 700 1 5 10 20 Runtime(s) qps spark- sql1.6 spark- sql2.0
  • 17. © Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. Group Queries • HLL aggregation • SparkSQL df.select(”srcip”,"tld”, “zip”) .where(array_contains(df("tld"), "company1.com")) .select(“zip”, “srcip”).groupBy(“zip”) .agg(approxCountDistinct(”srcip") as "visits") .collect() • LuceneDAO dao.group("tld:company1.com", "srcip", "count_approx") 58.07 174.44 298.67 669.69 6.52 11.92 12.72 20.29 0 100 200 300 400 500 600 700 800 1 5 10 20 Runtime(s) qps spark-sql1.6 spark-sql2.0 lucene-dao
  • 18. © Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. Device Heat-Map company1.com
  • 19. © Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. Time-series Queries • HLL aggregation • SparkSQL df.select(“time”,”srcip”,"tld”) .where(array_contains(df("tld"), ”company1.com")) .select(“time”, “srcip”).groupBy(“time”) .agg(approxCountDistinct(”srcip") as "visits") .collect() • LuceneDAO dao.timeseries("tld:company1.com", "srcip", "count_approx") Complex query supported: tld:company1.com AND zip:94* …. 54.88 169.02 279.44 528.88 1.99 4.59 7.31 13.34 0 100 200 300 400 500 600 700 1 5 10 20 Runtime(s) qps spark-sql1.6 spark-sql2.0 lucene-dao
  • 20. © Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. Time-Series Forecast • Given a query: select timestamp, srcip .where(tld=‘company1.com’ AND state=‘CA’) .groupBy(“time”) .agg(approxCountDistinct(“srcip”) as “visits”) • Predict deviceCount for next timestamp • Forecast deviceCount for next N timestamps TimeSeriesKNNRegression.predict Input: timeseries: Array[Double] topk: Int featureDim: Int normalize: Boolean multiStep: Int metric: KernelType=Euclidean Output: predicted values: Array[Double] Trapezium ML
  • 21. © Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. Forecast Service httpServer = { provider = "akka" hostname = "localhost" port = 19999 contextPath = "/" endPoints = [{ path = “analyzer-api" className = ”TimeseriesEndPoint" }] } Powered by Trapezium API class TimeseriesEndPoint(sc: SparkContext) extends SparkServiceEndPoint(sc) { override def route : timeseriesRoute val types = Map(“tld” -> LuceneType(true, StringType), “srcip” -> LuceneType(false, StringType), “visits” -> LuceneType(false, IntegerType)) val dao = new LuceneDAO(“path”, dimension, types) dao.load(sc) def timeseriesRoute : { post { request => { ts = dao.timeseries(request, minTime, maxTime, rollup, “srcip”, “count_approx”) predicted = TimeseriesKNNRegression.predict(ts, topk=5, featureDim=3, normalize=false, multiStep=5, metric=Euclidean) generateResponse(ts, predicted) } }
  • 22. © Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. Device-Count Forecast 5 step prediction company1.com
  • 23. © Verizon 2017 All RightsReserved Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners. Thank You. Q&A Join us and make machines intelligent Data & Artificial Intelligence Systems 499 Hamilton Ave, Palo Alto California