Realtime Analytical Query Processing and Predictive Model Building on High Dimensional Document Datasets with Timestamps: Spark Summit East talk by Debasish Das
Spark SQL and Mllib are optimized for running feature extraction and machine learning algorithms on row based columnar datasets through full scan but does not provide constructs for column indexing and time series analysis. For dealing with document datasets with timestamps where the features are represented as variable number of columns in each document and use-cases demand searching over columns and time to retrieve documents to generate learning models in realtime, a close integration within Spark and Lucene was needed. We introduced LuceneDAO in Spark Summit Europe 2016 to build distributed lucene shards from data frame but the time series attributes were not part of the data model. In this talk we present our extension to LuceneDAO to maintain time stamps with document-term view for search and allow time filters. Lucene shards maintain the time aware document-term view for search and vector space representation for machine learning pipelines. We used Spark as our distributed query processing engine where each query is represented as boolean combination over terms with filters on time. LuceneDAO is used to load the shards to Spark executors and power sub-second distributed document retrieval for the queries.
Our synchronous API uses Spark-as-a-Service to power analytical queries while our asynchronous API uses kafka, spark streaming and HBase to power time series prediction algorithms. In this talk we will demonstrate LuceneDAO write and read performance on millions of documents with 1M+ terms and configurable time stamp aggregate columns. We will demonstrate the latency of APIs on a suite
of queries generated from terms. Key takeaways from the talk will be a thorough understanding of how to make Lucene powered time aware search a first class citizen in Spark to build interactive analytical query processing and time series prediction algorithms.
Ăhnlich wie Realtime Analytical Query Processing and Predictive Model Building on High Dimensional Document Datasets with Timestamps: Spark Summit East talk by Debasish Das
Ăhnlich wie Realtime Analytical Query Processing and Predictive Model Building on High Dimensional Document Datasets with Timestamps: Spark Summit East talk by Debasish Das (20)
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Â
Realtime Analytical Query Processing and Predictive Model Building on High Dimensional Document Datasets with Timestamps: Spark Summit East talk by Debasish Das
1. Š Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Real-time analytical query processing and
predictive model building on
high dimensional document datasets with
timestamps
Debasish Das
Data & Artificial Intelligence, Verizon
Contributors
Algorithm: Santanu Das,Zhengming Xing
Platform: PonramaJegan
Frontend: AltaffShaik,Jon Leonhardt
2. Š Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Data Overview
⢠Location data
⢠Each srcip defined as unique row key
⢠Provides approximate location of each srcip
⢠Timeseries containing latitude, longitude, error bound, duration, timezone for
each srcip
⢠Clickstream data
⢠Contains clickstream data of each row key
⢠Contains startTime, duration, httphost, httpuri, upload/download bytes,
httpmethod
⢠Compatible with IPFIX/Netflow formats
3. Š Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Marketing Analytics
Lookalike modeling Competitive analysis
⢠Aggregate Anonymous analysis for insights
⢠Spark Summit Europe 2016
⢠Spark Summit East 2017
Demand Prediction
?
?
Location Clustering
4. Š Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Data Model
⢠Schema:srcip, timestmap, tld, zip, tldvisits, zipvisits
⢠Dense dimension,dense measure
â Data: 10.1.13.120,d1H2,company1.com,94555,2, 4
⢠Sparse dimension, dense measure
â Data: 10.1.13.120,d1, {company1.com,company2.com},{94555,
94301}, 10, 15
⢠Sparse dimension, sparse measure
â Data: 10.1.13.120,d1, {company1.com,company2.com},{94555,
94301}, {company1.com:4,company2.com:6},{94555:8,94301:7}
⢠Timestamp optional
⢠Competing technologies:PowerDrill, Druid, LinkedIn Pinot, Essbase
6. Š Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Lucene Storage
⢠Row storage: Spark Summit Europe 2016
â 2 indirect disk seeks for retrieval
Reference:
http://www.slideshare.net/lucenerevoluti
on/willnauer-simon-doc-values-column-
stride-fields-in-lucene
7. Š Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Lucene Column Store
⢠Column storage: Spark Summit
East 2017
â References: LUCENE-3108,
LUCENE-2935, LUCENE-2168,
LUCENE-1231
â Cache friendly column retrieval: 1
direct disk seek
â Integer column: Min-Max
encoding
â Numeric column: Uncompressed
â Binary column: Referenced
â Complex Type: Binary + Kryo
Integer Binary
8. Š Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
DeviceAnalyzer
⢠Goals
â srcip/visits as dense measure
â Real-Time queries
⢠Aggregate
⢠Group
⢠Timeseries
â Real-Time Timeseries forecast
9. Š Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Trapezium
DAIS Open Source framework to build batch, streaming and API services
https://github.com/Verizon/trapezium
10. Š Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Trapezium LuceneDAO
⢠SparkSQL optimized for full scan
â Column indexing not supported
⢠Fulfills Real-Time requirements for OLAP queries
⢠Lucene for indexing + storage per executor
⢠Spark operators for distributed aggregation
â treeAggregate
â mapPartition + treeReduce
⢠Features
⢠Build Distributed Lucene Shards from Dataframe
⢠Access saved shards through LuceneDAO for Analytics + ML pipelines
⢠Save shards to HDFS for QueryProcessor like SolrCloud
12. Š Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
LuceneDAO API
import trapezium.dal.lucene._
import org.apache.spark.sql.types._
object DeviceIndexer extend BatchTransaction {
process(dfs: Map[String, DataFrame], batchTime: Time): {
df = dfs(âDeviceStoreâ)
olapDf= rollup(df)
}
persist(df: DataFrame, batchTime: Time): {
val dimensions = Set(âtldâ, âzipâ)
val types = Map(âtldâ -> LuceneType(true, StringType),
âsrcipâ -> LuceneType(false, StringType),
âvisitsâ -> LuceneType(false,IntegerType))
val dao = new LuceneDAO(âpathâ, dimension,types)
dao.index(df,new Time(batchTime))
}
Index Creation
import trapezium.dal.lucene._
import org.apache.spark.sql.types._
Load:
val dimensions = Set(âtldâ, âzipâ)
val types = Map(âtldâ -> LuceneType(true,StringType),
âsrcipâ -> LuceneType(false,StringType),
âvisitsâ -> LuceneType(false,IntegerType))
val dao = new LuceneDAO(âpathâ, dimension,types)
dao.load(sc)
Queries:
dao.aggregate(query:String, measure: String, aggregator:String)
dao.group(query:String,dimension:String, measure: String,
aggregator:String)
dao.timeseries(query:String, minTime: Long, maxTime: Long,
rollup:Long, measure: String, aggregator:
String)
dao.search(query:String, columns: Seq[String]): DataFrame
Query Processing
13. Š Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
LuceneDAO Internals
⢠Retrieve documents with/without relevance
⢠ColumnAccessor over dimension + measures
⢠Disk / In-Memory ColumnAccessor
⢠C-store style while loops over dimension
⢠Spark ML style aggregators
⢠treeAggregate for distributed aggregation
14. Š Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Aggregation Architecture
15. Š Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Index Generation
⢠Dataset details:
57M devices, 4.2B docs
⢠Parquet: 79 GB
⢠Lucene Reverse Index: 16 GB
⢠Lucene DocValues: 59.6 GB
⢠Global Dictionary Size: 5.5 MB
⢠Executors: 20 Cores: 8
⢠RAM Driver: 16g Executor: 16g
⢠Runtime
â Parquet:
⢠1831.87 s
â Dictionary:
⢠213.7 s
â Index + Stored:
⢠360 s
16. Š Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Aggregate Queries
⢠HashSet aggregation
⢠SparkSQL
df.select(âsrcipâ,"tld")
.where(array_contains(df("tld"),
âcompany1.com"))
.agg(countDistinct(âsrcip") as "visits")
.collect()
⢠LuceneDAO
dao.aggregate("tld:company1.comâ,
"srcip", "count")
3.82 6.65 14.25 20.64
0
100
200
300
400
500
600
700
1 5 10 20
Runtime(s)
qps
spark-sql1.6
spark-sql2.0
lucene-dao
17. Š Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Group Queries
⢠HLL aggregation
⢠SparkSQL
df.select(âsrcipâ,"tldâ, âzipâ)
.where(array_contains(df("tld"),
"company1.com"))
.select(âzipâ, âsrcipâ).groupBy(âzipâ)
.agg(approxCountDistinct(âsrcip") as
"visits")
.collect()
⢠LuceneDAO
dao.aggregate("tld:company1.com", "srcip",
"count")
6.52 11.92 12.72 20.29
0
100
200
300
400
500
600
700
800
1 5 10 20
Runtime(s)
qps
spark-sql1.6
spark-sql2.0
lucene-dao
18. Š Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Device Heat-Map
company1.com
19. Š Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Timeseries Queries
⢠HLL aggregation
⢠SparkSQL
df.select(âtimeâ,âsrcipâ,"tldâ)
.where(array_contains(df("tld"),
âcompany1.com"))
.select(âtimeâ, âsrcipâ).groupBy(âtimeâ)
.agg(approxCountDistinct(âsrcip") as "visits")
.collect()
⢠LuceneDAO
dao.aggregate("tld:company1.com", "srcip",
"count")
1.99 4.59 7.31 13.34
0
100
200
300
400
500
600
700
1 2 3 4
spark-sql1.6
spark-sql2.0
lucene-dao
20. Š Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
TimeSeries Forecast
⢠Given a query:
select
timestamp, (srcip) as deviceCount
where
tld=âcompany1.comâAND state=âCAâ
⢠Predict deviceCount for next
timestamp
⢠Forecast deviceCount for next N
timestamps
TimeSeriesKNNRegression.predict
Input:
timeseries: Array[Double]
topk: Int
featureDim: Int
normalize: Boolean
multiStep: Int
metric: KernelType=Euclidean
Output:
predicted values: Array[Double]
Trapezium ML
21. Š Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Forecast Service
httpServer = {
provider = "akka"
hostname = "localhost"
port = 19999
contextPath = "/"
endPoints = [{
path = âanalyzer-api"
className =
âTimeseriesEndPoint"
}]
}
Powered by Trapezium API
class TimeseriesEndPoint(sc: SparkContext)
extends SparkServiceEndPoint(sc) {
override def route : timeseriesRoute
val types = Map(âtldâ -> LuceneType(true, StringType),
âsrcipâ -> LuceneType(false, StringType),
âvisitsâ -> LuceneType(false, IntegerType))
val dao = new LuceneDAO(âpathâ, dimension, types)
dao.load(sc)
def timeseriesRoute : {
post { request => {
ts = dao.timeseries(request, minTime, maxTime, rollup,
âsrcipâ, âcount_approxâ)
predicted = TimeseriesKNNRegression.predict(ts, topk=5,
featureDim=3, normalize=false, multiStep=5,
metric=Euclidean)
generateResponse(ts, predicted)
}
}
22. Š Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Device-Count Forecast
5 step prediction
company1.com
23. Š Verizon 2017 All RightsReserved
Informationcontainedhereinis providedASIS andsubject to changewithout notice. All trademarks usedhereinare property of their respective owners.
Thank You.
Q&A
Join us and make machines intelligent
Data & Artificial Intelligence Systems
499 Hamilton Ave, Palo Alto
California