SlideShare ist ein Scribd-Unternehmen logo
1 von 17
Downloaden Sie, um offline zu lesen
Secondary Sort and a Custom Comparator
What is Time Series Data? 
•Instatistics,signal processing,econometricsandmathematical finance, atime seriesis a sequence ofdata points, measured typically at successive time instants spaced at uniform time intervals. 
•Examples of time series data are the daily adjusted close price of a stock at the NYSE or sensor readings on a power grid occuring30 times a second. 
•Time series as a general class of problems has typically resided in the scientific and financial domains. 
•However, due to the ongoing explosion of available data, time series data is becoming more prevalent across a wider swath of industries. 
•Time Series sensors are being ubiquitously integrated in places like: 
–Thepower grid, aka “thesmart grid” 
–Cellular Services 
–As well as, military and environmental uses 
•The understanding of how we can refactortraditional approaches to these time series problems when inputting into MapReduce can potentially allow us to improve processing and analysis techniques in a timely fashion.
Current approaches 
•The financial industry has long been interested in time series data and have employed programming languages such as R to help deal with this problem. 
•So, why would a sector create a programming language specifically for one class of data when technologies like RDBMS have existed for decades? 
•In reality, current RDBMs technology has limitations when dealing with high-resolution time series data. 
•These limiting factors include: 
–High-frequency time series data coming from a variety of sources can create huge amounts of data in very little time 
–RDBMS’s tend to not like storing and indexing billions of rows. 
–Non-distributed RDBMS’s tend to not like scaling up into the hundreds of GB’s, let alone TB’s or PB’s. 
–RDBMS’s that can scale into those arenas tend to be very expensive, or require large amounts of specialized hardware. 
–To process high resolution time series data with a RDBMS we’d need to use an analytic aggregate function in tandem with moving window predicates (ex: the “OVER” clause) which results in rapidly increasing amounts of work to do as the granularity of time series data gets finer. 
–Query results are not perfectly commutable and cannot do variable step sliding windows (ex: step 5 seconds per window move) without significant unnecessary intermediate work or non- standard SQL functions. 
–Queries on RDBMS for time series for certain techniques can be awkward and tend to require premature subdividing of the data and costly reconstruction during processing (example: Data mining, iSAXdecompositions) 
–Due to the above factors, with large amounts of time series data RDBMS performance degrades while scaling.
Example Problem : Simple Moving Average 
•A simplemoving averageis the series of un-weighted averages in a subset of time series data points as a sliding window progresses over the time series data set. 
•Each time the window is moved we recalculate the average of the points in the window. 
•This produces a set of numbers representing the final moving average. 
•Typically the moving average technique is used with time series to highlight longer term trends or smooth out short-term noise. 
•Moving averages are similar to low pass filters in signal processing, and mathematically are considered a type of convolution. 
•In other terms, we take a window and fill it in a First In First Out (FIFO) manner with time series data points until we have N points in it. 
•We then take the average of these points and add this to our answer list. 
•We slide our window forward by M data points and again take the average of the data points in the window. 
•This process is repeated until the window can no longer be filled at which point the calculation is complete. 
•Let N=30, M = 1
Data 
•/input/movingaverage/NYSE_daily 
exchange 
stock_symbol 
date 
open 
high 
low 
close 
volume 
adj close 
NYSE 
AA 
3/5/2008 
37.01 
37.9 
36.13 
36.6 
17752400 
36.6 
NYSE 
AA 
3/4/2008 
38.85 
39.28 
38.26 
38.37 
11279900 
38.37 
NYSE 
AA 
3/3/2008 
38.25 
39.15 
38.1 
38.71 
11754600 
38.71 
NYSE 
AA 
3/2/2008 
37.9 
38.94 
37.1 
38 
15715600 
38 
NYSE 
AA 
3/1/2008 
37.17 
38.46 
37.13 
38.32 
13964700 
38.32 
NYSE 
AA 
2/29/2008 
38.77 
38.82 
36.94 
37.14 
22611400 
37.14 
NYSE 
AA 
2/28/2008 
38.61 
39.29 
38.19 
39.12 
11421700 
39.12 
NYSE 
AA 
2/27/2008 
38.19 
39.62 
37.75 
39.02 
14296300 
39.02 
NYSE 
AA 
2/26/2008 
38.59 
39.25 
38.08 
38.5 
14417700 
38.5 
NYSE 
AA 
2/25/2008 
36.64 
38.95 
36.48 
38.85 
22500100 
38.85 
NYSE 
AA 
2/24/2008 
36.38 
36.64 
35.58 
36.55 
12834300 
36.55 
NYSE 
AA 
2/23/2008 
36.88 
37.41 
36.25 
36.3 
13078200 
36.3 
NYSE 
AA 
2/22/2008 
35.96 
36.85 
35.51 
36.83 
10906600 
36.83 
NYSE 
AA 
2/21/2008 
36.19 
36.73 
35.84 
36.2 
12825300 
36.2 
NYSE 
AA 
2/20/2008 
35.16 
35.94 
35.12 
35.72 
14082200 
35.72 
NYSE 
AA 
2/19/2008 
36.01 
36.43 
35.05 
35.36 
18238800 
35.36 
NYSE 
AA 
2/18/2008 
33.75 
35.52 
33.63 
35.51 
21082100 
35.51 
NYSE 
AA 
2/17/2008 
34.33 
34.64 
33.26 
33.49 
12418900 
33.49 
NYSE 
AA 
2/16/2008 
33.82 
34.25 
33.29 
34.06 
11249800 
34.06 
NYSE 
AA 
2/15/2008 
32.67 
33.81 
32.37 
33.76 
10731400 
33.76 
NYSE 
AA 
2/14/2008 
32.24 
33.25 
31.9 
32.78 
9058900 
32.78 
NYSE 
AA 
2/13/2008 
32.95 
33.37 
32.26 
32.41 
7230300 
32.41 
NYSE 
AA 
2/12/2008 
33.3 
33.64 
32.52 
32.67 
11338000 
32.5 
NYSE 
AA 
2/11/2008 
34.57 
34.85 
33.98 
34.08 
9528000 
33.9 
NYSE 
AA 
2/10/2008 
33.67 
34.45 
33.07 
34.28 
15186100 
34.1 
NYSE 
AA 
2/9/2008 
32.13 
33.34 
31.95 
33.09 
9200400 
32.92 
NYSE 
AA 
2/8/2008 
32.58 
33.42 
32.11 
32.7 
10241400 
32.53 
NYSE 
AA 
2/7/2008 
31.73 
33.13 
31.57 
32.66 
14338500 
32.49 
NYSE 
AA 
2/6/2008 
30.27 
31.52 
30.06 
31.47 
8445100 
31.31 
NYSE 
AA 
2/5/2008 
31.16 
31.89 
30.55 
30.69 
17567800 
30.53 
NYSE 
AA 
2/4/2008 
37.01 
37.9 
36.13 
36.6 
17752400 
10.6 
NYSE 
AA 
2/3/2008 
38.85 
39.28 
38.26 
38.37 
11279900 
8.37
Approach 
•In our simple moving average example, however, we don’t operate on a per value basis specifically, nor do we produce an aggregate across all of the values. 
•Our operation in the aggregate sense involves a sliding window, which performs its operations on a subset of the data at each step. 
•We also have to consider that the points in our time series data are not guaranteed to arrive at the reduce in order and need to be sorted. 
•This is because with multiple map functions reading multiple sections of the source data MapReduce does not impose any order on the key-value pairs that are grouped together in the default partition and sorting schemes. 
•We want to group all of one stock’s adjusted close values together so we can apply the simple moving average operation over the sorted time series data. 
•We want toemit each time series key value pairkeyed on a stock symbol to group these values together. 
•In thereduce phasewe can run an operation, here the simple moving average, over the data. 
•Since the data more than likely will not arrive at the reducer in sorted order we’ll need to sort the data before we can calculate the simple moving average.
Problem 
•We’re limited by our Java Virtual Machine (JVM) child heap size and we are taking time to manually sort the data ourselves. 
•With a few design changes, we can solve both of these issues taking advantage of some inherent properties of MapReduce. 
–First we want to look at the case of sorting the data in memory on each reducer. 
–Currently we have to make sure we never send more data to a single reducer than can fit in memory. 
–The way we can currently control this is to give each reducer child JVM more heap and/or to further partition our time series data in the map phase. 
–In this case we’d partition further by time, breaking our data into smaller windows of time. 
•As opposed to further partitioning of the data, another approach to this issue is to allow Hadoop to sort the data for us in what’s called the “shuffle phase” of MapReduce. 
•If the data arrives at a reducer already in sorted order 
–we can lower our memory footprint and 
–reduce the number of loops through the data by only looking at the next N samples for each simple moving average calculation.
shuffle’s “secondary sort” mechanic 
•Sorting is something we can let Hadoop do for us and Hadoop has proven to be quite good at sorting large amounts of data. 
•In using the secondary sort mechanic we can solve both our heap and sort issues fairly simply and efficiently. 
•To employ secondary sort in our code, we need to make the key a composite of the natural key and the natural value.
Composite Key 
•The Composite Key gives Hadoop the needed information during the shuffle to perform a sort not only on the “stock symbol”, but on the time stamp as well. 
•The class that sorts these Composite Keys is called the key comparator. 
•The key comparator should order by the composite key, which is the combination of the natural key and the natural value. 
•We can see below where an abstract version of secondary sort is being performed on a composite key of 2 integers. 
•A more realistic example: Composite Key to have a stock symbol string (K1) and a timestamp (K2). The diagram has sorted the K/V pairs by both “K1: stock symbol” (natural key) and “K2: time stamp” (secondary key).
Partitioning by the natural key 
•Once we’ve sorted our data on the composite key, we now need to partition the data for the reduce phase. 
•Once we’ve partitioned our data the reducers can now start downloading the partition files and begin their merge phase. 
•NaturalKeyGroupingComparator, is used to make sure a reduce() call only sees the logically grouped data meant for that composite key.
In short 
•To summarize, there is a recipe here to get the effect of sorting by value: 
–Make the key a composite of the natural key and the natural value. 
–The sort comparator should order by the composite key, that is, the natural key and natural value. 
–The partitionerand grouping comparator for the composite key should consider only the natural key for partitioning and grouping.
Implementation : NaturalKey 
•what you would normally use as the key or “group by” operator. 
–In this case the Natural Key is the “group” or “stock symbol” as we need to group potentially unsorted stock data before we can sort it and calculate the simple moving average.
Implementation : Composite Key 
•A Key that is a combination of the natural key and the natural value we want to sort by. 
–In this case it would be the TimeseriesKeyclass which has two members: 
•String Group 
•long Timestamp 
–Where the natural key is “Group” and the natural value is the “Timestamp” member.
Implementation : CompositeKeyComparator 
•Compares two composite keys for sorting. 
•Should order by composite key.
Implementation : NaturalKeyPartitioner 
•Partitioner should only consider the natural key. 
•Blocks all data into a logical group, inside which we want the secondary sort to occur on the natural value, or the second half of the composite key. 
•Normal hash partitionerwould hash the object and send each key/value pair to a separate reducer.
Implementation : NaturalKeyGroupingComparator 
•Should only consider the natural key. 
•Inside a partition, a reducer is run on the different groups inside of the partition. 
•A custom grouping comparator makes sure that a single reducer sees a custom view of the groups, sometimes grouping values across natural value “borders” in the composite key.
End of session 
Day –2: Secondary Sort and a Custom Comparator

Weitere ähnliche Inhalte

Was ist angesagt?

Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advancedChirag Ahuja
 
MapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large ClustersMapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large ClustersAshraf Uddin
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce scriptHaripritha
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tipsSubhas Kumar Ghosh
 
Map reduce in Hadoop
Map reduce in HadoopMap reduce in Hadoop
Map reduce in Hadoopishan0019
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduceM Baddar
 
Relational Algebra and MapReduce
Relational Algebra and MapReduceRelational Algebra and MapReduce
Relational Algebra and MapReducePietro Michiardi
 
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ..."MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...Adrian Florea
 
MapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large ClustersMapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large Clusterskazuma_sato
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsZubair Nabi
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentationateeq ateeq
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsLeila panahi
 
Repartition join in mapreduce
Repartition join in mapreduceRepartition join in mapreduce
Repartition join in mapreduceUday Vakalapudi
 

Was ist angesagt? (20)

Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advanced
 
MapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large ClustersMapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large Clusters
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce script
 
Hadoop Map Reduce Arch
Hadoop Map Reduce ArchHadoop Map Reduce Arch
Hadoop Map Reduce Arch
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
Map reduce in Hadoop
Map reduce in HadoopMap reduce in Hadoop
Map reduce in Hadoop
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Relational Algebra and MapReduce
Relational Algebra and MapReduceRelational Algebra and MapReduce
Relational Algebra and MapReduce
 
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ..."MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
 
MapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large ClustersMapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large Clusters
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce Applications
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Interpreting the Data:Parallel Analysis with Sawzall
Interpreting the Data:Parallel Analysis with SawzallInterpreting the Data:Parallel Analysis with Sawzall
Interpreting the Data:Parallel Analysis with Sawzall
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling Algorithms
 
Repartition join in mapreduce
Repartition join in mapreduceRepartition join in mapreduce
Repartition join in mapreduce
 

Ähnlich wie Hadoop secondary sort and a custom comparator

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Join Algorithms in MapReduce
Join Algorithms in MapReduceJoin Algorithms in MapReduce
Join Algorithms in MapReduceShrihari Rathod
 
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...Amazon Web Services
 
Time Series Anomaly Detection with .net and Azure
Time Series Anomaly Detection with .net and AzureTime Series Anomaly Detection with .net and Azure
Time Series Anomaly Detection with .net and AzureMarco Parenzan
 
Problem-solving and design 1.pptx
Problem-solving and design 1.pptxProblem-solving and design 1.pptx
Problem-solving and design 1.pptxTadiwaMawere
 
Advanced Analytics using Apache Hive
Advanced Analytics using Apache HiveAdvanced Analytics using Apache Hive
Advanced Analytics using Apache HiveMurtaza Doctor
 
MapReduce
MapReduceMapReduce
MapReduceKavyaGo
 
Optimizing Queries over Partitioned Tables in MPP Systems
Optimizing Queries over Partitioned Tables in MPP SystemsOptimizing Queries over Partitioned Tables in MPP Systems
Optimizing Queries over Partitioned Tables in MPP SystemsEMC
 
MineDB Mineral Resource Evaluation White Paper
MineDB Mineral Resource Evaluation White PaperMineDB Mineral Resource Evaluation White Paper
MineDB Mineral Resource Evaluation White PaperDerek Diamond
 
BDAS Shark study report 03 v1.1
BDAS Shark study report  03 v1.1BDAS Shark study report  03 v1.1
BDAS Shark study report 03 v1.1Stefanie Zhao
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeDataWorks Summit
 
Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory option
Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory optionStar Transformation, 12c Adaptive Bitmap Pruning and In-Memory option
Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory optionFranck Pachot
 
Dataware house introduction by InformaticaTrainingClasses
Dataware house introduction by InformaticaTrainingClassesDataware house introduction by InformaticaTrainingClasses
Dataware house introduction by InformaticaTrainingClassesInformaticaTrainingClasses
 

Ähnlich wie Hadoop secondary sort and a custom comparator (20)

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Join Algorithms in MapReduce
Join Algorithms in MapReduceJoin Algorithms in MapReduce
Join Algorithms in MapReduce
 
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
 
Time Series Anomaly Detection with .net and Azure
Time Series Anomaly Detection with .net and AzureTime Series Anomaly Detection with .net and Azure
Time Series Anomaly Detection with .net and Azure
 
mod 2.pdf
mod 2.pdfmod 2.pdf
mod 2.pdf
 
Problem-solving and design 1.pptx
Problem-solving and design 1.pptxProblem-solving and design 1.pptx
Problem-solving and design 1.pptx
 
DB
DBDB
DB
 
Cassandra data modelling best practices
Cassandra data modelling best practicesCassandra data modelling best practices
Cassandra data modelling best practices
 
Advanced Analytics using Apache Hive
Advanced Analytics using Apache HiveAdvanced Analytics using Apache Hive
Advanced Analytics using Apache Hive
 
Applied Mathematics Unit 2SBA
Applied Mathematics Unit 2SBAApplied Mathematics Unit 2SBA
Applied Mathematics Unit 2SBA
 
try
trytry
try
 
MapReduce
MapReduceMapReduce
MapReduce
 
Optimizing Queries over Partitioned Tables in MPP Systems
Optimizing Queries over Partitioned Tables in MPP SystemsOptimizing Queries over Partitioned Tables in MPP Systems
Optimizing Queries over Partitioned Tables in MPP Systems
 
Enar short course
Enar short courseEnar short course
Enar short course
 
MineDB Mineral Resource Evaluation White Paper
MineDB Mineral Resource Evaluation White PaperMineDB Mineral Resource Evaluation White Paper
MineDB Mineral Resource Evaluation White Paper
 
BDAS Shark study report 03 v1.1
BDAS Shark study report  03 v1.1BDAS Shark study report  03 v1.1
BDAS Shark study report 03 v1.1
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
 
Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory option
Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory optionStar Transformation, 12c Adaptive Bitmap Pruning and In-Memory option
Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory option
 
Dataware house introduction by InformaticaTrainingClasses
Dataware house introduction by InformaticaTrainingClassesDataware house introduction by InformaticaTrainingClasses
Dataware house introduction by InformaticaTrainingClasses
 

Mehr von Subhas Kumar Ghosh

Mehr von Subhas Kumar Ghosh (18)

07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
 
03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hive
 
01 hbase
01 hbase01 hbase
01 hbase
 
06 pig etl features
06 pig etl features06 pig etl features
06 pig etl features
 
05 pig user defined functions (udfs)
05 pig user defined functions (udfs)05 pig user defined functions (udfs)
05 pig user defined functions (udfs)
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis
 
Hadoop Day 3
Hadoop Day 3Hadoop Day 3
Hadoop Day 3
 
Hadoop exercise
Hadoop exerciseHadoop exercise
Hadoop exercise
 
Hadoop availability
Hadoop availabilityHadoop availability
Hadoop availability
 
Hadoop scheduler
Hadoop schedulerHadoop scheduler
Hadoop scheduler
 
Hadoop data management
Hadoop data managementHadoop data management
Hadoop data management
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
 
02 Hadoop deployment and configuration
02 Hadoop deployment and configuration02 Hadoop deployment and configuration
02 Hadoop deployment and configuration
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Greedy embedding problem
Greedy embedding problemGreedy embedding problem
Greedy embedding problem
 

Kürzlich hochgeladen

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 

Kürzlich hochgeladen (20)

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Hadoop secondary sort and a custom comparator

  • 1. Secondary Sort and a Custom Comparator
  • 2. What is Time Series Data? •Instatistics,signal processing,econometricsandmathematical finance, atime seriesis a sequence ofdata points, measured typically at successive time instants spaced at uniform time intervals. •Examples of time series data are the daily adjusted close price of a stock at the NYSE or sensor readings on a power grid occuring30 times a second. •Time series as a general class of problems has typically resided in the scientific and financial domains. •However, due to the ongoing explosion of available data, time series data is becoming more prevalent across a wider swath of industries. •Time Series sensors are being ubiquitously integrated in places like: –Thepower grid, aka “thesmart grid” –Cellular Services –As well as, military and environmental uses •The understanding of how we can refactortraditional approaches to these time series problems when inputting into MapReduce can potentially allow us to improve processing and analysis techniques in a timely fashion.
  • 3. Current approaches •The financial industry has long been interested in time series data and have employed programming languages such as R to help deal with this problem. •So, why would a sector create a programming language specifically for one class of data when technologies like RDBMS have existed for decades? •In reality, current RDBMs technology has limitations when dealing with high-resolution time series data. •These limiting factors include: –High-frequency time series data coming from a variety of sources can create huge amounts of data in very little time –RDBMS’s tend to not like storing and indexing billions of rows. –Non-distributed RDBMS’s tend to not like scaling up into the hundreds of GB’s, let alone TB’s or PB’s. –RDBMS’s that can scale into those arenas tend to be very expensive, or require large amounts of specialized hardware. –To process high resolution time series data with a RDBMS we’d need to use an analytic aggregate function in tandem with moving window predicates (ex: the “OVER” clause) which results in rapidly increasing amounts of work to do as the granularity of time series data gets finer. –Query results are not perfectly commutable and cannot do variable step sliding windows (ex: step 5 seconds per window move) without significant unnecessary intermediate work or non- standard SQL functions. –Queries on RDBMS for time series for certain techniques can be awkward and tend to require premature subdividing of the data and costly reconstruction during processing (example: Data mining, iSAXdecompositions) –Due to the above factors, with large amounts of time series data RDBMS performance degrades while scaling.
  • 4. Example Problem : Simple Moving Average •A simplemoving averageis the series of un-weighted averages in a subset of time series data points as a sliding window progresses over the time series data set. •Each time the window is moved we recalculate the average of the points in the window. •This produces a set of numbers representing the final moving average. •Typically the moving average technique is used with time series to highlight longer term trends or smooth out short-term noise. •Moving averages are similar to low pass filters in signal processing, and mathematically are considered a type of convolution. •In other terms, we take a window and fill it in a First In First Out (FIFO) manner with time series data points until we have N points in it. •We then take the average of these points and add this to our answer list. •We slide our window forward by M data points and again take the average of the data points in the window. •This process is repeated until the window can no longer be filled at which point the calculation is complete. •Let N=30, M = 1
  • 5. Data •/input/movingaverage/NYSE_daily exchange stock_symbol date open high low close volume adj close NYSE AA 3/5/2008 37.01 37.9 36.13 36.6 17752400 36.6 NYSE AA 3/4/2008 38.85 39.28 38.26 38.37 11279900 38.37 NYSE AA 3/3/2008 38.25 39.15 38.1 38.71 11754600 38.71 NYSE AA 3/2/2008 37.9 38.94 37.1 38 15715600 38 NYSE AA 3/1/2008 37.17 38.46 37.13 38.32 13964700 38.32 NYSE AA 2/29/2008 38.77 38.82 36.94 37.14 22611400 37.14 NYSE AA 2/28/2008 38.61 39.29 38.19 39.12 11421700 39.12 NYSE AA 2/27/2008 38.19 39.62 37.75 39.02 14296300 39.02 NYSE AA 2/26/2008 38.59 39.25 38.08 38.5 14417700 38.5 NYSE AA 2/25/2008 36.64 38.95 36.48 38.85 22500100 38.85 NYSE AA 2/24/2008 36.38 36.64 35.58 36.55 12834300 36.55 NYSE AA 2/23/2008 36.88 37.41 36.25 36.3 13078200 36.3 NYSE AA 2/22/2008 35.96 36.85 35.51 36.83 10906600 36.83 NYSE AA 2/21/2008 36.19 36.73 35.84 36.2 12825300 36.2 NYSE AA 2/20/2008 35.16 35.94 35.12 35.72 14082200 35.72 NYSE AA 2/19/2008 36.01 36.43 35.05 35.36 18238800 35.36 NYSE AA 2/18/2008 33.75 35.52 33.63 35.51 21082100 35.51 NYSE AA 2/17/2008 34.33 34.64 33.26 33.49 12418900 33.49 NYSE AA 2/16/2008 33.82 34.25 33.29 34.06 11249800 34.06 NYSE AA 2/15/2008 32.67 33.81 32.37 33.76 10731400 33.76 NYSE AA 2/14/2008 32.24 33.25 31.9 32.78 9058900 32.78 NYSE AA 2/13/2008 32.95 33.37 32.26 32.41 7230300 32.41 NYSE AA 2/12/2008 33.3 33.64 32.52 32.67 11338000 32.5 NYSE AA 2/11/2008 34.57 34.85 33.98 34.08 9528000 33.9 NYSE AA 2/10/2008 33.67 34.45 33.07 34.28 15186100 34.1 NYSE AA 2/9/2008 32.13 33.34 31.95 33.09 9200400 32.92 NYSE AA 2/8/2008 32.58 33.42 32.11 32.7 10241400 32.53 NYSE AA 2/7/2008 31.73 33.13 31.57 32.66 14338500 32.49 NYSE AA 2/6/2008 30.27 31.52 30.06 31.47 8445100 31.31 NYSE AA 2/5/2008 31.16 31.89 30.55 30.69 17567800 30.53 NYSE AA 2/4/2008 37.01 37.9 36.13 36.6 17752400 10.6 NYSE AA 2/3/2008 38.85 39.28 38.26 38.37 11279900 8.37
  • 6. Approach •In our simple moving average example, however, we don’t operate on a per value basis specifically, nor do we produce an aggregate across all of the values. •Our operation in the aggregate sense involves a sliding window, which performs its operations on a subset of the data at each step. •We also have to consider that the points in our time series data are not guaranteed to arrive at the reduce in order and need to be sorted. •This is because with multiple map functions reading multiple sections of the source data MapReduce does not impose any order on the key-value pairs that are grouped together in the default partition and sorting schemes. •We want to group all of one stock’s adjusted close values together so we can apply the simple moving average operation over the sorted time series data. •We want toemit each time series key value pairkeyed on a stock symbol to group these values together. •In thereduce phasewe can run an operation, here the simple moving average, over the data. •Since the data more than likely will not arrive at the reducer in sorted order we’ll need to sort the data before we can calculate the simple moving average.
  • 7. Problem •We’re limited by our Java Virtual Machine (JVM) child heap size and we are taking time to manually sort the data ourselves. •With a few design changes, we can solve both of these issues taking advantage of some inherent properties of MapReduce. –First we want to look at the case of sorting the data in memory on each reducer. –Currently we have to make sure we never send more data to a single reducer than can fit in memory. –The way we can currently control this is to give each reducer child JVM more heap and/or to further partition our time series data in the map phase. –In this case we’d partition further by time, breaking our data into smaller windows of time. •As opposed to further partitioning of the data, another approach to this issue is to allow Hadoop to sort the data for us in what’s called the “shuffle phase” of MapReduce. •If the data arrives at a reducer already in sorted order –we can lower our memory footprint and –reduce the number of loops through the data by only looking at the next N samples for each simple moving average calculation.
  • 8. shuffle’s “secondary sort” mechanic •Sorting is something we can let Hadoop do for us and Hadoop has proven to be quite good at sorting large amounts of data. •In using the secondary sort mechanic we can solve both our heap and sort issues fairly simply and efficiently. •To employ secondary sort in our code, we need to make the key a composite of the natural key and the natural value.
  • 9. Composite Key •The Composite Key gives Hadoop the needed information during the shuffle to perform a sort not only on the “stock symbol”, but on the time stamp as well. •The class that sorts these Composite Keys is called the key comparator. •The key comparator should order by the composite key, which is the combination of the natural key and the natural value. •We can see below where an abstract version of secondary sort is being performed on a composite key of 2 integers. •A more realistic example: Composite Key to have a stock symbol string (K1) and a timestamp (K2). The diagram has sorted the K/V pairs by both “K1: stock symbol” (natural key) and “K2: time stamp” (secondary key).
  • 10. Partitioning by the natural key •Once we’ve sorted our data on the composite key, we now need to partition the data for the reduce phase. •Once we’ve partitioned our data the reducers can now start downloading the partition files and begin their merge phase. •NaturalKeyGroupingComparator, is used to make sure a reduce() call only sees the logically grouped data meant for that composite key.
  • 11. In short •To summarize, there is a recipe here to get the effect of sorting by value: –Make the key a composite of the natural key and the natural value. –The sort comparator should order by the composite key, that is, the natural key and natural value. –The partitionerand grouping comparator for the composite key should consider only the natural key for partitioning and grouping.
  • 12. Implementation : NaturalKey •what you would normally use as the key or “group by” operator. –In this case the Natural Key is the “group” or “stock symbol” as we need to group potentially unsorted stock data before we can sort it and calculate the simple moving average.
  • 13. Implementation : Composite Key •A Key that is a combination of the natural key and the natural value we want to sort by. –In this case it would be the TimeseriesKeyclass which has two members: •String Group •long Timestamp –Where the natural key is “Group” and the natural value is the “Timestamp” member.
  • 14. Implementation : CompositeKeyComparator •Compares two composite keys for sorting. •Should order by composite key.
  • 15. Implementation : NaturalKeyPartitioner •Partitioner should only consider the natural key. •Blocks all data into a logical group, inside which we want the secondary sort to occur on the natural value, or the second half of the composite key. •Normal hash partitionerwould hash the object and send each key/value pair to a separate reducer.
  • 16. Implementation : NaturalKeyGroupingComparator •Should only consider the natural key. •Inside a partition, a reducer is run on the different groups inside of the partition. •A custom grouping comparator makes sure that a single reducer sees a custom view of the groups, sometimes grouping values across natural value “borders” in the composite key.
  • 17. End of session Day –2: Secondary Sort and a Custom Comparator