Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
PySpark for Time Series Analysis
David Palaitis
Two Sigma Investments
About Me
Important Legal Information
The information presented here is offered for recruiting purposes only and
should not be used ...
Time Series
IOT feeds
sensor data
economic data
An ordered sequence of values of a variable
Time Series Analysis
Time Series Analysis
Time Series Analysis
Time Series at Two Sigma
Millions of
Time Series
Big and
Small
(1GB – 1PB)
Narrow (10
columns) and
Wide (1MM
Columns)
Even...
Let’s start from the beginning …
Examples!
What’s Missing?
You can’t even do “Word Count”
“Word Count” !
What’s missing? Time.
Windowed Aggregations
Temporal Joins
} window
w is a window specification e.g. 500ms, 5s, 3 business days
RDD[(K,V)] -> RDD[(K,Seq[V])]
reduceByWindow(f: (V, V) => V, w): RDD[(K, W)] => RDD[(K, V)]
reduceByWindow(f: (V, V) => V, w): RDD[(K, V)] => RDD[(K, V)]
https://github.com/twosigma/flint
Getting
Started …
Looking ahead.
Thank You.
Find me after the talk to see Flint in action.
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
Nächste SlideShare
Wird geladen in …5
×

New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis

2.480 Aufrufe

Veröffentlicht am

Whether it’s Internet of Things (IoT), analysis of Financial Data, or Adtech, the arrival of events in time order requires tools and techniques that are noticeably missing from the Pandas and pySpark software stack.
In this talk, we’ll cover Two Sigma’s contribution to time series analysis for Spark, our work with Pandas, and propose a roadmap for to future-proof pySpark and establish Python as a first class language in the Spark Ecosystem.

Veröffentlicht in: Daten & Analysen
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier

New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis

  1. 1. PySpark for Time Series Analysis David Palaitis Two Sigma Investments
  2. 2. About Me
  3. 3. Important Legal Information The information presented here is offered for recruiting purposes only and should not be used for any other purpose (including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes only and are not necessarily based on actual data. Nothing herein constitutes an offer to sell or the solicitation of any offer to buy any security or other interest. We consider this information to be confidential and not for redistribution or dissemination. Some of the images, logos or other material used herein may be protected by copyright and/or trademark. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa.
  4. 4. Time Series IOT feeds sensor data economic data An ordered sequence of values of a variable
  5. 5. Time Series Analysis
  6. 6. Time Series Analysis
  7. 7. Time Series Analysis
  8. 8. Time Series at Two Sigma Millions of Time Series Big and Small (1GB – 1PB) Narrow (10 columns) and Wide (1MM Columns) Evenly and Unevenly Spaced Observations
  9. 9. Let’s start from the beginning …
  10. 10. Examples!
  11. 11. What’s Missing?
  12. 12. You can’t even do “Word Count”
  13. 13. “Word Count” !
  14. 14. What’s missing? Time.
  15. 15. Windowed Aggregations
  16. 16. Temporal Joins } window
  17. 17. w is a window specification e.g. 500ms, 5s, 3 business days RDD[(K,V)] -> RDD[(K,Seq[V])]
  18. 18. reduceByWindow(f: (V, V) => V, w): RDD[(K, W)] => RDD[(K, V)]
  19. 19. reduceByWindow(f: (V, V) => V, w): RDD[(K, V)] => RDD[(K, V)]
  20. 20. https://github.com/twosigma/flint
  21. 21. Getting Started …
  22. 22. Looking ahead.
  23. 23. Thank You. Find me after the talk to see Flint in action.

×