SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Tools and Services for Data Intensive Research An Elephant Through the Eye of a Needle Roger Barga, Architect eXtreme Computing Group, Microsoft Research
Select eXtreme Computing Group (XCG) Initiatives Cloud Computing Futures ab initio R&D on cloud hardware/software infrastructure Multicore academic engagement Universal Parallel Computing Research Centers (UPCRCs) Software incubations Multicore applications, power management, scheduling Quantum computing Topological quantum computing investigations Security and cryptography Theoretical explorations and software tools ,[object Object]
Worldwide government and academic research partnerships
Inform next generation cloud computing infrastructure,[object Object]
Why Commercial Clouds are Important* Research Have good idea Write proposal Wait 6 months If successful, wait 3 months Install Computers Start Work Science Start-ups Have good idea  Write Business Plan Ask VCs to fund If successful.. Install Computers Start Work Cloud Computing Model Have good idea Grab nodes from Cloud provider Start Work Pay for what you used also scalability, cost, sustainability * Slide used with permission of Paul Watson, University of Newcastle (UK)
The Pull of Economics (follow the money) Moore’s “Law” favored consumer commodities Economics drove enormous improvements Specialized processors and mainframes faltered The commodity software industry was born LPIA  LPIA  DRAM  DRAM  OoO  x86 x86 ctlr ctlr x86 Today’s economics Unprecedented economies of scale Enterprise moving to PaaS, SaaS, cloud computing Opportunities for Analysis as a Service, multi-disciplinary data sets,… LPIA  LPIA  1 MB  1 MB  x86 x86 cache cache LPIA  LPIA  1 MB  GPU GPU x86 x86 cache 1 MB  1 MB  PCIe  PCIe  NoC NoC ctlr ctlr cache cache LPIA  LPIA  1 MB  GPU GPU x86 x86 cache This will drive changes in research computing and cloud infrastructure Just as did “killer micros” and inexpensive clusters LPIA  LPIA  1 MB  1 MB  x86 x86 cache cache LPIA  LPIA  DRAM  DRAM  OoO  x86 x86 ctlr ctlr x86
Drinking from the Twitter Fire Hose On the “input” end ,[object Object]
Enrich each element with significantly more metadata, e.g. geolocation.Assume the order of magnitude of the twitter user base is in the 10-50MM range, let’s crank this up to the 500M range. The average Twitter user is generating a relatively low incoming message rate right now, assume that a user’s devices (phone, car, PC) are enhanced to begin auto-generating periodic Twitter messages on their behalf, e.g. with location ‘pings’ and solving other problems that twitterbots are emerging to address.  So let’s say the input rate grows again to 10x-100x what it was in the previous step.
Drinking from the Twitter Fire Hose On the “input” end On the “output” end: three different usage modalities Each user has one or more ‘agents’ they run on their behalf, monitoring this input stream.  This might just be a client that displays a stream that is incoming from the @friends or #topics or the #interesting&@queries (user standing queries). A user can do more general queries from a search page.  This query may have more unstructured search terms than the above, and it is expected not just to be going against incoming stream but against much larger corpus of messages from the entire input stream that has been persisted for days, weeks, months, years… Finally, analytical tools or bots whose purpose is to do trend analysis on the knowledge popping out of the stream, in real-time.  Whether seeded with an interest (“let me know when a problem pops up with <product> that will damage my company’s reputation”) or just discovering a topic from the noise (“let me know when a new hot news item emerges”), both must be possible.
Pause for Moment… Defining representative challenges or quests to focus group attention is an excellent way to proceed as a community Publishing a whitepaper articulating these    challenges is a great way to allow others       to contribute to a shared research agenda Make simulated and reference data sets available    to ground such a distributed research effort
Drinking from the Twitter Fire Hose On the “input” end On the “output” end: three different usage modalities 	A combination of live data, including streaming, and historical data  	Lots of necessary technology, but no single technology is sufficient If this is going to be successful it must be accessible to the masses  Simple to use and highly scalable, which is extremely difficult 	because in actuality it is not simple…
This Talk is About Effort to build & port tools for data intensive research in the cloud ,[object Object],Able to handle torrential streams of live and historical data ,[object Object],Intersection of four fundamental strategies  Distribute Data and perform Parallel Processing Parallel operations to take advantage of multiple cores; Reduce the size of the data accessed Data compression Data structures that limit the amount of data required for queries; Stream data processing to extract information before storage
Microsoft’s Dryad Continuously deployed since 2006 Running on >> 104 machines Sifting through > 10Pb data daily Runs on clusters > 3000 machines Handles jobs with > 105 processes each Used by >> 100 developers Rich platform for data analysis Microsoft Research, Silicon Valley Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly
Pause for Moment… Data-Intensive Computing Symposium, 2007 Dryad is now freely available http://research.microsoft.com/en-us/collaboration/tools/dryad.aspx Thanks to Geoffrey Fox (Indiana) and Magda Balazinska (UW) as early adopters Commitment by External Research (MSR) to support research community use
Simple Programming Model Terasort, well known benchmark, time to sort time 1 TB data [J. Gray 1985] ,[object Object]
DryadLINQ provides simple but powerful programming model
 Only few lines of code needed to implement Terasort, benchmark May 2008
DryadLINQ result: 349 seconds (5.8 min)
 Cluster of 240 AMD64 (quad) machines, 920 disks
 Code: 17 lines of LINQDryadDataContext ddc = newDryadDataContext(fileDir); DryadTable<TeraRecord> records =    ddc.GetPartitionedTable<TeraRecord>(file); varq = records.OrderBy(x => x); q.ToDryadPartitionedTable(output);
LINQ Microsoft’s Language INtegrated Query Available in Visual Studio 2008 A set of operators to manipulate datasets in .NET Support traditional relational operators Select, Join, GroupBy, Aggregate, etc. Data model Data elements are strongly typed .NET objects Much more expressive than SQL tables Extremely extensible Add new custom operators Add new execution providers
Dryad Generalizes Unix Pipes Unix Pipes: 1-D 		grep |  sed  | sort | awk |  perl Dryad: 2-D, multi-machine, virtualized 	 grep1000 |  sed500  | sort1000 | awk500 |  perl50
Dryad Job Structure Channels Inputfiles Stage Outputfiles sort grep awk sed perl sort grep awk sed grep sort Vertices (processes) Channel is a finite streams of items ,[object Object]
  TCP pipes (inter-machine)
  Memory FIFOs (intra-machine),[object Object]
Dryad Job Staging 1. Build 7. Serialize vertices Vertex Code 2. Send .exe 5. Generate graph JM code Cluster services 6. Initialize vertices 3. Start JM 8. Monitor vertex execution 4. Query cluster resources
Dryad Scheduler is a State Machine Static optimizer builds execution graph Vertex can run anywhere once all its inputs are ready. Dynamic optimizer mutates running graph  Distributes code, routes data; Schedules processes on machines near data; Adjusts available compute resources at each stage; Automatically recovers computation, adjusts for overload ,[object Object]
If A’s inputs are gone, run upstream vertices again (recursively);
If A is slow, run a copy elsewhere and use output from one that finishes first.Masks failures in cluster and network;
Combining Query Providers Local Machine Execution Engines Scalability .Netprogram (C#, VB, F#, etc) DryadLINQ Cluster Query PLINQ LINQ provider interface Multi-core LINQ-to-IMDB Objects LINQ-to-CEP Single-core
LINQ == Tree of Operators A query is comprised of a tree of operators As with a program AST, these trees can be analyzed, rewritten This is why PLINQ can safely introduce parallelism q = from x in A where p(x) select x3; ,[object Object]

Weitere ähnliche Inhalte

Was ist angesagt?

Data Virtualization in the Cloud: Accelerating Data Virtualization Adoption
Data Virtualization in the Cloud: Accelerating Data Virtualization AdoptionData Virtualization in the Cloud: Accelerating Data Virtualization Adoption
Data Virtualization in the Cloud: Accelerating Data Virtualization AdoptionDenodo
 
How to Solve Real-Time Data Problems
How to Solve Real-Time Data ProblemsHow to Solve Real-Time Data Problems
How to Solve Real-Time Data ProblemsIBM Power Systems
 
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...Data Con LA
 
Seamless, Real-Time Data Integration with Connect
Seamless, Real-Time Data Integration with ConnectSeamless, Real-Time Data Integration with Connect
Seamless, Real-Time Data Integration with ConnectPrecisely
 
In memory computing
In memory computingIn memory computing
In memory computingGagan Reddy
 
Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-So...
Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-So...Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-So...
Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-So...Databricks
 
Nutanix + Cumulus Linux: Deploying True Hyper Convergence with Open Networking
Nutanix + Cumulus Linux: Deploying True Hyper Convergence with Open NetworkingNutanix + Cumulus Linux: Deploying True Hyper Convergence with Open Networking
Nutanix + Cumulus Linux: Deploying True Hyper Convergence with Open NetworkingCumulus Networks
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...MSAdvAnalytics
 
Nutanix and microsoft_webinar_oct_28
Nutanix and microsoft_webinar_oct_28Nutanix and microsoft_webinar_oct_28
Nutanix and microsoft_webinar_oct_28groberts52
 
Scale-on-Scale : Part 1 of 3 - Production Environment
Scale-on-Scale : Part 1 of 3 - Production EnvironmentScale-on-Scale : Part 1 of 3 - Production Environment
Scale-on-Scale : Part 1 of 3 - Production EnvironmentScale Computing
 
Nutanix - The Next Level in Web Scale IT Architectures is Here
Nutanix - The Next Level in Web Scale IT Architectures is HereNutanix - The Next Level in Web Scale IT Architectures is Here
Nutanix - The Next Level in Web Scale IT Architectures is HereVMUG IT
 
Denodo Data Virtualization Platform: Scalability (session 3 from Architect to...
Denodo Data Virtualization Platform: Scalability (session 3 from Architect to...Denodo Data Virtualization Platform: Scalability (session 3 from Architect to...
Denodo Data Virtualization Platform: Scalability (session 3 from Architect to...Denodo
 
Developing Software for Persistent Memory / Willhalm Thomas (Intel)
Developing Software for Persistent Memory / Willhalm Thomas (Intel)Developing Software for Persistent Memory / Willhalm Thomas (Intel)
Developing Software for Persistent Memory / Willhalm Thomas (Intel)Ontico
 
MT48 A Flash into the future of storage….  Flash meets Persistent Memory: The...
MT48 A Flash into the future of storage….  Flash meets Persistent Memory: The...MT48 A Flash into the future of storage….  Flash meets Persistent Memory: The...
MT48 A Flash into the future of storage….  Flash meets Persistent Memory: The...Dell EMC World
 
Webinar | Introducing DataStax Enterprise 4.6
Webinar | Introducing DataStax Enterprise 4.6Webinar | Introducing DataStax Enterprise 4.6
Webinar | Introducing DataStax Enterprise 4.6DataStax
 
Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Precisely
 
Cleversafe single page
Cleversafe single pageCleversafe single page
Cleversafe single pageJoe Krotz
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Cloudera, Inc.
 
Bringing NetApp Data ONTAP & Apache CloudStack Together
Bringing NetApp Data ONTAP & Apache CloudStack TogetherBringing NetApp Data ONTAP & Apache CloudStack Together
Bringing NetApp Data ONTAP & Apache CloudStack TogetherDavid La Motta
 
Performing Simulation-Based, Real-time Decision Making with Cloud HPC
Performing Simulation-Based, Real-time Decision Making with Cloud HPCPerforming Simulation-Based, Real-time Decision Making with Cloud HPC
Performing Simulation-Based, Real-time Decision Making with Cloud HPCinside-BigData.com
 

Was ist angesagt? (20)

Data Virtualization in the Cloud: Accelerating Data Virtualization Adoption
Data Virtualization in the Cloud: Accelerating Data Virtualization AdoptionData Virtualization in the Cloud: Accelerating Data Virtualization Adoption
Data Virtualization in the Cloud: Accelerating Data Virtualization Adoption
 
How to Solve Real-Time Data Problems
How to Solve Real-Time Data ProblemsHow to Solve Real-Time Data Problems
How to Solve Real-Time Data Problems
 
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
 
Seamless, Real-Time Data Integration with Connect
Seamless, Real-Time Data Integration with ConnectSeamless, Real-Time Data Integration with Connect
Seamless, Real-Time Data Integration with Connect
 
In memory computing
In memory computingIn memory computing
In memory computing
 
Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-So...
Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-So...Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-So...
Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-So...
 
Nutanix + Cumulus Linux: Deploying True Hyper Convergence with Open Networking
Nutanix + Cumulus Linux: Deploying True Hyper Convergence with Open NetworkingNutanix + Cumulus Linux: Deploying True Hyper Convergence with Open Networking
Nutanix + Cumulus Linux: Deploying True Hyper Convergence with Open Networking
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
 
Nutanix and microsoft_webinar_oct_28
Nutanix and microsoft_webinar_oct_28Nutanix and microsoft_webinar_oct_28
Nutanix and microsoft_webinar_oct_28
 
Scale-on-Scale : Part 1 of 3 - Production Environment
Scale-on-Scale : Part 1 of 3 - Production EnvironmentScale-on-Scale : Part 1 of 3 - Production Environment
Scale-on-Scale : Part 1 of 3 - Production Environment
 
Nutanix - The Next Level in Web Scale IT Architectures is Here
Nutanix - The Next Level in Web Scale IT Architectures is HereNutanix - The Next Level in Web Scale IT Architectures is Here
Nutanix - The Next Level in Web Scale IT Architectures is Here
 
Denodo Data Virtualization Platform: Scalability (session 3 from Architect to...
Denodo Data Virtualization Platform: Scalability (session 3 from Architect to...Denodo Data Virtualization Platform: Scalability (session 3 from Architect to...
Denodo Data Virtualization Platform: Scalability (session 3 from Architect to...
 
Developing Software for Persistent Memory / Willhalm Thomas (Intel)
Developing Software for Persistent Memory / Willhalm Thomas (Intel)Developing Software for Persistent Memory / Willhalm Thomas (Intel)
Developing Software for Persistent Memory / Willhalm Thomas (Intel)
 
MT48 A Flash into the future of storage….  Flash meets Persistent Memory: The...
MT48 A Flash into the future of storage….  Flash meets Persistent Memory: The...MT48 A Flash into the future of storage….  Flash meets Persistent Memory: The...
MT48 A Flash into the future of storage….  Flash meets Persistent Memory: The...
 
Webinar | Introducing DataStax Enterprise 4.6
Webinar | Introducing DataStax Enterprise 4.6Webinar | Introducing DataStax Enterprise 4.6
Webinar | Introducing DataStax Enterprise 4.6
 
Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?
 
Cleversafe single page
Cleversafe single pageCleversafe single page
Cleversafe single page
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Bringing NetApp Data ONTAP & Apache CloudStack Together
Bringing NetApp Data ONTAP & Apache CloudStack TogetherBringing NetApp Data ONTAP & Apache CloudStack Together
Bringing NetApp Data ONTAP & Apache CloudStack Together
 
Performing Simulation-Based, Real-time Decision Making with Cloud HPC
Performing Simulation-Based, Real-time Decision Making with Cloud HPCPerforming Simulation-Based, Real-time Decision Making with Cloud HPC
Performing Simulation-Based, Real-time Decision Making with Cloud HPC
 

Ähnlich wie Microsoft Dryad

High performance computing
High performance computingHigh performance computing
High performance computingGuy Tel-Zur
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer OverlordsIan Foster
 
Evolution from EDA to Data Mesh: Data in Motion
Evolution from EDA to Data Mesh: Data in MotionEvolution from EDA to Data Mesh: Data in Motion
Evolution from EDA to Data Mesh: Data in Motionconfluent
 
Computing Outside The Box
Computing Outside The BoxComputing Outside The Box
Computing Outside The BoxIan Foster
 
The hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaThe hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaAlluxio, Inc.
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011Ian Foster
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22marpierc
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009Ian Foster
 
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Folio3 Software
 
Ultralight Data Movement for IoT with SDC Edge
Ultralight Data Movement for IoT with SDC EdgeUltralight Data Movement for IoT with SDC Edge
Ultralight Data Movement for IoT with SDC EdgeDataWorks Summit
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform Seldon
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist SoftServe
 
Fundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvediFundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvediAnimesh Chaturvedi
 
Elephants in the cloud or How to become cloud ready
Elephants in the cloud or How to become cloud readyElephants in the cloud or How to become cloud ready
Elephants in the cloud or How to become cloud readyGetInData
 
Elephants in the cloud or how to become cloud ready
Elephants in the cloud or how to become cloud readyElephants in the cloud or how to become cloud ready
Elephants in the cloud or how to become cloud readyKrzysztof Adamski
 
Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...
Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...
Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...Evention
 
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & AlluxioAlluxio, Inc.
 
How to teach your data scientist to leverage an analytics cluster with Presto...
How to teach your data scientist to leverage an analytics cluster with Presto...How to teach your data scientist to leverage an analytics cluster with Presto...
How to teach your data scientist to leverage an analytics cluster with Presto...Alluxio, Inc.
 
Automated prevention of ransomware with machine learning and gpos
Automated prevention of ransomware with machine learning and gposAutomated prevention of ransomware with machine learning and gpos
Automated prevention of ransomware with machine learning and gposPriyanka Aash
 

Ähnlich wie Microsoft Dryad (20)

High performance computing
High performance computingHigh performance computing
High performance computing
 
IoT meets Big Data
IoT meets Big DataIoT meets Big Data
IoT meets Big Data
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
 
Evolution from EDA to Data Mesh: Data in Motion
Evolution from EDA to Data Mesh: Data in MotionEvolution from EDA to Data Mesh: Data in Motion
Evolution from EDA to Data Mesh: Data in Motion
 
Computing Outside The Box
Computing Outside The BoxComputing Outside The Box
Computing Outside The Box
 
The hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaThe hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at Helixa
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
 
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
 
Ultralight Data Movement for IoT with SDC Edge
Ultralight Data Movement for IoT with SDC EdgeUltralight Data Movement for IoT with SDC Edge
Ultralight Data Movement for IoT with SDC Edge
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
 
Fundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvediFundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvedi
 
Elephants in the cloud or How to become cloud ready
Elephants in the cloud or How to become cloud readyElephants in the cloud or How to become cloud ready
Elephants in the cloud or How to become cloud ready
 
Elephants in the cloud or how to become cloud ready
Elephants in the cloud or how to become cloud readyElephants in the cloud or how to become cloud ready
Elephants in the cloud or how to become cloud ready
 
Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...
Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...
Elephants in the cloud or how to become cloud ready - Krzysztof Adamski, GetI...
 
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
 
How to teach your data scientist to leverage an analytics cluster with Presto...
How to teach your data scientist to leverage an analytics cluster with Presto...How to teach your data scientist to leverage an analytics cluster with Presto...
How to teach your data scientist to leverage an analytics cluster with Presto...
 
Automated prevention of ransomware with machine learning and gpos
Automated prevention of ransomware with machine learning and gposAutomated prevention of ransomware with machine learning and gpos
Automated prevention of ransomware with machine learning and gpos
 

Kürzlich hochgeladen

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 

Kürzlich hochgeladen (20)

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 

Microsoft Dryad

  • 1. Tools and Services for Data Intensive Research An Elephant Through the Eye of a Needle Roger Barga, Architect eXtreme Computing Group, Microsoft Research
  • 2.
  • 3. Worldwide government and academic research partnerships
  • 4.
  • 5. Why Commercial Clouds are Important* Research Have good idea Write proposal Wait 6 months If successful, wait 3 months Install Computers Start Work Science Start-ups Have good idea Write Business Plan Ask VCs to fund If successful.. Install Computers Start Work Cloud Computing Model Have good idea Grab nodes from Cloud provider Start Work Pay for what you used also scalability, cost, sustainability * Slide used with permission of Paul Watson, University of Newcastle (UK)
  • 6. The Pull of Economics (follow the money) Moore’s “Law” favored consumer commodities Economics drove enormous improvements Specialized processors and mainframes faltered The commodity software industry was born LPIA LPIA DRAM DRAM OoO x86 x86 ctlr ctlr x86 Today’s economics Unprecedented economies of scale Enterprise moving to PaaS, SaaS, cloud computing Opportunities for Analysis as a Service, multi-disciplinary data sets,… LPIA LPIA 1 MB 1 MB x86 x86 cache cache LPIA LPIA 1 MB GPU GPU x86 x86 cache 1 MB 1 MB PCIe PCIe NoC NoC ctlr ctlr cache cache LPIA LPIA 1 MB GPU GPU x86 x86 cache This will drive changes in research computing and cloud infrastructure Just as did “killer micros” and inexpensive clusters LPIA LPIA 1 MB 1 MB x86 x86 cache cache LPIA LPIA DRAM DRAM OoO x86 x86 ctlr ctlr x86
  • 7.
  • 8. Enrich each element with significantly more metadata, e.g. geolocation.Assume the order of magnitude of the twitter user base is in the 10-50MM range, let’s crank this up to the 500M range. The average Twitter user is generating a relatively low incoming message rate right now, assume that a user’s devices (phone, car, PC) are enhanced to begin auto-generating periodic Twitter messages on their behalf, e.g. with location ‘pings’ and solving other problems that twitterbots are emerging to address.  So let’s say the input rate grows again to 10x-100x what it was in the previous step.
  • 9. Drinking from the Twitter Fire Hose On the “input” end On the “output” end: three different usage modalities Each user has one or more ‘agents’ they run on their behalf, monitoring this input stream.  This might just be a client that displays a stream that is incoming from the @friends or #topics or the #interesting&@queries (user standing queries). A user can do more general queries from a search page.  This query may have more unstructured search terms than the above, and it is expected not just to be going against incoming stream but against much larger corpus of messages from the entire input stream that has been persisted for days, weeks, months, years… Finally, analytical tools or bots whose purpose is to do trend analysis on the knowledge popping out of the stream, in real-time.  Whether seeded with an interest (“let me know when a problem pops up with <product> that will damage my company’s reputation”) or just discovering a topic from the noise (“let me know when a new hot news item emerges”), both must be possible.
  • 10. Pause for Moment… Defining representative challenges or quests to focus group attention is an excellent way to proceed as a community Publishing a whitepaper articulating these challenges is a great way to allow others to contribute to a shared research agenda Make simulated and reference data sets available to ground such a distributed research effort
  • 11. Drinking from the Twitter Fire Hose On the “input” end On the “output” end: three different usage modalities A combination of live data, including streaming, and historical data Lots of necessary technology, but no single technology is sufficient If this is going to be successful it must be accessible to the masses  Simple to use and highly scalable, which is extremely difficult because in actuality it is not simple…
  • 12.
  • 13. Microsoft’s Dryad Continuously deployed since 2006 Running on >> 104 machines Sifting through > 10Pb data daily Runs on clusters > 3000 machines Handles jobs with > 105 processes each Used by >> 100 developers Rich platform for data analysis Microsoft Research, Silicon Valley Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly
  • 14. Pause for Moment… Data-Intensive Computing Symposium, 2007 Dryad is now freely available http://research.microsoft.com/en-us/collaboration/tools/dryad.aspx Thanks to Geoffrey Fox (Indiana) and Magda Balazinska (UW) as early adopters Commitment by External Research (MSR) to support research community use
  • 15.
  • 16. DryadLINQ provides simple but powerful programming model
  • 17. Only few lines of code needed to implement Terasort, benchmark May 2008
  • 18. DryadLINQ result: 349 seconds (5.8 min)
  • 19. Cluster of 240 AMD64 (quad) machines, 920 disks
  • 20. Code: 17 lines of LINQDryadDataContext ddc = newDryadDataContext(fileDir); DryadTable<TeraRecord> records = ddc.GetPartitionedTable<TeraRecord>(file); varq = records.OrderBy(x => x); q.ToDryadPartitionedTable(output);
  • 21. LINQ Microsoft’s Language INtegrated Query Available in Visual Studio 2008 A set of operators to manipulate datasets in .NET Support traditional relational operators Select, Join, GroupBy, Aggregate, etc. Data model Data elements are strongly typed .NET objects Much more expressive than SQL tables Extremely extensible Add new custom operators Add new execution providers
  • 22. Dryad Generalizes Unix Pipes Unix Pipes: 1-D grep | sed | sort | awk | perl Dryad: 2-D, multi-machine, virtualized grep1000 | sed500 | sort1000 | awk500 | perl50
  • 23.
  • 24. TCP pipes (inter-machine)
  • 25.
  • 26. Dryad Job Staging 1. Build 7. Serialize vertices Vertex Code 2. Send .exe 5. Generate graph JM code Cluster services 6. Initialize vertices 3. Start JM 8. Monitor vertex execution 4. Query cluster resources
  • 27.
  • 28. If A’s inputs are gone, run upstream vertices again (recursively);
  • 29. If A is slow, run a copy elsewhere and use output from one that finishes first.Masks failures in cluster and network;
  • 30. Combining Query Providers Local Machine Execution Engines Scalability .Netprogram (C#, VB, F#, etc) DryadLINQ Cluster Query PLINQ LINQ provider interface Multi-core LINQ-to-IMDB Objects LINQ-to-CEP Single-core
  • 31.
  • 34. Nesting queries inside of others is commonPLINQ can fuse partitions var q1 = from x in A select x*2; var q2 = q1.Sum();
  • 35. Combining with PLINQ Query DryadLINQ subquery PLINQ
  • 36. Combining with LINQ-to-IMDB Query DryadLINQ Subquery Subquery Subquery Subquery Historical Reference Data LINQ-to-IMDB
  • 37. Combining with LINQ-to-CEP Query DryadLINQ Subquery Subquery Subquery Subquery Subquery ‘Live’ Streaming Data LINQ-to-IMDB LINQ-to-CEP
  • 38. Cost of storing data – few cents/month/MB Cost of acquiring data – negligible Extracting insight while acquiring data - priceless Mining historical data for ways to extract insight – precious CEDR CEP – the engine that makes it possible Consistent Streaming Through Time: A Vision for Event Stream Processing Roger S. Barga, Jonathan Goldstein, Mohamed H. Ali, Mingsheng Hong In the proceedings of CIDR 2007
  • 39. Complex Event Processing Complex Event Processing (CEP) is the continuous and incremental processing of event (data) streams from multiple sources based on declarative query and pattern specifications with near-zero latency.
  • 40.
  • 41.
  • 42. Separately specify desired disorder handling strategy
  • 43. Many interesting repercussionsConsistent Streaming Through Time: A Vision for Event Stream Processing Roger S. Barga, Jonathan Goldstein, Mohamed H. Ali, Mingsheng Hong In the proceedings of CIDR 2007
  • 44. CEDR (Orinoco) Overview Currently processing over 400M events per day for internal application (5000 events/sec)
  • 45.
  • 46.
  • 47.
  • 48.

Hinweis der Redaktion

  1. Language Integrated Query is an extension of .Net which allows one to write declarative computations on collections
  2. Dryad is a generalization of the Unix piping mechanism: instead of uni-dimensional (chain) pipelines, it provides two-dimensional pipelines. The unit is still a process connected by a point-to-point channel, but the processes are replicated.
  3. This is the basic Dryad terminology.
  4. The brain of a Dryad job is a centralized Job Manager, which maintains a complete state of the job.The JM controls the processes running on a cluster, but never exchanges data with them.(The data plane is completely separated from the control plane.)
  5. Computation Staging