SlideShare ist ein Scribd-Unternehmen logo
1 von 38
Data-Intensive Computing on Windows HPC Server with the DryadLINQ Framework John Vert Architect Microsoft Corporation SVR17
Moving Parts Windows HPC Server 2008 – cluster management, job scheduling Dryad – distributed execution engine, failure recovery, distribution, scalability across very large partitioned datasets LINQ – .NET extensions for declarative query, easy expression of data parallelism, unified data model PLINQ – multi-core parallelism across LINQ queries. DryadLINQ – Bring LINQ ease of programming to Dryad
Software Stack … Image Processing MachineLearning Graph Analysis DataMining .NET Applications DryadLINQ Dryad HPC Job Scheduler Windows HPC Server 2008 Windows HPC Server 2008 Windows HPC Server 2008 Windows HPC Server 2008
Dryad Provides a general, flexible distributed execution layer Dataflow graph as the computation model Can be modified by runtime optimizations Higher language layer supplies graph, vertex code, serialization code, hints for data locality Automatically handles distributed execution Distributes code, routes data Schedules processes on machines near data Masks failures in cluster and network
A Dryad JobDirected acyclic graph (DAG) Outputs Processing vertices Channels (file, fifo,  pipe) Inputs
2-D Piping Unix Pipes: 1-D grep   |  sed  |  sort   |  awk  |  perl Dryad: 2-D 	   grep1000    |     sed500   |    sort1000   |     awk500   |  perl50 6
LINQLanguage Integrated Query Declarative extensions to C# and VB.NET for iterating over collections In memory Via data providers SQL-Like Broadly adoptable by developers Easy to use Reduces written code Predictable results Scalable experience Deep tooling support
PLINQ Parallel Language Integrated Query Value Proposition: Enable LINQ developers to take advantage of parallel hardware—with basic understanding of data parallelism. Declarative data parallelism (focus on the “what” not the “how”) Alternative to LINQ-to-Objects Same set of query operators + some extras Default is IEnumerable<T> based Preview in Parallel Extensions to .NET Framework 3.5 CTP Shipping in .NET Framework 4.0 Beta 2
DryadLINQLINQ to clusters Declarative programming style of LINQ for clusters Automatic parallelization Parallel query plan exploits multi-node parallelism PLINQ underneath exploits multi-core parallelism Integration with VS and .NET Type safety, automatic serialization Query plan optimizations Static optimization rules to optimize locality Dynamic run-time optimizations
DryadLINQ: From LINQ to Dryad Automatic query plan generation Distributed query execution by Dryad LINQ query Query plan Dryad varlogentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); logs where select
A Simple LINQ Query IEnumerable<BabyInfo> babies = ...;  varresults = from baby in babies where baby.Name == queryName && baby.State == queryState && baby.Year >= yearStart &&  baby.Year <= yearEnd orderbybaby.Yearascending select baby;
A Simple PLINQ Query IEnumerable<BabyInfo> babies = ...;  varresults = from baby in babies.AsParallel() where baby.Name == queryName && baby.State == queryState && baby.Year >= yearStart &&  baby.Year <= yearEnd orderbybaby.Yearascending select baby;
A Simple DryadLINQQuery PartitionedTable<BabyInfo> babies =  PartitionedTable.Get<BabyInfo>(“BabyInfo.pt”); varresults = from baby in babies               where baby.Name == queryName && baby.State == queryState && baby.Year >= yearStart &&  baby.Year <= yearEnd orderbybaby.Yearascending select baby;
PartitionedTable<T>Core data structure for DryadLINQ Scale-out, partitioned container for .NET objects Derives from IQueryable<T>, IEnumerable<T> ToPartitionedTable() extension methods DryadLINQ operators consume and produce PartitionedTable<T> DryadLINQ generates code to serialize/deserialize your .NET objects Underlying storage can be partitioned file, partitioned SQL table, cluster filesystem
Partitioned FileFile-based container for PartitionedTable<T> metadata XCutput20a0fcfart 20 0,1855000,HPCMETAHN01 1,1630000,HPCA1CN13 2,1707500,HPCA1CN12 3,1828820,HPCA1CN22 4,1802140,HPCA1CN07 5,1741000,HPCA1CN08 6,1733980,HPCA1CN11 7,1762620,HPCA1CN06 8,1861300,HPCA1CN14 9,1807460,HPCA1CN17 10,1807560,HPCA1CN23 11,1768120,HPCA1CN20 12,1847220,HPCA1CN03 13,1729160,HPCA1CN16 14,1767500,HPCA1CN05 15,1781520,HPCA1CN04 16,1728480,HPCA1CN09 17,1802580,HPCA1CN18 18,1862380,HPCA1CN10 19,1762540,HPCA1CN21 HPCMETAHN01Cutput20a0fcfart.00000000
PartitionedFileFile-based container for PartitionedTable<T> metadata XCutput20a0fcfart 20 0,1855000,HPCMETAHN01 1,1630000,HPCA1CN13 2,1707500,HPCA1CN12 3,1828820,HPCA1CN22 4,1802140,HPCA1CN07 5,1741000,HPCA1CN08 6,1733980,HPCA1CN11 7,1762620,HPCA1CN06 8,1861300,HPCA1CN14 9,1807460,HPCA1CN17 10,1807560,HPCA1CN23 11,1768120,HPCA1CN20 12,1847220,HPCA1CN03 13,1729160,HPCA1CN16 14,1767500,HPCA1CN05 15,1781520,HPCA1CN04 16,1728480,HPCA1CN09 17,1802580,HPCA1CN18 18,1862380,HPCA1CN10 19,1762540,HPCA1CN21 HPCMETAHN01Cutput20a0fcfart.00000000 HPCA1CN13Cutput20a0fcfart.00000001 HPCA1CN12Cutput20a0fcfart.00000002 HPCA1CN22Cutput20a0fcfart.00000003 HPCA1CN07Cutput20a0fcfart.00000004 HPCA1CN08Cutput20a0fcfart.00000005 HPCA1CN11Cutput20a0fcfart.00000006 HPCA1CN06Cutput20a0fcfart.00000007 HPCA1CN14Cutput20a0fcfart.00000008 HPCA1CN17Cutput20a0fcfart.00000009 HPCA1CN23Cutput20a0fcfart.00000010 HPCA1CN20Cutput20a0fcfart.00000011 HPCA1CN03Cutput20a0fcfart.00000012 HPCA1CN16Cutput20a0fcfart.00000013 HPCA1CN05Cutput20a0fcfart.00000014 HPCA1CN04Cutput20a0fcfart.00000015 HPCA1CN09Cutput20a0fcfart.00000016 HPCA1CN18Cutput20a0fcfart.00000017 HPCA1CN10Cutput20a0fcfart.00000018 HPCA1CN21Cutput20a0fcfart.00000019
A typical data-intensive query var logs = PartitionedTable.Get<string>(“weblogs.pt”); varlogentries =         from line in logs         where !line.StartsWith("#")         select new LogEntry(line); var user =          from access in logentries         where access.user.EndsWith(@"vert")         select access; var accesses =         from access in user         group access by access.page into pages         select new UserPageCount(“jvert",  pages.Key, pages.Count()); varhtmAccesses =         from access in accesses         where access.page.EndsWith(".htm") orderbyaccess.count descending         select access;  Go through logs and keep only lines that are not comments. Parse each line into a new LogEntryobject. Go through logentries and keep only entries that are accesses by jvert. Group jvertaccesses according to what page they correspond to. For each page, count the occurrences. Sort the pages jverthas accessed according to access frequency.
Dryad Parallel DAG execution logs logentries varlogentries = from line in logs         where !line.StartsWith("#")         select new LogEntry(line); var user =          from access in logentries         where access.user.EndsWith(@"vert")         select access; var accesses =         from access in user         group access by access.page into pages         select new UserPageCount(“jvert",  pages.Key, pages.Count()); varhtmAccesses =         from access in accesses         where access.page.EndsWith(".htm") orderbyaccess.count descending         select access;  user accesses htmAccesses output
Query plan generation Separation of query from its execution context Add all the loaded assemblies as resources Eliminate references to local variables by partially evaluating all the expressions in the query Distribute objects used by the query Detect impure queries when possible Automatic code generation Object serialization code for Dryad channels Managed code for Dryad Vertices Static query plan optimizations Pipelining: composing multiple operators into one vertex Minimize unnecessary  data repartitions Other standard DB optimizations
DryadLINQ query plan Query 0 Output: file://hpcmetahn01Cutput7e651a4-38b7-490c-8399-f63eaba7f29a.pt DryadLinq0.dll was built successfully. Input:         [PartitionedTable: file://weblogs.pt] Super__1:         Where(line => !(line.StartsWith(_)))         Select(line => new logdemo.LogEntry(line))         Where(access => access.user.EndsWith(_)) DryadGroupBy(access => access.page,(k__0, pages) => new LinqToDryad.Pair<String,Int32>(k__0, pages.Count())) DryadHashPartition(e => e.Key,e => e.Key) Super__12: DryadMerge() DryadGroupBy(e => e.Key,e => e.Value,(k__0, g__1) => new LinqToDryad.Pair<String,Int32>(k__0, g__1.Sum()))         Select(pages => new logdemo.UserPageCount(_, pages.Key, pages.Count()))
XML representationGenerated by DryadLINQ and passed to Dryad <Query>   <DryadLinqVersion>1.0.1401.0</DryadLinqVersion>   <ClusterName>hpcmetahn01</ClusterName>   ...   <Resources>     <Resource>wrappernativeinfo.dll</Resource>     <Resource>DryadLinq0.dll</Resource>     <Resource>System.Threading.dll</Resource>     <Resource>logdemo.exe</Resource>     <Resource>LinqToDryad.dll</Resource>   </Resources>   <QueryPlan>     <Vertex>      <UniqueId>0</UniqueId>  <Type>InputTable</Type>       <Name>weblogs.pt</Name>       ...    </Vertex> <Vertex> <UniqueId>1</UniqueId>  <Type>Super</Type>       <Name>Super__1</Name>       ... <Children> <Child>           <UniqueId>0</UniqueId>         </Child> </Children> </Vertex>    ...   </QueryPlan> <Query> List of files to be shipped to the cluster Vertex definitions
DryadLINQ generated codeCompiled at runtime, assembly passed to Dryad to implement vertices  public sealed class DryadLinq__Vertex     {         public static int Super__1(string args) {             < . . . > DryadVertexEnvdenv = new DryadVertexEnv(args, dvertexparam); var dwriter__2 = denv.MakeWriter(DryadLinq__Extension.FactoryType__0); var dreader__3 = denv.MakeReader(DryadLinq__Extension.FactoryString); var source__4 = DryadLinqVertex.DryadWhere(dreader__3, line => (!(line.StartsWith(@"#"))), true); var source__5 = DryadLinqVertex.DryadSelect(source__4, line => new logdemo.LogEntry(line), true); var source__6 = DryadLinqVertex.DryadWhere(source__5, access => access.user.EndsWith(@"vert"), true); var source__7 = DryadLinqVertex.DryadGroupBy(source__6, access => access.page, (k__0, pages) => new LinqToDryad.Pair<System.String,System.Int32>(k__0, pages.Count<logdemo.LogEntry>()), null, true, true, false); DryadLinqVertex.DryadHashPartition(source__7, e => e.Key, null, dwriter__2); DryadLinqLog.Add("Vertex Super__1 completed at {0}", DateTime.Now.ToString("MM/dd/yyyyHH:mm:ss.fff"));             return 0;         }         public static int Super__12(string args) { < . . . >        }
DryadLINQ query operators Almost all the useful LINQ operators Where, Select, SelectMany, OrderBy, GroupBy, Join, GroupJoin, Distinct, Concat, Union, Intersect, Except, Count, Contains, Sum, Min, Max, Average, Any, All, Skip, Take, Aggregate Operators introduced by DryadLINQ HashPartition, RangePartition, Merge, Fork Dryad Apply Operates on sequences rather than items
MapReduce in DryadLINQ MapReduce(source,             // sequence of Ts           mapper,             // T -> Ms keySelector, // M -> K           reducer)            // (K, Ms) -> Rs { var map = source.SelectMany(mapper); var group = map.GroupBy(keySelector); var result = group.SelectMany(reducer);      return result;      // sequence of Rs }
K-means in DryadLINQ public static Vector NearestCenter(Vector v, IEnumerable<Vector> centers) {     return centers.Aggregate((r, c) => (r - v).Norm2() < (c - v).Norm2() ? r : c); } public static IQueryable<Vector> Step(IQueryable<Vector> vectors, IQueryable<Vector> centers) {     return vectors.GroupBy(point => NearestCenter(point, centers)) .Select(group => group.Aggregate((x,y) => x + y) / group.Count()); } var vectors = PartitionedTable.Get<Vector>("vectors.pt"); IQueryable<Vector> centers = vectors.Take(100); for (int i = 0; i < 10; i++) {     centers = Step(vectors, centers); } centers.ToPartitionedTable<Vector>(“centers.pt”); public class Vector {     public double[] entries;     [Associative]     public static Vector operator +(Vector v1, Vector v2) { … }     public static Vector operator -(Vector v1, Vector v2) { … }     public double Norm2() {…} }
Putting it all togetherIt’s LINQ all the way down Major League Baseball dataset Pitch-by-pitch data for every MLB game since 2007 47,909 pitch XML files (one for each pitcher appearance) 6,127 player XML files (one for each player) Hash partition the input data files to distribute the work LINQ to XML to shred the data DryadLINQ to analyze dataset
Load the dataset and partitionDefine Pitch and Player classes void StagePitchData(string[] fileList, string PartitionedFile) { // partition the list of filenames across      // 20 nodes of the cluster varpitches = fileList.ToPartitionedTable("filelist")                   .HashPartition((x) => (x), 20) .SelectMany((f) => XElement.Load(f).Elements("atbat")) .SelectMany((a) => a.Elements("pitch") .Select((p) => new Pitch((string)a.Attribute("pitcher"),                                      (string)a.Attribute("batter"), p))); pitches.ToPartitionedTable(PartitionedFile); } Void StagePlayerData(string[] fileList, string PartitionedFile) { varplayers = fileList.Select((p) => new Player(XElement.Load(p))); players.ToPartitionedTable(PartitionedFile);     return 0; }
Analyze dataset with LINQ IQueryable<Pitch> FindFastest(IQueryable<Pitch> pitches, intcount) {     return pitches.OrderByDescending((p) => p.StartSpeed)                   .Take(count); }
Supports LINQ Joins IQueryable<string>  FindFastestPitchers(IQueryable<Pitch> pitches, IQueryable<Player> players, intcount) {     return pitches.OrderByDescending((p) => p.StartSpeed)                   .Take(count)                   .Join(players,                         (o) => o.Pitcher,                         (i) => i.Id,                         (o, i) => i.FirstName + " " + i.LastName)                   .Distinct(); }
DryadLINQ on HPC Server DryadLINQ program runs on client workstation Develop, debug, run locally When ToPartitionedTable() is called, the query expression is materialized (codegen, query plan, optimization) and a job is submitted to HPC Server HPC Server allocates resources for the job and schedules the single task. This task is the Dryad Job Manager The JM then schedules additional tasks to execute the vertices of the DryadLINQ query When the job completes, the client program picks up the output result and continues.
Examples of DryadLINQ Applications Data mining Analysis of service logs for network security Analysis of Windows Watson/SQM data Cluster monitoring and performance analysis Graph analysis Accelerated Page-Rank computation Road network shortest-path preprocessing Image processing Image indexing Decision tree training Epitome computation Simulation light flow simulations for next-generation display research Monte-Carlo simulations for mobile data eScience Machine learning platform for health solutions Astrophysics simulation
Ongoing Work Advanced query optimizations Combination of static analysis and annotations Sampling execution of the query plan Dynamic query optimization Incremental computation Real-time event processing Global scheduling Dynamically allocate cluster resources between multiple concurrent DryadLINQ applications Scale-out partitioned storage Pluggable storage providers DryadLINQ on Azure Better debugging, performance analysis, visualization, etc.
Additional Resources Dryad and DryadLINQ http://connect.microsoft.com/DryadLINQ DryadLINQ source, Dryad binaries, documentation, samples, blog, discussion group, etc. PLINQ Available in Parallel Extensions to .NET Framework 3.5 CTP Available in .NET Framework 4.0 Beta 2 http://msdn.microsoft.com/en-us/concurrency/default.aspx http://msdn.microsoft.com/en-us/magazine/cc163329.aspx Windows HPC Server 2008 http://www.microsoft.com/hpc Download it, try it, we want your feedback!
Questions?
YOUR FEEDBACK IS IMPORTANT TO US! Please fill out session evaluation forms online at MicrosoftPDC.com
Learn More On Channel 9 Expand your PDC experience through Channel 9. Explore videos, hands-on labs, sample code and demos through the new Channel 9 training courses. channel9.msdn.com/learn Built by Developers for Developers….
DryadLINQ Data-Intensive Computing
DryadLINQ Data-Intensive Computing

Weitere ähnliche Inhalte

Was ist angesagt?

Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...Qbeast
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep DiveVasia Kalavri
 
OrientDB - Time Series and Event Sequences - Codemotion Milan 2014
OrientDB - Time Series and Event Sequences - Codemotion Milan 2014OrientDB - Time Series and Event Sequences - Codemotion Milan 2014
OrientDB - Time Series and Event Sequences - Codemotion Milan 2014Luigi Dell'Aquila
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Data Con LA
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015Robbie Strickland
 
Grokking Techtalk #38: Escape Analysis in Go compiler
 Grokking Techtalk #38: Escape Analysis in Go compiler Grokking Techtalk #38: Escape Analysis in Go compiler
Grokking Techtalk #38: Escape Analysis in Go compilerGrokking VN
 
What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?Miklos Christine
 
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...Flink Forward
 
Apache Pinot Meetup Sept02, 2020
Apache Pinot Meetup Sept02, 2020Apache Pinot Meetup Sept02, 2020
Apache Pinot Meetup Sept02, 2020Mayank Shrivastava
 
Extending Flink State Serialization for Better Performance and Smaller Checkp...
Extending Flink State Serialization for Better Performance and Smaller Checkp...Extending Flink State Serialization for Better Performance and Smaller Checkp...
Extending Flink State Serialization for Better Performance and Smaller Checkp...Flink Forward
 
Rapid Web API development with Kotlin and Ktor
Rapid Web API development with Kotlin and KtorRapid Web API development with Kotlin and Ktor
Rapid Web API development with Kotlin and KtorTrayan Iliev
 
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018Seunghyun Lee
 
Design and Implementation of the Security Graph Language
Design and Implementation of the Security Graph LanguageDesign and Implementation of the Security Graph Language
Design and Implementation of the Security Graph LanguageAsankhaya Sharma
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
Distributing Queries the Citus Way | PostgresConf US 2018 | Marco Slot
Distributing Queries the Citus Way | PostgresConf US 2018 | Marco SlotDistributing Queries the Citus Way | PostgresConf US 2018 | Marco Slot
Distributing Queries the Citus Way | PostgresConf US 2018 | Marco SlotCitus Data
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Martin Zapletal
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 

Was ist angesagt? (20)

Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
OrientDB - Time Series and Event Sequences - Codemotion Milan 2014
OrientDB - Time Series and Event Sequences - Codemotion Milan 2014OrientDB - Time Series and Event Sequences - Codemotion Milan 2014
OrientDB - Time Series and Event Sequences - Codemotion Milan 2014
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
 
Grokking Techtalk #38: Escape Analysis in Go compiler
 Grokking Techtalk #38: Escape Analysis in Go compiler Grokking Techtalk #38: Escape Analysis in Go compiler
Grokking Techtalk #38: Escape Analysis in Go compiler
 
What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?
 
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
 
Apache Pinot Meetup Sept02, 2020
Apache Pinot Meetup Sept02, 2020Apache Pinot Meetup Sept02, 2020
Apache Pinot Meetup Sept02, 2020
 
Apache Spark
Apache Spark Apache Spark
Apache Spark
 
Extending Flink State Serialization for Better Performance and Smaller Checkp...
Extending Flink State Serialization for Better Performance and Smaller Checkp...Extending Flink State Serialization for Better Performance and Smaller Checkp...
Extending Flink State Serialization for Better Performance and Smaller Checkp...
 
Rapid Web API development with Kotlin and Ktor
Rapid Web API development with Kotlin and KtorRapid Web API development with Kotlin and Ktor
Rapid Web API development with Kotlin and Ktor
 
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
 
Design and Implementation of the Security Graph Language
Design and Implementation of the Security Graph LanguageDesign and Implementation of the Security Graph Language
Design and Implementation of the Security Graph Language
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Distributing Queries the Citus Way | PostgresConf US 2018 | Marco Slot
Distributing Queries the Citus Way | PostgresConf US 2018 | Marco SlotDistributing Queries the Citus Way | PostgresConf US 2018 | Marco Slot
Distributing Queries the Citus Way | PostgresConf US 2018 | Marco Slot
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
Stress test data pipeline
Stress test data pipelineStress test data pipeline
Stress test data pipeline
 

Andere mochten auch

Lesson12: Reinforcement Learning for Critterbot Science 8
Lesson12: Reinforcement Learning for Critterbot Science 8Lesson12: Reinforcement Learning for Critterbot Science 8
Lesson12: Reinforcement Learning for Critterbot Science 8butest
 
original
originaloriginal
originalbutest
 
Project_Proposal.docx - bbb-extensions.googlecode.com
Project_Proposal.docx - bbb-extensions.googlecode.comProject_Proposal.docx - bbb-extensions.googlecode.com
Project_Proposal.docx - bbb-extensions.googlecode.combutest
 
Download
DownloadDownload
Downloadbutest
 
A Bilevel Optimization Approach to Machine Learning
A Bilevel Optimization Approach to Machine LearningA Bilevel Optimization Approach to Machine Learning
A Bilevel Optimization Approach to Machine Learningbutest
 
Web Design Services
Web Design ServicesWeb Design Services
Web Design Servicesbutest
 
Multi-Task Learning and Web Search Ranking
Multi-Task Learning and Web Search RankingMulti-Task Learning and Web Search Ranking
Multi-Task Learning and Web Search Rankingbutest
 
Improving the Quality and Efficiency of Public Higher Education: Computer Sci...
Improving the Quality and Efficiency of Public Higher Education: Computer Sci...Improving the Quality and Efficiency of Public Higher Education: Computer Sci...
Improving the Quality and Efficiency of Public Higher Education: Computer Sci...Gihan Wikramanayake
 

Andere mochten auch (9)

ppt
pptppt
ppt
 
Lesson12: Reinforcement Learning for Critterbot Science 8
Lesson12: Reinforcement Learning for Critterbot Science 8Lesson12: Reinforcement Learning for Critterbot Science 8
Lesson12: Reinforcement Learning for Critterbot Science 8
 
original
originaloriginal
original
 
Project_Proposal.docx - bbb-extensions.googlecode.com
Project_Proposal.docx - bbb-extensions.googlecode.comProject_Proposal.docx - bbb-extensions.googlecode.com
Project_Proposal.docx - bbb-extensions.googlecode.com
 
Download
DownloadDownload
Download
 
A Bilevel Optimization Approach to Machine Learning
A Bilevel Optimization Approach to Machine LearningA Bilevel Optimization Approach to Machine Learning
A Bilevel Optimization Approach to Machine Learning
 
Web Design Services
Web Design ServicesWeb Design Services
Web Design Services
 
Multi-Task Learning and Web Search Ranking
Multi-Task Learning and Web Search RankingMulti-Task Learning and Web Search Ranking
Multi-Task Learning and Web Search Ranking
 
Improving the Quality and Efficiency of Public Higher Education: Computer Sci...
Improving the Quality and Efficiency of Public Higher Education: Computer Sci...Improving the Quality and Efficiency of Public Higher Education: Computer Sci...
Improving the Quality and Efficiency of Public Higher Education: Computer Sci...
 

Ähnlich wie DryadLINQ Data-Intensive Computing

Apache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextPrateek Maheshwari
 
LINQ to HPC: Developing Big Data Applications on Windows HPC Server
LINQ to HPC: Developing Big Data Applications on Windows HPC ServerLINQ to HPC: Developing Big Data Applications on Windows HPC Server
LINQ to HPC: Developing Big Data Applications on Windows HPC ServerSaptak Sen
 
Yogesh kumar kushwah represent’s
Yogesh kumar kushwah represent’sYogesh kumar kushwah represent’s
Yogesh kumar kushwah represent’sYogesh Kushwah
 
Overview of VS2010 and .NET 4.0
Overview of VS2010 and .NET 4.0Overview of VS2010 and .NET 4.0
Overview of VS2010 and .NET 4.0Bruce Johnson
 
Intake 38 data access 3
Intake 38 data access 3Intake 38 data access 3
Intake 38 data access 3Mahmoud Ouf
 
Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0Anyscale
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectMao Geng
 
Whidbey old
Whidbey old Whidbey old
Whidbey old grenaud
 
Fabric - Realtime stream processing framework
Fabric - Realtime stream processing frameworkFabric - Realtime stream processing framework
Fabric - Realtime stream processing frameworkShashank Gautam
 
BenchmarkDotNet - Powerful .NET library for benchmarking
BenchmarkDotNet  - Powerful .NET library for benchmarkingBenchmarkDotNet  - Powerful .NET library for benchmarking
BenchmarkDotNet - Powerful .NET library for benchmarkingLarry Nung
 
OSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at Netflix
OSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at NetflixOSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at Netflix
OSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at NetflixManish Pandit
 
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemTimely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemAccumulo Summit
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionChetan Khatri
 
Dataservices: Processing (Big) Data the Microservice Way
Dataservices: Processing (Big) Data the Microservice WayDataservices: Processing (Big) Data the Microservice Way
Dataservices: Processing (Big) Data the Microservice WayQAware GmbH
 
Serverless London 2019 FaaS composition using Kafka and CloudEvents
Serverless London 2019   FaaS composition using Kafka and CloudEventsServerless London 2019   FaaS composition using Kafka and CloudEvents
Serverless London 2019 FaaS composition using Kafka and CloudEventsNeil Avery
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Robert Metzger
 
C# as a System Language
C# as a System LanguageC# as a System Language
C# as a System LanguageScyllaDB
 

Ähnlich wie DryadLINQ Data-Intensive Computing (20)

Apache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's Next
 
LINQ to HPC: Developing Big Data Applications on Windows HPC Server
LINQ to HPC: Developing Big Data Applications on Windows HPC ServerLINQ to HPC: Developing Big Data Applications on Windows HPC Server
LINQ to HPC: Developing Big Data Applications on Windows HPC Server
 
Yogesh kumar kushwah represent’s
Yogesh kumar kushwah represent’sYogesh kumar kushwah represent’s
Yogesh kumar kushwah represent’s
 
Overview of VS2010 and .NET 4.0
Overview of VS2010 and .NET 4.0Overview of VS2010 and .NET 4.0
Overview of VS2010 and .NET 4.0
 
Intake 38 data access 3
Intake 38 data access 3Intake 38 data access 3
Intake 38 data access 3
 
Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Whidbey old
Whidbey old Whidbey old
Whidbey old
 
B_110500002
B_110500002B_110500002
B_110500002
 
Intake 37 linq2
Intake 37 linq2Intake 37 linq2
Intake 37 linq2
 
Fabric - Realtime stream processing framework
Fabric - Realtime stream processing frameworkFabric - Realtime stream processing framework
Fabric - Realtime stream processing framework
 
BenchmarkDotNet - Powerful .NET library for benchmarking
BenchmarkDotNet  - Powerful .NET library for benchmarkingBenchmarkDotNet  - Powerful .NET library for benchmarking
BenchmarkDotNet - Powerful .NET library for benchmarking
 
OSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at Netflix
OSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at NetflixOSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at Netflix
OSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at Netflix
 
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemTimely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
 
Dataservices: Processing (Big) Data the Microservice Way
Dataservices: Processing (Big) Data the Microservice WayDataservices: Processing (Big) Data the Microservice Way
Dataservices: Processing (Big) Data the Microservice Way
 
.net Framework
.net Framework.net Framework
.net Framework
 
Serverless London 2019 FaaS composition using Kafka and CloudEvents
Serverless London 2019   FaaS composition using Kafka and CloudEventsServerless London 2019   FaaS composition using Kafka and CloudEvents
Serverless London 2019 FaaS composition using Kafka and CloudEvents
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
 
C# as a System Language
C# as a System LanguageC# as a System Language
C# as a System Language
 

Mehr von butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

Mehr von butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

DryadLINQ Data-Intensive Computing

  • 1. Data-Intensive Computing on Windows HPC Server with the DryadLINQ Framework John Vert Architect Microsoft Corporation SVR17
  • 2. Moving Parts Windows HPC Server 2008 – cluster management, job scheduling Dryad – distributed execution engine, failure recovery, distribution, scalability across very large partitioned datasets LINQ – .NET extensions for declarative query, easy expression of data parallelism, unified data model PLINQ – multi-core parallelism across LINQ queries. DryadLINQ – Bring LINQ ease of programming to Dryad
  • 3. Software Stack … Image Processing MachineLearning Graph Analysis DataMining .NET Applications DryadLINQ Dryad HPC Job Scheduler Windows HPC Server 2008 Windows HPC Server 2008 Windows HPC Server 2008 Windows HPC Server 2008
  • 4. Dryad Provides a general, flexible distributed execution layer Dataflow graph as the computation model Can be modified by runtime optimizations Higher language layer supplies graph, vertex code, serialization code, hints for data locality Automatically handles distributed execution Distributes code, routes data Schedules processes on machines near data Masks failures in cluster and network
  • 5. A Dryad JobDirected acyclic graph (DAG) Outputs Processing vertices Channels (file, fifo, pipe) Inputs
  • 6. 2-D Piping Unix Pipes: 1-D grep | sed | sort | awk | perl Dryad: 2-D grep1000 | sed500 | sort1000 | awk500 | perl50 6
  • 7. LINQLanguage Integrated Query Declarative extensions to C# and VB.NET for iterating over collections In memory Via data providers SQL-Like Broadly adoptable by developers Easy to use Reduces written code Predictable results Scalable experience Deep tooling support
  • 8. PLINQ Parallel Language Integrated Query Value Proposition: Enable LINQ developers to take advantage of parallel hardware—with basic understanding of data parallelism. Declarative data parallelism (focus on the “what” not the “how”) Alternative to LINQ-to-Objects Same set of query operators + some extras Default is IEnumerable<T> based Preview in Parallel Extensions to .NET Framework 3.5 CTP Shipping in .NET Framework 4.0 Beta 2
  • 9. DryadLINQLINQ to clusters Declarative programming style of LINQ for clusters Automatic parallelization Parallel query plan exploits multi-node parallelism PLINQ underneath exploits multi-core parallelism Integration with VS and .NET Type safety, automatic serialization Query plan optimizations Static optimization rules to optimize locality Dynamic run-time optimizations
  • 10. DryadLINQ: From LINQ to Dryad Automatic query plan generation Distributed query execution by Dryad LINQ query Query plan Dryad varlogentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); logs where select
  • 11. A Simple LINQ Query IEnumerable<BabyInfo> babies = ...; varresults = from baby in babies where baby.Name == queryName && baby.State == queryState && baby.Year >= yearStart && baby.Year <= yearEnd orderbybaby.Yearascending select baby;
  • 12. A Simple PLINQ Query IEnumerable<BabyInfo> babies = ...; varresults = from baby in babies.AsParallel() where baby.Name == queryName && baby.State == queryState && baby.Year >= yearStart && baby.Year <= yearEnd orderbybaby.Yearascending select baby;
  • 13. A Simple DryadLINQQuery PartitionedTable<BabyInfo> babies = PartitionedTable.Get<BabyInfo>(“BabyInfo.pt”); varresults = from baby in babies where baby.Name == queryName && baby.State == queryState && baby.Year >= yearStart && baby.Year <= yearEnd orderbybaby.Yearascending select baby;
  • 14. PartitionedTable<T>Core data structure for DryadLINQ Scale-out, partitioned container for .NET objects Derives from IQueryable<T>, IEnumerable<T> ToPartitionedTable() extension methods DryadLINQ operators consume and produce PartitionedTable<T> DryadLINQ generates code to serialize/deserialize your .NET objects Underlying storage can be partitioned file, partitioned SQL table, cluster filesystem
  • 15. Partitioned FileFile-based container for PartitionedTable<T> metadata XCutput20a0fcfart 20 0,1855000,HPCMETAHN01 1,1630000,HPCA1CN13 2,1707500,HPCA1CN12 3,1828820,HPCA1CN22 4,1802140,HPCA1CN07 5,1741000,HPCA1CN08 6,1733980,HPCA1CN11 7,1762620,HPCA1CN06 8,1861300,HPCA1CN14 9,1807460,HPCA1CN17 10,1807560,HPCA1CN23 11,1768120,HPCA1CN20 12,1847220,HPCA1CN03 13,1729160,HPCA1CN16 14,1767500,HPCA1CN05 15,1781520,HPCA1CN04 16,1728480,HPCA1CN09 17,1802580,HPCA1CN18 18,1862380,HPCA1CN10 19,1762540,HPCA1CN21 HPCMETAHN01Cutput20a0fcfart.00000000
  • 16. PartitionedFileFile-based container for PartitionedTable<T> metadata XCutput20a0fcfart 20 0,1855000,HPCMETAHN01 1,1630000,HPCA1CN13 2,1707500,HPCA1CN12 3,1828820,HPCA1CN22 4,1802140,HPCA1CN07 5,1741000,HPCA1CN08 6,1733980,HPCA1CN11 7,1762620,HPCA1CN06 8,1861300,HPCA1CN14 9,1807460,HPCA1CN17 10,1807560,HPCA1CN23 11,1768120,HPCA1CN20 12,1847220,HPCA1CN03 13,1729160,HPCA1CN16 14,1767500,HPCA1CN05 15,1781520,HPCA1CN04 16,1728480,HPCA1CN09 17,1802580,HPCA1CN18 18,1862380,HPCA1CN10 19,1762540,HPCA1CN21 HPCMETAHN01Cutput20a0fcfart.00000000 HPCA1CN13Cutput20a0fcfart.00000001 HPCA1CN12Cutput20a0fcfart.00000002 HPCA1CN22Cutput20a0fcfart.00000003 HPCA1CN07Cutput20a0fcfart.00000004 HPCA1CN08Cutput20a0fcfart.00000005 HPCA1CN11Cutput20a0fcfart.00000006 HPCA1CN06Cutput20a0fcfart.00000007 HPCA1CN14Cutput20a0fcfart.00000008 HPCA1CN17Cutput20a0fcfart.00000009 HPCA1CN23Cutput20a0fcfart.00000010 HPCA1CN20Cutput20a0fcfart.00000011 HPCA1CN03Cutput20a0fcfart.00000012 HPCA1CN16Cutput20a0fcfart.00000013 HPCA1CN05Cutput20a0fcfart.00000014 HPCA1CN04Cutput20a0fcfart.00000015 HPCA1CN09Cutput20a0fcfart.00000016 HPCA1CN18Cutput20a0fcfart.00000017 HPCA1CN10Cutput20a0fcfart.00000018 HPCA1CN21Cutput20a0fcfart.00000019
  • 17. A typical data-intensive query var logs = PartitionedTable.Get<string>(“weblogs.pt”); varlogentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); var user = from access in logentries where access.user.EndsWith(@"vert") select access; var accesses = from access in user group access by access.page into pages select new UserPageCount(“jvert", pages.Key, pages.Count()); varhtmAccesses = from access in accesses where access.page.EndsWith(".htm") orderbyaccess.count descending select access; Go through logs and keep only lines that are not comments. Parse each line into a new LogEntryobject. Go through logentries and keep only entries that are accesses by jvert. Group jvertaccesses according to what page they correspond to. For each page, count the occurrences. Sort the pages jverthas accessed according to access frequency.
  • 18. Dryad Parallel DAG execution logs logentries varlogentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); var user = from access in logentries where access.user.EndsWith(@"vert") select access; var accesses = from access in user group access by access.page into pages select new UserPageCount(“jvert", pages.Key, pages.Count()); varhtmAccesses = from access in accesses where access.page.EndsWith(".htm") orderbyaccess.count descending select access; user accesses htmAccesses output
  • 19. Query plan generation Separation of query from its execution context Add all the loaded assemblies as resources Eliminate references to local variables by partially evaluating all the expressions in the query Distribute objects used by the query Detect impure queries when possible Automatic code generation Object serialization code for Dryad channels Managed code for Dryad Vertices Static query plan optimizations Pipelining: composing multiple operators into one vertex Minimize unnecessary data repartitions Other standard DB optimizations
  • 20. DryadLINQ query plan Query 0 Output: file://hpcmetahn01Cutput7e651a4-38b7-490c-8399-f63eaba7f29a.pt DryadLinq0.dll was built successfully. Input: [PartitionedTable: file://weblogs.pt] Super__1: Where(line => !(line.StartsWith(_))) Select(line => new logdemo.LogEntry(line)) Where(access => access.user.EndsWith(_)) DryadGroupBy(access => access.page,(k__0, pages) => new LinqToDryad.Pair<String,Int32>(k__0, pages.Count())) DryadHashPartition(e => e.Key,e => e.Key) Super__12: DryadMerge() DryadGroupBy(e => e.Key,e => e.Value,(k__0, g__1) => new LinqToDryad.Pair<String,Int32>(k__0, g__1.Sum())) Select(pages => new logdemo.UserPageCount(_, pages.Key, pages.Count()))
  • 21. XML representationGenerated by DryadLINQ and passed to Dryad <Query> <DryadLinqVersion>1.0.1401.0</DryadLinqVersion> <ClusterName>hpcmetahn01</ClusterName> ... <Resources> <Resource>wrappernativeinfo.dll</Resource> <Resource>DryadLinq0.dll</Resource> <Resource>System.Threading.dll</Resource> <Resource>logdemo.exe</Resource> <Resource>LinqToDryad.dll</Resource> </Resources> <QueryPlan> <Vertex> <UniqueId>0</UniqueId> <Type>InputTable</Type> <Name>weblogs.pt</Name> ... </Vertex> <Vertex> <UniqueId>1</UniqueId> <Type>Super</Type> <Name>Super__1</Name> ... <Children> <Child> <UniqueId>0</UniqueId> </Child> </Children> </Vertex> ... </QueryPlan> <Query> List of files to be shipped to the cluster Vertex definitions
  • 22. DryadLINQ generated codeCompiled at runtime, assembly passed to Dryad to implement vertices public sealed class DryadLinq__Vertex { public static int Super__1(string args) { < . . . > DryadVertexEnvdenv = new DryadVertexEnv(args, dvertexparam); var dwriter__2 = denv.MakeWriter(DryadLinq__Extension.FactoryType__0); var dreader__3 = denv.MakeReader(DryadLinq__Extension.FactoryString); var source__4 = DryadLinqVertex.DryadWhere(dreader__3, line => (!(line.StartsWith(@"#"))), true); var source__5 = DryadLinqVertex.DryadSelect(source__4, line => new logdemo.LogEntry(line), true); var source__6 = DryadLinqVertex.DryadWhere(source__5, access => access.user.EndsWith(@"vert"), true); var source__7 = DryadLinqVertex.DryadGroupBy(source__6, access => access.page, (k__0, pages) => new LinqToDryad.Pair<System.String,System.Int32>(k__0, pages.Count<logdemo.LogEntry>()), null, true, true, false); DryadLinqVertex.DryadHashPartition(source__7, e => e.Key, null, dwriter__2); DryadLinqLog.Add("Vertex Super__1 completed at {0}", DateTime.Now.ToString("MM/dd/yyyyHH:mm:ss.fff")); return 0; } public static int Super__12(string args) { < . . . > }
  • 23. DryadLINQ query operators Almost all the useful LINQ operators Where, Select, SelectMany, OrderBy, GroupBy, Join, GroupJoin, Distinct, Concat, Union, Intersect, Except, Count, Contains, Sum, Min, Max, Average, Any, All, Skip, Take, Aggregate Operators introduced by DryadLINQ HashPartition, RangePartition, Merge, Fork Dryad Apply Operates on sequences rather than items
  • 24. MapReduce in DryadLINQ MapReduce(source, // sequence of Ts mapper, // T -> Ms keySelector, // M -> K reducer) // (K, Ms) -> Rs { var map = source.SelectMany(mapper); var group = map.GroupBy(keySelector); var result = group.SelectMany(reducer); return result; // sequence of Rs }
  • 25. K-means in DryadLINQ public static Vector NearestCenter(Vector v, IEnumerable<Vector> centers) { return centers.Aggregate((r, c) => (r - v).Norm2() < (c - v).Norm2() ? r : c); } public static IQueryable<Vector> Step(IQueryable<Vector> vectors, IQueryable<Vector> centers) { return vectors.GroupBy(point => NearestCenter(point, centers)) .Select(group => group.Aggregate((x,y) => x + y) / group.Count()); } var vectors = PartitionedTable.Get<Vector>("vectors.pt"); IQueryable<Vector> centers = vectors.Take(100); for (int i = 0; i < 10; i++) { centers = Step(vectors, centers); } centers.ToPartitionedTable<Vector>(“centers.pt”); public class Vector { public double[] entries; [Associative] public static Vector operator +(Vector v1, Vector v2) { … } public static Vector operator -(Vector v1, Vector v2) { … } public double Norm2() {…} }
  • 26. Putting it all togetherIt’s LINQ all the way down Major League Baseball dataset Pitch-by-pitch data for every MLB game since 2007 47,909 pitch XML files (one for each pitcher appearance) 6,127 player XML files (one for each player) Hash partition the input data files to distribute the work LINQ to XML to shred the data DryadLINQ to analyze dataset
  • 27. Load the dataset and partitionDefine Pitch and Player classes void StagePitchData(string[] fileList, string PartitionedFile) { // partition the list of filenames across // 20 nodes of the cluster varpitches = fileList.ToPartitionedTable("filelist") .HashPartition((x) => (x), 20) .SelectMany((f) => XElement.Load(f).Elements("atbat")) .SelectMany((a) => a.Elements("pitch") .Select((p) => new Pitch((string)a.Attribute("pitcher"), (string)a.Attribute("batter"), p))); pitches.ToPartitionedTable(PartitionedFile); } Void StagePlayerData(string[] fileList, string PartitionedFile) { varplayers = fileList.Select((p) => new Player(XElement.Load(p))); players.ToPartitionedTable(PartitionedFile); return 0; }
  • 28. Analyze dataset with LINQ IQueryable<Pitch> FindFastest(IQueryable<Pitch> pitches, intcount) { return pitches.OrderByDescending((p) => p.StartSpeed) .Take(count); }
  • 29. Supports LINQ Joins IQueryable<string> FindFastestPitchers(IQueryable<Pitch> pitches, IQueryable<Player> players, intcount) { return pitches.OrderByDescending((p) => p.StartSpeed) .Take(count) .Join(players, (o) => o.Pitcher, (i) => i.Id, (o, i) => i.FirstName + " " + i.LastName) .Distinct(); }
  • 30. DryadLINQ on HPC Server DryadLINQ program runs on client workstation Develop, debug, run locally When ToPartitionedTable() is called, the query expression is materialized (codegen, query plan, optimization) and a job is submitted to HPC Server HPC Server allocates resources for the job and schedules the single task. This task is the Dryad Job Manager The JM then schedules additional tasks to execute the vertices of the DryadLINQ query When the job completes, the client program picks up the output result and continues.
  • 31. Examples of DryadLINQ Applications Data mining Analysis of service logs for network security Analysis of Windows Watson/SQM data Cluster monitoring and performance analysis Graph analysis Accelerated Page-Rank computation Road network shortest-path preprocessing Image processing Image indexing Decision tree training Epitome computation Simulation light flow simulations for next-generation display research Monte-Carlo simulations for mobile data eScience Machine learning platform for health solutions Astrophysics simulation
  • 32. Ongoing Work Advanced query optimizations Combination of static analysis and annotations Sampling execution of the query plan Dynamic query optimization Incremental computation Real-time event processing Global scheduling Dynamically allocate cluster resources between multiple concurrent DryadLINQ applications Scale-out partitioned storage Pluggable storage providers DryadLINQ on Azure Better debugging, performance analysis, visualization, etc.
  • 33. Additional Resources Dryad and DryadLINQ http://connect.microsoft.com/DryadLINQ DryadLINQ source, Dryad binaries, documentation, samples, blog, discussion group, etc. PLINQ Available in Parallel Extensions to .NET Framework 3.5 CTP Available in .NET Framework 4.0 Beta 2 http://msdn.microsoft.com/en-us/concurrency/default.aspx http://msdn.microsoft.com/en-us/magazine/cc163329.aspx Windows HPC Server 2008 http://www.microsoft.com/hpc Download it, try it, we want your feedback!
  • 35. YOUR FEEDBACK IS IMPORTANT TO US! Please fill out session evaluation forms online at MicrosoftPDC.com
  • 36. Learn More On Channel 9 Expand your PDC experience through Channel 9. Explore videos, hands-on labs, sample code and demos through the new Channel 9 training courses. channel9.msdn.com/learn Built by Developers for Developers….