DryadLINQ Data-Intensive Computing

Data-Intensive Computing on Windows HPC Server with the DryadLINQ Framework John Vert Architect Microsoft Corporation SVR17

Moving Parts Windows HPC Server 2008 – cluster management, job scheduling Dryad – distributed execution engine, failure recovery, distribution, scalability across very large partitioned datasets LINQ – .NET extensions for declarative query, easy expression of data parallelism, unified data model PLINQ – multi-core parallelism across LINQ queries. DryadLINQ – Bring LINQ ease of programming to Dryad

Software Stack … Image Processing MachineLearning Graph Analysis DataMining .NET Applications DryadLINQ Dryad HPC Job Scheduler Windows HPC Server 2008 Windows HPC Server 2008 Windows HPC Server 2008 Windows HPC Server 2008

Dryad Provides a general, flexible distributed execution layer Dataflow graph as the computation model Can be modified by runtime optimizations Higher language layer supplies graph, vertex code, serialization code, hints for data locality Automatically handles distributed execution Distributes code, routes data Schedules processes on machines near data Masks failures in cluster and network

A Dryad JobDirected acyclic graph (DAG) Outputs Processing vertices Channels (file, fifo, pipe) Inputs

2-D Piping Unix Pipes: 1-D grep | sed | sort | awk | perl Dryad: 2-D grep1000 | sed500 | sort1000 | awk500 | perl50 6

LINQLanguage Integrated Query Declarative extensions to C# and VB.NET for iterating over collections In memory Via data providers SQL-Like Broadly adoptable by developers Easy to use Reduces written code Predictable results Scalable experience Deep tooling support

PLINQ Parallel Language Integrated Query Value Proposition: Enable LINQ developers to take advantage of parallel hardware—with basic understanding of data parallelism. Declarative data parallelism (focus on the “what” not the “how”) Alternative to LINQ-to-Objects Same set of query operators + some extras Default is IEnumerable<T> based Preview in Parallel Extensions to .NET Framework 3.5 CTP Shipping in .NET Framework 4.0 Beta 2

DryadLINQLINQ to clusters Declarative programming style of LINQ for clusters Automatic parallelization Parallel query plan exploits multi-node parallelism PLINQ underneath exploits multi-core parallelism Integration with VS and .NET Type safety, automatic serialization Query plan optimizations Static optimization rules to optimize locality Dynamic run-time optimizations

DryadLINQ: From LINQ to Dryad Automatic query plan generation Distributed query execution by Dryad LINQ query Query plan Dryad varlogentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); logs where select

A Simple LINQ Query IEnumerable<BabyInfo> babies = ...; varresults = from baby in babies where baby.Name == queryName && baby.State == queryState && baby.Year >= yearStart && baby.Year <= yearEnd orderbybaby.Yearascending select baby;

A Simple PLINQ Query IEnumerable<BabyInfo> babies = ...; varresults = from baby in babies.AsParallel() where baby.Name == queryName && baby.State == queryState && baby.Year >= yearStart && baby.Year <= yearEnd orderbybaby.Yearascending select baby;

A Simple DryadLINQQuery PartitionedTable<BabyInfo> babies = PartitionedTable.Get<BabyInfo>(“BabyInfo.pt”); varresults = from baby in babies where baby.Name == queryName && baby.State == queryState && baby.Year >= yearStart && baby.Year <= yearEnd orderbybaby.Yearascending select baby;

PartitionedTable<T>Core data structure for DryadLINQ Scale-out, partitioned container for .NET objects Derives from IQueryable<T>, IEnumerable<T> ToPartitionedTable() extension methods DryadLINQ operators consume and produce PartitionedTable<T> DryadLINQ generates code to serialize/deserialize your .NET objects Underlying storage can be partitioned file, partitioned SQL table, cluster filesystem

Partitioned FileFile-based container for PartitionedTable<T> metadata XCutput20a0fcfart 20 0,1855000,HPCMETAHN01 1,1630000,HPCA1CN13 2,1707500,HPCA1CN12 3,1828820,HPCA1CN22 4,1802140,HPCA1CN07 5,1741000,HPCA1CN08 6,1733980,HPCA1CN11 7,1762620,HPCA1CN06 8,1861300,HPCA1CN14 9,1807460,HPCA1CN17 10,1807560,HPCA1CN23 11,1768120,HPCA1CN20 12,1847220,HPCA1CN03 13,1729160,HPCA1CN16 14,1767500,HPCA1CN05 15,1781520,HPCA1CN04 16,1728480,HPCA1CN09 17,1802580,HPCA1CN18 18,1862380,HPCA1CN10 19,1762540,HPCA1CN21 HPCMETAHN01Cutput20a0fcfart.00000000

PartitionedFileFile-based container for PartitionedTable<T> metadata XCutput20a0fcfart 20 0,1855000,HPCMETAHN01 1,1630000,HPCA1CN13 2,1707500,HPCA1CN12 3,1828820,HPCA1CN22 4,1802140,HPCA1CN07 5,1741000,HPCA1CN08 6,1733980,HPCA1CN11 7,1762620,HPCA1CN06 8,1861300,HPCA1CN14 9,1807460,HPCA1CN17 10,1807560,HPCA1CN23 11,1768120,HPCA1CN20 12,1847220,HPCA1CN03 13,1729160,HPCA1CN16 14,1767500,HPCA1CN05 15,1781520,HPCA1CN04 16,1728480,HPCA1CN09 17,1802580,HPCA1CN18 18,1862380,HPCA1CN10 19,1762540,HPCA1CN21 HPCMETAHN01Cutput20a0fcfart.00000000 HPCA1CN13Cutput20a0fcfart.00000001 HPCA1CN12Cutput20a0fcfart.00000002 HPCA1CN22Cutput20a0fcfart.00000003 HPCA1CN07Cutput20a0fcfart.00000004 HPCA1CN08Cutput20a0fcfart.00000005 HPCA1CN11Cutput20a0fcfart.00000006 HPCA1CN06Cutput20a0fcfart.00000007 HPCA1CN14Cutput20a0fcfart.00000008 HPCA1CN17Cutput20a0fcfart.00000009 HPCA1CN23Cutput20a0fcfart.00000010 HPCA1CN20Cutput20a0fcfart.00000011 HPCA1CN03Cutput20a0fcfart.00000012 HPCA1CN16Cutput20a0fcfart.00000013 HPCA1CN05Cutput20a0fcfart.00000014 HPCA1CN04Cutput20a0fcfart.00000015 HPCA1CN09Cutput20a0fcfart.00000016 HPCA1CN18Cutput20a0fcfart.00000017 HPCA1CN10Cutput20a0fcfart.00000018 HPCA1CN21Cutput20a0fcfart.00000019

A typical data-intensive query var logs = PartitionedTable.Get<string>(“weblogs.pt”); varlogentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); var user = from access in logentries where access.user.EndsWith(@"vert") select access; var accesses = from access in user group access by access.page into pages select new UserPageCount(“jvert", pages.Key, pages.Count()); varhtmAccesses = from access in accesses where access.page.EndsWith(".htm") orderbyaccess.count descending select access; Go through logs and keep only lines that are not comments. Parse each line into a new LogEntryobject. Go through logentries and keep only entries that are accesses by jvert. Group jvertaccesses according to what page they correspond to. For each page, count the occurrences. Sort the pages jverthas accessed according to access frequency.

Dryad Parallel DAG execution logs logentries varlogentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); var user = from access in logentries where access.user.EndsWith(@"vert") select access; var accesses = from access in user group access by access.page into pages select new UserPageCount(“jvert", pages.Key, pages.Count()); varhtmAccesses = from access in accesses where access.page.EndsWith(".htm") orderbyaccess.count descending select access; user accesses htmAccesses output

Query plan generation Separation of query from its execution context Add all the loaded assemblies as resources Eliminate references to local variables by partially evaluating all the expressions in the query Distribute objects used by the query Detect impure queries when possible Automatic code generation Object serialization code for Dryad channels Managed code for Dryad Vertices Static query plan optimizations Pipelining: composing multiple operators into one vertex Minimize unnecessary data repartitions Other standard DB optimizations

DryadLINQ query plan Query 0 Output: file://hpcmetahn01Cutput7e651a4-38b7-490c-8399-f63eaba7f29a.pt DryadLinq0.dll was built successfully. Input: [PartitionedTable: file://weblogs.pt] Super__1: Where(line => !(line.StartsWith(_))) Select(line => new logdemo.LogEntry(line)) Where(access => access.user.EndsWith(_)) DryadGroupBy(access => access.page,(k__0, pages) => new LinqToDryad.Pair<String,Int32>(k__0, pages.Count())) DryadHashPartition(e => e.Key,e => e.Key) Super__12: DryadMerge() DryadGroupBy(e => e.Key,e => e.Value,(k__0, g__1) => new LinqToDryad.Pair<String,Int32>(k__0, g__1.Sum())) Select(pages => new logdemo.UserPageCount(_, pages.Key, pages.Count()))

XML representationGenerated by DryadLINQ and passed to Dryad <Query> <DryadLinqVersion>1.0.1401.0</DryadLinqVersion> <ClusterName>hpcmetahn01</ClusterName> ... <Resources> <Resource>wrappernativeinfo.dll</Resource> <Resource>DryadLinq0.dll</Resource> <Resource>System.Threading.dll</Resource> <Resource>logdemo.exe</Resource> <Resource>LinqToDryad.dll</Resource> </Resources> <QueryPlan> <Vertex> <UniqueId>0</UniqueId> <Type>InputTable</Type> <Name>weblogs.pt</Name> ... </Vertex> <Vertex> <UniqueId>1</UniqueId> <Type>Super</Type> <Name>Super__1</Name> ... <Children> <Child> <UniqueId>0</UniqueId> </Child> </Children> </Vertex> ... </QueryPlan> <Query> List of files to be shipped to the cluster Vertex definitions

DryadLINQ generated codeCompiled at runtime, assembly passed to Dryad to implement vertices public sealed class DryadLinq__Vertex { public static int Super__1(string args) { < . . . > DryadVertexEnvdenv = new DryadVertexEnv(args, dvertexparam); var dwriter__2 = denv.MakeWriter(DryadLinq__Extension.FactoryType__0); var dreader__3 = denv.MakeReader(DryadLinq__Extension.FactoryString); var source__4 = DryadLinqVertex.DryadWhere(dreader__3, line => (!(line.StartsWith(@"#"))), true); var source__5 = DryadLinqVertex.DryadSelect(source__4, line => new logdemo.LogEntry(line), true); var source__6 = DryadLinqVertex.DryadWhere(source__5, access => access.user.EndsWith(@"vert"), true); var source__7 = DryadLinqVertex.DryadGroupBy(source__6, access => access.page, (k__0, pages) => new LinqToDryad.Pair<System.String,System.Int32>(k__0, pages.Count<logdemo.LogEntry>()), null, true, true, false); DryadLinqVertex.DryadHashPartition(source__7, e => e.Key, null, dwriter__2); DryadLinqLog.Add("Vertex Super__1 completed at {0}", DateTime.Now.ToString("MM/dd/yyyyHH:mm:ss.fff")); return 0; } public static int Super__12(string args) { < . . . > }

DryadLINQ query operators Almost all the useful LINQ operators Where, Select, SelectMany, OrderBy, GroupBy, Join, GroupJoin, Distinct, Concat, Union, Intersect, Except, Count, Contains, Sum, Min, Max, Average, Any, All, Skip, Take, Aggregate Operators introduced by DryadLINQ HashPartition, RangePartition, Merge, Fork Dryad Apply Operates on sequences rather than items

MapReduce in DryadLINQ MapReduce(source, // sequence of Ts mapper, // T -> Ms keySelector, // M -> K reducer) // (K, Ms) -> Rs { var map = source.SelectMany(mapper); var group = map.GroupBy(keySelector); var result = group.SelectMany(reducer); return result; // sequence of Rs }

K-means in DryadLINQ public static Vector NearestCenter(Vector v, IEnumerable<Vector> centers) { return centers.Aggregate((r, c) => (r - v).Norm2() < (c - v).Norm2() ? r : c); } public static IQueryable<Vector> Step(IQueryable<Vector> vectors, IQueryable<Vector> centers) { return vectors.GroupBy(point => NearestCenter(point, centers)) .Select(group => group.Aggregate((x,y) => x + y) / group.Count()); } var vectors = PartitionedTable.Get<Vector>("vectors.pt"); IQueryable<Vector> centers = vectors.Take(100); for (int i = 0; i < 10; i++) { centers = Step(vectors, centers); } centers.ToPartitionedTable<Vector>(“centers.pt”); public class Vector { public double[] entries; [Associative] public static Vector operator +(Vector v1, Vector v2) { … } public static Vector operator -(Vector v1, Vector v2) { … } public double Norm2() {…} }

Putting it all togetherIt’s LINQ all the way down Major League Baseball dataset Pitch-by-pitch data for every MLB game since 2007 47,909 pitch XML files (one for each pitcher appearance) 6,127 player XML files (one for each player) Hash partition the input data files to distribute the work LINQ to XML to shred the data DryadLINQ to analyze dataset

Load the dataset and partitionDefine Pitch and Player classes void StagePitchData(string[] fileList, string PartitionedFile) { // partition the list of filenames across // 20 nodes of the cluster varpitches = fileList.ToPartitionedTable("filelist") .HashPartition((x) => (x), 20) .SelectMany((f) => XElement.Load(f).Elements("atbat")) .SelectMany((a) => a.Elements("pitch") .Select((p) => new Pitch((string)a.Attribute("pitcher"), (string)a.Attribute("batter"), p))); pitches.ToPartitionedTable(PartitionedFile); } Void StagePlayerData(string[] fileList, string PartitionedFile) { varplayers = fileList.Select((p) => new Player(XElement.Load(p))); players.ToPartitionedTable(PartitionedFile); return 0; }

Analyze dataset with LINQ IQueryable<Pitch> FindFastest(IQueryable<Pitch> pitches, intcount) { return pitches.OrderByDescending((p) => p.StartSpeed) .Take(count); }

Supports LINQ Joins IQueryable<string> FindFastestPitchers(IQueryable<Pitch> pitches, IQueryable<Player> players, intcount) { return pitches.OrderByDescending((p) => p.StartSpeed) .Take(count) .Join(players, (o) => o.Pitcher, (i) => i.Id, (o, i) => i.FirstName + " " + i.LastName) .Distinct(); }

DryadLINQ on HPC Server DryadLINQ program runs on client workstation Develop, debug, run locally When ToPartitionedTable() is called, the query expression is materialized (codegen, query plan, optimization) and a job is submitted to HPC Server HPC Server allocates resources for the job and schedules the single task. This task is the Dryad Job Manager The JM then schedules additional tasks to execute the vertices of the DryadLINQ query When the job completes, the client program picks up the output result and continues.

Examples of DryadLINQ Applications Data mining Analysis of service logs for network security Analysis of Windows Watson/SQM data Cluster monitoring and performance analysis Graph analysis Accelerated Page-Rank computation Road network shortest-path preprocessing Image processing Image indexing Decision tree training Epitome computation Simulation light flow simulations for next-generation display research Monte-Carlo simulations for mobile data eScience Machine learning platform for health solutions Astrophysics simulation

Ongoing Work Advanced query optimizations Combination of static analysis and annotations Sampling execution of the query plan Dynamic query optimization Incremental computation Real-time event processing Global scheduling Dynamically allocate cluster resources between multiple concurrent DryadLINQ applications Scale-out partitioned storage Pluggable storage providers DryadLINQ on Azure Better debugging, performance analysis, visualization, etc.

Additional Resources Dryad and DryadLINQ http://connect.microsoft.com/DryadLINQ DryadLINQ source, Dryad binaries, documentation, samples, blog, discussion group, etc. PLINQ Available in Parallel Extensions to .NET Framework 3.5 CTP Available in .NET Framework 4.0 Beta 2 http://msdn.microsoft.com/en-us/concurrency/default.aspx http://msdn.microsoft.com/en-us/magazine/cc163329.aspx Windows HPC Server 2008 http://www.microsoft.com/hpc Download it, try it, we want your feedback!

YOUR FEEDBACK IS IMPORTANT TO US! Please fill out session evaluation forms online at MicrosoftPDC.com

Learn More On Channel 9 Expand your PDC experience through Channel 9. Explore videos, hands-on labs, sample code and demos through the new Channel 9 training courses. channel9.msdn.com/learn Built by Developers for Developers….

DryadLINQ Data-Intensive Computing

DryadLINQ Data-Intensive Computing

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (9)

Ähnlich wie DryadLINQ Data-Intensive Computing

Ähnlich wie DryadLINQ Data-Intensive Computing (20)

Mehr von butest

Mehr von butest (20)

DryadLINQ Data-Intensive Computing