Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Azure DataLake 大全

756 Aufrufe

Veröffentlicht am

Azure Data Lake のほぼ全機能説明です。

Veröffentlicht in: Technologie
  • Loggen Sie sich ein, um Kommentare anzuzeigen.

Azure DataLake 大全

  1. 1. #azurejp https://www.facebook.com/dahatake/ https://twitter.com/dahatake/ https://github.com/dahatake/ https://daiyuhatakeyama.wordpress.com/ https://www.slideshare.net/dahatake/
  2. 2. “Volume” 膨大 “Velocity” スピード “Variety” 多様性 社内の 資産 サーチや ソーシャル オープン データ コラボ可視化
  3. 3. Inbound Data Buffered Ingest (message bus) Event 処理 Logic Event Decoration Spooling/Archiving Hot Store Analytical Store Curation Dashboards/Reports Exploration Interactive Data Movement / Sync
  4. 4. 取り込み Modern Data Lifecycle 処理 保存 利用 キュレーション
  5. 5. 取り込み Modern Data Lifecycle 処理 保存 利用 Event Hubs IoT Hubs Service Bus Kafka HDInsight ADLA Storm Spark Stream Analytics ADLS Azure Storage Azure SQL DB Azure SQL DW ADLS Azure DW Azure SQL DB Hbase Cassandra Azure Storage Power BI キュレーション Azure Data Factory Azure ML
  6. 6. Dashboards InteractiveExploration API も考慮する必要あり
  7. 7. 即、学べる
  8. 8. https://start.cortanaanalytics.com/
  9. 9. HDInsight Analytics Store Hadoop as a Services Big Data Query as a Services 容量無制限 Raw Data アクセスコントロー ル
  10. 10. Azure Data Lake service 無限にデータをストア・管理 Row Data を保存 高スループット、低いレイテンシの分析ジョ ブ セキュリティ、アクセスコントロール Azure Data Lake store HDInsight & Azure Data Lake Analytics
  11. 11. https://azure.microsoft.com/ja-jp/regions/services/#
  12. 12. ADL Analytics Account Links to ADL Stores ADL Store Account (the default one) Job Queue キーの設置: - Max Concurrent Jobs - Max ADLUs per Job - Max Queue Length Links to Azure Blob Stores U-SQL Catalog Metadata U-SQL Catalog Data ADLAU = Azure Data Lake Analytics Unit
  13. 13. ON PREMISES CLOUD Massive Archive On Prem HDFS インポート Active Incoming Data 継続更新 “Landing Zone” Data Lake Store AzCopy でコピー Data Lake Store Data Lake Analytics 永続ストア箇所への移動 と ジョブ実行により作成さ れた 構造化データセットの保 存 DW (many instances) 構造化データの作成。 CONSUMPTION Machine Learning 機械学習の実行、検証 Web Portals Mobile Apps Power BI 実験・検証 A/B テストや 顧客行動の変化の追跡 Jupyter Data Science Notebooks
  14. 14. スケールに制限なし 全てのデータの種類を そのネイティブ フォーマットで保 存 クラウド上でのWebHDFS 企業利用のためのセキュリティ、 アクセス制御、暗号化など 分析用に最適化 Azure Data Lake Store Big Data 分析のための ハイパースケールな データリポジトリ
  15. 15. Map reduce Hbase トランザクショ ン HDFS アプリケーションHive クエリ Azure HDInsight Hadoop WebHDFS クライアント Hadoop WebHDFS クライアント WebHDFS エンドポイント WebHDFS REST API WebHDFS REST API ADL Store file ADL Store file ADL Store file ADL Store fileADL Store file Azure Data Lake Store
  16. 16. Local ADL Store  Azure Portal  Azure PowerShell  Azure CLI  Data Lake Tools for Visual Studio  Azure Data Factory  AdlCopy ツール Azure Stream Analytics Azure HDInsight Storm Azure Data Factory Apache Sqoop  Apache DistCp  Azure Data Factory  AdlCopy ツール
  17. 17. 個々のファイルとディレクトリは、 オーナーとグループに紐づく ファイル、ディレクトリは、オー ナー、グループのメンバー、他の ユーザーに対して、 read(r), write(w), execute(x)の パーミッションを持つ きめ細かなACLs(アクセス管理リス ト)のルールにより、ユーザー名や グループ名を 指定して管理ができる
  18. 18. Azure Data Lake Store file …Block 1 Block 2 Block 2 Backend Storage Data node Data node Data node Data node Data nodeData node Block Block Block Block Block Block
  19. 19. AdlCopy /Source <Blob source> /Dest <ADLS destination> /SourceKey <Key for Blob account> /Account <ADLA account> /Units <Number of Analytics units>
  20. 20. hadoop distcp wasb://<container_name>@<storage_account_name>.blob.core.windows.net/example/data/gutenberg adl://<data_lake_store_account>.azuredatalakestore.net:443/myfolder
  21. 21. sqoop-import --connect "jdbc:sqlserver://<sql-database-server-name>.database .windows.net:1433;username=<username>@<sql-database-server-name>;password= <password>;database=<sql-database-name>“ --table Table1 --target-dir adl:// <data-lake-store-name>.azuredatalakestore.net/Sqoop/SqoopImportTable1
  22. 22. スケールに制限なし U-SQL, SQLのメリットにC#のパワーを加えた新 しい言語 Data Lake Store に最適化 Azure データサービスへの FEDERATED QUERY 企業利用のためのセキュリティ、 アクセス制御、暗号化など ジョブ単位での課金とスケール設定 Azure Data Lake Analytics 全てのどんなサイズのデータ でも処理できる Apache YARNベースの 分析サービス
  23. 23. HDInsight Java, Eclipse, Hive, etc. フルマネージド の Hadoop クラスタ Data Lake Analytics C#, SQL & PowerShell フルマネージド の 分散管理処理クラスタ DryAd ベース
  24. 24. U-SQL Query Result Query Azure Storage Blobs Azure SQL in VMs Azure SQL DB Azure Data Lake Analytics
  25. 25. Management Operations Java C++.NET Node.js Data Operations WebHDFS Client LibWebHDFS
  26. 26. Management Operations Java C++.NET Node.js U-SQL Extensibility
  27. 27. Management Data Lake Analytics アカウント Jobs U-SQL job Catalog カタログ(メタデータ) Management Data Lake Store アカウント File System Upload, download, list, delete, rename, append (WebHDFS) Analytics Store Azure Active Directory
  28. 28. #azurejp
  29. 29. #azurejp string subscriptionId = "83daeeec-f16c-47f8-9dc4-6ff1ebf9feb3"; var adlaClient = new DataLakeAnalyticsAccountManagementClient(tokenCreds); adlaClient.SubscriptionId = subscriptionId;
  30. 30. #azurejp
  31. 31. 多くの SQL & .NET DEVELOPERS 宣言型言語の SQL と 逐次実行型である C# のパワーを融合 構造化、一部構造化、非構造化データの融合 全てのデータに分散クエリの実施 U-SQL Big Data のための新しい言語
  32. 32. #azurejp @rows = EXTRACT name string, id int FROM “/data.csv” USING Extractors.Csv( ); OUTPUT @rows TO “/output.csv” USING Outputters.Csv(); Rowsets EXTRACT for files OUTPUT Schema Types Inputs & Outputs Keywords are UPPERCASE
  33. 33. #azurejp REFERENCE ASSEMBLY WebLogExtASM; @rs = EXTRACT UserID string, Start DateTime, End DateTime, Region string, SitesVisited string, PagesVisited string FROM "swebhdfs://Logs/WebLogRecords.csv" USING WebLogExtractor (); @result = SELECT UserID, (End.Subtract(Start)).TotalSeconds AS Duration FROM @rs ORDER BY Duration DESC FETCH 10; OUTPUT @result TO "swebhdfs://Logs/Results/top10.txt" USING Outputter.Tsv(); • 型定義は C# の型定義と同じ • データをファイルから抽出・読 み込み するときに、スキーマを定義 Data Lake Store のファイ ル独自形式を解析するカスタム 関数 C# の関数 行セット: (中間テーブ ルの概念に近 い) TSV形式で読み取る関数
  34. 34. DECLARE @endDate DateTime = DateTime.Now; DECLARE @startDate DateTime = @endDate.AddDays(-7); @orders = EXTRACT OrderId int, Customer string, Date DateTime, Amount float FROM "/input/orders.txt" USING Extractors.Tsv(); @orders = SELECT * FROM @orders WHERE Date >= startDate AND Date <= endDate; @orders = SELECT * FROM @orders WHERE Customer.Contains(“Contoso”); OUTPUT @orders TO "/output/output.txt" USING Outputters.Tsv(); U-SQL Basics (1) DECLARE C# の式で変数宣言 (2) EXTRACT ファイル読み込み時に スキーマを決定し、結果を RowSet に (3) RowSet 式を使ってデータを再定 義 (4) OUTPUT データをファイルへ出 力 1 2 3 4
  35. 35. CREATE ASSEMBLY OrdersDB.SampleDotNetCode FROM @"/Assemblies/Helpers.dll"; REFERENCE ASSEMBLY OrdersDB.Helpers; @rows = SELECT OrdersDB.Helpers.Normalize(Customer) AS Customer, Amount AS Amount FROM @orders; @rows = PROCESS @rows PRODUCE OrderId string, FraudDetectionScore double USING new OrdersDB.Detection.FraudAnalyzer(); OUTPUT @rows TO "/output/output.dat" USING OrdersDB.CustomOutputter(); U-SQL .NET Code 利用 (1) CREATE ASSEMBLY アセンブ リーをU-SQL Catalog へ登録 (2) REFERENCE ASSEMBLY アセン ブリーへの参照宣言 (3) U-SQL 式の中で、C# メソッドの 呼び出し (4) PROCESS User Defined Operator を使って、行ごとの処理を実行 (5) OUTPUT 独自のデータ形式で出 力 1 2 3 4 5
  36. 36. #azurejp // WASB (requires setting up a WASB DataSource in ADLS) @rows = EXTRACT name string, id int FROM “wasb://…/data..csv” USING Extractors.Csv( ); // ADLS (absolute path) @rows = EXTRACT name string, id int FROM “adl://…/data..csv” USING Extractors.Csv( ); // ADLS (relative to default ADLS for an ADLA account) @rows = EXTRACT name string, id int FROM “/…/data..csv” USING Extractors.Csv( ); Default Extractors Extractors.Csv( ) Extractors.Tsv( )
  37. 37. #azurejp @rows = EXTRACT name string, id int FROM “/file1.tsv”, “/file2.tsv”, “/file3.tsv” USING Extractors.Csv( );
  38. 38. #azurejp @rows = EXTRACT name string, id int FROM “adl://…/data..csv” USING Extractors.Csv(); OUTPUT @rows TO “/data.tsv” USING Outputters.Csv(); Default Outputters Outputters.Csv( ) Outputters.Tsv( )
  39. 39. #azurejp @rows = EXTRACT Name string, Id int, FROM “/file.tsv” USING Extractors.Tsv(skipFirstNRows:1); OUTPUT @data TO "/output/docsamples/output_header.csv" USING Outputters.Csv(outputHeader:true); スキップする行数を指 定
  40. 40. #azurejp https://msdn.microsoft.com/ja-jp/library/azure/mt621320.aspx
  41. 41. #azurejp @rows = EXTRACT <schema> FROM “adl://…/data..csv” USING Outputters.Csv(); DECLARE @inputfile string = “adl://…/data..csv” @rows = EXTRACT <schema> FROM @inputfile USING Outputters.Csv();
  42. 42. #azurejp DECLARE @a string = "Hello World"; DECLARE @b int = 2; DECLARE @c dateTime = System.DateTime.Parse("1979/03/31"); DECLARE @d dateTime = DateTime.Now; DECLARE @e Guid = System.Guid.Parse("BEF7A4E8-F583-4804-9711-7E608215EBA6"); DECLARE @f byte [] = new byte[] { 0, 1, 2, 3, 4}; @変数名 で定義
  43. 43. #azurejp @departments = SELECT * FROM (VALUES (31, "Sales"), (33, "Engineering"), (34, "Clerical"), (35, "Marketing") ) AS D( DepID, DepName );
  44. 44. #azurejp @output = SELECT Start, Region, Duration + 1.0 AS Duration2 FROM @searchlog;
  45. 45. #azurejp @output = SELECT Start, Region, Duration FROM @searchlog WHERE Region == "en-gb";
  46. 46. #azurejp
  47. 47. #azurejp @output = SELECT Start, Region, Duration FROM @searchlog; @output = SELECT * FROM @output WHERE Region == "en-gb";
  48. 48. #azurejp @output = SELECT Start, Region, Duration FROM @searchlog WHERE (Duration >= 60) OR NOT (Region == "en-gb"); // NOTE: && and || perform short-circuiting @output = SELECT Start, Region, Duration FROM @searchlog WHERE (Duration >= 60) || !(Region == "en-gb");
  49. 49. #azurejp @output = SELECT Start, Region, Duration FROM @searchlog WHERE Start >= DateTime.Parse("2012/02/16") AND Start <= DateTime.Parse("2012/02/17");
  50. 50. #azurejp rs = SELECT FirstName, LastName, JobTitle FROM People WHERE JobTitle IN ("Engineer", "Designer“, “Writer”);
  51. 51. #azurejp @output = SELECT Region, COUNT() AS NumSessions, SUM(Duration) AS TotalDuration, AVG(Duration) AS AvgDwellTtime, MAX(Duration) AS MaxDuration, MIN(Duration) AS MinDuration FROM @searchlog GROUP BY Region;
  52. 52. #azurejp // NO GROUP BY @output = SELECT SUM(Duration) AS TotalDuration FROM @searchlog; // WITH GROUP BY @output = SELECT Region, SUM(Duration) AS TotalDuration FROM searchlog GROUP BY Region;
  53. 53. #azurejp // find all the Regions where the total dwell time is > 200 @output = SELECT Region, SUM(Duration) AS TotalDuration FROM @searchlog GROUP BY Region HAVING TotalDuration > 200;
  54. 54. #azurejp // Option 1 @output = SELECT Region, SUM(Duration) AS TotalDuration FROM @searchlog GROUP BY Region; @output2 = SELECT * FROM @output WHERE TotalDuration > 200; // Option 2 @output = SELECT Region, SUM(Duration) AS TotalDuration FROM @searchlog GROUP BY Region HAVING SUM(Duration) > 200;
  55. 55. #azurejp // List the sessions in increasing order of Duration @output = SELECT * FROM @searchlog ORDER BY Duration ASC FETCH FIRST 3 ROWS; // This does not work (ORDER BY requires FETCH) @output = SELECT * FROM @searchlog ORDER BY Duration ASC;
  56. 56. #azurejp OUTPUT @output TO @"/Samples/Output/SearchLog_output.tsv" ORDER BY Duration ASC USING Outputters.Tsv();
  57. 57. #azurejp  LEFT OUTER JOIN  LEFT INNER JOIN  RIGHT INNER JOIN  RIGHT OUTER JOIN  FULL OUTER JOIN  CROSS JOIN  LEFT SEMI JOIN  RIGHT SEMI JOIN  EXCEPT ALL  EXCEPT DISTINCT  INTERSECT ALL  INTERSECT DISTINCT  UNION ALL  UNION DISTINCT
  58. 58. #azurejp @rs1 = SELECT ROW_NUMBER() OVER ( ) AS RowNumber, Start, Region FROM @searchlog ORDER BY Start;
  59. 59. 以下の一連のクエリは、中間の行セットの @irs を必要とする User Id Region Duration A$A892 en-us 10500 HG54#A en-us 22270 YSD78@ en-us 38790 JADI899 en-gb 18780 YCPB(%U en-gb 17000 BHPY687 en-gb 16700 BGFSWQ en-bs 57750 BSD805 en-fr 15675 BSDYTH7 en-fr 10250 @rs = EXTRACT UserID string, Start DateTime, End DateTime, Region string, SitesVisited string, PagesVisited string FROM "swebhdfs://Logs/WebLogRecords.txt" USING WebLogExtractor(); @irs = SELECT UserID, Region, (End.Subtract(Start)).TotalSeconds AS Duration FROM @rs; WebLogRecords.txt Azure Data Lake
  60. 60. #azurejp [SUM = 207715] UserId Region Duration A$A892 en-us 10500 HG54#A en-us 22270 YSD78@ en-us 38790 JADI899 en-gb 18780 YCPB(%U en-gb 17000 BHPY687 en-gb 16700 BGFSWQ en-bs 57750 BSD805 en-fr 15675 BSDYTH7 en-fr 10250 UserId TotalDuration A$A892 207715 HG54#A 207715 YSD78@ 207715 JADI899 207715 YCPB(%U 207715 BHPY687 207715 BGFSWQ 207715 BSD805 207715 BSDYTH7 207715 全行のウィンドウを通じて 期間をサマリーする @result = SELECT UserID, SUM(Duration) OVER() AS TotalDuration FROM @irs; @irs @result ユーザーIDと、ウェブサイトにおける全ユーザーの滞在時間の総計をリストする
  61. 61. #azurejp UserId Region Duration A$A892 en-us 10500 HG54#A en-us 22270 YSD78@ en-us 38790 JADI899 en-gb 18780 YCPB(%U en-gb 17000 BHPY687 en-gb 16700 BGFSWQ en-bs 57750 BSD805 en-fr 15675 BSDYTH7 en-fr 10250 UserId Region RegionTotal A$A892 en-us 71560 HG54#A en-us 71569 YSD78@ en-us 71560 JADI899 en-gb 52480 YCPB(%U en-gb 52480 BHPY687 en-gb 52480 BGFSWQ en-bs 57750 BSD805 en-fr 25925 BSDYTH7 en-fr 25925 @irs @total2 @total2 = SELECT UserId, Region, SUM(Duration) OVER( PARTITION BY Region ) AS RegionTotal FROM @irs; リージョンのウィンドウを通じて 期間をサマリーする ユーザーIDと、リージョンとリージョンごとのウェブサイトにおける滞在時間の総計 をリストする
  62. 62. #azurejp UserId Region Duration A$A892 en-us 10500 HG54#A en-us 22270 YSD78@ en-us 38790 JADI899 en-gb 18780 YCPB(%U en-gb 17000 BHPY687 en-gb 16700 BGFSWQ en-bs 57750 BSD805 en-fr 15675 BSDYTH7 en-fr 10250 UserId Region CountByRegion A$A892 en-us 3 HG54#A en-us 3 YSD78@ en-us 3 JADI899 en-gb 3 YCPB(%U en-gb 3 BHPY687 en-gb 3 BGFSWQ en-bs 1 BSD805 en-fr 2 BSDYTH7 en-fr 2 @irs @result リージョンごとのユーザー数をカウントする @result = SELECT UserId, Region, COUNT(*) OVER( PARTITION BY Region) AS CountByRegion FROM @irs; リージョンごとのユーザー数のリストする
  63. 63. #azurejp UserId Region Duration A$A892 en-us 10500 HG54#A en-us 22270 YSD78@ en-us 38790 JADI899 en-gb 18780 YCPB(%U en-gb 17000 BHPY687 en-gb 16700 BGFSWQ en-bs 57750 BSD805 en-fr 15675 BSDYTH7 en-fr 10250 UserId Region Rank YSD78@ en-us 1 HG54#A en-us 2 JADI899 en-gb 1 YCPB(%U en-gb 2 BGFSWQ en-bs 1 BSD805 en-fr 1 BSDYTH7 en-fr 2 @irs @result @result = SELECT UserId, Region, ROW_NUMBER() OVER(PARTITION BY Vertical ORDER BY Duration) AS Rank FROM @irs GROUP BY Region HAVING RowNumber <= 2; 各リージョンで最も滞在時間の長いユーザー2人を見つける
  64. 64. #azurejp @a = SELECT Region, Urls FROM @searchlog; @b = SELECT Region, SqlArray.Create(Urls.Split(';')) AS UrlTokens FROM @a; @c = SELECT Region, Token AS Url FROM @b CROSS APPLY EXPLODE (UrlTokens) AS r(Token); @a @b @c CROSS APPLY EXPLODE ARRAY TYPE
  65. 65. #azurejp @d = SELECT Region, ARRAY_AGG<string>(Url).ToArray() AS UrlArray FROM @c GROUP BY Region; @e = SELECT Region, string.Join(";", UrlArray) AS Urls FROM @c; @c @e @d
  66. 66. log_2015_10_01.txt log_2015_10_02.txt log_2015_10_03.txt log_2015_10_04.txt log_2015_10_05.txt log_2015_10_06.txt log_2015_10_07.txt log_2015_10_08.txt log_2015_10_09.txt log_2015_10_10.txt log_2015_10_11.txt
  67. 67. #azurejp suffix {suffix} ファイル名が値として登録 される
  68. 68. #azurejp date {date:yyyy} {date:MM} {date:dd} 書式パターンからデータを読み込 む 4文字 Year 2文字 month 2文字 day
  69. 69. #azurejp date suffix {date:yyyy} {date:MM} {date:dd} {suffix} C#式でファイルセット の フィルター実施
  70. 70. それぞれの Catalog は N個の Database を持つ Tables Table-Valued Functions Assemblies
  71. 71. #azurejp CREATE TABLE Customers( id int, key int, Customer string, Date DateTime, Amount float, INDEX index1 CLUSTERED (id) PARTITIONED BY (date) DISTRIBUTED BY HASH (key) INTO 4 ); /catalog/…/tables/Guid(T)/ Guid(T.p1).ss Guid(T.p2).ss Guid(T.p3).ss 論理構造 物理構造 @date1 @date2 @date3 ID1 H1 ID1 H1 ID1 H1 ID2 ID2 ID3ID3 ID4 H2 ID4 H2 ID5 ID5 ID6 H3 ID6 H2ID6 H3 ID7 H4 ID7 ID8 ID7 ID9 ID8 Clustering -> データの近さ Partition -> ライフサイクル管理 Distribution -> データの近さ + 分 散性
  72. 72. U-SQL Table が 構造化されたデータとして 管理 Files under “/catalog/database” Catalog フォルダーを 直接読み書きしない
  73. 73. #azurejp CREATE FUNCTION MyDB.dbo. RETURNS @rows TABLE ( Name string, Id int ) AS BEGIN @rows = EXTRACT Name string, Id int, FROM “/file.tsv” USING Extractors.Tsv(); RETURN; END; 結果を返す RowSet Schema 定義された RowSet A Single concept that replaces Scope views & functions -Discoverable -Schematized
  74. 74. #azurejp CREATE FUNCTION MyDB.dbo. RETURNS @rows AS BEGIN @rows = EXTRACT Name string, Id int, FROM “/file.tsv” USING Extractors.Tsv(); RETURN; END; RowSet からスキーマ定 義
  75. 75. #azurejp // A Table @rs = SELECT * FROM MyDB.dbo.Customers; // A Table valued Function @rs = SELECT * FROM MyDB.dbo.GetData();
  76. 76. #azurejp @output = SELECT Region.ToUpper() AS NewRegion FROM @searchlog; @output= SELECT Start, Region, Start.DayOfYear AS StartDayOfYear FROM @searchlog;
  77. 77. #azurejp @output= SELECT Start, Region, ((double) Duration) AS DurationDouble FROM @searchlog;
  78. 78. #azurejp // User-defined code is not supported DECLARE myName string = MyHelper.GetMyName();
  79. 79. #azurejp
  80. 80. #azurejp REFERENCE ASSEMBLY MyCode; @rows = SELECT OrdersDB.Helpers.Normalize(Customer) AS CustN, Amount AS Amount FROM @orders; CREATE ASSEMBLY MyCode FROM @"/DLLs/Helpers.dll"; CREATE ASSEMBLY で 参照設定 アセンブリーをCateLogに事 前アップロード。その後 CREATE ASSEMBLY 読み込み
  81. 81. Query Azure Storage Blobs Azure SQL in VMs Azure SQL DB Azure Data Lake Analytics U-SQL Query Azure SQL Data Warehouse Azure Data Lake Storage
  82. 82. ADLA Account youradlaaccount SQL Server yoursqlserver SQL DB/DW AdventureW orksLT U-SQL DB AdventureWorksLT_External DB Credential AdventureWorksLT_Creds External DataSource AdventureWorksLT_Creds Table Customers External Table CustomersExternal External Tableに スキーマ定義済 みのクエリ実行 スキーマ指定 せずにクエリ 実行
  83. 83. # If you have the username & password as strings $username = "username" $passwd = ConvertTo-SecureString "password" -AsPlainText -Force $creds = New-Object System.Management.Automation.PSCredential($username, $passwd) # Prompt user for credentials $creds = Get-Credential OR
  84. 84. New-AdlCatalogCredential -Account "youradlaaccount" ` -DatabaseName "AdventureWorksLT_ExternalDB" ` -DatabaseHost “yoursqlserver.database.windows.net" ` -Port 1433 ` -CredentialName "AdventureWorksLT_Creds" ` -Credential $creds
  85. 85. USE DATABASE [AdventureWorksLT_ExternalDB]; CREATE DATA SOURCE IF NOT EXISTS AdventureWorksLT_DS FROM AZURESQLDB WITH ( PROVIDER_STRING = "Database=AdventureWorksLT;Trusted_Connection=False;Encrypt=True", CREDENTIAL = AdventureWorksLT_Creds, REMOTABLE_TYPES = (bool, byte, sbyte, short, ushort, int, uint, long, ulong, decimal, float, double, string, DateTime) ); Remotable types
  86. 86. USE DATABASE [AdventureWorksLT_ExternalDB]; @customers = SELECT * FROM EXTERNAL AdventureWorksLT_DS LOCATION "[SalesLT].[Customer]"; OUTPUT @customers TO @"/SalesLT_Customer.csv" USING Outputters.Csv();
  87. 87. USE DATABASE [AdventureWorksLT_ExternalDB]; CREATE EXTERNAL TABLE IF NOT EXISTS dbo.CustomersExternal ( CustomerID int?, NameStyle bool, Title string, FirstName string, MiddleName string, LastName string, Suffix string, CompanyName string, SalesPerson string, EmailAddress string, Phone string, PasswordHash string, PasswordSalt string, Rowguid Guid, ModifiedDate DateTime? ) FROM AdventureWorksLT_DS LOCATION "[SalesLT].[Customer]"; USE DATABASE [AdventureWorksLT_ExternalDB]; @customers = SELECT * FROM dbo.CustomersExternal; OUTPUT @customers TO @"/SalesLT_Customer.csv" USING Outputters.Csv(); External Table からの読み取り
  88. 88. #azurejp CREATE CREDENTIAL IF NOT EXISTS dahatakeAdmin WITH USER_NAME ="dahatake", IDENTITY = "dahatakeSec"; CREATE DATA SOURCE IF NOT EXISTS pubsSource FROM AZURESQLDB WITH ( PROVIDER_STRING = "Initial Catalog=pubs;Encrypt=True", CREDENTIAL = dahatakeAdmin ); @result = SELECT * FROM EXTERNAL pubsSource EXECUTE @"SELECT * FROM dbo.employee"; OUTPUT @result TO "/output/employee.csv" USING Outputters.Csv(); Install-Module AzureRM Install-AzureRM Login-AzureRmAccount Get-AzureRmSubscription Set-AzureRmContext -SubscriptionId “<subscription ID>" $passwd = ConvertTo-SecureString “<password>" -AsPlainText -Force $mysecret = New-Object System.Management.Automation.PSCredential("dah atakeSec", $passwd) New-AzureRmDataLakeAnalyticsCatalogSecret - DatabaseName "master" -AccountName "dahatakeadla" -Secret $mysecret -Host "dahatakesql.database.windows.net" -Port 1433 資格情報オブジェクト: https://msdn.microsoft.com/ja-jp/library/azure/mt621327.aspx
  89. 89. Input Data (K, A, B, C, D) REDUCE ON K Partition K0 Partition K1 Partition K2 REDUCER Python/R REDUCER Python/R REDUCER Python/R Output for K0 Output for K0 Output for K0 Extensions の追加 REFERENCE ASSEMBLY [ExtPython] REFERENCE ASSEMBLY [ExtR] 特別な Reducers によって Python or R code を分散実行 • Extension.Python.Reducer • Extension.R.Reducer Standard DataFrame を Reducerの入出力とし て使える NOTE: Reducer は、Aggregate を含んでいな い
  90. 90. REFERENCE ASSEMBLY [ExtPython]; DECLARE @myScript = @" def get_mentions(tweet): return ';'.join( ( w[1:] for w in tweet.split() if w[0]=='@' ) ) def usqlml_main(df): del df['time'] del df['author'] df['mentions'] = df.tweet.apply(get_mentions) del df['tweet'] return df "; @t = SELECT * FROM (VALUES ("D1","T1","A1","@foo Hello World @bar"), ("D2","T2","A2","@baz Hello World @beer") ) AS D( date, time, author, tweet ); @m = REDUCE @t ON date PRODUCE date string, mentions string USING new Extension.Python.Reducer(pyScript:@myScript); Python Extensions U-SQLを並列分散処理に使用する Python code を多くのノード上で実 行 NumPy、Pandasのような、Python の標準ライブラリが利用できる
  91. 91. REFERENCE ASSEMBLY ImageCommon; REFERENCE ASSEMBLY FaceSdk; REFERENCE ASSEMBLY ImageEmotion; REFERENCE ASSEMBLY ImageTagging; REFERENCE ASSEMBLY ImageOcr; @imgs = EXTRACT FileName string, ImgData byte[] FROM @"/images/{FileName:*}.jpg" USING new Cognition.Vision.ImageExtractor(); // Extract the number of objects on each image and tag them @objects = PROCESS @imgs PRODUCE FileName, NumObjects int, Tags string READONLY FileName USING new Cognition.Vision.ImageTagger(); OUTPUT @objects TO "/objects.tsv" USING Outputters.Tsv(); Imaging
  92. 92. REFERENCE ASSEMBLY [TextCommon]; REFERENCE ASSEMBLY [TextSentiment]; REFERENCE ASSEMBLY [TextKeyPhrase]; @WarAndPeace = EXTRACT No int, Year string, Book string, Chapter string, Text string FROM @"/usqlext/samples/cognition/war_and_peace.csv" USING Extractors.Csv(); @sentiment = PROCESS @WarAndPeace PRODUCE No, Year, Book, Chapter, Text, Sentiment string, Conf double USING new Cognition.Text.SentimentAnalyzer(true); OUTPUT @sentinment TO "/sentiment.tsv" USING Outputters.Tsv(); Text Analysis
  93. 93. • オブジェクト認識 (タグ) • 顔認識、感情認識 • JOIN処理 – 幸せな人は誰なのか? REFERENCE ASSEMBLY ImageCommon; REFERENCE ASSEMBLY FaceSdk; REFERENCE ASSEMBLY ImageEmotion; REFERENCE ASSEMBLY ImageTagging; @objects = PROCESS MegaFaceView PRODUCE FileName, NumObjects int, Tags string READONLY FileName USING new Cognition.Vision.ImageTagger(); @tags = SELECT FileName, T.Tag FROM @objects CROSS APPLY EXPLODE(SqlArray.Create(Tags.Split(';'))) AS T(Tag) WHERE T.Tag.ToString().Contains("dog") OR T.Tag.ToString().Contains("cat"); @emotion_raw = PROCESS MegaFaceView PRODUCE FileName string, NumFaces int, Emotion string READONLY FileName USING new Cognition.Vision.EmotionAnalyzer(); @emotion = SELECT FileName, T.Emotion FROM @emotion_raw CROSS APPLY EXPLODE(SqlArray.Create(Emotion.Split(';'))) AS T(Emotion); @correlation = SELECT T.FileName, Emotion, Tag FROM @emotion AS E INNER JOIN @tags AS T ON E.FileName == T.FileName; Images Objects Emotions filter join aggregate
  94. 94. 2015/08/23
  95. 95. #azurejp
  96. 96. #azurejp
  97. 97. #azurejp
  98. 98. #azurejp
  99. 99. #azurejp
  100. 100. #azurejp
  101. 101. #azurejp
  102. 102. #azurejp
  103. 103. #azurejp
  104. 104. #azurejp
  105. 105. #azurejp
  106. 106. #azurejp
  107. 107. #azurejp
  108. 108. #azurejp
  109. 109. #azurejp
  110. 110. https://docs.microsoft.com/ja- jp/azure/data-lake- analytics/data-lake-analytics-u- sql-programmability-guide
  111. 111. Batch Streaming Machine Learning
  112. 112. Job Front End Job Scheduler Compiler Service Job Queue Job Manager U-SQL Catalog YARN Job 投入 Job 実行 U-SQL Runtime vertex 実行
  113. 113. U-SQL C# user code C++ system code Algebra other files (system files, deployed resources) managed dll Unmanaged dll Input script Compilation output (in job folder) Files Meta Data Service Deployed to vertices Compiler & Optimizer
  114. 114. ジョブはVertexに分割 Vertex が実行単位 Input Output Output 6 ステージ 8 Vertex Vertexはステージに展開 – 同じステージのVertexは、 同じ処理をする – 前段階のステージのVertexに 依存する – 1つのVertexのジョブ実行は5時間まで acyclic graph (循環のないグラフ)
  115. 115. Preparing Queued Running Finalizing Ended (Succeeded, Failed, Cancelled) New Compiling Queued Schedulin g Starting Running Ended 画面上 状態 ADLAU の 空を確認
  116. 116. 進捗 統計情報
  117. 117. 処理読み込み 保存 INSERT OUTPUT OUTPUT SELECT… FROM… WHERE… EXTRACT EXTRACT SELECT SELECT Azure Data Lake Azure Data Lake Azure SQL DB Azure Storage Blobs Azure Storage Blobs RowSet RowSet
  118. 118. <><><><> <><><><> <><><><> <><><><> <><><><> <><><><> Extent 1 Region = “en-us” <><><><> <><><><> <><><><> <><><><> <><><><> <><><><> Extent 2 Region = “en-gb” <><><><> <><><><> <><><><> <><><><> <><><><> <><><><> Extent 3 Region = “en-fr”CREATE TABLE LogRecordsTable (UserId int, Start DateTime, Region string, INDEX idx CLUSTERED (Region ASC) PARTITIONED BY HASH (Region)); インサート時に、 “Region” カラムに基 づき、3つの範囲に 渡って ハッシュ分散される INSERT INTO LogRecordsTable SELECT UserId, Start, End, Region FROM @rs パーティションが 分かれている @rs = SELECT * FROM LogRecordsTable WHERE Region == “en-gb” 1 2 3
  119. 119. Full agg Region ごとにクラスタ化されたテーブル Read Read Read Read Full agg Full agg Partial agg Partial agg Extent 1 Extent 2 Extent 3 Extent 4 Sort Sort Top 100 Top 100 Sort Top 100 Top 100 Read Read Read Read 非構造化データ Partial agg Partial agg Partial agg Partial agg Full agg Full agg Full agg Sort Sort Sort Top 100 Top 100 Top 100 Extent 1 Extent 2 Extent 3 Extent 4 Partition Partition Partition Partition @rs1 = SELECT Region, COUNT() AS Total FROM @rs GROUP BY Region; @rs2 = SELECT TOP 100 Region, Total FROM @rs1 ORDER BY Total; 高コストな処理
  120. 120. 0 5,000,000 10,000,000 15,000,000 20,000,000 25,000,000 30,000,000 35,000,000 40,000,000 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 州ごとの人口

×