This document discusses using SQL-MapReduce (SQL-MR) for advanced analytical queries. SQL-MR allows parallelization of complex SQL queries using MapReduce functions. It simplifies architectures by eliminating the need for separate data warehouses, datamarts and cubes. SQL-MR enables new forms of analytics like deep, complex, operational and self-service analytics without restricting queries. The marriage of SQL and MapReduce offers significant potential by parallelizing analytical logic processing.
The Codex of Business Writing Software for Real-World Solutions 2.pptx
SQL-MapReduce for Advanced Analytics and Simplified Architecture
1. Using SQL-MapReduce for Advanced Analytical Queries by Rick F. van der LansR20/Consultancy BV
2. What Did the Users Want? BI reports Production databases
3. But What Did We Create? ODS data warehouse datamart production database cube
4. Problems with Current DW Platforms 45% 40% 39% 37% 33% 29% 23% 23% 21% 20% 19% 16% 16% 15% 14% 13% 11% 4% 3% Poor query response Can’t support advanced analytics Inadequate data load speed Can’t scale to large data volumes Cost of scaling up is too expensive Poorly suited to real-time or on demand workloads Current platform is a legacy we must phase out Can’t support data modeling we need We need platform that supports mixed workloads Can’t support large concurrent user count Inadequate high availability Inadequate support for in-memory processing Inadequate support for web services and SOA Current platform is 32-bit, and we need 64-bit Current platform is SMP, and we need MPP We need platform better suited to cloud or virtualization Can’t secure the data properly Other No problems Source: P. Russom, ‘Next Generation Data Warehouse Platforms’, TDWI Best Practices Report, fourth quarter 2009.
5. 49% 8% 20% 12% 8% 1% 1% 3% current DW platform 2009 2010 2011 2012 2013 2014 2015 or later Need for More Powerful Data Warehouse Platforms no plans to replace Source: P. Russom, ‘Next Generation Data Warehouse Platforms’, TDWI Best Practices Report, fourth quarter 2009.
6. New Forms of Analytics Advanced Analytics Operational Analytics Deep Analytics Self-Service Analytics Complex Analytics Automated Analytics
7. Positioning of Advanced Analytics complexity of analytical queries high complex queries on small to medium size databases advanced analytics simple queries on small to medium size databases simple queries on large to ultra large databases low database size low high
8. Parallellization of SQL Worker Worker Worker SELECT * FROM CUSTOMERS WHERE LOCATION = 'New York' Database Server Master
9. How Easy Is Parallelizing SQL Queries? (1) Example 1: SELECT ID, SALES_DATE, PRICE FROM SALES_RECORDS WHERE PRICE > 100 Example 2: SELECT REGION_ID, SUM(PRICE) FROM SALES_RECORDS WHERE PRICE > 100 GROUP BY REGION_ID
10. How Easy Is Parallelizing SQL Queries? (2) Example 3: Get all the flights to London for which another flight exists to London that leaves within an hour on the same day. SELECT * FROM DEPARTURES AS D1 WHERE DESTINATION = 'London' AND DEPARTURE_TIME + 60 MINUTES >= (SELECT MIN(DEPARTURE_TIME) FROM DEPARTURES AS D2 WHERE DESTINATION = 'London' AND D2.DEPARTURE_TIME > D1.DEPARTURE_TIME AND D2.DEPARTURE_DAY = D1.DEPARTURE_DAY) ORDER BY DEPARTURE_TIME
11. How Easy Is Parallelizing SQL Queries? (3) SELECTA.PROD_DESC AS ITEM1,B.PROD_DESC AS ITEM2,C.PROD_DESC AS ITEM3,COUNT (*) AS CNTFROM(SELECT SF.STORE_ID, SF.REG_ID, SF.TRAN_NO, SF.ITEM_ID, SF.DT, PD.PROD_DESC, PD.PRICE FROM SALES_FACT SF INNER JOIN PRODUCT_DIM PD WHERE SF.ITEM_ID=PD.ITEM_ID) AS TRANSACTIONS A, (SELECT SF.STORE_ID, SF.REG_ID, SF.TRAN_NO, SF.ITEM_ID, SF.DT, PD.PROD_DESC, PD.PRICE FROM SALES_FACT SF INNER JOIN PRODUCT_DIM PD WHERE SF.ITEM_ID=PD.ITEM_ID) AS TRANSACTIONS B,(SELECT SF.STORE_ID, SF.REG_ID, SF.TRAN_NO, SF.ITEM_ID, SF.DT, PD.PROD_DESC, PD.PRICE FROM SALES_FACT SF ,,, PRODUCT_DIM PD WHERE SF.ITEM_ID=PD.ITEM_ID) AS TRANSACTIONS C WHERE A.STORE_ID=B.STORE_ID AND B.STORE_ID=C.STORE_ID AND A.STORE_ID=C.STORE_ID AND A.REG_ID=B.REG_ID AND B.REG_ID=C.REG_ID AND A.REG_ID=C.REG_ID AND A.TRAN_NO=B.TRAN_NO AND B.TRAN_NO=C.TRAN_NO AND A.TRAN_NO=C.TRAN_NO AND A.DT=B.DT AND B.DT=C.DT AND A.DT=C.DT AND A.ITEM_ID<>B.ITEM_ID AND A.ITEM_ID<>C.ITEM_ID AND B.ITEM_ID<>C.ITEM_IDGROUP BY A.PROD_DESC, B.PROD_DESC, C.PROD_DESCHAVING COUNT(*)>1000ORDER BY COUNT(*) DESC; Example 4: Market basket analysis:
12. Declarativeness and Storage Independency Declarativeness: The developer has only to program what has to be done, and not how it should be done. Storage independency: The language should hide how data is physically stored and how it is accessed.
13. Advantages of Two Properties Productivity increase less code has to be written Maintainability: less code means having to maintain less code Flexibility: changes to the storage layer can be made without the need to change the SQL code in the reports
14. Different Types of SQL Functions Built-in or User-defined SELECT FLIGHT, TRUNCATE(DEPARTURE_TIME, MINUTES) FROM DEPARTURES AS D1 WHERE BANK_HOLIDAY(DEPARTURE_TIME) = 1 Scalar or Table SELECT AVG(DURATION) FROM LAST_FIVE_ROWS(DEPARTURES) Pure SQL, Procedural, or External Simple or Complex
15. MapReduce MapReduce is a programming model introduced by Google Aimed at processing requests on large data sets where the processing can be distributed over a high number of nodes using parallel capabilities Two steps Map and Reduce Map is like Select Reduce is like Group-by
16. Aster Data’s SQL-MapReduce (1) SQL-MR is a set of built-in and user-defined external table functions Example: SELECT * FROM GET_NEXT_FLIGHT_1HR (ON DEPARTURES PARTITION BY DESTINATION) WHERE DESTINATION = 'London' ORDER BY DEPARTURE_TIME All the SQL-MR function processing is parallelized Including complex group-by operations and time-series analytics
17. Aster Data’s SQL-MapReduce (2) An SQL-MR function can contain the most complex analytical logic Programmers of SQL don’t need to learn a new language, Java, C++, Python, and many more can be used The SQL statements invoking SQL-MR functions are still declarative and storage-independent The functions themselves are not Usable by any BI tools supporting SQL
31. Business Advantages of SQL-MR Simplification of architecture Deep analytics Complex analytics Operational analytics Self-service analytics No forbidden queries
33. Conclusions The analytical and reporting demands are increasing Most environments already have problems with performance The marriage of SQL and MapReduce offers an enormous potential Parallelizing the processing of analytical logic
34. Business Advantages of SQL-MR Simplification of architecture Deep analytics Complex analytics Operational analytics Self-service analytics No forbidden queries
35. Questions & Answers Rick van der Lans R20 Consultancy e-mail: rick@r20.nl website: http://www.r20.nl Stephanie McReynolds Director of Product Marketing, Aster Data e-mail: smcreyno@asterdata.com For More Information on Aster Data: http: //www.asterdata.com