This session is a special one, and yes because of the subject matter but also because of the set-up of the session. It is split in two mini-sessions, one about New Technologies and an introduction to Parallel Datawarehouse:
New Technolgies:
This part of the session is all about the discovering the extremes of SQL Server.
First we will talk about the new SQL Servers In-memory technologies the updatable ColumnStore and Heckaton technology, both pushing the boundaries of SMB machines far beyond what we taught possible 3 years ago.
SQL Server PDW:
With the new SQL Servers In-memory technologies SQL Server pushes the SMB machines far, for some of us these boundaries are still to close for comfort.
So meet the scalable version of SQL Server, obliterating the limits of SMB machines. This is an introduction to SQL Server PDW, the next step in the (r)evolution of SQL Server, capable of running high performance data warehouse queries on big data even offering seamless integration with Hadoop using PolyBase.
20. Scale
Standard Enterprise Fasttrack PDW
Reliable SMB Reliable Business
Critical SMB
Reference
Architecture
High End MPP DWH
Needs Maintenance
hours
Online Maintenance
24/7/365
Based upon
Enterprise edition
High end Data marts
and EDWs
Software Only Sofware Only Architecture (hard
and software)
Appliance
Scale Up Scale Up Scale Up DWH Scale out
OLTP OLTP / /
Small DWH DWH up to 10’s of
TB
Data Marts and
small to midsize
DWH
Up to PB’s
31. Load performance
0 10 20 30 40 50 60 70
PDW
DEV SQL -> SQL
PROD TeraData -> SQL
Loading 100 milion rows in minutes (shorter is better)
32. Data Pumps
• Reading 132 MB/s from disk = 8 GB per
minute
• Reading 2 DVDs per minute
33. Scaling
• Scales Lineair
– Demo PDW has only 2 units
• There is a pdw development edition
– but it is a developer appliance! For an msdn
ultimate subscription, there is 1 pdw developer
license.
34. Future Proof
• DWLoader is the fastest load mechanism
• Transformations can be done using CTAS statements
• Loading from remote server:
– Any remote server connected with infiniband switch
– Multiple servers allowed
• Pollybase
– Ready for big data
• Can use existing SSIS
De world of data is changing, let’s try to look into the near future shall we, by as short as 2015, organizations will have to integrate high-value, diverse, and even completely new information types and sources and have to try to turn all this into coherent information
Regina Casonato et al., “Information Management in the 21st Century
I am a Senior SQL Server trainer en Senior Consultant working for Kohera
Currently working as a SQL Server architect
I coach and train DBA’s and developers.
Rich experiance in both complex development and production environments
I’m specialised in tweaking and tuning in both virtual and physical environments
Succurity is currently a hughly underestimated issue.
Questions on social and web analytics
Example: What is my brand and product sentiment? How effective is my online campaign? Who am I reaching? How can I optimize or target the correct audience?
Questions that require connecting to live data feeds
Example: A large shipping company uses live weather feeds and traffic patterns to fine tune its ship and truck routes leading to improved delivery times and cost savings. Retailers analyze sales, pricing, economic, demographic, and live weather data to tailor product selections at particular stores and determine the timing of price markdowns.
Questions that require advanced analytics
Example: Financial firms use machine learning to build better fraud detection algorithms that go beyond the simple business rules involving charge frequency and location to also include an individual’s customized buying patterns, ultimately leading to a better customer experience.
Organizations that are able to take advantage of new technologies to ask and answer these new types of questions will be able to more effectively differentiate and derive new value for the business whether it is in the form of revenue growth, cost savings, or creating entirely new business models.
As we all know, the recession in 2008 dramatically impacted most organizations where, in some cases, significant cost cutting measures were put into place to control spending. This impacted IT and the CIO’s budget where spending was tightly controlled and in many cases dramatically lowered.
In 2012, Gartner did a survey with more than 2,000 CIOs and found that IT budgets will not increase dramatically from previous years. In the average case, IT showed flat budgets. However, even with this being the case, there is an expectation that technology’s role in the enterprise must provide more value than before.
This presents a scenario with IT that they have to both meet an increasing expectation to deliver value (that is, actively contributing to the enterprise’s growth) with the expectation that IT must also help reduce or control costs. That’s why IT must address these tough challenges by amplifying their strategies and operations to do more with what they already have. CIOs need to be efficient in how they allocate their budgets so that they can amplify their value to the business.
Slow, Inifficient and with a steep learning curve
Need for different systems
Data that is difficult to corrolate
Doubtfull results
SQL Server doesn’t scale
SQL Server has no DWH sollution
SQL Server is like access
SQL Server cannot handle TB size DWH’s, leave allone PB Sized
Databases are more and more becoming a dynamic environment
Availability: There are hardly any SLAs for databases in the cloud. To prepare and run effectively on the dynamic cloud environment, every database, regardless of its size, must run in a replicable setup, which is typically more complex and expensive.
Scalability: While scaling an application is pretty straightforward, scaling the database tier is more difficult
Flexibility: Allowing you to add/remove resources to match your needs, with no need for over-provisioning or over-paying to prepare for any future peaks.
Overhead: Cloud IT operations are tedious, complex and often more cumbersome.
Expertise: Developers flock to the cloud – and with good reason. The flip side is that once the application gains momentum, running it effectively – and the DB in particular – requires a skill set not readily available for most developers. To allow developers to focus on their code rather than on the IT, the cloud ecosystem provides a myriad of off-the-shelf development platforms and cloud services to integrate with to streamline development and time-to-production.
Multi Tenacy: For cloud providers, PaaS, SaaS and other large customers that need to run thousands of databases simultaneously, multi-tenancy enables a cost-effective and operationally efficient framework.
SMB
PDW (MPP)
For those data warehousing clients with requirements that are multiterabyte, high-end decision support, requiring superior price/performance across significant numbers of users, massively parallel processing (MPP) is an operational necessity.
Although symmetric multiprocessing (SMP) is raising the price/performance bar and pushing the crossover point upward, SMP and MPP will coexist in a gray transition area at the high end; and clients will benefit from the choices and competition.
Many more data warehousing clients will be able to make use of standard SMP approaches than had previously been the case, but those with multiterabyte volumes combined with high-performance requirements (active data warehousing) will still require a special-purpose data warehouse server. Innovations in have reduced contention in SMP designs and have reduced the coordination costs
However, slope of one linear scalability of hundreds of processors still requires an MPP database.
The central trade-off between single image (SMP) and parallel processing (MPP) database warehousing servers is between ease of administration and scalability.
While MPP scales linearly over it’s nodes, troubleshooting so many processors can be an issue for administrators trained in the SMP world.
The performance cost of data movement through the high-speed switch (a defining characteristic of clustered hardware and MPP databases) can be significant, and data placement remains a critical success factor. This is true, but it is the required trade-off for high-performance results given complex queries against large volume points.
MPP offers superior scalability of computing power and throughput;
SMP offers the best price/performance, especially below the gray area
MPP has an issue of leaving processing power unused throughout the day. (Consider a perfectly distributed MPP database. Now run a query that joins two tables on "date." Redistribution occurs, and more data get hashed to certain nodes. Real-time imbalance occurs.)
SMP overcomes this imbalance as the software is dynamically able to allocate parallel tasks within a single node
Columnstore provides dramatic performance
Updateable and clustered xVelocity columnstore
Stores data in columnar format
Memory-optimized for next-generation performance
Updateable to support bulk and/or trickle loading
The SQL Server connector for Apache Hadoop lets customers move large volumes of data between Hadoop and SQL Server while the SQL Server PDW connector for Apache Hadoop moves data between Hadoop and SQL Server Parallel Data Warehouse (PDW). These new connectors will enable customers to work effectively with both structured and unstructured data.
External tables and full SQL query access to data stored in Hadoop Distributed File System (HDFS)
HDFS bridge for direct and fully parallelized Access to data in HDFS
Joining “on-the-fly” PDW data with data from HDFS
Parallel import of data from HDFS in PDW tables for persistent storage
Parallel export of PDW data into HDFS, including “round-tripping” of data
More specifically, Hadoop is a basic set of tools that help developers create applications spread across multiple CPU cores on multiple servers
it’s parallelism taken to an extreme.
DATA WAREHOUSING IN HADOOP
If you need to work with big data, Hadoop is becoming the _de facto_ answer. But once your data is in Hadoop, how do you query it?
If you need big data warehousing, look no further than Hive is a data warehouse built on top of Hadoop. Hive is a mature tool – it was developed at Facebook to handle their data warehouse needs. It’s best to think of Hive as an enterprise data warehouse (EDW)
Hive was designed to be easy for SQL professionals to use. Rather than write Java, developers write queries using HiveQL (based on ANSI SQL) and receive results as a table. As you’d expect from an EDW, Hive queries will take a long time to run; results are frequently pushed into tables to be consumed by reporting or business intelligence tools. It’s not uncommon to see Hive being used to pre-process data that will be pushed into a data mart or processed into a cube.
Analytics Platform System (APS) isn’t simply a renaming of the Parallel Data Warehouse (PDW). It is not really a new product, but rather a name change due to a new feature in Appliance Update 1 (AU1) of PDW. That new feature is the ability to have a HDInsight region (a Hadoop cluster) inside the appliance.
So APS combines SQL Server and Hadoop into a single offering that Microsoft is touting as providing “big data in a box.”
Think of APS as the “evolution” of Microsoft’s current SQL Server Parallel Data Warehouse product. Using PolyBase, it now supports the ability to query data using SQL across the traditional data warehouse, plus data stored in a Hadoop region, whether in the appliance or a separate Hadoop Cluster.
APS is a no-compromise modern data warehouse solution that seamlessly combines a best-in-class relational database management system, in-memory technologies, Hadoop and cloud integration in a turnkey package built for Big Data analytics.
General details
All hosts run Windows Server 2012 Standard
All virtual machines run Windows Server 2012 Standard as a guest operating system
All fabric and workload activity happens in Hyper-V virtual machines
Fabric virtual machines, MAD01, and CTL share one server
Lower overhead costs especially for small topologies
PDW Agent runs on all hosts and all virtual machines and collects appliance health data on fabric and workload
DWConfig and Admin Console continue to exist
Minor extensions expose host-level information
Windows Storage Spaces handles mirroring and spares and enables use of lower cost DAS (JBODs) rather than SAN
PDW workload details
SQL Server 2012 Enterprise Edition (PDW build) control node and compute nodes for PDW workload
Storage details
Similar layout to V1
More files per filegroup
Larger number of spindles in parallel
Excel is one of the primary clients to enable big data analytics on Microsoft platforms. In Excel 2013, our primary BI tools are PowerPivot, a data-modeling tool, and Power View, a data-visualization tool, and they are built right into the software, no additional downloads required. This enables users of all levels to do self-service BI using the familiar interface of Excel.
Through a Hive Add-in for Excel, our HDInsight services easily integrate with the BI tools in Office 2013, allowing users to create easily analyze massive amounts of structured or unstructured data with a very familiar tool.
In addition to Excel, Microsoft offers other client tools for interacting with Big Data: BI Professionals can use BI Developer Studio to design OLAP cubes or scalable PowerPivot models in SQL Server Analysis Services. Developers will continue using Visual Studio to develop and test MapReduce programs written in .NET. Finally, IT operators will manage their Hadoop clusters on HDInsight with System Center that they use today.
Direct parallel data access between PDW Compute Nodes and Hadoop Data Nodes
Support of all HDFS file formats
Introducing “structure” on the “unstructured” data
High-level goals for V2
Seamless Integration with Hadoop via regular T-SQL
Enhancing PDW query engine to process data coming from the Hadoop Distributed File System (HDFS)
Fully parallelized query processing for highly performing data import and export from HDFS
Integration with various Hadoop implementations
Hadoop on Windows Server, Hortonworks, and Cloudera
Both distributed systems
Parallel data access between PDW and Hadoop
Different goals and internal architecture
Combined power of Big Data integration