Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

The Shifting Landscape of Data Integration

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 45 Anzeige

The Shifting Landscape of Data Integration

Herunterladen, um offline zu lesen

Enterprises and organizations from every industry and scale are working to leverage data to achieve their strategic objectives — whether they are to be more profitable, effective, risk-tolerant, prepared, sustainable, and/or adaptable in an ever-changing world. Data has exploded in volume during the last decade as humans and machines alike produce data at an exponential pace. Also, exciting technologies have emerged around that data to improve our abilities and capabilities around what we can do with data.
Behind this data revolution, there are forces at work, causing enterprises to shift the way they leverage data and accelerate the demand for leverageable data. Organizations (and the climates in which they operate) are becoming more and more complex. They are also becoming increasingly digital and, thus, dependent on how data informs, transforms, and automates their operations and decisions. With increased digitization comes an increased need for both scale and agility at scale.
In this session, we have undertaken an ambitious goal of evaluating the current vendor landscape and assessing which platforms have made, or are in the process of making, the leap to this new generation of Data Management and integration capabilities.

Enterprises and organizations from every industry and scale are working to leverage data to achieve their strategic objectives — whether they are to be more profitable, effective, risk-tolerant, prepared, sustainable, and/or adaptable in an ever-changing world. Data has exploded in volume during the last decade as humans and machines alike produce data at an exponential pace. Also, exciting technologies have emerged around that data to improve our abilities and capabilities around what we can do with data.
Behind this data revolution, there are forces at work, causing enterprises to shift the way they leverage data and accelerate the demand for leverageable data. Organizations (and the climates in which they operate) are becoming more and more complex. They are also becoming increasingly digital and, thus, dependent on how data informs, transforms, and automates their operations and decisions. With increased digitization comes an increased need for both scale and agility at scale.
In this session, we have undertaken an ambitious goal of evaluating the current vendor landscape and assessing which platforms have made, or are in the process of making, the leap to this new generation of Data Management and integration capabilities.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie The Shifting Landscape of Data Integration (20)

Anzeige

Weitere von DATAVERSITY (20)

Aktuellste (20)

Anzeige

The Shifting Landscape of Data Integration

  1. 1. The Shifting Landscape of Data Integration Presented by: William McKnight “#1 Global Influencer in Data Warehousing” Onalytica President, McKnight Consulting Group Inc 5000 @williammcknight www.mcknightcg.com (214) 514-1444
  2. 2. © All rights reserved. Matillion 2021 A vendor perspective from Matillion The Shifting Landscape of Data Integration
  3. 3. © All rights reserved. Matillion 2021 2 Paul Lacey Sr. Director Product Marketing Matillion Speakers
  4. 4. © All rights reserved. Matillion 2021 New Challenges in the Cloud From 3 Vs to 3 Ds
  5. 5. © All rights reserved. Matillion 2021 © All rights reserved. Matillion 2021 Cloud Data Challenges Source: Calming the Storm with Cloud Data Management - Webinar - Matillion & IDC, April 2021, n=401
  6. 6. © All rights reserved. Matillion 2021 © All rights reserved. Matillion 2021 IDC Sees Shift From the 3 Vs to the 3 Ds Velocity Volume Variety Cloud Migration Dynamic Distributed Diverse Source: Calming the Storm with Cloud Data Management - Webinar - Matillion & IDC, April 2021
  7. 7. © All rights reserved. Matillion 2021 About Matillion A Modern Solution to Modern Challenges
  8. 8. © All rights reserved. Matillion 2021 Bringing the power of the cloud to modern data challenges with the flexibility of cloud-native data integration Low-Code, No-Compromise 7 © All rights reserved. Matillion 2021
  9. 9. © All rights reserved. Matillion 2021 Delivering transformative value Pay only for what you use: Achieve rapid returns on cloud investments that can be paid forward across the business Easy to use Get things done faster in the cloud with low-code, visual, repeatable design patterns Built for the enterprise Easily scale for massive data volumes and stay within enterprise security requirements Built for the cloud Leverage the full power of the cloud with push-down ELT and native cloud platform integrations About Matillion 8 Cloud data platforms we support:
  10. 10. © All rights reserved. Matillion 2021 9 Our Products © All rights reserved. Matillion 2021
  11. 11. © All rights reserved. Matillion 2021 More Info: matillion.com
  12. 12. © All rights reserved. Matillion 2021 Matillion #TeamGreen Matillion #teamgreen matillion.com Thank You!
  13. 13. William McKnight • Dog owner • Fitness – Current National Age Group Champion Deka.Fit and Hyrox Pro • Piano
  14. 14. Reports Spreadsheets Dashboards Transform Tools/Scripts Cubes or Marts Enterprise Analytic Data Operational Systems Data Science Artificial Intelligence Executives Compliance Audit IT Staff Governance Finance Stewardship Cloud Stores Cloud Apps Process Steward Data Steward Data Owner Business Terms Ops Documents Managers Users COBOL Reports Data, Processes, People, Privacy, Problems, and Projects Over Time
  15. 15. Why So Many Data Stores? 4
  16. 16. Performance • Performance is a critical point of interest when it comes to selecting an analytics platform • To measure data warehouse performance, we use similarly priced specifications across data warehouse competitors. • Usually when people say they care about performance, it is the ultimate metric of price/performance • The realities of creating fair tests can be overwhelming to many shops, and is a task usually underestimated. 5
  17. 17. Cost Predictability and Transparency • The cost profile options for cloud databases are straightforward if you accept the defaults for simple workload or proof-of- concept (POC) environments • Initial entry costs and inadequately scoped environments can artificially lower expectations of the true costs of jumping into a cloud data warehouse environment. • For some, you pay for compute resources as a function of time, but you also choose the hourly rate based on certain enterprise features you need. • With some platforms, you pay for bytes processed and the underlying architecture is unknown. The environment is scaled automatically without affecting price. There is also a cost-per- hour flat rate where you would need to calculate how long it would take to run your queries to completion to predict costs. • Customers need to analyze current workloads, performance, and concurrency and project those into realistic pricing in alternative platforms. 6
  18. 18. Administration • Overall costs, time, as well as storage and compute resources are affected by the simplicity of configurability and overall use. • The platform should have embraced a self-sufficiency model for its customers and be well into the process of automating repetitive tasks. • Easy administration starts with setup that is a simple process of asking basic information and providing helpful information for selecting the storage and node configurations. • The data store should support mission-critical business applications with minimal downtime. 7
  19. 19. Optimizer • You may need: • Conditional parallelism and what the causes are of variations in the parallelism deployed. • Dynamic and controllable prioritization of resources for queries. • Time and requirements for optimal queries, such as compiling indexes or updating statistics. • Workload isolation capabilities. 8
  20. 20. Concurrency Scaling • If the database has concurrency limitations, designing around them is difficult at best, and limiting to effective data usage. • If the data store automatically scales up to overcome concurrency limitations, this may be costly if the data warehouse charges by compute node. • If the data store charges per user, costs will also increase as the data warehouse is put to more use in the company. • The workload may need linear scaling in overall query workload performance as concurrent users are added. 9
  21. 21. Resource Elasticity • You may need the ability to scale up and down and take advantage of the elastic compute and storage capabilities in the cloud, public or private, without disruption or delay. • The more the customer needs to be involved in resource determination and provisioning, the less elastic, and less modern, the solution is. • One thing to watch for in elasticity scaling is keeping the amount of money spent by the customer under the customer’s control. 10
  22. 22. Machine Learning • Today, some data query languages need to be extended to include machine learning, or firms may find the programming required will be too challenging to keep pace. • Data stores today may need to weave machine learning into their data processing workflows • Vendors must accommodate and extend SQL to include machine learning functions and algorithms to expand the capabilities of those tools and users. • If your database does not include machine learning, there are many extra things to be concerned with. • Other components will be needed to complete the toolbox and get the job done. • Ideally, security for machine learning will be the same as database security. • The data warehouse also needs to be able to operate at scale, beyond sampling 11
  23. 23. Data Storage Format Alternatives • Cloud object storage is relatively inexpensive making data storage at high scale affordable. • On-premises, specialized private cloud storage options such as Pure Flashblades tend to offer similar data type storage flexibility • To take full advantage of the elasticity of the cloud without driving up costs, data warehouse compute and storage need to be scaled separately. • To take full advantage of the many types of data available, such as Apache ORC, Apache Parquet, JSON, Apache Avro, etc., modern data warehouses need to be able to analyze that data without moving or altering it. • A unified analytics warehouse that supports these various data formats means you have the benefit of querying them directly, without greatly expanding the hierarchical complex data types to a standard tabular data structure for analysis. • You should also be able to import data directly from these formats • The ability to join data for analysis between the various internal and external data formats provides the highest level of analytic flexibility. 12
  24. 24. Challenges of Today’s Environment vComplexity vContext vCost
  25. 25. Data Integration Product #Goals • Cloud Native • Intelligently driven automation suggests and generates new data pipelines between source and target without manually mapping or design • Data Orchestration – Managing the ebb and flow of data throughout the ecosystem, – Determining what data is analyzed/stored at the origin, – Determining what data is moved upstream and at what granularity and state – Moving, integrating and updating data, metadata and master data and machine learning models as evolution happens at the core • Trust created through transparency and understanding – Modern data management and integration platforms give organizations holistic views of their enterprise data and deep, thorough lineage paths to show how critical data traces back to a trusted, primary source • Able to dynamically, and even automatically, scale to meet the increasing complexity and concurrency demands of query executions. 14
  26. 26. Capabilities for Cloud Data Integration
  27. 27. Data Lineage Stakeholders and Use Cases Data Lineage Stakeholders and Use Case • Legislative Bodies and Compliance • Data Discovery Initiatives • Impact Analysis • Audit Functions • Data Quality Programs • Project Sponsors and Managers • Data Architects • Training and Onboarding Staff • Root Cause Analysis
  28. 28. Data Lineage Requirements Data Lineage Requirements • Graphical Representation • Impact Analysis • Root Cause Analysis • Extend to Non-Standard or Custom Sources • General Accessibility to Lineage and Metadata
  29. 29. Data Integration Requirements • Comprehensive Native Connectivity • Multi-Latency Data Ingestion • Data Integration Patterns • Data Quality and Data Governance • Data Cataloging and Metadata Management • Enterprise Trust • Artificial Intelligence and Automation • Ecosystem and Multi-Cloud 18
  30. 30. Data Prep Requirements • Designed as a low setup, no configuration, and no maintenance data pipeline to lift data from operational sources and deliver it to a modern cloud data warehouse • Well-suited for popular cloud applications like Salesforce, Stripe, and Marketo • Transformations are SQL-based rules written by business users and set to run on a schedule or as new data becomes available • Look for – Migrate function to move or clone business logic and resources, change data capture, queue messaging, and pub-sub – Dynamic ETL (such as job variables and iterators) plus data orchestration and staging mechanisms (such as functions to make real-time adjustments if jobs run long or fail) – Automation in pipeline building 19
  31. 31. Application Programming Interfaces • A ubiquitous method and de facto standard of communication among modern information technologies. • APIs have begun to replace older, more cumbersome methods of information sharing with lightweight, endpoints. • Due to the popularity and proliferation of APIs and microservices, the need has arisen to manage the multitude of services a company relies on—both internal and external. – APIs themselves vary greatly in protocols, methods, authorization/authentication schemes, and usage patterns. – Additionally, IT needs greater control over their hosted APIs, such as rate limiting, quotas, policy enforcement, and user identification, to ensure high availability and prevent abuse and security breaches. – Exposing APIs opens the door to many partners who can co- create and expand the core platform without even knowing anything about the underlying technology. • Organizations depend on these services to be properly managed, with high performance and availability. – Many companies experience workloads of more than 1,000 transactions per second on their API endpoints. – For these organizations, their need for performance is tantamount to their need for management because they rely on these API transaction rates to keep up with the speed of their business. – On the contrary, many companies are looking for a solution to load balance across redundant API endpoints and enable high transaction volumes. – A financial institution with 1,000 transactions happening per second translates to 86 million API calls in a single 24-hour day. 20
  32. 32. API & Microservices Ecosystem Public Private - External Private - Internal Over 20,000 public APIs* *according to https://www.programmableweb.com/apis/directory External Partners Connected Apps & Data
  33. 33. Platform Architecture Load Balancer (Nginx, HAProxy, ELB, etc.) API Nodes Postgres or Cassandra API Endpoint 1 API Endpoint 2 API Endpoint…n Client 1 Client 2 Client …n
  34. 34. API Requirements • Performance: Good for high performance workloads (>1,000TPS) • Reliability: All workloads completed with 100% message completion • Complexity: Multiple plugins enabled
  35. 35. Streaming Solutions
  36. 36. ETL is Insufficient for this combination • Data platforms operating at an enterprise-wide scale • A high variety of data sources • Real-time/streaming data • ETL forces either real-time loading without being scalable or scalability with batch loading – Data, produced from numerous sources, is a torrent of flowing information, needing to be timestamped, dispatched, and even duplicated (to protect against data loss) – A postman is needed to distribute data from message senders to receivers at the right place at the right time. 25
  37. 37. Real-Time Data • A.k.a. messaging, live feeds, real-time, event-driven • Comes in continuously and often quickly, so we call it streaming data. • Needs special attention and can be of immense value, but only if we are alerted in time. • Foundation for Artificial Intelligence – Stream data forms the core of data for artificial intelligence 26
  38. 38. Enter Message-Oriented Middleware aka Streaming and message queuing technology • Messages can be any kind of data wrapped in a neat package with a very simple header as a bow on top. • Messages are sent by “producers”—systems, sensors, or devices that generate the messages—toward a “broker.” • A broker does not process the messages, but instead routes them into queues according to the information enclosed in the message header or its own routing process. • Then “consumers” retrieve the messages from the queues to which they subscribe (although sometimes messages are pushed to consumers rather than pulled). • The consumers open the messages and perform some kind of action on them. 27
  39. 39. Performance and scalability in streaming 28 Storage Ability to retain varying volumes of messages for varying lengths of time Throughput High, sustainable rate of message processing Latency Fast, consistent responsiveness for publishing and consumption Operations Minimizing operational burden for scaling, tuning, and monitoring
  40. 40. Streaming Architecture Apps 29 Streaming Platform Change logs Streaming data pipelines Messaging or Stream processing Request - Response DW Technical Support Web Services API Big Data Analysis IDE / Developer GUI Data Lake Parallel Tools Multi-Threaded Math Libraries Cluster support
  41. 41. Apache Kafka • Open source streaming platform developed at LinkedIn • A distributed publish-subscribe messaging system that maintains feeds of messages called topics – Publishers write data to topics and subscribers read from topics – Kafka topics are partitioned and replicated across multiple nodes in your Hadoop cluster • Enables “source to sink” data pipelines • Kafka messages are simple, byte-long arrays that can store objects in virtually any format with a key attached to each message; often in JSON • E&L in ETL through Kafka Connect API • T in ETL through Kafka Streams API • Fault-tolerant • DIY 30
  42. 42. Apache Pulsar • Originally developed at Yahoo • Began its incubation at Apache in late 2016 • Has been in production as Yahoo for 3 years prior—utilized in popular services and applications like Yahoo! Mail, Finance, Sports, Flickr, Gemini Ads, and Sherpa • Follows the publisher-subscriber model (pub-sub), and has the same producers, topics, and consumers as some of the aforementioned technologies • Uses built-in multi-datacenter replication • Architected for multi-tenancy and uses concepts of properties and namespaces – A property could represent all the messaging for a particular team, application, product vertical, etc. – Namespaces is the administrative unit where security, replication, configurations, and subscriptions are managed – At the topic level, messages are partitioned and managed by brokers using a user-defined routing policy—such as single, round robin, hash, or custom—thus granting further transparency and control in the process 31
  43. 43. Workloads are Distinguished by • The number of topics • The size of the messages being produced and consumed • The number of subscriptions per topic • The number of producers per topic • The rate at which producers produce messages (per second) • The size of the consumer’s backlog (in gigabytes) • The total duration of the test (in minutes) 32
  44. 44. Key Takeaways & Application to the Enterprise • An enterprise has many different types of data stores as well as many data stores of the same type – This extends to clouds • The reasons for this are many – Performance – Cost Predictability and Transparency – Administration – Optimizer – Concurrency – Resource Elasticity – Machine Learning Capabilities – Data Storage Format Alternatives • It is fully expected that enterprise environments would have a heterogenous vendor as well as other vendors • APIs have begun to replace older, more cumbersome methods of information sharing with lightweight, endpoints. • Streaming and message queuing have lasting value to organizations. • Streaming and messaging will be able to meet the real-time data volume, variety, and timing requirements of the coming years.
  45. 45. The Shifting Landscape of Data Integration Presented by: William McKnight “#1 Global Influencer in Data Warehousing” Onalytica President, McKnight Consulting Group Inc 5000 @williammcknight www.mcknightcg.com (214) 514-1444

×