Anzeige

HEDW-2020-Using-Data-Virtualization-to-Break-Down-Data-Silos.pptx

26. Mar 2023
Anzeige

Más contenido relacionado

Anzeige

HEDW-2020-Using-Data-Virtualization-to-Break-Down-Data-Silos.pptx

  1. Data Virtualization: Breaking Down Data Silos and Other Data Problems • Richard Hanks (richard_hanks@byu.edu) • Roger Tervort (roger_tervort@byu.edu)
  2. BRIGHAM YOUNG UNIVERSITY (BYU)
  3. Data Virtualization “Business demand for self-service access to real-time data from multiple data sources and in varied formats complicates data management.” - Gartner “Leveraging Data Virtualization in Modern Data Architectures”, April 5, 2019
  4. Data Virtualization Distributed Data Management Technology is based on the execution of distributed data management Flexibility Consumed by applications, query/reporting tools, message-oriented middleware or other data management infrastructure components. AbstractionLayer Layer of abstraction above the physical implementation of data, to simplify querying logic. Multiple Data Sources Used primarily for queries against multiple heterogeneous data sources, and federation of query results into virtual views Virtual Integrated Views Data virtualization can be used to create virtualized and integrated views of data. (In-memory, rather than executing data movement) Gartner: Market Guide for Data Virtualization (16 Nov 2018); Leveraging Data Virtualization in Modern Data Architectures (5 Apr 2019)
  5. Data A Cleaning Formatting Joining Standardizing Pivoting Automating Storing Field Types Data B Report A Excel CSV Oracle Manual Processes Report B Extracting Wrangling Aggregating Calculations Data C Data D Report A (version 2) ETL API Finding MySQL If we have the data… Why is it so hard to develop the report that I want??
  6. Poll How prevalent are data silos at your school? 1 = We have nearly all data centralized 3 = We have some central data 5 = Most of our data is in data silos
  7. Academic Freedom Culture? Data Silos? Resulting Problems: • Lack of Centralized Data • Disparate Systems • Data Replication • Broken Data Pipelines • Overuse of ETL – just to move data • Data Security (Authentication / Authorization)
  8. Selection of DV Tools www.dremio.com www.denodo.com
  9. Benefits from Data Virtualization OIT Managed Data Enrollment Services Department of Continuing Education Center for Teaching and Learning Library Marriott School of Management Single Point of Entry / Autorization Tableau Business Objects Excel PowerBI Python R SQL Other
  10. Benefits from Data Virtualization OIT Managed Data Enrollment Services Department of Continuing Education Center for Teaching and Learning Library Marriott School of Management Tableau Business Objects Excel PowerBI Python R SQL Other Single Point of Entry / Autorization
  11. Benefits from Data Virtualization Single Point of Entry / Autorization Tableau Business Objects Excel PowerBI Python R SQL Other SIS Identity Registration Student Dimension Excel/CSV Data Virtual Views (Curated) Database Physical Layer Other Virtual Layer - Collibra (DSA) - Searchable (Dremio Catalog)
  12. 01 Reduction in ETL Development 02 Reduction in Data Replication 03 Flexible Data Pipeline for Data Science and Adhoc 04 Quicker DSA Approval and Delivery 05 Reduction in Large Tableau Data Refreshes 06 Breakdown of Data Silos / Departments Still have their data
  13. 07 Row / Column / Masking Data Security 08 Addition of CSV, JSON, and some XLSX files 09 Combining Data Sources (Oracle, MS SQL, MySQL, AWS, Mongo) 10 Curated Data Sets (General and Surgical) 11 Acceleration of Queries (Caching of Data) 12 Pre-Aggregation Queries (Cube type OLAP)
  14. Case Studies – Real Life Examples
  15. Library and Enrollment Services (ES) each need data the other group has. Library data stores library and patron usage data in MySQL, MongoDB, and Oracle. ES has student demographic data in Oracle (currently centralized and managed by IT). Both will need Data Sharing Agreements (DSA) and will need the data updated frequently. Need Extract Mongo DB data to flat file. Build ETL to combine data from MySQL, Oracle (Library), and Oracle (ES). Data is joined on common business keys. Estimated time to delivery: 3-4+ weeks (not including DSA) Old Leave data in its place. Use Data Virtualization to create Virtual Data Sources in SQL to query all sources and combine and join data. Change authorization for new Virtual Data Sets. Estimated time to delivery 2 days to 1 week (not including DSA) New
  16. General Studies needs an analysis of the order of courses taken to meet the Language of Learning requirement. Course data is available, but sequencing and analysis will be done in SAS. Output will be a csv file, but will need to be enriched with demographic data of students who took specific classes. Need Extract Course data into CSV. Analyze data in SAS. Export SAS result file to CSV. Load CSV into Oracle. Enrich SAS result data with other Oracle data. Use Tableau to deliver dashboard of results. Old Way Leave data in its place. Use Dremio to feed data into SAS via ODBC. Output results stored to NAS drive as csv. Demographic data added Virtual Data Set. Tableau points to Dremio data set. Dremio becomes a Data Science Sand Box. New Way
  17. Large campus department wanted to do a turnover analysis on their administrative and student employees. Need 5+ years of data. No standard analysis process exists. Data in PS Oracle and will be combined with Department Internal job descriptions and classifications. HR Department concerned about additional data in tables with the data that would be “coming along for the ride.” Need Use ETL to create custom table or custom extract into Department databases. Department will perform its analysis in MatLab. But how to update? Old Leave data in its place. Use SQL in Dremio to query all sources and combine and join data. MatLab to use ODBC to query Virtual Data Set for analysis. BONUS: DSA was based on Virtual Data Set rather than on multiple underlying Oracle tables. No data came along for the ride. Time to delivery 2 weeks (including DSA) New
  18. Our Security Operations Center recently expanded their coverage to include additional Church related academic institutions. One service they were offering was Threat and Federated Intelligence. To provide that same service to all campuses with multiple heterogeneous systems will be a challenge. Need Use RunDeck ETL, python, and other to put automation. Automation from each system and from those to S3. Somehow combine enriching data in Oracle with S3 (another ETL?) Old Way Develop an event driven, microservices architecture. Microservices pulls data from systems and saves to JSON file in AWS S3. Dremio makes each JSON file appear like a table and is joined with other tables in Dremio to enrich the data. Data can feed reporting or other ad-hoc analysis New Way
  19. Questions richard_hanks@byu.edu roger_tervort@byu.edu

Hinweis der Redaktion

  1. Single point of entry – Authentication / Authorization of Data (Row, Column, Masking) Flexible Tool (ODBC and Direct Connections) Leave Data at the Source (Less Data Replication)
  2. 4. Easier access to data across campus (via DSA with no data coming along for the ride) 5. Breakdown of existing silos 6. Use flat files as data sources (csv, Excel, JSON) 7. Virtual Data Warehouse (Enterprise View) Dimensions: Student, Faculty, Admin, Date, OU Structure, HR Structure, Colleges, Courses Measures: GPA, Enrollments, Counts, Averages, Hours 8. Curated Data – Data Sets that make sense (some individualized) – THIS is where we really help a lot of the Have Nots. 9. Work with Data Stewards to make “pre-approved” data sets 10. Ability to search data (auto cataloging and tagging) 11. Reflections – data and aggregation query acceleration
  3. 12. Rapid Prototyping – data proof of concept (avoid extensive ETL) 13. Queries across multiple database platforms, on-prem and cloud
Anzeige