Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Kudu as Storage Layer to Digitize Credit Processes

64 Aufrufe

Veröffentlicht am

With HDFS and HBase, there are two different storage options available in the Hadoop ecosystem. Both have their strengths and weaknesses. However, neither HDFS nor HBase can be used universally for all kinds of workloads. Usually this leads to complex hybrid architectures. Kudu is a very versatile storage layer which fills this gap and simplifies the architecture of Big Data systems.
A large German bank is using Kudu as storage layer to fasten their credit processes. Within this system, financial transactions of millions of customers are analysed by Spark jobs to categorize transactions and to calculate key figures. In addition to this analytical workload, several frontend applications are using the Kudu Java API to perform random reads and writes in real-time.
The presentation will cover these topics:
- Business and technical requirements
- Data access patterns
- System architecture
- Kudu data modelling
- Kudu architecture for High Availability
- Experiences from development and operations
The slides are from my talk at DataWorks Summit 2019 #DWS19 in Barcelona.

Veröffentlicht in: Daten & Analysen
  • DOWNLOAD FULL BOOKS, INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • DOWNLOAD FULL BOOKS, INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Gehören Sie zu den Ersten, denen das gefällt!

Kudu as Storage Layer to Digitize Credit Processes

  1. 1. info@ordix.de www.ordix.de DataWorks Summit Barcelona March 20, 2019 Olaf Hein Fast Analytics on Fast Data Kudu as Storage Layer to Digitize Credit Processes
  2. 2. Fast Analytics on Fast Data - DataWorks Summit Barcelona 2019 - Olaf Hein 2 Problem Classic Credit Process  A lot of paperwork  Application in the subsidiary  Costly process  Manual approval  Deferred disbursement
  3. 3. Fast Analytics on Fast Data - DataWorks Summit Barcelona 2019 - Olaf Hein 3 Solution Digitised Credit Processes  No paperwork  Application via online banking  Optimised process  Automated approval  Immediate disbursement
  4. 4. Fast Analytics on Fast Data - DataWorks Summit Barcelona 2019 - Olaf Hein 4 Automated Self-Disclosure Date Posting Text Amount 01.01. ABC Inc. Salary 3600 01.01. GOV KG1234567890 400 01.01. Rental Fee Main Street, Redhill 800 13.01. ACME says thank you R1001345 40 15.01. Gas & Electricity GE0987654321 120 15.05. Waterworks W111222 80 20.05. ACME says thank you R1001432 60 Revenue Amount Salary 3600 Child Benefit 400 Rental Income 0 Total Revenue 4000 Expenses Rental Fee 800 Additional Housing Costs 200 Other Regular Expenses 0 Total Expenses 1000 Summary Spendable Income 3000 Credit Limit 10000
  5. 5. Fast Analytics on Fast Data - DataWorks Summit Barcelona 2019 - Olaf Hein 5 Hadoop Storage Options (classic) Pace of Analytics PaceofData Static Data Inserts Only Frequent Updates Real-Time Updates Archive Batch Jobs Interactive Queries Real-Time Access HBase HDFS
  6. 6. Fast Analytics on Fast Data - DataWorks Summit Barcelona 2019 - Olaf Hein 6 HDFS Distributed File System  Data modelling  Large data files  Directories as tables  Subdirectories as partitions  Several supported file formats  No primary key and no index Characteristics  Write once read many  Bulk inserts  Large table and partition scans  No random access  No updates and no deletes
  7. 7. Fast Analytics on Fast Data - DataWorks Summit Barcelona 2019 - Olaf Hein 7 HBase NoSQL Database  Data modelling  Tables  Row key (primary key)  Column families  Dynamic columns  Data files  Stored in HDFS  Rows ordered by row key Characteristics  Random access  Very fast inserts, updates and deletes  Short range scans
  8. 8. Fast Analytics on Fast Data - DataWorks Summit Barcelona 2019 - Olaf Hein 8 Hadoop Storage Options (new) Pace of Analytics PaceofData Static Data Inserts Only Frequent Updates Real-Time Updates Archive Batch Jobs Interactive Queries Real-Time Access HBase HDFS Kudu
  9. 9. Fast Analytics on Fast Data - DataWorks Summit Barcelona 2019 - Olaf Hein 9 Kudu Storage Layer / NoSQL Database  Data modelling  Tables, columns and data types  Partitions  Primary key and partition key  Data files stored in Linux FS  Columnar storage  Rows  Ordered by primary key  Distributed by partition key Characteristics  Random access  Fast inserts, updates and deletes  Large (and short) range scans Hadoop storage layer to enable fast analytics on fast data
  10. 10. Fast Analytics on Fast Data - DataWorks Summit Barcelona 2019 - Olaf Hein 10 Data Ingestion Bulk Inserts  Daily delivery of financial transactions  5,000,000 median  20,000,000 peak  Delivered by batch jobs  ETL processing  Cleansing  Conforming  Merging 110 101 101 011 011 010
  11. 11. Fast Analytics on Fast Data - DataWorks Summit Barcelona 2019 - Olaf Hein 11 Categorisation & Payment Flow Analysis (PFA) Large Scans, Inserts und Updates  Categorisation of new transactions  Payment Flow Analysis for active customers  13 month transaction history  500.000.000 transactions per day Very Large Scans & Updates  Model changes  Update categories and self-disclosures  1,500,000,000 transactions for 13 months
  12. 12. Fast Analytics on Fast Data - DataWorks Summit Barcelona 2019 - Olaf Hein 12 Online Interfaces Random Read  Credit limit  Self-disclosure Random Writes  Updated self-disclosure  Re-calculated credit limit
  13. 13. Fast Analytics on Fast Data - DataWorks Summit Barcelona 2019 - Olaf Hein 13 Delete Random Deletes  Delete particular customer data  Transactions  Self-disclosure  Reasons  Outdated data (data economy)  Customer disables credit limit option  Customer cancels account
  14. 14. Fast Analytics on Fast Data - DataWorks Summit Barcelona 2019 - Olaf Hein 14 Analytics Monitor Model  Complete data set  One account Tools  Impala SQL  Python  Anaconda  Jupyter Notebook  PySpark
  15. 15. Fast Analytics on Fast Data - DataWorks Summit Barcelona 2019 - Olaf Hein 15 Summary of Data Access Patterns Bulk Inserts Large Scans Short Scans Random Reads Updates Deletes Analytics
  16. 16. Fast Analytics on Fast Data - DataWorks Summit Barcelona 2019 - Olaf Hein 16 High Level Architecture Data Science SandboxData Lake Spark & Impala RDBMS Kudu Extract REST Services Online-Banking Subsidiary HDFS LoadTransform PFA Data Science Workplace Kudu Load Jupyter Notebooks DS & ML Linux FS
  17. 17. Fast Analytics on Fast Data - DataWorks Summit Barcelona 2019 - Olaf Hein 17 Kudu Architecture (1/2) Tablet Server Tablet Server Client Kudu MasterKudu Master Follower Kudu Master Leader read & write metadata read & write data Tablet 1 Leader Tablet 2 Follower Tablet Server Tablet Server Tablet 1 Follower Tablet 2 Follower Tablet 1 Follower Tablet 2 Leader read data
  18. 18. Fast Analytics on Fast Data - DataWorks Summit Barcelona 2019 - Olaf Hein 18 Kudu Architecture (2/2) Kudu Master  Metadata management  Database catalog  Tablet - Server mapping  3 for high availability (minimal 1)  Reading & writing via leader Kudu Tablet Server  Data storage  Tables are  Split into tablets (partitions)  Replicated for availability (odd number of replicas)  Direct data flow between client and tablet server  Writing via leader  Reading via leader or follower Tablet Server Tablet Server Client Kudu MasterKudu Master Follower Kudu Master Leader read & write metadata read & write data Tablet 1 Leader Tablet 2 Follower Tablet Server Tablet Server Tablet 1 Follower Tablet 2 Follower Tablet 1 Follower Tablet 2 Leader read data
  19. 19. Fast Analytics on Fast Data - DataWorks Summit Barcelona 2019 - Olaf Hein 19 Data Modelling Primary Key Design  Unique index  Ordering of rows Partitioning  Data distribution and scalability  Rows on tablets  Tablets on tablet servers  Several strategies  Range  Hash  Multilevel TRANSACTION IBANPS POSTING_TEXT TRX_IDPS AMOUNT SELF_DISCLOSURE IBANPS CHILD_BENEFIT RENTAL_INCOME TRX_DATEPS SALARY RENTAL_FEE ADD_HOUSING_COSTS OTHER_EXPENSES CREDIT_LIMIT 1n CATEGORY
  20. 20. Fast Analytics on Fast Data - DataWorks Summit Barcelona 2019 - Olaf Hein 20 Hash Partitioning (1/2)  Partition key has to be part of primary key  Distribution by hash value  Immutable number of buckets (partitions)  Number of buckets depends on  Data volume  Number of tablet servers & cores per server  Access patterns Tablet 1 Hash Bucket = 0 Tablet 2 Hash Bucket = 1 Tablet 3 Hash Bucket = 2 HASH ( iban ) Tablet 4 Hash Bucket = 4
  21. 21. Fast Analytics on Fast Data - DataWorks Summit Barcelona 2019 - Olaf Hein 21 Hash Partitioning (2/2) create table self_disclosure( iban string, amount double, child_benefit double, rental_income double, rental_fee double, add_housing_costs double, other_expenses double, credit_limit double, primary key (iban) ) partition by hash (iban) partitions 4 stored as kudu;
  22. 22. Fast Analytics on Fast Data - DataWorks Summit Barcelona 2019 - Olaf Hein 22 Range Partitioning (1/2)  Partition key has to be part of primary key  Range  Open or closed  No overlapping  Partitions can be added and deleted on demand  Range size depends on  Data volume  Number of tables servers & cores per server  Access patterns Tablet 2 '2018-01-01' <= trx_date < '2019-01-01' Tablet 1 trx_date < '2018-01-01' RANGE ( trx_date ) Tablet 3 '2019-01-01' <= trx_date < '2020-01-01'
  23. 23. Fast Analytics on Fast Data - DataWorks Summit Barcelona 2019 - Olaf Hein 23 Range Partitioning (2/2) create table transaction ( iban string, trx_date string, trx_id string, posting_text string, amount double, category int, primary key (iban, trx_date, trx_id) ) partition by range (trx_date) ( partition values < '2018-01-01', partition '2018-01-01' <= values < '2019-01-01', partition '2019-01-01' <= values < '2020-01-01' ) stored as kudu;
  24. 24. Fast Analytics on Fast Data - DataWorks Summit Barcelona 2019 - Olaf Hein 24 Multilevel Partitioning (1/3) Tablet 1 '2017-01-01' <= trx_date < '2018-01-01' Hash Bucket = 0 Tablet 2 '2018-01-01' <= trx_date < '2019-01-01' Hash Bucket = 0 Tablet 3 '2019-01-01' <= trx_date < '2020-01-01' Hash Bucket = 0 HASH(iban) RANGE ( trx_date ) Tablet 4 '2017-01-01' <= trx_date < '2018-01-01' Hash Bucket = 1 Tablet 5 '2018-01-01' <= trx_date < '2019-01-01' Hash Bucket = 1 Tablet 6 '2019-01-01' <= trx_date < '2020-01-01' Hash Bucket = 1 Tablet 7 '2017-01-01' <= trx_date < '2018-01-01' Hash Bucket = 2 Tablet 8 '2018-01-01' <= trx_date < '2019-01-01' Hash Bucket = 2 Tablet 9 '2019-01-01' <= trx_date < '2020-01-01' Hash Bucket = 2
  25. 25. Fast Analytics on Fast Data - DataWorks Summit Barcelona 2019 - Olaf Hein 25 Multilevel Partitioning (2/3)  Combination of  Hash + Range  Hash + [Hash ...] + [Range]  Advantages (for time series data)  Hash partitioning  Write operations are distributed over several tablets  Read operations for range scans are parallelized  All records of one IBAN in one partition (partition pruning)  Range partitioning  Add new partitions (scalability)  Delete old partitions  All records of one year in one partition (partition pruning) Tablet 1 '2017-01-01' <= trx_date < '2018-01-01' Hash Bucket = 0 Tablet 2 '2018-01-01' <= trx_date < '2019-01-01' Hash Bucket = 0 Tablet 3 '2019-01-01' <= trx_date < '2020-01-01' Hash Bucket = 0 HASH(iban) RANGE ( trx_date ) Tablet 4 '2017-01-01' <= trx_date < '2018-01-01' Hash Bucket = 1 Tablet 5 '2018-01-01' <= trx_date < '2019-01-01' Hash Bucket = 1 Tablet 6 '2019-01-01' <= trx_date < '2020-01-01' Hash Bucket = 1 Tablet 7 '2017-01-01' <= trx_date < '2018-01-01' Hash Bucket = 2 Tablet 8 '2018-01-01' <= trx_date < '2019-01-01' Hash Bucket = 2 Tablet 9 '2019-01-01' <= trx_date < '2020-01-01' Hash Bucket = 2
  26. 26. Fast Analytics on Fast Data - DataWorks Summit Barcelona 2019 - Olaf Hein 26 Multilevel Partitioning (3/3) create table transaction ( iban string, trx_date string, trx_id string, posting_text string, amount double, category int, primary key (iban, trx_date, trx_id) ) partition by hash (iban) partitions 3, range (trx_date) ( partition '2017-01-01' <= values < '2018-01-01', partition '2018-01-01' <= values < '2019-01-01', partition '2019-01-01' <= values < '2020-01-01' ) stored as kudu;
  27. 27. Fast Analytics on Fast Data - DataWorks Summit Barcelona 2019 - Olaf Hein 27 Experiences with Kudu Pros  Easy data modelling  Partitioning  Key design  Easy usage  Spark Connector  Java Client  Impala / SQL  Easy operations  High availability  Stability  Performance  Very versatile  Perfect fit for our access patterns Cons  Security  Only Kerberos authentication  No authorisation (so far)  Workaround using Impala and Sentry  Backup  Not implemented (so far)  Workaround using Impala or Spark  Operations  Sensitive to network latencies and time differences  Tool set still growing  Maximum sizing  100 tablet servers  8 TiB and 2.000 tablets per server (post-replication)  60 tablets per table and server
  28. 28. ORDIX AG Aktiengesellschaft für Softwareentwicklung, Schulung, Beratung und Systemintegration Zentrale Paderborn Karl-Schurz-Straße 19a 33100 Paderborn Tel.: 05251 1063-0 Fax: 0180 1 67349 0 Seminarzentrum Wiesbaden Kreuzberger Ring 13 65205 Wiesbaden Tel.: 0611 77840-00 info@ordix.de www.ordix.de Thank you
  29. 29. Fast Analytics on Fast Data - DataWorks Summit Barcelona 2019 - Olaf Hein 29 Credits (1/2) Slide 1: nadla; iStock; Stock photo ID:502978042 https://www.istockphoto.com/de/foto/digitale-stadt-gm502978042-82258535 Slide 2: Free-Photos; Pixabay https://pixabay.com/de/photos/schreiben-person-papierkram-papier-828911 Slide 3: PourquoiPas; Pixabay https://pixabay.com/de/photos/frau-tablet-entspannt-wohnzimmer-2099466 Slide 6: Apache Hadoop Project https://svn.apache.org/repos/asf/hadoop/logos/out_rgb/hadoop+hdfs+mapreduce+common_rgb.pdf Slide 7: Apache HBase Project https://github.com/apache/hbase/blob/master/src/site/resources/images/hbase_logo_with_orca_large.png Slide 9: Apache Kudu Project https://d3dr9sfxru4sde.cloudfront.net/i/k/apachekudu_logo_0716_580px.png Slide 11: AdamStanislav; Openclipart https://openclipart.org/detail/245716/carl-gausss-formula-to-quickly-add-up-all-numbers-from-1-to-n Slide 12: voyeg3r; Openclipart https://openclipart.org/detail/192568/puzzle Slide 13: warszawianka; Openclipart https://openclipart.org/detail/32401/tango-edit-delete
  30. 30. Fast Analytics on Fast Data - DataWorks Summit Barcelona 2019 - Olaf Hein 30 Credits (2/2) Slide 15: oksmith; Openclipart https://openclipart.org/detail/295787/thumbs-up Apache Hadoop Project https://svn.apache.org/repos/asf/hadoop/logos/out_rgb/hadoop+hdfs+mapreduce+common_rgb.pdf Apache HBase Project https://github.com/apache/hbase/blob/master/src/site/resources/images/hbase_logo.svg Apache Kudu Project https://d3dr9sfxru4sde.cloudfront.net/i/k/apachekudu_logo_0716_580px.png Slide 16: Exocet; Openclipart https://openclipart.org/detail/262417/cartoon-android-smartphone mcol; Openclipart https://openclipart.org/detail/15877/pie-chart Slide 28: nadla; iStock; Stock photo ID:502978042 https://www.istockphoto.com/de/foto/digitale-stadt-gm502978042-82258535

×