Suche senden
Hochladen
Storage Characteristics Of Call Data Records In Column Store Databases
•
5 gefällt mir
•
3,420 views
David Walker
Folgen
Technologie
Business
Melden
Teilen
Melden
Teilen
1 von 28
Jetzt herunterladen
Downloaden Sie, um offline zu lesen
Empfohlen
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
Slim Baltagi
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Helena Edelson
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
James Serra
Data reply sneak peek: real time decision engines
Data reply sneak peek: real time decision engines
confluent
PASS Summit 2020
PASS Summit 2020
Kellyn Pot'Vin-Gorman
Notes on NUMA architecture
Notes on NUMA architecture
Intel Software Brasil
Core Archive for SAP Solutions
Core Archive for SAP Solutions
OpenText
Asap implementation methodology (2)
Asap implementation methodology (2)
Pradipta Mallick
Empfohlen
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
Slim Baltagi
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Helena Edelson
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
James Serra
Data reply sneak peek: real time decision engines
Data reply sneak peek: real time decision engines
confluent
PASS Summit 2020
PASS Summit 2020
Kellyn Pot'Vin-Gorman
Notes on NUMA architecture
Notes on NUMA architecture
Intel Software Brasil
Core Archive for SAP Solutions
Core Archive for SAP Solutions
OpenText
Asap implementation methodology (2)
Asap implementation methodology (2)
Pradipta Mallick
Sizing sap hana
Sizing sap hana
Jaleel Ahmed Gulammohiddin
physical and logical data independence
physical and logical data independence
apoorva_upadhyay
Cassandra
Cassandra
Pooja GV
Gcp dataflow
Gcp dataflow
Igor Roiter
Rocks db state store in structured streaming
Rocks db state store in structured streaming
Balaji Mohanam
Hadoop
Hadoop
Ramakrishna Reddy Bijjam
Spark architecture
Spark architecture
GauravBiswas9
Data Warehouse Back to Basics: Dimensional Modeling
Data Warehouse Back to Basics: Dimensional Modeling
Dunn Solutions Group
Building Real-time Serverless Backends with GraphQL
Building Real-time Serverless Backends with GraphQL
Amazon Web Services
NOSQL vs SQL
NOSQL vs SQL
Mohammed Fazuluddin
Thousands of Threads and Blocking I/O
Thousands of Threads and Blocking I/O
George Cao
Text mining
Text mining
Ali A Jalil
A Survey of HBase Application Archetypes
A Survey of HBase Application Archetypes
HBaseCon
Data warehouse architecture
Data warehouse architecture
uncleRhyme
Introduction data mining
Introduction data mining
Rana Chakraborty
Dimensional Modeling
Dimensional Modeling
Sunita Sahu
Web Information Retrieval and Mining
Web Information Retrieval and Mining
Carlos Castillo (ChaTo)
Sap enhancement packages
Sap enhancement packages
Joyce Maina
Implementing High Availability Caching with Memcached
Implementing High Availability Caching with Memcached
Gear6
Webinars - Introducción Oracle Data Masking and Subsetting Pack
Webinars - Introducción Oracle Data Masking and Subsetting Pack
avanttic Consultoría Tecnológica
Kb 40 kevin_klineukug_reading20070717[1]
Kb 40 kevin_klineukug_reading20070717[1]
shuwutong
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
DATAVERSITY
Weitere ähnliche Inhalte
Was ist angesagt?
Sizing sap hana
Sizing sap hana
Jaleel Ahmed Gulammohiddin
physical and logical data independence
physical and logical data independence
apoorva_upadhyay
Cassandra
Cassandra
Pooja GV
Gcp dataflow
Gcp dataflow
Igor Roiter
Rocks db state store in structured streaming
Rocks db state store in structured streaming
Balaji Mohanam
Hadoop
Hadoop
Ramakrishna Reddy Bijjam
Spark architecture
Spark architecture
GauravBiswas9
Data Warehouse Back to Basics: Dimensional Modeling
Data Warehouse Back to Basics: Dimensional Modeling
Dunn Solutions Group
Building Real-time Serverless Backends with GraphQL
Building Real-time Serverless Backends with GraphQL
Amazon Web Services
NOSQL vs SQL
NOSQL vs SQL
Mohammed Fazuluddin
Thousands of Threads and Blocking I/O
Thousands of Threads and Blocking I/O
George Cao
Text mining
Text mining
Ali A Jalil
A Survey of HBase Application Archetypes
A Survey of HBase Application Archetypes
HBaseCon
Data warehouse architecture
Data warehouse architecture
uncleRhyme
Introduction data mining
Introduction data mining
Rana Chakraborty
Dimensional Modeling
Dimensional Modeling
Sunita Sahu
Web Information Retrieval and Mining
Web Information Retrieval and Mining
Carlos Castillo (ChaTo)
Sap enhancement packages
Sap enhancement packages
Joyce Maina
Implementing High Availability Caching with Memcached
Implementing High Availability Caching with Memcached
Gear6
Webinars - Introducción Oracle Data Masking and Subsetting Pack
Webinars - Introducción Oracle Data Masking and Subsetting Pack
avanttic Consultoría Tecnológica
Was ist angesagt?
(20)
Sizing sap hana
Sizing sap hana
physical and logical data independence
physical and logical data independence
Cassandra
Cassandra
Gcp dataflow
Gcp dataflow
Rocks db state store in structured streaming
Rocks db state store in structured streaming
Hadoop
Hadoop
Spark architecture
Spark architecture
Data Warehouse Back to Basics: Dimensional Modeling
Data Warehouse Back to Basics: Dimensional Modeling
Building Real-time Serverless Backends with GraphQL
Building Real-time Serverless Backends with GraphQL
NOSQL vs SQL
NOSQL vs SQL
Thousands of Threads and Blocking I/O
Thousands of Threads and Blocking I/O
Text mining
Text mining
A Survey of HBase Application Archetypes
A Survey of HBase Application Archetypes
Data warehouse architecture
Data warehouse architecture
Introduction data mining
Introduction data mining
Dimensional Modeling
Dimensional Modeling
Web Information Retrieval and Mining
Web Information Retrieval and Mining
Sap enhancement packages
Sap enhancement packages
Implementing High Availability Caching with Memcached
Implementing High Availability Caching with Memcached
Webinars - Introducción Oracle Data Masking and Subsetting Pack
Webinars - Introducción Oracle Data Masking and Subsetting Pack
Ähnlich wie Storage Characteristics Of Call Data Records In Column Store Databases
Kb 40 kevin_klineukug_reading20070717[1]
Kb 40 kevin_klineukug_reading20070717[1]
shuwutong
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
DATAVERSITY
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Denodo
Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptx
Priyadarshini648418
Secondary Storage - General Knowledge
Secondary Storage - General Knowledge
Samat
Data Lakes: A Logical Approach for Faster Unified Insights
Data Lakes: A Logical Approach for Faster Unified Insights
Denodo
DC Storage Review
DC Storage Review
Rodney Koch
S016828 storage-tiering-nola-v1710b
S016828 storage-tiering-nola-v1710b
Tony Pearson
Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2
Connor McDonald
Presentation dell™ power vault™ md3
Presentation dell™ power vault™ md3
xKinAnx
Výhody a benefity nasazení Oracle Database Appliance
Výhody a benefity nasazení Oracle Database Appliance
MarketingArrowECS_CZ
ITI015En-The evolution of databases (I)
ITI015En-The evolution of databases (I)
Huibert Aalbers
Things learned from OpenWorld 2013
Things learned from OpenWorld 2013
Connor McDonald
Webinar: How MongoDB is Used to Manage Reference Data - May 2014
Webinar: How MongoDB is Used to Manage Reference Data - May 2014
MongoDB
Data Lakes: A Logical Approach for Faster Unified Insights (ASEAN)
Data Lakes: A Logical Approach for Faster Unified Insights (ASEAN)
Denodo
Oracle big data appliance and solutions
Oracle big data appliance and solutions
solarisyougood
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
Maaz Anjum
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
Amazon Web Services
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Community
System Analysis And Design
System Analysis And Design
Lijo Stalin
Ähnlich wie Storage Characteristics Of Call Data Records In Column Store Databases
(20)
Kb 40 kevin_klineukug_reading20070717[1]
Kb 40 kevin_klineukug_reading20070717[1]
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptx
Secondary Storage - General Knowledge
Secondary Storage - General Knowledge
Data Lakes: A Logical Approach for Faster Unified Insights
Data Lakes: A Logical Approach for Faster Unified Insights
DC Storage Review
DC Storage Review
S016828 storage-tiering-nola-v1710b
S016828 storage-tiering-nola-v1710b
Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2
Presentation dell™ power vault™ md3
Presentation dell™ power vault™ md3
Výhody a benefity nasazení Oracle Database Appliance
Výhody a benefity nasazení Oracle Database Appliance
ITI015En-The evolution of databases (I)
ITI015En-The evolution of databases (I)
Things learned from OpenWorld 2013
Things learned from OpenWorld 2013
Webinar: How MongoDB is Used to Manage Reference Data - May 2014
Webinar: How MongoDB is Used to Manage Reference Data - May 2014
Data Lakes: A Logical Approach for Faster Unified Insights (ASEAN)
Data Lakes: A Logical Approach for Faster Unified Insights (ASEAN)
Oracle big data appliance and solutions
Oracle big data appliance and solutions
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
Maaz Anjum - IOUG Collaborate 2013 - An Insight into Space Realization on ODA...
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
System Analysis And Design
System Analysis And Design
Mehr von David Walker
Moving To MicroServices
Moving To MicroServices
David Walker
Big Data Week 2016 - Worldpay - Deploying Secure Clusters
Big Data Week 2016 - Worldpay - Deploying Secure Clusters
David Walker
Data Works Berlin 2018 - Worldpay - PCI Compliance
Data Works Berlin 2018 - Worldpay - PCI Compliance
David Walker
Data Works Summit Munich 2017 - Worldpay - Multi Tenancy Clusters
Data Works Summit Munich 2017 - Worldpay - Multi Tenancy Clusters
David Walker
Big Data Analytics 2017 - Worldpay - Empowering Payments
Big Data Analytics 2017 - Worldpay - Empowering Payments
David Walker
Data Driven Insurance Underwriting
Data Driven Insurance Underwriting
David Walker
Data Driven Insurance Underwriting (Dutch Language Version)
Data Driven Insurance Underwriting (Dutch Language Version)
David Walker
An introduction to data virtualization in business intelligence
An introduction to data virtualization in business intelligence
David Walker
BI SaaS & Cloud Strategies for Telcos
BI SaaS & Cloud Strategies for Telcos
David Walker
Building an analytical platform
Building an analytical platform
David Walker
Gathering Business Requirements for Data Warehouses
Gathering Business Requirements for Data Warehouses
David Walker
Data warehousing change in a challenging environment
Data warehousing change in a challenging environment
David Walker
Building a data warehouse of call data records
Building a data warehouse of call data records
David Walker
Struggling with data management
Struggling with data management
David Walker
A linux mac os x command line interface
A linux mac os x command line interface
David Walker
Connections a life in the day of - david walker
Connections a life in the day of - david walker
David Walker
Conspectus data warehousing appliances – fad or future
Conspectus data warehousing appliances – fad or future
David Walker
An introduction to social network data
An introduction to social network data
David Walker
Using the right data model in a data mart
Using the right data model in a data mart
David Walker
Implementing Netezza Spatial
Implementing Netezza Spatial
David Walker
Mehr von David Walker
(20)
Moving To MicroServices
Moving To MicroServices
Big Data Week 2016 - Worldpay - Deploying Secure Clusters
Big Data Week 2016 - Worldpay - Deploying Secure Clusters
Data Works Berlin 2018 - Worldpay - PCI Compliance
Data Works Berlin 2018 - Worldpay - PCI Compliance
Data Works Summit Munich 2017 - Worldpay - Multi Tenancy Clusters
Data Works Summit Munich 2017 - Worldpay - Multi Tenancy Clusters
Big Data Analytics 2017 - Worldpay - Empowering Payments
Big Data Analytics 2017 - Worldpay - Empowering Payments
Data Driven Insurance Underwriting
Data Driven Insurance Underwriting
Data Driven Insurance Underwriting (Dutch Language Version)
Data Driven Insurance Underwriting (Dutch Language Version)
An introduction to data virtualization in business intelligence
An introduction to data virtualization in business intelligence
BI SaaS & Cloud Strategies for Telcos
BI SaaS & Cloud Strategies for Telcos
Building an analytical platform
Building an analytical platform
Gathering Business Requirements for Data Warehouses
Gathering Business Requirements for Data Warehouses
Data warehousing change in a challenging environment
Data warehousing change in a challenging environment
Building a data warehouse of call data records
Building a data warehouse of call data records
Struggling with data management
Struggling with data management
A linux mac os x command line interface
A linux mac os x command line interface
Connections a life in the day of - david walker
Connections a life in the day of - david walker
Conspectus data warehousing appliances – fad or future
Conspectus data warehousing appliances – fad or future
An introduction to social network data
An introduction to social network data
Using the right data model in a data mart
Using the right data model in a data mart
Implementing Netezza Spatial
Implementing Netezza Spatial
Kürzlich hochgeladen
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
MadyBayot
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
apidays
Architecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
Andrey Devyatkin
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
Martijn de Jong
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
Khushali Kathiriya
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
lior mazor
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
sammart93
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
apidays
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
The Digital Insurer
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
Product Anonymous
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
The Digital Insurer
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
Zilliz
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Edi Saputra
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
Dropbox
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
DianaGray10
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
wesley chun
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Deepika Singh
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
apidays
Kürzlich hochgeladen
(20)
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Architecting Cloud Native Applications
Architecting Cloud Native Applications
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Storage Characteristics Of Call Data Records In Column Store Databases
1.
STORAGE CHARACTERISTICS
OF CALL DATA RECORDS IN COLUMN STORE DATABASES D AV I D M WA L K E R D ATA M A N A G E M E N T & WA R E H O U S I N G
2.
OVERVIEW • This
presentation gives a brief overview of the storage characteristics of Call Data Records in Column Store Databases • It discusses • What are Call Data Records (CDRs)? • What is a Column Store Database? • How efficient is a column store database for storing CDR and other (similar) machine generated data? • It does not: • Examine performance in any detail • Compare column store to traditional row-based Jan 2012 © 2012 Data Management & Warehousing 2
3.
WHAT ARE CALL
DATA RECORDS (CDRs) ? • Every time a telephone call is made data about that call is recorded. At its most basic this will include: • The Calling Number (who made the call) • The Called Number (who was called) • The Start Time • The End Time (or the duration) • Various pieces of technical information (which network switch was used, mobile handset identifier, call direction, is it a free x800 type call etc.) Jan 2012 © 2012 Data Management & Warehousing 3
4.
CDRs AT MULTIPLE
LEVELS • A CDR is created at the switch, each switch involved in a call creates its own CDRs, these are often called Network CDRs • The Network CDRs are joined together into a record of an end to end call record through a process known as mediation. These are Unrated CDRs • Finally the cost of the call is calculated and added to the Unrated CDRs to create Rated CDRs Jan 2012 © 2012 Data Management & Warehousing 4
5.
MORE CDR COMPLEXITY
• There are CDRs that are used for billing the subscriber, often called Retail CDRs • There are also CDRs that are used to charge other operators when their call travels over your network (e.g. when you make a mobile call that finishes on land line from another operator) These are known as Interconnect CDRs or Wholesale CDRs • There are also differences between Mobile and Fixed (Land) Line CDRs • Finally each Switch Manufacturer (there are over 60) and each Mediation and/or Billing system (again at least 50) uses their own format Jan 2012 © 2012 Data Management & Warehousing 5
6.
FOR THIS EXERCISE
… • We are using a European Telephone Company (Telco) Mobile Rated Interconnect CDRs • We have 12,902 files, containing 435,242,447 CDRs over a 181 day period from 482,883 subscribers • Each CDR has 80 fields and 583 characters in a fixed length record format file. In addition we have added an additional mandatory field to hold the source file name from which the record came Jan 2012 © 2012 Data Management & Warehousing 6
7.
DATA DISTRIBUTION IN
THE CDR RECORDS (1) • The structure of the data in the record has a massive impact on its storage. There are a number of factors to look at: • Data Types, Padding, Place Holders and Data Cardinality • The example data we are using has 2 Datetime fields, 11 Char fields, 10 Numeric fields, 33 Integer fields and 25 Varchar fields which is a fairly typical mix for this type of machine generated data. In the source file these are all held as ASCII text. Jan 2012 © 2012 Data Management & Warehousing 7
8.
DATA DISTRIBUTION IN
THE CDR RECORDS (2) • Fixed length records are padded. In our data set the ‘Calling Number’ fixed length field is defined as 24 characters long however the maximum field length in the actual data is only 11 characters long. This means that there always 13 space characters of padding afterwards • 24 of our 80 fields have no information in them at all, 43 of the fields are mandatory and are 100% populated. The remaining 13 fields have between 25% and 75% of the records filled. Jan 2012 © 2012 Data Management & Warehousing 8
9.
DATA DISTRIBUTION IN
THE CDR RECORDS (3) • Finally the number of discreet values (cardinality) a field has affects storage. One flag field has possible values of 0 or 1 and therefore a (low) cardinality of 2, another field has a nearly unique value for every record and therefore a very high cardinality. Of the 57 fields with data there are 20 fields with high cardinality, 5 fields with medium cardinality and the remaining 32 fields have a low cardinality Jan 2012 © 2012 Data Management & Warehousing 9
10.
WHAT IS A
COLUMN STORE DATABASE? • Traditionally databases are ‘row-based’ i.e. each field of data in a record is stored next to each other. Forename Surname Gender David Walker Male Helen Walker Female Sheila Jones Female • Column store databases store the values in columns and then hold a mapping to form the record • This is transparent to the user, who queries a table with SQL in exactly the same way as they would a row-based database Jan 2012 © 2012 Data Management & Warehousing 10
11.
COLUMN STORAGE
First Name F Token Note: To the user this appears as a conventional row-based table that can be queried by standard Value SQL, it is only the underlying storage that is different David PPP Helen QQQ F Token S Token G Token Sheila RRR PPP YYY BBB Surname Value S Token QQQ YYY AAA Jones XXX RRR XXX AAA Walker YYY Gender Value G Token Female AAA Male BBB Jan 2012 © 2012 Data Management & Warehousing 11
12.
EFFICIENCIES OF COLUMN
STORE DATABASES • Column store databases offer significant storage optimisation opportunities especially where there is low or medium cardinality character strings (e.g. the telephone numbers and reference data) because long strings are not repeatedly stored • In addition it is possible to compress the data column stores very efficiently • It is possible, in some column store implementations, that the column storage holds additional metadata that can be used to speed up specific queries (e.g. the number of records associated with each value in a column) • Reduced the data volume stored means reduced I/O when querying the database, this consequently gives query performance improvements Jan 2012 © 2012 Data Management & Warehousing 12
13.
INEFFICIENCIES OF COLUMN
STORE DATABASES • In general manipulating individual rows for updates is expensive as it has to go to each of the columns and then update the mapping table • Some column store databases have specific technologies to limit the impact of this by caching updates • Consequently Column Store Databases are not efficient at OLTP type applications – however they are very efficient for DWH/BI/Archive type applications because the data is bulk loaded rather than individual row inserts, it is not frequently updated and used in large set based queries Jan 2012 © 2012 Data Management & Warehousing 13
14.
HOW EFFICIENT IS
IT TO STORE THIS DATA? • What hardware was used and what would be needed for a production environment? • How was the data loaded? • What was the storage characteristics? Jan 2012 © 2012 Data Management & Warehousing 14
15.
THE TEST ENVIRONMENT
• The test environment was designed to measure storage and not system performance • This test was done using Sybase IQ 15.4 • Sybase has had a column storage database called IQ since 1996 and is one of the most established of the 25 or so currently listed on Wikipedia • The server was running CentOS 5.7 x64, a Redhat Linux derivative • The hardware consisted of: • Intel Xeon Quad-Core X3363 • 16GB Memory • Adaptec 5405 RAID Controller with 2x 1TB 7200rpm Hard Disk (RAID1) • The database was built on file systems rather than raw devices • Total hardware cost was less than US$3000 • Software licences were provided on evaluation Jan 2012 © 2012 Data Management & Warehousing 15
16.
A PRODUCTION ENVIRONMENT?
• To make this into a production environment would depend on the volume of data per month and the number of months data to be held and the type of CDR • The biggest performance driver would be to have more disk spindles adding more (faster) drives or using solid state disks. This would improve performance as well as adding greater capacity • e.g. 16 1Tb drives in RAID10 configuration would provide around 7.75Tb of space and store 75 Billion of these CDRs • Using raw devices instead or file systems would also improve performance • Other performance enhancements would include • Moving from 1 to 2 or 4 Quad Core CPUs • Adding another 16Gb of memory Jan 2012 © 2012 Data Management & Warehousing 16
17.
LOADING THE DATA
• The data was loaded using PELT, an ETL tool written and used by Data Management & Warehousing • The loading was done to production level quality • Data is loaded into a load table (CDR_LOAD) which has a view (CDR_CONVERT) over it that applies data quality checks. The data is then selected from the view and inserted into the main table (CDRs) • Each step is fully logged and audited Jan 2012 © 2012 Data Management & Warehousing 17
18.
THE LOADING STEPS
• Copy a compressed (Unix • Insert into the main CDR table Compress .Z) flat file (as from the DQ view provided) from the CDR_CONVERT over the incoming directory to the CDR_LOAD table workspace • Record the size of the CDR • Record the size of the .Z file table in kilobytes in bytes • Truncate the CDR_LOAD table • Uncompress the file • Compress the source file with • Record the size in bytes and ‘gzip -9’ (maximum the number of records in compression, longest the uncompressed file execution) • Use iSQL ‘Load’ command • Record the size of the .gz file in to insert the data into a bytes CDR_LOAD table • Move the compressed .gz file • Record the size of the to an archive directory CDR_LOAD table in kilobytes Jan 2012 © 2012 Data Management & Warehousing 18
19.
RESULTS • 12,902
files were loaded • 27.48 Gb of un-indexed with zero data quality storage in the database errors • 8.6:1 Compression Ratio • 435,583,388 CDRs • 41.47 Gb of fully indexed storage in the database • 236.50 Gb of raw files • 5.7:1 Compression Ratio • 20.03 Gb of storage in the • Loading: 33 hours, 22 original .Z files minutes, 12 second • 11.8:1 Compression Ratio • Indexing: 2 hours, 13 • 12.42 Gb of storage in the minutes, 9 seconds archive .gz files • 19.0:1 Compression Ratio Jan 2012 © 2012 Data Management & Warehousing 19
20.
ADDING INDEXES •
By default the table has no indexes • This is the same in most databases • For this test every field was indexed • This added 63 indexes that took up an additional 24Gb • The total space used was still 5.7 times smaller than the space used by the raw files • These indexes would significantly improve query performance • However not all the indexes would be required in a production system as not all fields would be actively queried and this would reduce the space used Jan 2012 © 2012 Data Management & Warehousing 20
21.
DISK SPACE USED Jan
2012 © 2012 Data Management & Warehousing 21
22.
LOAD PERFORMANCE •
The average file had 33,760 records • The ETL to load an average file took 11 seconds • 2 seconds to copy to the working directory and decompress • 3 seconds import into CDR_LOAD table • 3 seconds copy from CDR_CONVERT table to CDRS table • 2 seconds to gzip -9 and archive • 1 second logging and truncating tables • None of the tables were indexed during the load Jan 2012 © 2012 Data Management & Warehousing 22
23.
OBSERVATIONS (1) •
The results were approximately in the middle of our expectations and previous experience of other similar data sets where the raw data has been compressed between 5 and 10 times • Even low end hardware gives acceptable load performance suitable for archive functionality but production scale hardware is needed for BI/DWH Jan 2012 © 2012 Data Management & Warehousing 23
24.
OBSERVATIONS (2) •
Some database tuning techniques are needed for truly massive data sets but can be designed in from the outset at low cost (e.g. which indexes/index types) • It is worth considering putting each month (or some other similar date based partitioning) in separate tables for systems management purposes as it makes it easy to remove the data at the end of the archiving process • Smaller reference tables added to the schema would have little/no compression but they are also very small and therefore not contribute greatly to the space used Jan 2012 © 2012 Data Management & Warehousing 24
25.
ALTERNATIVE SCENARIOS •
This presentation uses information gathered on specific data used for a specific purpose by a client • Companies may wonder how their data would work in both storage and performance terms • Vendors may also wonder how their technologies compare in both storage and performance terms • If you are interested in finding out please contact us with these or any other Data Warehousing/Business Intelligence enquiries Jan 2012 © 2012 Data Management & Warehousing 25
26.
CONTACT US •
Data Management & Warehousing • Website: http://www.datamgmt.com • Telephone: +44 (0) 118 321 5930 • David Walker • E-Mail: davidw@datamgmt.com • Telephone: +44 (0) 7990 594 372 • Skype: datamgmt • White Papers: http://scribd.com/davidmwalker Jan 2012 © 2012 Data Management & Warehousing 26
27.
ABOUT US
Data Management & Warehousing is a UK based consultancy that has been delivering successful business intelligence and data warehousing solutions since 1995. Our consultants have worked with major corporations around the world including the US, Europe, Africa and the Middle East. We have worked in many industry sectors such as telcos, manufacturing, retail, financial and transport. We provide governance and project management as well as expertise in the leading technologies. Jan 2012 © 2012 Data Management & Warehousing 27
28.
THANK YOU © 2
0 1 2 - D ATA M A N A G E M E N T & WA R E H O U S I N G H T T P : / / W W W. D ATA M G M T. C O M
Jetzt herunterladen