Axa Assurance Maroc - Insurer Innovation Award 2024
Big Data Strategy for the Relational World
1. Big Data Strategy
for the Relational World
Embracing Disruption, Avoiding Regression
Andrew J. Brust
Founder & CEO, Blue Badge Insights
Big Data correspondent, ZDNet
Big Data Analyst, GigaOM Research
2. Bio
• CEO and Founder, Blue Badge Insights
• Big Data blogger for ZDNet
• Microsoft Regional Director, MVP
• Co-chair, Visual Studio Live! and 18 years as a speaker
• Founder, Microsoft BI User Group of NYC
– http://www.msbinyc.com
• Co-moderator, NYC .NET Developers Group
– http://www.nycdotnetdev.com
• “Redmond Review” columnist for
Visual Studio Magazine and Redmond Developer News
• Twitter: @andrewbrust
5. Big Data: Why Should You Care?
• Because analytics (i.e. BI) has always been
important, but it was expensive and obscure
• Because the economics of processing and
storage make Big Data feasible
6. Big Data: Why Should You be
Cautious?
• Too many vendors; too much churn
• Designed for the lab, not for mainstream
business
• Immature technology and tooling
– Results in serious recruiting and dev costs
• So, you can’t ignore Big Data, but you can’t
just pursue with abandon, either.
– That’s hard!
8. Database Trends
• Mongo and Cassandra, primarilyNoSQL
• aka “unstructured data”Late-bound schema
• Especially HDFSFile-based table handling
• And Massively Parallel ProcessingColumnar storage
• Very few throwing them awayCo-existence with RDBMS, OLAP
databases
• Still expect tables or cubesLittle change in tools/clients
10. Consistency
• CAP Theorem
–Databases may only excel at two of the following
three attributes: consistency, availability and partition
tolerance
• NoSQL does not offer “ACID” guarantees
–Atomicity, consistency, isolation and durability
• Instead offers “eventual consistency”
–Similar to DNS propagation
12. NoSQL Upside
• Distributed by default
• Open source lets you peg costs to personnel,
more than to customers
• Developer enthusiasm
13. Hadoop
• Open source, petabyte-scale data analysis and
processing framework
• Runs on commodity hardware
• Lots of ecosystem
• Two main components:
– Hadoop Distributed File System (HDFS)
– MapReduce engine
14. Hadoop
• Open source, petabyte-scale data analysis and
processing framework
• Runs on commodity hardware
• Lots of ecosystem
• Two main components:
– Hadoop Distributed File System (HDFS)
– MapReduce engine
15. Why MapReduce is Cool
• Extremely flexible – full power of a procedural
programming language
• Map step, essentially, allows ad hoc ETL
• With Reduce step, aggregation is a first-class
concept
• Growing ecosystem of tools that generate
MapReduce code
16. Why MapReduce Sucks
• It’s a batch mode technology
• It’s not declarative
• Most BI products don’t work with MR natively
– They connect via Hive instead (by and large)
• It’s good for a group of use cases, but it’s not a
good general framework
17. The Google DNA
• Hadoop and HBase came from Google
– MapReduce, GFS
– BigTable
• Hadoop was built for their use cases, and they
don’t use it as extensively now
• So why is the world going Hadoop-crazy?
18. Benefits of Schema-Free
• Variable schema is accommodated
– Great for product catalogs, content management
and the like
• Simple for archival storage
• For analysis:
– Avoids politics of achieving consensus on
structure
– Allows different schema for different applications
19. Cloud Effect
• Database as a service and SaaS BI/Analytics gets
companies excited
– Cloudant
– Amazon: DynamoDB, RDS, RedShift, Jaspersoft
• Elastic capabilities of cloud provide small customers
with access to huge clusters
– Amazon EMR, Microsoft Windows Azure HDInsight now
– Google Compute Engine, Rackspace/Hortonworks to come
• Cloud-borne reference data adds value
• But casualties emerging: e.g. Xeround
20. SQL Skillset and Ecosystem
• Making recruiting faster and cheaper
DBAs, most devs know it
ORMs expect it
• Even if they also talk to MDX and NoSQL sources
Reporting/analysis tools are premised on it
Companies are invested in it
Abandoning it is naive
21. MPP is Big Data
(via acquisition)
• Acquired Aster DataTeradata
• IBMNetezza
• HPVertica
• EMCPivotal/Greenplum
• ActianParAccel
• Microsoft-DATAllegro acquisitionSQL Server Parallel Data
Warehouse
22. SQL – BD Convergence
• Brings the SQL language and data warehouse
products, on one side, together with Hadoop, on
the other
• Goal is to make Hadoop interactive, non-batch
• May involve Hive and its APIs
• May involve direct access to HDFS
– Bypassing MapReduce
• Think of the “database” as HDFS, and MapReduce
as merely an access method.
27. Dremel and Drill
• Dremel is Google’s column store analytical database
– Proprietary; available publicly as BigQuery
• Hierarchical/nested too
– Allows schema variance without anarchy
• “…scales to thousands of CPUs and petabytes of data,
and has thousands of users at Google.”
• Uses SQL, has growing BI tool support
• Petabyte scale
• Drill:Dremel as Hadoop:MapReduce+GFS
• And then there’s Spanner
28. In-Memory
• SAP HANA
– And Sybase IQ
• Data Warehouse Appliances
• VoltDB
• Oracle TimesTen
• IBM solidDB
– Also TM1 (in-memory OLAP)
• Coming: SQL Server’s “Hekaton” engine
29. The Truth About In-Memory
• Judicious use of in-memory database technology can
speed analytical queries
– Combine with columnar technology, rinse, repeat
• Can also eliminate need for deferred writes
• A RAM-only strategy like HANA’s seems impractical
• Keep in mind:
– SSD is memory too. It’s slower, but it’s memory.
– Conversely, L1, L2 and L3 cache is faster than RAM. Single
Instruction, Multiple Data (SIMD) makes things faster still.
• Hybrid approaches are most sensible
30. What’s Ahead?
• Consolidation! We can’t have this many vendors:
– Some will go out of business
– Some will get acquired
– A few will stay independent (but may merge with each
other)
• Hadoop recedes into the service layer
• NoSQL shakes out, matures, coexists
• NewSQL gets adopted or acquired
• In-memory becomes a standard option
31. Risks and Considerations
• Pick an esoteric database now and you may be
forced to migrate later
• SQL Server and Oracle could add features that
make the specialty products superfluous
– Or new products
• Conversely, NoSQL products may acquire
ACID-like features themselves
• More convergence
32. Recommendations
• NoSQL has its use cases. But it also has its
abuses.
• Look carefully at the number of customers
• Look also at how widely deployed the product
is within those customer companies
33. Recommendations
• If you haven’t looked seriously at Hadoop, do so.
But remember, it’s infrastructure.
• You can reach out to Big Data now, or you can
wait for it to reach out to you
– Cost/benefit of earlier adoption vs. late following
• For repeatable big problems, MapReduce works
well; for iterative query, “SQL” technologies are
much better
– akin to standard reports versus ad hoc queries
34. Parting Thoughts
• NoSQL and Big Data are disruptive
• You ignore them at your peril
• But if they can’t, ultimately, blend into current
technology environments then they’re
destined to fail
• You can embrace the change without being
sacrificed. Just watch your back.