Modern big data solutions often incorporate Hadoop as one of the components and require the integration of Hadoop with other components including Oracle Database. This presentation explains how Hadoop integrates with Oracle products focusing specifically on the Oracle Database products. Explore various methods and tools available to move data between Oracle Database and Hadoop, how to transparently access data in Hadoop from Oracle Database, and review how other products, such as Oracle Business Intelligence Enterprise Edition and Oracle Data Integrator integrate with Hadoop.
2. Alex Gorbachev
• Chief Technology Officer at Pythian
• Blogger
• Cloudera Champion of Big Data
• OakTable Network member
• Oracle ACE Director
• Founder of BattleAgainstAnyGuess.com
• Founder of Sydney Oracle Meetup
• EVP, IOUG
4. Why Big Data boom now?
• Advances in communication – it’s now feasible to
transfer large amounts of data economically by
anyone from virtually anywhere
• Commodity hardware – high performance and high
capacity at low price is available
• Commodity software – open-source phenomena
made advanced software products affordable to
anyone
• New data sources – mobile, sensors, social media
data-sources
• What’s been only possible at very high cost in the
past, can now be done by any small or large
business
6. Not everyone is
Facebook, Google, Yahoo
and etc.
These guys had to
push the envelope
because traditional
technology didn’t
scale
7. Not everyone is
Facebook, Google, Yahoo
and etc.
These guys had to
push the envelope
because traditional
technology didn’t
scale
Mere mortals’ challenge
is cost and agility
8. System capability per $
Big Data technology
may be expensive at low
scale due to high
engineering efforts.
Traditional technology
becomes too complex
and expensive to scale.
investments, $
capabilities
traditional
Big Data
12. Why is Hadoop so affordable?
• Cheap hardware
• Resiliency through software
• Horizontal scalability
• Open-source software
13. How much does it cost?
Oracle Big Data Appliance
X5-2 rack - $525K list price
• 18 data nodes
• 648 CPU cores
• 2.3 TB RAM
• 216 x 4TB disks
• 864TB of raw disk capacity
• 288TB usable (triple
mirror)
• 40G InfiniBand + 10GbE
networking
• Cloudera Enterprise
14. Hadoop is very flexible
• Rich ecosystem of tools
• Can handle any data format
– Relational
– Text
– Audio, video
– Streaming data
– Logs
– Non-relational structured data (JSON, XML, binary
formats)
– Graph data
• Not limited to relational data processing
15. Challenges with Hadoop
for those of us used to Oracle
• New data access tools
– Relational and non-relational data
• Non-Oracle (and non-ANSI) Hive SQL
– Java-based UDFs and UDAFs
• Security features are not there out-of-the-box
• Maybe slow for “small data”
17. Apache Hive
• Apache Hive provides a SQL layer over Hadoop
– data in HDFS (structured or unstructured via SerDe)
– using one of distributed processing frameworks –
MapReduce, Spark, Tez
• Presents data from HDFS as tables and columns
– Hive metastore (aka data dictionary)
• SQL language access (HiveQL)
– Parses SQL and creates execution plans in MR, Spark or
Tez
• JDBC and ODBC drivers
– Access from ETL and BI tools
– Custom apps
– Development tools
19. Access Hive using SQL Developer
• Demo
• Use Cloudera JDBC drivers
• Query data & browse metadata
• Run DDL from SQL tab
• Create Hive table definitions inside Oracle DB
20. Hadoop and OBIEE 11g
• OBIEE 11.1.1.7 can query Hive/Hadoop as a
data source
– Hive ODBC drivers
– Apache Hive Physical Layer database type
• Limited features
– OBIEE 11.1.1.7 OBIEE has HiveServer1 ODBC
drivers
– HiveQL is only a subset of ANSI SQL
• Hive query response time is slow for speed of
thought response time
21. ODI 12c
• ODI – data transformation tool
– ELT approach pushes transformations down to
Hadoop - leveraging power of cluster
– Hive, HBase, Sqoop and OLH/ODCH KMs provide
native Hadoop loading / transformation
• Upcoming support for Pig and Spark
• Workflow orchestration
• Metadata and model-driven
• GUI workflow design
• Transformation audit & data quality
22. Moving Data to Hadoop using ODI
• Interface with Apache Sqoop using IKM SQL to
Hive-HBase-File knowledge module
– Hadoop ecosystem tool
– Able to run in parallel
– Optimized Sqoop JDBC drivers integration for Oracle
– Bi-directional in-and-out of Hadoop to RDBMS
– Data is moved directly between Hadoop cluster and
database
• Export RBDMS data to file and load using IKM
File to Hive
24. Oracle Big Data Connectors
• Oracle Loader for Hadoop
– Offloads some pre-processing to Hadoop MR jobs (data
type conversion, partitioning, sorting).
– Direct load into the database (online method)
– Data Pump binary files in HDFS (offline method)
• These can then be accessed as external tables on
HDFS
• Oracle Direct Connector for Hadoop
– Create external table on files in HDFS
– Text files or Data Pump binary files
– WARNING: lots of data movement! Great for archival non-
frequently accessed data to HDFS
25. Oracle Big Data SQL
25
Source: http://www.slideshare.net/gwenshap/data-wrangling-and-oracle-connectors-for-hadoop
26. Oracle Big Data SQL
• Transparent access from Oracle DB
to Hadoop
– Oracle SQL dialect
– Oracle DB security model
– Join data from Hadoop and Oracle
• SmartScan - pushing code to data
– Same software base as on Exadata
Storage Cells
– Minimize data transfer from Hadoop to
Oracle
• Requires BDA and Exadata
• Licensed per Hadoop disk spindle
26
30. Traditional Needs of Data Warehouses
• Speed of thought end user analytics experience
– BI tools coupled with DW databases
• Scalable data platform
– DW database
• Versatile and scalable data transformation
engine
– ETL tools sometimes coupled with DW databases
• Data quality control and audit
– ETL tools
33. What drives Hadoop adoption for
Data Warehousing?
1. Cost efficiency
2. Agility needs
34. Why is Hadoop Cost Efficient?
Hadoop leverages two main trends in IT
industry
• Commodity hardware – high performance and
high capacity at low price is available
• Commodity software – open-source
phenomena made advanced software products
affordable to anyone
35. How Does Hadoop Enable Agility?
• Load first, structure later
– Don’t need to spend months changing DW to add
new types of data without knowing for sure it will be
valuable for end users
– Quick and easy to verify hypothesis – perfect data
exploration platform
• All data in one place is very powerful
– Much easier to test new theories
• Natural fit for “unstructured” data
36. Traditional needs of DW & Hadoop
• Speed of thought end user analytics experience?
– Very recent features – Impala, Presto, Drill, Hadapt, etc.
– BI tools embracing Hadoop as DW
– Totally new products become available
• Scalable data platform?
– Yes
• Versatile and scalable data transformation engine?
– Yes but needs a lot of DIY
– ETL vendors embraced Hadoop
• Data quality control and audit?
– Hadoop makes it more difficult because of flexibility it
brings
– A lot of DIY but ETL vendors getting better supporting
Hadoop + new products appear
37. Unique Hadoop Challenges
• Still “young” technology
– requires a lot of high quality engineering talent
• Security doesn’t come out of the box
– Capabilities are there but very tedious to implement
and somewhat fragile
• Challenge of selecting the right tool for the job
– Hadoop ecosystem is huge
• Hadoop breaks IT silos
• Requires commoditization of IT operations
– Large footprint with agile deployments
52. Thanks and Q&A
Contact info
gorbachev@pythian.com
+1-877-PYTHIAN
To follow us
pythian.com/blog
@alexgorbachev
@pythian
linkedin.com/company/pythian
Editor's Notes
WHERE Clause Evaluation
Column Projection
Bloom Filters for Better Join Performance
JSON Parsing, Data Mining Model Evaluation
There is a lot of interesting data that is not generated by your company.
Listings of businesses in specific locations.
Connections in social media
The data may be un-structured, semi-structured or even structured. but it isn’t structured in the way your DWH expects and needs.
We need a landing pad for cleanup, pre-processing, aggregating, filtering and structuring.
Hadoop is perfect for this.
Mappers can scrape data from websites efficiently.
Map-reduce jobs that cleanup and process the data.
And then load the results into your DWH.
We want the top 3 items bought by left handed women between ages of 41 and 43, on November 15, 1998.
How long it will take you to answer this question? For one of my customers, the answer is 25 minutes.
As data grows older, it usually becomes less valuable to the business, and it gets aggregated and shelved off to tapes or other cheap storage. This means that for many organizations, answering details questions about events that happened more than few month ago is impossible or at least very challenging. The business learned to never ask those questions, because the answer is “you can’t”.
Hadoop combines cheap storage and massive processing power, this allows us to store detailed history of our business, and to generate reports about it. And once the answer for questions about history is “You will have your data in 25 minutes” instead of “impossible”, the questions turn out to be less rare than we assumed.
7 Petabytes of log file data
3 lines point to the security hole that allowed a break-in last week
Your DWH has aggregated information from the logs. Maybe.
Hadoop is very cost effective about storing data. Lots of cheap disks, easy to throw data in without pre-processing.
Search the data when you need it.
Bad schema design is not big data
Using 8 year old hardware is not big data
Not having purging policy is not big data
Not configuring your database and operating system correctly is not big data
Poor data filtering is not big data either
Keep the data you need and use. In a way that you can actually use it.
If doing this requires cutting edge technology, excellent! But don’t tell me you need NoSQL because you don’t purge data and have un-optimized PL/SQL running on 10-yo hardware.
Bad schema design is not big data
Using 8 year old hardware is not big data
Not having purging policy is not big data
Not configuring your database and operating system correctly is not big data
Poor data filtering is not big data either
Keep the data you need and use. In a way that you can actually use it.
If doing this requires cutting edge technology, excellent! But don’t tell me you need NoSQL because you don’t purge data and have un-optimized PL/SQL running on 10-yo hardware.