Weitere ähnliche Inhalte
Ähnlich wie Integrated Data Warehouse with Hadoop and Oracle Database (20)
Mehr von Gwen (Chen) Shapira (20)
Kürzlich hochgeladen (20)
Integrated Data Warehouse with Hadoop and Oracle Database
- 2. Why Pythian
• Recognized Leader:
• Global industry leader in data infrastructure managed services and consulting with expertise
in Oracle, Oracle Applications, Microsoft SQL Server, MySQL, big data and systems
administration
• Work with over 200 multinational companies such as Forbes.com, Fox Sports, Nordion and
Western Union to help manage their complex IT deployments
• Expertise:
• One of the world’s largest concentrations of dedicated, full-time DBA expertise. Employ 8
Oracle ACEs/ACE Directors
• Hold 7 Specializations under Oracle Platinum Partner program, including Oracle Exadata,
Oracle GoldenGate & Oracle RAC
• Global Reach & Scalability:
• 24/7/365 global remote support for DBA and consulting, systems administration, special
projects or emergency response
© 2012 – Pythian
- 3. About Gwen Shapira
• Oracle ACE Director
• 13 Years with pager
• 7 as Oracle DBA
• Senior Consultant:
• Has MacBook, will travel.
• @gwenshap
• http://www.pythian.com/news/author/
shapira/
© 2012 – Pythian
- 4. Agenda
• What is Big Data?
• Why do we care about Big Data?
• Why your DWH needs Hadoop?
• Examples of Hadoop in the DWH
• How to integrate Hadoop into your
DWH
• Avoiding major pitfalls
© 2012 – Pythian
- 9. Data Arriving at fast Rates
Typically unstructured
Stored without aggregation
Analyzed in Real Time
For Reasonable Cost
© 2012 – Pythian
- 10. Where does Big Data come from?
• Social media
• Enterprise transactional data
• Consumer behaviour
• Multimedia
• Sensors and embedded devices
• Network devices
© 2012 – Pythian
- 14. Big Problems with Big Data
• It is:
• Unstructured
• Unprocessed
• Un-aggregated
• Un-filtered
• Repetitive
• And generally messy.
Oh, and there is a lot of it.
© 2012 – Pythian
- 15. Technical Challenges
• Storage capacity
• Storage throughput Scalable storage
• Pipeline throughput
• Processing power
• Parallel processing
Massive Parallel Processing
• System Integration
• Data Analysis Ready to use tools
© 2012 – Pythian
- 17. Hadoop in a Nutshell
HDFS: Map-Reduce:
Replicated Distributed Big-Data File Framework for writing massively parallel
jobs
System
© 2012 – Pythian
- 18. Hadoop Benefits
• Reliable solution based on unreliable hardware
• Designed for large files
• Load data first, structure later
• Designed to maximize throughput of large scans
• Designed to maximize parallelism
• Designed to scale
• Flexible development platform
• Solution Ecosystem
© 2012 – Pythian
- 19. Hadoop Limitations
• Hadoop is scalable but not fast
• Batteries not included
• Instrumentation not included either
• Well-known reliability limitations
© 2012 – Pythian
- 21. ETL for Unstructured Data
Logs Flume Hadoop
Web servers, Cleanup,
app server, aggregation
clickstreams Longterm storage
DWH
BI,
batch reports
© 2012 – Pythian
- 22. ETL for Structured Data
Sqoop, Hadoop
OLTP
Oracle, Perl Transformation
MySQL, aggregation
Informix… Longterm storage
DWH
BI,
batch reports
© 2012 – Pythian
- 28. Sqoop
Queries
© 2012 – Pythian
- 29. Sqoop is Flexible (for import)
• Select
columns from table where condition
• Or write your own query
• Split column
• Parallel
• Incremental
• File formats
© 2012 – Pythian
- 30. Sqoop Import Examples
• Sqoop
import
-‐-‐connect
jdbc:oracle:thin:@//dbserver:
1521/masterdb
-‐-‐username
hr
-‐-‐table
emp
-‐-‐where
“start_date
’01-‐01-‐2012’”
• Sqoop
import
jdbc:oracle:thin:@//dbserver:1521/masterdb
-‐-‐username
myuser
-‐-‐table
shops
-‐-‐split-‐by
shop_id
Must be indexed or
-‐-‐num-‐mappers
16
partitioned to avoid
16 full table scans
© 2012 – Pythian
- 31. Less Flexible Export
• 100row batch inserts
• Commit every 100 batches
• Parallel export
• Update mode
Example:
sqoop
export
-‐-‐connect
jdbc:oracle:thin:@//dbserver:1521/masterdb
-‐-‐table
bar
-‐-‐export-‐dir
/results/bar_data
© 2012 – Pythian
- 32. Fuse-DFS
• Mount HDFS on Oracle server:
• sudo yum install hadoop-0.20-fuse
• hadoop-fuse-dfs
dfs://
name_node_hostname:namenode_port mount_point
• Use external tables to load data into Oracle
• File Formats may vary
• All ETL best practices apply
© 2012 – Pythian
- 33. Oracle Loader for Hadoop
• Load data from Hadoop into Oracle
• Map-Reduce job inside Hadoop
• Converts data types.
• Partitions and sorts
• Direct path loads
• Reduces CPU utilization on database
© 2012 – Pythian
- 34. Oracle Direct Connector to HDFS
• Create external tables of files in HDFS
• PREPROCESSOR
HDFS_BIN_PATH:hdfs_stream
• All the features of External Tables
• Tested (by Oracle) as 5 times faster (GB/s) than FUSE-DFS
© 2012 – Pythian
- 39. Use Hadoop Efficiently
• Understand your bottlenecks:
• CPU, storage or network?
• Reduce use of temporary data:
• All data is over the network
• Written to disk in triplicate.
• Eliminate unbalanced workloads
• Offload work to RDBMS
• Fine-tune optimization with Map-Reduce
© 2012 – Pythian
- 40. Your Data
is NOT
as BIG
as you think
© 2012 – Pythian
- 41. Getting Started
• Pick a Business Problem
• Acquire Data
• Use right tool for the job
• Hadoop can start on the cheap
• Integrate the systems
• Analyze data
• Get operational
© 2012 – Pythian
- 42. Thank you and QA
To contact us…
sales@pythian.com
1-877-PYTHIAN
To follow us…
http://www.pythian.com/news/
http://www.facebook.com/pages/The-Pythian-Group/163902527671
@pythian
@pythianjobs
http://www.linkedin.com/company/pythian
© 2012 – Pythian