SlideShare ist ein Scribd-Unternehmen logo
1 von 25
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Hive 0.13: An upgrade in
Performance, Scaling,
Security and Multi-tenancy
Vikram Dixit
(vikram@apache.org)
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Hive – SQL on Hadoop
• Open source Apache project
• Started by Facebook in 2009
• Tools to enable easy data extract/transform/load (ETL)
• Work with structured, unstructured, semi-structured data
• Access to files stored either directly in Apache HDFSTM or in other
data storage systems such as Apache HBaseTM
• Query execution via MapReduce/Tez
• Metadata sharing via HCatalog allows your Pig scripts to work with
Hive tables
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Using Hive Effectively
• Understanding Hive’s use case
– Current focus on making it a fast analytics engine that scales
– Transactions coming
• Understand the storage mechanism right for you
– ORC File - highest compression, metadata used to enable faster reads
– Parquet - intermediate compression, fast reads
– RC File - legacy, most widely used but suffers in performance
– Text - ease of use but lowest in terms of performance
• Use the right execution engine
– Tez is the recommended execution engine for performance
– Map reduce is chosen by default in cases where tez can not yet run the query
• Use the right configuration flags
– Many optimizations are turned on by default
– Some are not. Need to tune it for your cluster because a default is hard to come up
with.
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
What’s new in Hive 0.13
•Speed
– Hive on Tez – Broadcast Joins, Bucket Map Joins
– Vectorized Query processing
– Split elimination for ORC file
– Parquet file format support
•Scale
– Smaller hash tables allowing more scalable map joins
– More scalable dynamic partition loads
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
What’s new in Hive 0.13
• More SQL improvements
– SQL standard Authorization
– Char support, Decimal improvements
– Permanent UDFs
– Streaming ingest from Flume for ACID capability
• Additional Improvements
– Hive Server 2 improvements
– HCatalog parity with Hive data types
– JDBC improvements viz. job cancel, async execution
• Even more goodies
– Mavenization
– Parallel test framework
– Lots of documentation
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Stinger Project
(announced February 2013)
Batch AND Interactive SQL-IN-
Hadoop
Stinger Initiative
A broad, community-based effort to
drive the next generation of HIVE
Hive 0.13, April, 2013
• Hive on Apache Tez
• Query Service
• Buffer Cache
• Cost Based Optimizer (Optiq)
• Vectorized Processing
Hive 0.11, May 2013:
• Base Optimizations
• SQL Analytic Functions
• ORCFile, Modern File Format
Hive 0.12, October 2013:
• VARCHAR, DATE Types
• ORCFile predicate pushdown
• Advanced Optimizations
• Performance Boosts via YARN
Speed
Improve Hive query performance by 100X to
allow for interactive query times (seconds)
Scale
The only SQL interface to Hadoop designed
for queries that scale from TB to PB
SQL
Support broadest range of SQL semantics for
analytic applications running against Hadoop
…all IN Hadoop
Goals:
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
SPEED: Increasing Hive
Performance
Key Highlights
– Tez: New execution engine
– Vectorized Query Processing
– Startup time improvement
– Statistics to accelerate query execution
– Cost Based Optimizer: Optiq (missed the cut)
Interactive Query Times across ALL use cases
• Simple and advanced queries in seconds
• Integrates seamlessly with existing tools
• Currently a >100x improvement in just nine months
Elements of Fast SQL Execution
• Query Planner/Cost Based Optimizer
w/ Statistics
• Query Startup
• Query Execution
• I/O Path
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Apache Tez (“Speed”)
• Replaces MapReduce as primitive for Pig, Hive, Cascading etc.
– Smaller latency for interactive queries
– Higher throughput for batch queries
– 22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo, Microsoft
YARN ApplicationMaster to run DAG of Tez Tasks
Task with pluggable Input, Processor and Output
Tez Task - <Input, Processor, Output>
Task
ProcessorInput Output
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Hive – MR Hive – Tez
Hive-on-MR vs. Hive-on-Tez
SELECT g1.x, g1.avg, g2.cnt
FROM (SELECT a.x, AVERAGE(a.y) AS avg FROM a GROUP BY a.x) g1
JOIN (SELECT b.x, COUNT(b.y) AS avg FROM b GROUP BY b.x) g2
ON (g1.x = g2.x)
ORDER BY avg;
GROUP a BY a.x
JOIN (a,b)
GROUP b BY b.x
ORDER BY
M M M
R R
M M
R
M M
R
M
R
HDFS HDFS
HDFS
M M M
R R
R
M M
R
GROUP BY a.x
JOIN (a,b)
ORDER BY
GROUP BY x
Tez avoids
unnecessary writes
to HDFS
HIVE-4660
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Shuffle Join
SELECT ss.ss_item_sk, ss.ss_quantity, inv.inv_quantity_on_hand
FROM inventory inv
JOIN store_sales ss
ON (inv.inv_item_sk = ss.ss_item_sk);
Hive – MR Hive – Tez
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Broadcast Join
• Similar to map-join w/o the need to build a hash table on the
client
• Will work with any level of sub-query nesting
• Uses stats to determine if applicable
• How it works:
– Broadcast result set is computed in parallel on the cluster
– Join processor are spun up in parallel
– Broadcast set is streamed to join processor
– Join processors build hash table
– Other relation is joined with hashtable
• Tez handles:
– Best parallelism
– Best data transfer of the hashed relation
– Best scheduling to avoid latencies
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Broadcast Join
Hive – MR Hive – Tez
M M M
M
HDFS
M MM
M M
HDFS
SELECT ss.ss_item_sk, ss.ss_quantity, inv.inv_quantity_on_hand
FROM store_sales ss
JOIN inventory inv
ON (inv.inv_item_sk = ss.ss_item_sk);
HDFS
Inventory scan
(Runs as single
local map task)
Store Sales scan
and Join
(Inventory hash
table read as side
file)
Inventory scan
(Runs on cluster
potentially more
than 1 mapper)
Store Sales scan
and Join
Broadcast
edge
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Dynamically partitioned Hash join
• Kicks in when large table is bucketed
– Bucketed table
– Dynamic as part of query processing
– Enabled via set hive.convert.join.bucket.mapjoin.tez = true; (use 0.13.1)
• Uses custom edge to match the partitioning on the smaller table
• Allows hash-join in cases where broadcast would be too large
• Tez gives us the option of building custom edges and vertex
managers
– Fine grained control over how the data is replicated and partitioned
– Scheduling and actual data transfer is handled by Tez
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Dynamically Partitioned Hash Join
SELECT ss.ss_item_sk, ss.ss_quantity, inv.inv_quantity_on_hand
FROM store_sales ss
JOIN inventory inv
ON (inv.inv_item_sk = ss.ss_item_sk);
Hive – MR Hive – Tez
M MM
M M
HDFS
Inventory scan
(Runs on cluster
potentially more
than 1 mapper)
Store Sales scan
and Join (Custom
vertex reads
both inputs – no
side file reads)
Custom edge
(routes outputs of
previous stage to
the correct
Mappers of the
next stage)M MM
M
HDFS
Inventory scan
(Runs as single
local map task)
Store Sales scan
and Join
(Inventory hash
table read as side
file)
HDFS
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Dynamically Partitioned Hash Join
Plans look very similar to map join but the way things work change between
MR and Tez.
Hive – MR (Bucket map-join) Hive – Tez
• Not dynamically partitioned.
• Both tables need to be bucketed by the join key.
• Local task that generates the hash table writes n
files corresponding to n buckets.
• Number of mappers for the join must be same
as the number of buckets.
• Each of these mappers reads the corresponding
bucket file of the local task to perform the join.
• Only one of the sides needs to be bucketed and
the other side is dynamically bucketed.
• Also works if neither side is explicitly bucketed,
but another operation forced bucketing in the
pipeline (traits)
• No writing to HDFS.
• There can be more mappers than number of
buckets but splits do not span multiple buckets.
• The dynamically bucketed mappers have as
many outputs as number of buckets and a
custom tez routing ensures these outputs reach
the right mappers.
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Bulk Inner loop: Vectorization
• Avoid Writable objects & use primitive int/long
– Allows efficient JIT code for primitive types
• Generate per-type loops & avoid runtime type-checks
• The classes generated look like
– LongColEqualDoubleColumn
– LongColEqualLongColumn
– LongColEqualLongScalar
• Avoid duplicate operations on repeated values
– isRepeating & hasNulls
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
ORC: ZeroCopy & caching
• Use memory mapped I/O path in HDFS
– HDFS in-memory cache
• ORC reads can start deserializing early
– there is no blocking read() call
• Allow OS read-ahead to kick-in
• Use buffer-cache pages without copying it
• Avoid wasting heap space on ORC stripes
• Decompress directly from mapped buffers
– Fast JNI code for SNAPPY decompressors
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Scaling
• Reduce size of map join hash tables
– Hundred bytes were being used to store an integer (Map join key)
– HIVE-6430 reduced sizes of the hash tables by 60-70% in many cases
– Allowed more efficient use of memory and hence more tables to fit in
• Large number of open record writers in ORC file reduced to just 1
– HIVE-6455
– Now in a multi-insert scenario, performance is much better and many more
inserts can be done in parallel
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
TPC-DS 10 TB Query Times (Tuned Queries)
Page 19
Data: 10 TB data loaded into ORCFile using defaults.
Hardware: 20 nodes: 24 CPU threads, 64 GB RAM, 5 local SATA drives each.
Queries: TPC-DS queries with partition filters added.
© Hortonworks Inc. 2011
Security
Page 20
Architecting the Future of Big Data
• Old authorization based on grant/revoke
• Incomplete model - eg. Anybody can run grant statement
• Does not try following standard
• Why follow standard ?
• Lot of thought has been put into the standard – important for
security!
• It’s a standard!
• Hive should have built-in authorization
• Easy to use, no additional components to manage
• New features get added that needs authorization
• Life cycle of objects should be synced with authorization policy
© Hortonworks Inc. 2011
Managing privileges
Page 21
Architecting the Future of Big Data
• Grant/revoke privilege on object to/from user/role
• SHOW GRANT statement
• view privilege grants based on user/role name and/or object name
• INSERT, SELECT, DELETE, UPDATE, ALL
• Privileges for some actions based on object ownership
• Table/view ownership : Most alter commands, drop
• Database ownership : create table, drop database
• URI privileges based on file permissions
• https://cwiki.apache.org/confluence/display/Hive/SQL+Stand
ard+based+hive+authorization#SQLStandardBasedHiveAuthor
ization-Configuration
• Use hive 0.13.1 – fixes the issues listed under known issues in
above wiki doc.
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Hive Server 2 improvements
• Hive server 2 now supports thrift over HTTP and kerberos/LDAP
authentication on HTTP
• Also supports HTTPS
• HiveServer2 can keep sessions alive
– Between different JDBC queries
• New security model helps
– All secure queries run as “hive” user
• Ideal for short exploratory queries
• Uses same JARs (no download for task)
• Even better JIT performance on >1 queries
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Other improvements
• Insert-update-delete semantics. Streaming ingest from flume. (HIVE-
5317)
– Transaction manager added in. Support for ORC file format only at this time.
• Lots of UDF support via permanent functions. No need to have add
jar for most commonly used UDFs. Ideally, admin adds the permanent
(trusted) functions.
• Parquet is a supported storage format.
https://cwiki.apache.org/confluence/display/Hive/Parquet
• HCatalog now supports all the datatypes supported in Hive.
• Hive is now mavenized (Thanks Brock Noland!)
• Parallel test framework means Unit testing happens faster and
changes get in faster.
• Lots of new documentation for all the new features. (Thanks Lefty!)
• Bottom line: Hive 0.13 is the fastest, most feature rich version of hive
so far.
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Future Work
• Lot more improvements coming up
• Speed
– Sort Merge Bucket Map join in Tez
– Total ordering of data
– Skew joins
– Cost based Optimizer
• Security
– Authorizing permanent UDF access
– Authorizing ‘show grant’
– Support hdfs ACL in URI permission checks (new in hadoop 2.4)
– More SQL syntax support – eg revoke just admin option on a role
• Multi-tenancy
– Sticky HS2 sessions for improved performance in a multi-tenant environment
– Improve scheduling in a multi-tenant environment
© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.
Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

HBase at Bloomberg: High Availability Needs for the Financial Industry
HBase at Bloomberg: High Availability Needs for the Financial IndustryHBase at Bloomberg: High Availability Needs for the Financial Industry
HBase at Bloomberg: High Availability Needs for the Financial IndustryHBaseCon
 
HBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and SparkHBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and SparkHBaseCon
 
Hadoop Hardware @Twitter: Size does matter!
Hadoop Hardware @Twitter: Size does matter!Hadoop Hardware @Twitter: Size does matter!
Hadoop Hardware @Twitter: Size does matter!DataWorks Summit
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Hadoop / Spark Conference Japan
 
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaHBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaCloudera, Inc.
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemRajkumar Singh
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestHBaseCon
 
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big DataHBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big DataCloudera, Inc.
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin
 
Quick Introduction to Apache Tez
Quick Introduction to Apache TezQuick Introduction to Apache Tez
Quick Introduction to Apache TezGetInData
 
Digital Library Collection Management using HBase
Digital Library Collection Management using HBaseDigital Library Collection Management using HBase
Digital Library Collection Management using HBaseHBaseCon
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataRyan Bosshart
 
Pig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataPig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataDataWorks Summit
 
HBaseCon 2015: Elastic HBase on Mesos
HBaseCon 2015: Elastic HBase on MesosHBaseCon 2015: Elastic HBase on Mesos
HBaseCon 2015: Elastic HBase on MesosHBaseCon
 
HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014larsgeorge
 
2015 GHC Presentation - High Availability and High Frequency Big Data Analytics
2015 GHC Presentation - High Availability and High Frequency Big Data Analytics2015 GHC Presentation - High Availability and High Frequency Big Data Analytics
2015 GHC Presentation - High Availability and High Frequency Big Data AnalyticsEsther Kundin
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSBouquet
 

Was ist angesagt? (20)

Hadoop Ecosystem Overview
Hadoop Ecosystem OverviewHadoop Ecosystem Overview
Hadoop Ecosystem Overview
 
HBase at Bloomberg: High Availability Needs for the Financial Industry
HBase at Bloomberg: High Availability Needs for the Financial IndustryHBase at Bloomberg: High Availability Needs for the Financial Industry
HBase at Bloomberg: High Availability Needs for the Financial Industry
 
HBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and SparkHBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and Spark
 
Hadoop Hardware @Twitter: Size does matter!
Hadoop Hardware @Twitter: Size does matter!Hadoop Hardware @Twitter: Size does matter!
Hadoop Hardware @Twitter: Size does matter!
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
 
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaHBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
 
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big DataHBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 
Quick Introduction to Apache Tez
Quick Introduction to Apache TezQuick Introduction to Apache Tez
Quick Introduction to Apache Tez
 
Digital Library Collection Management using HBase
Digital Library Collection Management using HBaseDigital Library Collection Management using HBase
Digital Library Collection Management using HBase
 
February 2014 HUG : Hive On Tez
February 2014 HUG : Hive On TezFebruary 2014 HUG : Hive On Tez
February 2014 HUG : Hive On Tez
 
Big Data Journey
Big Data JourneyBig Data Journey
Big Data Journey
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
 
Pig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataPig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big Data
 
HBaseCon 2015: Elastic HBase on Mesos
HBaseCon 2015: Elastic HBase on MesosHBaseCon 2015: Elastic HBase on Mesos
HBaseCon 2015: Elastic HBase on Mesos
 
HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014
 
2015 GHC Presentation - High Availability and High Frequency Big Data Analytics
2015 GHC Presentation - High Availability and High Frequency Big Data Analytics2015 GHC Presentation - High Availability and High Frequency Big Data Analytics
2015 GHC Presentation - High Availability and High Frequency Big Data Analytics
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMS
 

Andere mochten auch

Aziksa hadoop for buisness users2 santosh jha
Aziksa hadoop for buisness users2 santosh jhaAziksa hadoop for buisness users2 santosh jha
Aziksa hadoop for buisness users2 santosh jhaData Con LA
 
140614 bigdatacamp-la-keynote-jon hsieh
140614 bigdatacamp-la-keynote-jon hsieh140614 bigdatacamp-la-keynote-jon hsieh
140614 bigdatacamp-la-keynote-jon hsiehData Con LA
 
20140614 introduction to spark-ben white
20140614 introduction to spark-ben white20140614 introduction to spark-ben white
20140614 introduction to spark-ben whiteData Con LA
 
Ag big datacampla-06-14-2014-ajay_gopal
Ag big datacampla-06-14-2014-ajay_gopalAg big datacampla-06-14-2014-ajay_gopal
Ag big datacampla-06-14-2014-ajay_gopalData Con LA
 
Big datacamp june14_alex_liu
Big datacamp june14_alex_liuBig datacamp june14_alex_liu
Big datacamp june14_alex_liuData Con LA
 
Big Data Day LA 2015 - Solr Search with Spark for Big Data Analytics in Actio...
Big Data Day LA 2015 - Solr Search with Spark for Big Data Analytics in Actio...Big Data Day LA 2015 - Solr Search with Spark for Big Data Analytics in Actio...
Big Data Day LA 2015 - Solr Search with Spark for Big Data Analytics in Actio...Data Con LA
 
2014 bigdatacamp asya_kamsky
2014 bigdatacamp asya_kamsky2014 bigdatacamp asya_kamsky
2014 bigdatacamp asya_kamskyData Con LA
 
Yarn cloudera-kathleenting061414 kate-ting
Yarn cloudera-kathleenting061414 kate-tingYarn cloudera-kathleenting061414 kate-ting
Yarn cloudera-kathleenting061414 kate-tingData Con LA
 
Summit v4 dave wolcott
Summit v4 dave wolcottSummit v4 dave wolcott
Summit v4 dave wolcottData Con LA
 
Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'...
Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'...Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'...
Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'...Data Con LA
 
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...Data Con LA
 
Kiji cassandra la june 2014 - v02 clint-kelly
Kiji cassandra la   june 2014 - v02 clint-kellyKiji cassandra la   june 2014 - v02 clint-kelly
Kiji cassandra la june 2014 - v02 clint-kellyData Con LA
 
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...Data Con LA
 
Hadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapRHadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapRData Con LA
 
Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Data Con LA
 
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...Data Con LA
 
Big Data Day LA 2015 - Deep Learning Human Vocalized Animal Sounds by Sabri S...
Big Data Day LA 2015 - Deep Learning Human Vocalized Animal Sounds by Sabri S...Big Data Day LA 2015 - Deep Learning Human Vocalized Animal Sounds by Sabri S...
Big Data Day LA 2015 - Deep Learning Human Vocalized Animal Sounds by Sabri S...Data Con LA
 
Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...
Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...
Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...Data Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...Data Con LA
 
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Data Con LA
 

Andere mochten auch (20)

Aziksa hadoop for buisness users2 santosh jha
Aziksa hadoop for buisness users2 santosh jhaAziksa hadoop for buisness users2 santosh jha
Aziksa hadoop for buisness users2 santosh jha
 
140614 bigdatacamp-la-keynote-jon hsieh
140614 bigdatacamp-la-keynote-jon hsieh140614 bigdatacamp-la-keynote-jon hsieh
140614 bigdatacamp-la-keynote-jon hsieh
 
20140614 introduction to spark-ben white
20140614 introduction to spark-ben white20140614 introduction to spark-ben white
20140614 introduction to spark-ben white
 
Ag big datacampla-06-14-2014-ajay_gopal
Ag big datacampla-06-14-2014-ajay_gopalAg big datacampla-06-14-2014-ajay_gopal
Ag big datacampla-06-14-2014-ajay_gopal
 
Big datacamp june14_alex_liu
Big datacamp june14_alex_liuBig datacamp june14_alex_liu
Big datacamp june14_alex_liu
 
Big Data Day LA 2015 - Solr Search with Spark for Big Data Analytics in Actio...
Big Data Day LA 2015 - Solr Search with Spark for Big Data Analytics in Actio...Big Data Day LA 2015 - Solr Search with Spark for Big Data Analytics in Actio...
Big Data Day LA 2015 - Solr Search with Spark for Big Data Analytics in Actio...
 
2014 bigdatacamp asya_kamsky
2014 bigdatacamp asya_kamsky2014 bigdatacamp asya_kamsky
2014 bigdatacamp asya_kamsky
 
Yarn cloudera-kathleenting061414 kate-ting
Yarn cloudera-kathleenting061414 kate-tingYarn cloudera-kathleenting061414 kate-ting
Yarn cloudera-kathleenting061414 kate-ting
 
Summit v4 dave wolcott
Summit v4 dave wolcottSummit v4 dave wolcott
Summit v4 dave wolcott
 
Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'...
Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'...Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'...
Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'...
 
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
 
Kiji cassandra la june 2014 - v02 clint-kelly
Kiji cassandra la   june 2014 - v02 clint-kellyKiji cassandra la   june 2014 - v02 clint-kelly
Kiji cassandra la june 2014 - v02 clint-kelly
 
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
 
Hadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapRHadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapR
 
Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014
 
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
 
Big Data Day LA 2015 - Deep Learning Human Vocalized Animal Sounds by Sabri S...
Big Data Day LA 2015 - Deep Learning Human Vocalized Animal Sounds by Sabri S...Big Data Day LA 2015 - Deep Learning Human Vocalized Animal Sounds by Sabri S...
Big Data Day LA 2015 - Deep Learning Human Vocalized Animal Sounds by Sabri S...
 
Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...
Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...
Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
 
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
 

Ähnlich wie La big datacamp2014_vikram_dixit

Performance Hive+Tez 2
Performance Hive+Tez 2Performance Hive+Tez 2
Performance Hive+Tez 2t3rmin4t0r
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerhdhappy001
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingTeddy Choi
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
 
Unit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptxUnit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptxBhavanaHotchandani
 
Stinger hadoop summit june 2013
Stinger hadoop summit june 2013Stinger hadoop summit june 2013
Stinger hadoop summit june 2013alanfgates
 
An In-Depth Look at Putting the Sting in Hive
An In-Depth Look at Putting the Sting in HiveAn In-Depth Look at Putting the Sting in Hive
An In-Depth Look at Putting the Sting in HiveDataWorks Summit
 
Mapreduce over snapshots
Mapreduce over snapshotsMapreduce over snapshots
Mapreduce over snapshotsenissoz
 
Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionAdam Muise
 
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times Faster
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times FasterApril 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times Faster
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times FasterYahoo Developer Network
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2tcloudcomputing-tw
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud ComputingFarzad Nozarian
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
February 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesFebruary 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesYahoo Developer Network
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep DiveHortonworks
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Ashish Narasimham
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 DataWorks Summit
 

Ähnlich wie La big datacamp2014_vikram_dixit (20)

Performance Hive+Tez 2
Performance Hive+Tez 2Performance Hive+Tez 2
Performance Hive+Tez 2
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stinger
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Unit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptxUnit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptx
 
Stinger hadoop summit june 2013
Stinger hadoop summit june 2013Stinger hadoop summit june 2013
Stinger hadoop summit june 2013
 
An In-Depth Look at Putting the Sting in Hive
An In-Depth Look at Putting the Sting in HiveAn In-Depth Look at Putting the Sting in Hive
An In-Depth Look at Putting the Sting in Hive
 
Mapreduce over snapshots
Mapreduce over snapshotsMapreduce over snapshots
Mapreduce over snapshots
 
Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical Introduction
 
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times Faster
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times FasterApril 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times Faster
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times Faster
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
February 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesFebruary 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and Insides
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep Dive
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30
 
MHUG - YARN
MHUG - YARNMHUG - YARN
MHUG - YARN
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 

Mehr von Data Con LA

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA
 

Mehr von Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 

Kürzlich hochgeladen

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 

Kürzlich hochgeladen (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 

La big datacamp2014_vikram_dixit

  • 1. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Hive 0.13: An upgrade in Performance, Scaling, Security and Multi-tenancy Vikram Dixit (vikram@apache.org)
  • 2. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Hive – SQL on Hadoop • Open source Apache project • Started by Facebook in 2009 • Tools to enable easy data extract/transform/load (ETL) • Work with structured, unstructured, semi-structured data • Access to files stored either directly in Apache HDFSTM or in other data storage systems such as Apache HBaseTM • Query execution via MapReduce/Tez • Metadata sharing via HCatalog allows your Pig scripts to work with Hive tables
  • 3. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Using Hive Effectively • Understanding Hive’s use case – Current focus on making it a fast analytics engine that scales – Transactions coming • Understand the storage mechanism right for you – ORC File - highest compression, metadata used to enable faster reads – Parquet - intermediate compression, fast reads – RC File - legacy, most widely used but suffers in performance – Text - ease of use but lowest in terms of performance • Use the right execution engine – Tez is the recommended execution engine for performance – Map reduce is chosen by default in cases where tez can not yet run the query • Use the right configuration flags – Many optimizations are turned on by default – Some are not. Need to tune it for your cluster because a default is hard to come up with.
  • 4. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. What’s new in Hive 0.13 •Speed – Hive on Tez – Broadcast Joins, Bucket Map Joins – Vectorized Query processing – Split elimination for ORC file – Parquet file format support •Scale – Smaller hash tables allowing more scalable map joins – More scalable dynamic partition loads
  • 5. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. What’s new in Hive 0.13 • More SQL improvements – SQL standard Authorization – Char support, Decimal improvements – Permanent UDFs – Streaming ingest from Flume for ACID capability • Additional Improvements – Hive Server 2 improvements – HCatalog parity with Hive data types – JDBC improvements viz. job cancel, async execution • Even more goodies – Mavenization – Parallel test framework – Lots of documentation
  • 6. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Stinger Project (announced February 2013) Batch AND Interactive SQL-IN- Hadoop Stinger Initiative A broad, community-based effort to drive the next generation of HIVE Hive 0.13, April, 2013 • Hive on Apache Tez • Query Service • Buffer Cache • Cost Based Optimizer (Optiq) • Vectorized Processing Hive 0.11, May 2013: • Base Optimizations • SQL Analytic Functions • ORCFile, Modern File Format Hive 0.12, October 2013: • VARCHAR, DATE Types • ORCFile predicate pushdown • Advanced Optimizations • Performance Boosts via YARN Speed Improve Hive query performance by 100X to allow for interactive query times (seconds) Scale The only SQL interface to Hadoop designed for queries that scale from TB to PB SQL Support broadest range of SQL semantics for analytic applications running against Hadoop …all IN Hadoop Goals:
  • 7. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. SPEED: Increasing Hive Performance Key Highlights – Tez: New execution engine – Vectorized Query Processing – Startup time improvement – Statistics to accelerate query execution – Cost Based Optimizer: Optiq (missed the cut) Interactive Query Times across ALL use cases • Simple and advanced queries in seconds • Integrates seamlessly with existing tools • Currently a >100x improvement in just nine months Elements of Fast SQL Execution • Query Planner/Cost Based Optimizer w/ Statistics • Query Startup • Query Execution • I/O Path
  • 8. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Apache Tez (“Speed”) • Replaces MapReduce as primitive for Pig, Hive, Cascading etc. – Smaller latency for interactive queries – Higher throughput for batch queries – 22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo, Microsoft YARN ApplicationMaster to run DAG of Tez Tasks Task with pluggable Input, Processor and Output Tez Task - <Input, Processor, Output> Task ProcessorInput Output
  • 9. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Hive – MR Hive – Tez Hive-on-MR vs. Hive-on-Tez SELECT g1.x, g1.avg, g2.cnt FROM (SELECT a.x, AVERAGE(a.y) AS avg FROM a GROUP BY a.x) g1 JOIN (SELECT b.x, COUNT(b.y) AS avg FROM b GROUP BY b.x) g2 ON (g1.x = g2.x) ORDER BY avg; GROUP a BY a.x JOIN (a,b) GROUP b BY b.x ORDER BY M M M R R M M R M M R M R HDFS HDFS HDFS M M M R R R M M R GROUP BY a.x JOIN (a,b) ORDER BY GROUP BY x Tez avoids unnecessary writes to HDFS HIVE-4660
  • 10. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Shuffle Join SELECT ss.ss_item_sk, ss.ss_quantity, inv.inv_quantity_on_hand FROM inventory inv JOIN store_sales ss ON (inv.inv_item_sk = ss.ss_item_sk); Hive – MR Hive – Tez
  • 11. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Broadcast Join • Similar to map-join w/o the need to build a hash table on the client • Will work with any level of sub-query nesting • Uses stats to determine if applicable • How it works: – Broadcast result set is computed in parallel on the cluster – Join processor are spun up in parallel – Broadcast set is streamed to join processor – Join processors build hash table – Other relation is joined with hashtable • Tez handles: – Best parallelism – Best data transfer of the hashed relation – Best scheduling to avoid latencies
  • 12. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Broadcast Join Hive – MR Hive – Tez M M M M HDFS M MM M M HDFS SELECT ss.ss_item_sk, ss.ss_quantity, inv.inv_quantity_on_hand FROM store_sales ss JOIN inventory inv ON (inv.inv_item_sk = ss.ss_item_sk); HDFS Inventory scan (Runs as single local map task) Store Sales scan and Join (Inventory hash table read as side file) Inventory scan (Runs on cluster potentially more than 1 mapper) Store Sales scan and Join Broadcast edge
  • 13. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Dynamically partitioned Hash join • Kicks in when large table is bucketed – Bucketed table – Dynamic as part of query processing – Enabled via set hive.convert.join.bucket.mapjoin.tez = true; (use 0.13.1) • Uses custom edge to match the partitioning on the smaller table • Allows hash-join in cases where broadcast would be too large • Tez gives us the option of building custom edges and vertex managers – Fine grained control over how the data is replicated and partitioned – Scheduling and actual data transfer is handled by Tez
  • 14. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Dynamically Partitioned Hash Join SELECT ss.ss_item_sk, ss.ss_quantity, inv.inv_quantity_on_hand FROM store_sales ss JOIN inventory inv ON (inv.inv_item_sk = ss.ss_item_sk); Hive – MR Hive – Tez M MM M M HDFS Inventory scan (Runs on cluster potentially more than 1 mapper) Store Sales scan and Join (Custom vertex reads both inputs – no side file reads) Custom edge (routes outputs of previous stage to the correct Mappers of the next stage)M MM M HDFS Inventory scan (Runs as single local map task) Store Sales scan and Join (Inventory hash table read as side file) HDFS
  • 15. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Dynamically Partitioned Hash Join Plans look very similar to map join but the way things work change between MR and Tez. Hive – MR (Bucket map-join) Hive – Tez • Not dynamically partitioned. • Both tables need to be bucketed by the join key. • Local task that generates the hash table writes n files corresponding to n buckets. • Number of mappers for the join must be same as the number of buckets. • Each of these mappers reads the corresponding bucket file of the local task to perform the join. • Only one of the sides needs to be bucketed and the other side is dynamically bucketed. • Also works if neither side is explicitly bucketed, but another operation forced bucketing in the pipeline (traits) • No writing to HDFS. • There can be more mappers than number of buckets but splits do not span multiple buckets. • The dynamically bucketed mappers have as many outputs as number of buckets and a custom tez routing ensures these outputs reach the right mappers.
  • 16. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Bulk Inner loop: Vectorization • Avoid Writable objects & use primitive int/long – Allows efficient JIT code for primitive types • Generate per-type loops & avoid runtime type-checks • The classes generated look like – LongColEqualDoubleColumn – LongColEqualLongColumn – LongColEqualLongScalar • Avoid duplicate operations on repeated values – isRepeating & hasNulls
  • 17. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. ORC: ZeroCopy & caching • Use memory mapped I/O path in HDFS – HDFS in-memory cache • ORC reads can start deserializing early – there is no blocking read() call • Allow OS read-ahead to kick-in • Use buffer-cache pages without copying it • Avoid wasting heap space on ORC stripes • Decompress directly from mapped buffers – Fast JNI code for SNAPPY decompressors
  • 18. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Scaling • Reduce size of map join hash tables – Hundred bytes were being used to store an integer (Map join key) – HIVE-6430 reduced sizes of the hash tables by 60-70% in many cases – Allowed more efficient use of memory and hence more tables to fit in • Large number of open record writers in ORC file reduced to just 1 – HIVE-6455 – Now in a multi-insert scenario, performance is much better and many more inserts can be done in parallel
  • 19. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. TPC-DS 10 TB Query Times (Tuned Queries) Page 19 Data: 10 TB data loaded into ORCFile using defaults. Hardware: 20 nodes: 24 CPU threads, 64 GB RAM, 5 local SATA drives each. Queries: TPC-DS queries with partition filters added.
  • 20. © Hortonworks Inc. 2011 Security Page 20 Architecting the Future of Big Data • Old authorization based on grant/revoke • Incomplete model - eg. Anybody can run grant statement • Does not try following standard • Why follow standard ? • Lot of thought has been put into the standard – important for security! • It’s a standard! • Hive should have built-in authorization • Easy to use, no additional components to manage • New features get added that needs authorization • Life cycle of objects should be synced with authorization policy
  • 21. © Hortonworks Inc. 2011 Managing privileges Page 21 Architecting the Future of Big Data • Grant/revoke privilege on object to/from user/role • SHOW GRANT statement • view privilege grants based on user/role name and/or object name • INSERT, SELECT, DELETE, UPDATE, ALL • Privileges for some actions based on object ownership • Table/view ownership : Most alter commands, drop • Database ownership : create table, drop database • URI privileges based on file permissions • https://cwiki.apache.org/confluence/display/Hive/SQL+Stand ard+based+hive+authorization#SQLStandardBasedHiveAuthor ization-Configuration • Use hive 0.13.1 – fixes the issues listed under known issues in above wiki doc.
  • 22. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Hive Server 2 improvements • Hive server 2 now supports thrift over HTTP and kerberos/LDAP authentication on HTTP • Also supports HTTPS • HiveServer2 can keep sessions alive – Between different JDBC queries • New security model helps – All secure queries run as “hive” user • Ideal for short exploratory queries • Uses same JARs (no download for task) • Even better JIT performance on >1 queries
  • 23. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Other improvements • Insert-update-delete semantics. Streaming ingest from flume. (HIVE- 5317) – Transaction manager added in. Support for ORC file format only at this time. • Lots of UDF support via permanent functions. No need to have add jar for most commonly used UDFs. Ideally, admin adds the permanent (trusted) functions. • Parquet is a supported storage format. https://cwiki.apache.org/confluence/display/Hive/Parquet • HCatalog now supports all the datatypes supported in Hive. • Hive is now mavenized (Thanks Brock Noland!) • Parallel test framework means Unit testing happens faster and changes get in faster. • Lots of new documentation for all the new features. (Thanks Lefty!) • Bottom line: Hive 0.13 is the fastest, most feature rich version of hive so far.
  • 24. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Future Work • Lot more improvements coming up • Speed – Sort Merge Bucket Map join in Tez – Total ordering of data – Skew joins – Cost based Optimizer • Security – Authorizing permanent UDF access – Authorizing ‘show grant’ – Support hdfs ACL in URI permission checks (new in hadoop 2.4) – More SQL syntax support – eg revoke just admin option on a role • Multi-tenancy – Sticky HS2 sessions for improved performance in a multi-tenant environment – Improve scheduling in a multi-tenant environment
  • 25. © Hortonworks Inc. 2013.© Hortonworks Inc. 2013. Questions?

Hinweis der Redaktion

  1. With Hive and Stinger we are focused on enabling the SQL ecosystem and to do that we’ve put Hive on a clear roadmap to SQL compliance. That includes adding critical datatypes like character and date types as well as implementing common SQL semantics seen in most databases.