SlideShare a Scribd company logo
1 of 19
Impala - Turbocharge Your Big Data Access
Ophir Cohen,
Data Platform Group Leader,
ophirc@liveperson.com
Jan 2014
Connection Before Content
--> What is my age?
--> How many children do I have?
--> What is my favorite sport?
Also:
• Past: Co-Founder at Collarity, users to content matching
and relevancy engine
• A Big Data expert
• Technologies enthusiastic with preferences to open sources
LivePerson Is...
8,500customers
Creating Meaningful
Customer Connections
SaaS pioneer since 1998
Mission
Customers
Technology
13 TB
per month 20M
Engagements per month
1.8 B
Visits per month
VOLUME
Volumes
Data challenges @ LP
1. ~ 13TB of data each month
2. > 1PB Hadoop cluster
3. Few clusters across the globe
4. ~ 15,000 MR jobs daily on our main cluster
5. Various heterogeneous users (RND projects, PS, analytics, scientists
and more…)
Data accessing challenges @ LP
1. One month can take few hours (or days!)
2. Complex data model
3. PSs and analytics does not know Java (or scala ;) )
Hadoop Recap
1. Created by Doug Cutting (Yahoo employee back then) at about 2005
2. HDFS - distributed, scalable, and portable file system
3. MapReduce - framework for easily writing applications which process
vast amounts of data (multi-terabyte data-sets) in-parallel on large
clusters (thousands of nodes) in a reliable, fault-tolerant manner.
4. The leading big data solution
Accessing Hadoop
Stone Age
Java Map/Reduce
Java Map/Reduce
Pros
➢ It has been there from the beginning
➢ Reliable
➢ Flexible
➢ Easy for Java developers
Cons
➢ You need to know Java
➢ You need to write in Java ;)
➢ Exhausting development cycle
Bronze Age
Hive
Hive
Pros
➢ Common SQL-like language (HQL)
➢ Running on the cluster - great for trial-and-error method
Cons
➢ Declarative language (and those limited)
➢ Each query takes time
Iron Age
Impala
Impala
1. Cloudera initiative
2. As Cloudera says: “Real-Time Queries in Apache Hadoop,
For Real”
3. scalable parallel database technology
4. Impala is integrated with Hadoop to use the same file and
data formats, metadata, security and resource management
frameworks used by MapReduce, Apache Hive, Apache Pig
and other Hadoop software
Impala
Impala
✓ Uses Hive interface (HQL)
➢ No new education needed
➢ No (or small) Hive queries rewrite needed
✓ 4 to 30 X faster than Hive
➢ Trial and error works!
✓ Bypass map/reduce
Modern History
What next?
What next???
➢ RDBMS like on top of Hadoop
■ ACID
➢ Faster and faster access
➢ Security
➢ Security
➢ Security
➢ Data serialization solutions
THANK YOU!
We are Hiring
Ophir Cohen,
ophchu@gmail.com
@ophchu
Extended version of this presentation will be given soon
Look for it on IL-TeckTalks group on Meetup:
http://www.meetup.com/ILTechTalks/

More Related Content

Viewers also liked

Continuous Testing Meets the Classroom at Code.org
Continuous Testing Meets the Classroom at Code.orgContinuous Testing Meets the Classroom at Code.org
Continuous Testing Meets the Classroom at Code.orgSauce Labs
 
Pivotal Failure - Lessons Learned from Lean Startup Machine DC
Pivotal Failure - Lessons Learned from Lean Startup Machine DCPivotal Failure - Lessons Learned from Lean Startup Machine DC
Pivotal Failure - Lessons Learned from Lean Startup Machine DCDave Haeffner
 
Agile testing for mere mortals
Agile testing for mere mortalsAgile testing for mere mortals
Agile testing for mere mortalsDave Haeffner
 
How To Use Selenium Successfully
How To Use Selenium SuccessfullyHow To Use Selenium Successfully
How To Use Selenium SuccessfullyDave Haeffner
 
Full Stack Testing Done Well
Full Stack Testing Done WellFull Stack Testing Done Well
Full Stack Testing Done WellDave Haeffner
 
You do not need automation engineer - Sqa Days - 2015 - EN
You do not need automation engineer  - Sqa Days - 2015 - ENYou do not need automation engineer  - Sqa Days - 2015 - EN
You do not need automation engineer - Sqa Days - 2015 - ENIakiv Kramarenko
 
Web ui tests examples with selenide, nselene, selene & capybara
Web ui tests examples with  selenide, nselene, selene & capybaraWeb ui tests examples with  selenide, nselene, selene & capybara
Web ui tests examples with selenide, nselene, selene & capybaraIakiv Kramarenko
 
Cross Platform Appium Tests: How To
Cross Platform Appium Tests: How ToCross Platform Appium Tests: How To
Cross Platform Appium Tests: How ToGlobalLogic Ukraine
 
Polyglot automation - QA Fest - 2015
Polyglot automation - QA Fest - 2015Polyglot automation - QA Fest - 2015
Polyglot automation - QA Fest - 2015Iakiv Kramarenko
 
Getting Started with Selenium
Getting Started with SeleniumGetting Started with Selenium
Getting Started with SeleniumDave Haeffner
 
Three Simple Chords of Alternative PageObjects and Hardcore of LoadableCompon...
Three Simple Chords of Alternative PageObjects and Hardcore of LoadableCompon...Three Simple Chords of Alternative PageObjects and Hardcore of LoadableCompon...
Three Simple Chords of Alternative PageObjects and Hardcore of LoadableCompon...Iakiv Kramarenko
 

Viewers also liked (15)

Continuous Testing Meets the Classroom at Code.org
Continuous Testing Meets the Classroom at Code.orgContinuous Testing Meets the Classroom at Code.org
Continuous Testing Meets the Classroom at Code.org
 
Pivotal Failure - Lessons Learned from Lean Startup Machine DC
Pivotal Failure - Lessons Learned from Lean Startup Machine DCPivotal Failure - Lessons Learned from Lean Startup Machine DC
Pivotal Failure - Lessons Learned from Lean Startup Machine DC
 
The Testable Web
The Testable WebThe Testable Web
The Testable Web
 
Agile testing for mere mortals
Agile testing for mere mortalsAgile testing for mere mortals
Agile testing for mere mortals
 
KISS Automation.py
KISS Automation.pyKISS Automation.py
KISS Automation.py
 
How To Use Selenium Successfully
How To Use Selenium SuccessfullyHow To Use Selenium Successfully
How To Use Selenium Successfully
 
Full Stack Testing Done Well
Full Stack Testing Done WellFull Stack Testing Done Well
Full Stack Testing Done Well
 
You do not need automation engineer - Sqa Days - 2015 - EN
You do not need automation engineer  - Sqa Days - 2015 - ENYou do not need automation engineer  - Sqa Days - 2015 - EN
You do not need automation engineer - Sqa Days - 2015 - EN
 
Web ui tests examples with selenide, nselene, selene & capybara
Web ui tests examples with  selenide, nselene, selene & capybaraWeb ui tests examples with  selenide, nselene, selene & capybara
Web ui tests examples with selenide, nselene, selene & capybara
 
Selenium Basics
Selenium BasicsSelenium Basics
Selenium Basics
 
Cross Platform Appium Tests: How To
Cross Platform Appium Tests: How ToCross Platform Appium Tests: How To
Cross Platform Appium Tests: How To
 
Polyglot automation - QA Fest - 2015
Polyglot automation - QA Fest - 2015Polyglot automation - QA Fest - 2015
Polyglot automation - QA Fest - 2015
 
Getting Started with Selenium
Getting Started with SeleniumGetting Started with Selenium
Getting Started with Selenium
 
Three Simple Chords of Alternative PageObjects and Hardcore of LoadableCompon...
Three Simple Chords of Alternative PageObjects and Hardcore of LoadableCompon...Three Simple Chords of Alternative PageObjects and Hardcore of LoadableCompon...
Three Simple Chords of Alternative PageObjects and Hardcore of LoadableCompon...
 
Bdd lessons-learned
Bdd lessons-learnedBdd lessons-learned
Bdd lessons-learned
 

More from LivePerson

Microservices on top of kafka
Microservices on top of kafkaMicroservices on top of kafka
Microservices on top of kafkaLivePerson
 
Graph QL Introduction
Graph QL IntroductionGraph QL Introduction
Graph QL IntroductionLivePerson
 
System Revolution- How We Did It
System Revolution- How We Did It System Revolution- How We Did It
System Revolution- How We Did It LivePerson
 
Liveperson DLD 2015
Liveperson DLD 2015 Liveperson DLD 2015
Liveperson DLD 2015 LivePerson
 
Http 2: Should I care?
Http 2: Should I care?Http 2: Should I care?
Http 2: Should I care?LivePerson
 
Mobile app real-time content modifications using websockets
Mobile app real-time content modifications using websocketsMobile app real-time content modifications using websockets
Mobile app real-time content modifications using websocketsLivePerson
 
Mobile SDK: Considerations & Best Practices
Mobile SDK: Considerations & Best Practices Mobile SDK: Considerations & Best Practices
Mobile SDK: Considerations & Best Practices LivePerson
 
Apache Avro in LivePerson [Hebrew]
Apache Avro in LivePerson [Hebrew]Apache Avro in LivePerson [Hebrew]
Apache Avro in LivePerson [Hebrew]LivePerson
 
Apache Avro and Messaging at Scale in LivePerson
Apache Avro and Messaging at Scale in LivePersonApache Avro and Messaging at Scale in LivePerson
Apache Avro and Messaging at Scale in LivePersonLivePerson
 
Data compression in Modern Application
Data compression in Modern ApplicationData compression in Modern Application
Data compression in Modern ApplicationLivePerson
 
Support Office Hour Webinar - LivePerson API
Support Office Hour Webinar - LivePerson API Support Office Hour Webinar - LivePerson API
Support Office Hour Webinar - LivePerson API LivePerson
 
SIP - Introduction to SIP Protocol
SIP - Introduction to SIP ProtocolSIP - Introduction to SIP Protocol
SIP - Introduction to SIP ProtocolLivePerson
 
Scalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduceScalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduceLivePerson
 
Building Enterprise Level End-To-End Monitor System with Open Source Solution...
Building Enterprise Level End-To-End Monitor System with Open Source Solution...Building Enterprise Level End-To-End Monitor System with Open Source Solution...
Building Enterprise Level End-To-End Monitor System with Open Source Solution...LivePerson
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceLivePerson
 
From a Kafkaesque Story to The Promised Land at LivePerson
From a Kafkaesque Story to The Promised Land at LivePersonFrom a Kafkaesque Story to The Promised Land at LivePerson
From a Kafkaesque Story to The Promised Land at LivePersonLivePerson
 
How can A/B testing go wrong?
How can A/B testing go wrong?How can A/B testing go wrong?
How can A/B testing go wrong?LivePerson
 
Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)LivePerson
 

More from LivePerson (18)

Microservices on top of kafka
Microservices on top of kafkaMicroservices on top of kafka
Microservices on top of kafka
 
Graph QL Introduction
Graph QL IntroductionGraph QL Introduction
Graph QL Introduction
 
System Revolution- How We Did It
System Revolution- How We Did It System Revolution- How We Did It
System Revolution- How We Did It
 
Liveperson DLD 2015
Liveperson DLD 2015 Liveperson DLD 2015
Liveperson DLD 2015
 
Http 2: Should I care?
Http 2: Should I care?Http 2: Should I care?
Http 2: Should I care?
 
Mobile app real-time content modifications using websockets
Mobile app real-time content modifications using websocketsMobile app real-time content modifications using websockets
Mobile app real-time content modifications using websockets
 
Mobile SDK: Considerations & Best Practices
Mobile SDK: Considerations & Best Practices Mobile SDK: Considerations & Best Practices
Mobile SDK: Considerations & Best Practices
 
Apache Avro in LivePerson [Hebrew]
Apache Avro in LivePerson [Hebrew]Apache Avro in LivePerson [Hebrew]
Apache Avro in LivePerson [Hebrew]
 
Apache Avro and Messaging at Scale in LivePerson
Apache Avro and Messaging at Scale in LivePersonApache Avro and Messaging at Scale in LivePerson
Apache Avro and Messaging at Scale in LivePerson
 
Data compression in Modern Application
Data compression in Modern ApplicationData compression in Modern Application
Data compression in Modern Application
 
Support Office Hour Webinar - LivePerson API
Support Office Hour Webinar - LivePerson API Support Office Hour Webinar - LivePerson API
Support Office Hour Webinar - LivePerson API
 
SIP - Introduction to SIP Protocol
SIP - Introduction to SIP ProtocolSIP - Introduction to SIP Protocol
SIP - Introduction to SIP Protocol
 
Scalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduceScalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduce
 
Building Enterprise Level End-To-End Monitor System with Open Source Solution...
Building Enterprise Level End-To-End Monitor System with Open Source Solution...Building Enterprise Level End-To-End Monitor System with Open Source Solution...
Building Enterprise Level End-To-End Monitor System with Open Source Solution...
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
From a Kafkaesque Story to The Promised Land at LivePerson
From a Kafkaesque Story to The Promised Land at LivePersonFrom a Kafkaesque Story to The Promised Land at LivePerson
From a Kafkaesque Story to The Promised Land at LivePerson
 
How can A/B testing go wrong?
How can A/B testing go wrong?How can A/B testing go wrong?
How can A/B testing go wrong?
 
Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)
 

Recently uploaded

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 

Recently uploaded (20)

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 

From HDFS to Impala - Turbocharge your big data access

  • 1. Impala - Turbocharge Your Big Data Access Ophir Cohen, Data Platform Group Leader, ophirc@liveperson.com Jan 2014
  • 2. Connection Before Content --> What is my age? --> How many children do I have? --> What is my favorite sport? Also: • Past: Co-Founder at Collarity, users to content matching and relevancy engine • A Big Data expert • Technologies enthusiastic with preferences to open sources
  • 3. LivePerson Is... 8,500customers Creating Meaningful Customer Connections SaaS pioneer since 1998 Mission Customers Technology
  • 4. 13 TB per month 20M Engagements per month 1.8 B Visits per month VOLUME Volumes
  • 5. Data challenges @ LP 1. ~ 13TB of data each month 2. > 1PB Hadoop cluster 3. Few clusters across the globe 4. ~ 15,000 MR jobs daily on our main cluster 5. Various heterogeneous users (RND projects, PS, analytics, scientists and more…)
  • 6. Data accessing challenges @ LP 1. One month can take few hours (or days!) 2. Complex data model 3. PSs and analytics does not know Java (or scala ;) )
  • 7. Hadoop Recap 1. Created by Doug Cutting (Yahoo employee back then) at about 2005 2. HDFS - distributed, scalable, and portable file system 3. MapReduce - framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) in a reliable, fault-tolerant manner. 4. The leading big data solution
  • 10. Java Map/Reduce Pros ➢ It has been there from the beginning ➢ Reliable ➢ Flexible ➢ Easy for Java developers Cons ➢ You need to know Java ➢ You need to write in Java ;) ➢ Exhausting development cycle
  • 12. Hive Pros ➢ Common SQL-like language (HQL) ➢ Running on the cluster - great for trial-and-error method Cons ➢ Declarative language (and those limited) ➢ Each query takes time
  • 14. Impala 1. Cloudera initiative 2. As Cloudera says: “Real-Time Queries in Apache Hadoop, For Real” 3. scalable parallel database technology 4. Impala is integrated with Hadoop to use the same file and data formats, metadata, security and resource management frameworks used by MapReduce, Apache Hive, Apache Pig and other Hadoop software
  • 16. Impala ✓ Uses Hive interface (HQL) ➢ No new education needed ➢ No (or small) Hive queries rewrite needed ✓ 4 to 30 X faster than Hive ➢ Trial and error works! ✓ Bypass map/reduce
  • 18. What next??? ➢ RDBMS like on top of Hadoop ■ ACID ➢ Faster and faster access ➢ Security ➢ Security ➢ Security ➢ Data serialization solutions
  • 19. THANK YOU! We are Hiring Ophir Cohen, ophchu@gmail.com @ophchu Extended version of this presentation will be given soon Look for it on IL-TeckTalks group on Meetup: http://www.meetup.com/ILTechTalks/

Editor's Notes

  1. In LP we are saying ‘Connection before content’ Enthusiastic of new technologies with preferences to open sources
  2. Couple of facts about LP Been around since 98 Doing SaaS from 98 (before sombody invented Saas…) 8 of the top 10 fortune companies are using LP And a LOT of data
  3. Complex data model Need few trial and error to find what you need Crossing data is hard
  4. Great for data scintist and Pss BUT! Also great for me if I just want to check something - no need to write any code. Two main problems: 1. Declarative language (and those limited) 2. Even the error and trial queries takes ages to be executed
  5. MPP (Massively Parallel Processing)
  6. What do you think are the next steps?