SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Downloaden Sie, um offline zu lesen
Rapid Data Exploration
    With Hadoop
      Peter Skomoroch
     Senior Data Scientist




      @peteskomoroch
Outline
• Overview: LinkedIn Biz, Tech, & Analytics
• Rapid Data Exploration 101
        - Spatial Analytics Pig Code
        - Trend detection with Pig & Python
        - R Streaming Example
•   Deep Dive: Our Data Analysis Approach
•   Building Data Products
•   LinkedIn Data Insights
Connect the world’s professionals to make
  them more productive and successful
Professional Identity
LinkedIn at a glance
• Founded in 2003
• #17 site in the US (Alexa)
• 60+ million members
• First million members = 477 days
• Latest million = 9 days
• 500K+ company profiles
• 12+ million small business professionals
• In 2009 - 1billion people searches
• Average age: 41
• Household income $107,000
• 42% are “decision makers”
How International?
• More than 50% international
  (members in over 200 countries & territories)
• 13+ million in Europe
• 4+ million in India
• 3+ million in UK
• #13 site in UK (Alexa)
How do we keep the lights on?
• Profitable since 2007
• Valued at over $1B at the last funding round
• Subscriptions
• Ads
• Job Postings
• Enterprise Client
Hadoop on LinkedIn
1,400+ members list “Hadoop” on their profile
What other skills do they have?
•HBase, Lucene, Solr, MapReduce, Nutch...
Where are they?        Who do they work for?
 • 36% in Bay Area      • 11% Yahoo!
 • 8% in India          • 2% Apache Software Foundation
 • 6% in NYC            • 1% LinkedIn
 • 4% in Seattle        • 1% Google
 • 4% in Los Angeles    • 1% Facebook
Hadoop at LinkedIn
Voldemort Data Storage
Compact, compressed, binary data (something like Avro)
 Type can be any combination of int, double, float, String,
Map, List, etc. => Sequence Files
 Example member definition:
  {

 ‘member_id’: ‘int32’,
     ‘first_name': 'string',
     ’last_name': ’string’,
     ‘age’    : ‘int32’
      …
    }
Getting Data In
•From Databases (user data, news, jobs etc.)
  • Need a way to get data reliably periodically
  • Need tests to verify data
  • Support for incremental replication
  • Solution: Transmogrify Driver Program
    • InputReader: JDBCReader, CSV Reader
    • Output Writer: JDBCWriter, HDFS writers
• From web logs (page views, search, clicks etc)
  • Weblogs files are rsynced and loaded up in HDFS
  • Hadoop jobs for date cleaning and transformation.
Getting Data Out
Giving Back: Open Source
http://sna-projects.com/sna/
Analytics Technologies
We Build Things With Data

           Give smart people great tools,
           enable them to solve problems
Prototyping Culture
How does Hadoop
 enable rapid data
   exploration?
Pig for Spatial Analytics
US County HeatMap
Pig for Trend Detection
Python Streaming Script
Sort Output & Display
R Streaming Also Easy




*from http://www.stat.uiowa.edu/~luke/classes/295-hpc/
Let’s Talk Data
Business is recognizing the importance of analytics
What data do we start with?
We can also leverage...
• Connection Graph          • Company Pages
• Recommendations           • Talent Match
• Address Book Uploads      • Web Referrals
• Search Logs               • 1M+ Twitter Accounts
• Profile Views & Activity   • Wikipedia Data
• Job Postings              • Mechanical Turk
• LinkedIn Groups           • Census, BLS, & Data.gov
• LinkedIn Questions        • Much more...
How do we think of Analytics?




      Data Jujitsu
Lots of Medium can be
more powerful than Big


             >
Reconstruct Reality
        from Data Exhaust
Data Scientist Lessons
• Follow the data, avoid assumptions
• Sanity check the extremes (0, infinity)
• Don’t get mired in rare edge cases
• Data Jujitsu: solve easier auxiliary problems
• Build smaller consistent samples to test code
• Establish a baseline model quickly, iterate often
• Use the right tool for the job at hand
• Iterate quickly with high level languages
Where did the bankers go?
We’re Hiring!
http://sna-projects.com/sna/
pskomoro@linkedin.com
@peteskomoroch

Weitere ähnliche Inhalte

Was ist angesagt?

Getting started in data science (4:3)
Getting started in data science (4:3)Getting started in data science (4:3)
Getting started in data science (4:3)Thinkful
 
Getting started in data science (4:3)
Getting started in data science (4:3)Getting started in data science (4:3)
Getting started in data science (4:3)Thinkful
 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Thinkful
 
Data Science Overview
Data Science OverviewData Science Overview
Data Science OverviewDavide Mauri
 
Demystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine LearningDemystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine LearningJulian Bright
 
Data Culture Series - Keynote & Panel - Birmingham - 8th April 2015
Data Culture Series  - Keynote & Panel - Birmingham - 8th April 2015Data Culture Series  - Keynote & Panel - Birmingham - 8th April 2015
Data Culture Series - Keynote & Panel - Birmingham - 8th April 2015Jonathan Woodward
 
Building Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On HadoopBuilding Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On HadoopNikolai Avteniev
 
What Is GDS and Neo4j’s GDS Library
What Is GDS and Neo4j’s GDS LibraryWhat Is GDS and Neo4j’s GDS Library
What Is GDS and Neo4j’s GDS LibraryNeo4j
 
David golynskiy resume it5
David golynskiy resume it5 David golynskiy resume it5
David golynskiy resume it5 David Golynskiy
 
Introduction to Graph databases and Neo4j (by Stefan Armbruster)
Introduction to Graph databases and Neo4j (by Stefan Armbruster)Introduction to Graph databases and Neo4j (by Stefan Armbruster)
Introduction to Graph databases and Neo4j (by Stefan Armbruster)barcelonajug
 
Semantically Enabled Personal Information Management with Cluug.com
Semantically Enabled Personal Information Management with Cluug.comSemantically Enabled Personal Information Management with Cluug.com
Semantically Enabled Personal Information Management with Cluug.comBernhard Schandl
 
Graph Databases - Where Do We Do the Modeling Part?
Graph Databases - Where Do We Do the Modeling Part?Graph Databases - Where Do We Do the Modeling Part?
Graph Databases - Where Do We Do the Modeling Part?DATAVERSITY
 
Data Science: Harnessing Open Data for High Impact Solutions
Data Science: Harnessing Open Data for High Impact SolutionsData Science: Harnessing Open Data for High Impact Solutions
Data Science: Harnessing Open Data for High Impact SolutionsMohd Izhar Firdaus Ismail
 
AI in the Intelligent Workplace
AI in the Intelligent WorkplaceAI in the Intelligent Workplace
AI in the Intelligent WorkplaceSharon O'Dea
 
Personalized News and Video Recomendation System at LinkSure
Personalized News and Video Recomendation System at LinkSurePersonalized News and Video Recomendation System at LinkSure
Personalized News and Video Recomendation System at LinkSureLeanne Hwee
 
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”VOGIN-academie
 
Big Data Content Organization, Discovery, and Management
Big Data Content Organization, Discovery, and ManagementBig Data Content Organization, Discovery, and Management
Big Data Content Organization, Discovery, and ManagementAccess Innovations, Inc.
 
Graph databases and the #panamapapers
Graph databases and the #panamapapersGraph databases and the #panamapapers
Graph databases and the #panamapapersdarthvader42
 

Was ist angesagt? (20)

Getting started in data science (4:3)
Getting started in data science (4:3)Getting started in data science (4:3)
Getting started in data science (4:3)
 
Getting started in data science (4:3)
Getting started in data science (4:3)Getting started in data science (4:3)
Getting started in data science (4:3)
 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)
 
Data Science Overview
Data Science OverviewData Science Overview
Data Science Overview
 
Demystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine LearningDemystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine Learning
 
Data Culture Series - Keynote & Panel - Birmingham - 8th April 2015
Data Culture Series  - Keynote & Panel - Birmingham - 8th April 2015Data Culture Series  - Keynote & Panel - Birmingham - 8th April 2015
Data Culture Series - Keynote & Panel - Birmingham - 8th April 2015
 
Building Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On HadoopBuilding Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On Hadoop
 
What Is GDS and Neo4j’s GDS Library
What Is GDS and Neo4j’s GDS LibraryWhat Is GDS and Neo4j’s GDS Library
What Is GDS and Neo4j’s GDS Library
 
David golynskiy resume it5
David golynskiy resume it5 David golynskiy resume it5
David golynskiy resume it5
 
Introduction to Graph databases and Neo4j (by Stefan Armbruster)
Introduction to Graph databases and Neo4j (by Stefan Armbruster)Introduction to Graph databases and Neo4j (by Stefan Armbruster)
Introduction to Graph databases and Neo4j (by Stefan Armbruster)
 
Semantically Enabled Personal Information Management with Cluug.com
Semantically Enabled Personal Information Management with Cluug.comSemantically Enabled Personal Information Management with Cluug.com
Semantically Enabled Personal Information Management with Cluug.com
 
Graph Databases - Where Do We Do the Modeling Part?
Graph Databases - Where Do We Do the Modeling Part?Graph Databases - Where Do We Do the Modeling Part?
Graph Databases - Where Do We Do the Modeling Part?
 
Data Science: Harnessing Open Data for High Impact Solutions
Data Science: Harnessing Open Data for High Impact SolutionsData Science: Harnessing Open Data for High Impact Solutions
Data Science: Harnessing Open Data for High Impact Solutions
 
AI in the Intelligent Workplace
AI in the Intelligent WorkplaceAI in the Intelligent Workplace
AI in the Intelligent Workplace
 
Kurukshetra - Big Data
Kurukshetra - Big DataKurukshetra - Big Data
Kurukshetra - Big Data
 
Personalized News and Video Recomendation System at LinkSure
Personalized News and Video Recomendation System at LinkSurePersonalized News and Video Recomendation System at LinkSure
Personalized News and Video Recomendation System at LinkSure
 
Paving The Way To Data Driven
Paving The Way To Data DrivenPaving The Way To Data Driven
Paving The Way To Data Driven
 
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
 
Big Data Content Organization, Discovery, and Management
Big Data Content Organization, Discovery, and ManagementBig Data Content Organization, Discovery, and Management
Big Data Content Organization, Discovery, and Management
 
Graph databases and the #panamapapers
Graph databases and the #panamapapersGraph databases and the #panamapapers
Graph databases and the #panamapapers
 

Ähnlich wie Rapid Data Exploration With Hadoop

Building Competitive Moats With Data
Building Competitive Moats With DataBuilding Competitive Moats With Data
Building Competitive Moats With DataPeter Skomoroch
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science TJ Stalcup
 
Data Science-Why?What?How? By Hari Prasad
Data Science-Why?What?How? By Hari PrasadData Science-Why?What?How? By Hari Prasad
Data Science-Why?What?How? By Hari PrasadHari Prasad
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 CareerBuilder.com
 
Data Foundation for Analytics Excellence by Tanimura, cathy from Okta
Data Foundation for Analytics Excellence by Tanimura, cathy from OktaData Foundation for Analytics Excellence by Tanimura, cathy from Okta
Data Foundation for Analytics Excellence by Tanimura, cathy from OktaTin Ho
 
2017 06-14-getting started with data science
2017 06-14-getting started with data science2017 06-14-getting started with data science
2017 06-14-getting started with data scienceThinkful
 
Intro to Data Science
Intro to Data ScienceIntro to Data Science
Intro to Data ScienceTJ Stalcup
 
How Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisHow Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisCrowdFlower
 
Architecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment OptionsArchitecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment OptionsCaserta
 
Frank Bien Opening Keynote - Join 2016
Frank Bien Opening Keynote - Join 2016Frank Bien Opening Keynote - Join 2016
Frank Bien Opening Keynote - Join 2016Looker
 
Frank Bien Opening Keynote - Join 2016
Frank Bien Opening Keynote - Join 2016Frank Bien Opening Keynote - Join 2016
Frank Bien Opening Keynote - Join 2016Looker
 
Big Data Landscape 2018
Big Data Landscape 2018Big Data Landscape 2018
Big Data Landscape 2018Leanne Hwee
 
Data Science at LinkedIn - Data-Driven Products & Insights
Data Science at LinkedIn - Data-Driven Products & InsightsData Science at LinkedIn - Data-Driven Products & Insights
Data Science at LinkedIn - Data-Driven Products & InsightsYael Garten
 
What Managers Need to Know about Data Science
What Managers Need to Know about Data ScienceWhat Managers Need to Know about Data Science
What Managers Need to Know about Data ScienceAnnie Flippo
 
Neo4j GraphTalk Düsseldorf - How Graphs revolutionise Identity & Access Manag...
Neo4j GraphTalk Düsseldorf - How Graphs revolutionise Identity & Access Manag...Neo4j GraphTalk Düsseldorf - How Graphs revolutionise Identity & Access Manag...
Neo4j GraphTalk Düsseldorf - How Graphs revolutionise Identity & Access Manag...Neo4j
 

Ähnlich wie Rapid Data Exploration With Hadoop (20)

Building Competitive Moats With Data
Building Competitive Moats With DataBuilding Competitive Moats With Data
Building Competitive Moats With Data
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science
 
Semantics and Machine Learning
Semantics and Machine LearningSemantics and Machine Learning
Semantics and Machine Learning
 
Data Science-Why?What?How? By Hari Prasad
Data Science-Why?What?How? By Hari PrasadData Science-Why?What?How? By Hari Prasad
Data Science-Why?What?How? By Hari Prasad
 
Hadoop and SAP BI
Hadoop and SAP BI   Hadoop and SAP BI
Hadoop and SAP BI
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
 
Data Foundation for Analytics Excellence by Tanimura, cathy from Okta
Data Foundation for Analytics Excellence by Tanimura, cathy from OktaData Foundation for Analytics Excellence by Tanimura, cathy from Okta
Data Foundation for Analytics Excellence by Tanimura, cathy from Okta
 
2017 06-14-getting started with data science
2017 06-14-getting started with data science2017 06-14-getting started with data science
2017 06-14-getting started with data science
 
Intro to Data Science
Intro to Data ScienceIntro to Data Science
Intro to Data Science
 
How Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisHow Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment Analysis
 
Architecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment OptionsArchitecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment Options
 
Ds01 data science
Ds01   data scienceDs01   data science
Ds01 data science
 
Big databigideasit4bc
Big databigideasit4bcBig databigideasit4bc
Big databigideasit4bc
 
Frank Bien Opening Keynote - Join 2016
Frank Bien Opening Keynote - Join 2016Frank Bien Opening Keynote - Join 2016
Frank Bien Opening Keynote - Join 2016
 
Frank Bien Opening Keynote - Join 2016
Frank Bien Opening Keynote - Join 2016Frank Bien Opening Keynote - Join 2016
Frank Bien Opening Keynote - Join 2016
 
Big Data Landscape 2018
Big Data Landscape 2018Big Data Landscape 2018
Big Data Landscape 2018
 
Data Science at LinkedIn - Data-Driven Products & Insights
Data Science at LinkedIn - Data-Driven Products & InsightsData Science at LinkedIn - Data-Driven Products & Insights
Data Science at LinkedIn - Data-Driven Products & Insights
 
Big Data for HR
Big Data for HRBig Data for HR
Big Data for HR
 
What Managers Need to Know about Data Science
What Managers Need to Know about Data ScienceWhat Managers Need to Know about Data Science
What Managers Need to Know about Data Science
 
Neo4j GraphTalk Düsseldorf - How Graphs revolutionise Identity & Access Manag...
Neo4j GraphTalk Düsseldorf - How Graphs revolutionise Identity & Access Manag...Neo4j GraphTalk Düsseldorf - How Graphs revolutionise Identity & Access Manag...
Neo4j GraphTalk Düsseldorf - How Graphs revolutionise Identity & Access Manag...
 

Mehr von Peter Skomoroch

Bridging the AI Gap: Building Stakeholder Support
Bridging the AI Gap: Building Stakeholder SupportBridging the AI Gap: Building Stakeholder Support
Bridging the AI Gap: Building Stakeholder SupportPeter Skomoroch
 
Managing Machines: The New AI Dev Stack
Managing Machines: The New AI Dev StackManaging Machines: The New AI Dev Stack
Managing Machines: The New AI Dev StackPeter Skomoroch
 
Product Management for AI
Product Management for AIProduct Management for AI
Product Management for AIPeter Skomoroch
 
Executive Briefing: Why managing machines is harder than you think
Executive Briefing: Why managing machines is harder than you thinkExecutive Briefing: Why managing machines is harder than you think
Executive Briefing: Why managing machines is harder than you thinkPeter Skomoroch
 
SF Data Science: Developing Data Products
SF Data Science: Developing Data ProductsSF Data Science: Developing Data Products
SF Data Science: Developing Data ProductsPeter Skomoroch
 
Skills, Reputation, and Search
Skills, Reputation, and SearchSkills, Reputation, and Search
Skills, Reputation, and SearchPeter Skomoroch
 
LinkedIn Endorsements: Reputation, Virality, and Social Tagging
LinkedIn Endorsements: Reputation, Virality, and Social TaggingLinkedIn Endorsements: Reputation, Virality, and Social Tagging
LinkedIn Endorsements: Reputation, Virality, and Social TaggingPeter Skomoroch
 
Developing Data Products
Developing Data ProductsDeveloping Data Products
Developing Data ProductsPeter Skomoroch
 
Practical Problem Solving with Data - Onlab Data Conference, Tokyo
Practical Problem Solving with Data - Onlab Data Conference, TokyoPractical Problem Solving with Data - Onlab Data Conference, Tokyo
Practical Problem Solving with Data - Onlab Data Conference, TokyoPeter Skomoroch
 
Street Fighting Data Science
Street Fighting Data ScienceStreet Fighting Data Science
Street Fighting Data SciencePeter Skomoroch
 
Data Mashups -Data Science Summit
Data Mashups -Data Science SummitData Mashups -Data Science Summit
Data Mashups -Data Science SummitPeter Skomoroch
 
Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011Peter Skomoroch
 
Prototyping Data Intensive Apps: TrendingTopics.org
Prototyping Data Intensive Apps: TrendingTopics.orgPrototyping Data Intensive Apps: TrendingTopics.org
Prototyping Data Intensive Apps: TrendingTopics.orgPeter Skomoroch
 

Mehr von Peter Skomoroch (14)

Bridging the AI Gap: Building Stakeholder Support
Bridging the AI Gap: Building Stakeholder SupportBridging the AI Gap: Building Stakeholder Support
Bridging the AI Gap: Building Stakeholder Support
 
Managing Machines: The New AI Dev Stack
Managing Machines: The New AI Dev StackManaging Machines: The New AI Dev Stack
Managing Machines: The New AI Dev Stack
 
Product Management for AI
Product Management for AIProduct Management for AI
Product Management for AI
 
Executive Briefing: Why managing machines is harder than you think
Executive Briefing: Why managing machines is harder than you thinkExecutive Briefing: Why managing machines is harder than you think
Executive Briefing: Why managing machines is harder than you think
 
SF Data Science: Developing Data Products
SF Data Science: Developing Data ProductsSF Data Science: Developing Data Products
SF Data Science: Developing Data Products
 
Skills, Reputation, and Search
Skills, Reputation, and SearchSkills, Reputation, and Search
Skills, Reputation, and Search
 
LinkedIn Endorsements: Reputation, Virality, and Social Tagging
LinkedIn Endorsements: Reputation, Virality, and Social TaggingLinkedIn Endorsements: Reputation, Virality, and Social Tagging
LinkedIn Endorsements: Reputation, Virality, and Social Tagging
 
Developing Data Products
Developing Data ProductsDeveloping Data Products
Developing Data Products
 
Practical Problem Solving with Data - Onlab Data Conference, Tokyo
Practical Problem Solving with Data - Onlab Data Conference, TokyoPractical Problem Solving with Data - Onlab Data Conference, Tokyo
Practical Problem Solving with Data - Onlab Data Conference, Tokyo
 
Street Fighting Data Science
Street Fighting Data ScienceStreet Fighting Data Science
Street Fighting Data Science
 
Data Mashups -Data Science Summit
Data Mashups -Data Science SummitData Mashups -Data Science Summit
Data Mashups -Data Science Summit
 
Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011
 
Prototyping Data Intensive Apps: TrendingTopics.org
Prototyping Data Intensive Apps: TrendingTopics.orgPrototyping Data Intensive Apps: TrendingTopics.org
Prototyping Data Intensive Apps: TrendingTopics.org
 
Elasticwulf Pycon Talk
Elasticwulf Pycon TalkElasticwulf Pycon Talk
Elasticwulf Pycon Talk
 

Kürzlich hochgeladen

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 

Kürzlich hochgeladen (20)

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

Rapid Data Exploration With Hadoop

  • 1. Rapid Data Exploration With Hadoop Peter Skomoroch Senior Data Scientist @peteskomoroch
  • 2. Outline • Overview: LinkedIn Biz, Tech, & Analytics • Rapid Data Exploration 101 - Spatial Analytics Pig Code - Trend detection with Pig & Python - R Streaming Example • Deep Dive: Our Data Analysis Approach • Building Data Products • LinkedIn Data Insights
  • 3. Connect the world’s professionals to make them more productive and successful
  • 5. LinkedIn at a glance • Founded in 2003 • #17 site in the US (Alexa) • 60+ million members • First million members = 477 days • Latest million = 9 days • 500K+ company profiles • 12+ million small business professionals • In 2009 - 1billion people searches • Average age: 41 • Household income $107,000 • 42% are “decision makers”
  • 6. How International? • More than 50% international (members in over 200 countries & territories) • 13+ million in Europe • 4+ million in India • 3+ million in UK • #13 site in UK (Alexa)
  • 7. How do we keep the lights on? • Profitable since 2007 • Valued at over $1B at the last funding round • Subscriptions • Ads • Job Postings • Enterprise Client
  • 8. Hadoop on LinkedIn 1,400+ members list “Hadoop” on their profile What other skills do they have? •HBase, Lucene, Solr, MapReduce, Nutch... Where are they? Who do they work for? • 36% in Bay Area • 11% Yahoo! • 8% in India • 2% Apache Software Foundation • 6% in NYC • 1% LinkedIn • 4% in Seattle • 1% Google • 4% in Los Angeles • 1% Facebook
  • 10. Voldemort Data Storage Compact, compressed, binary data (something like Avro) Type can be any combination of int, double, float, String, Map, List, etc. => Sequence Files Example member definition: { ‘member_id’: ‘int32’, ‘first_name': 'string', ’last_name': ’string’, ‘age’ : ‘int32’ … }
  • 11. Getting Data In •From Databases (user data, news, jobs etc.) • Need a way to get data reliably periodically • Need tests to verify data • Support for incremental replication • Solution: Transmogrify Driver Program • InputReader: JDBCReader, CSV Reader • Output Writer: JDBCWriter, HDFS writers • From web logs (page views, search, clicks etc) • Weblogs files are rsynced and loaded up in HDFS • Hadoop jobs for date cleaning and transformation.
  • 13. Giving Back: Open Source http://sna-projects.com/sna/
  • 15. We Build Things With Data Give smart people great tools, enable them to solve problems
  • 17. How does Hadoop enable rapid data exploration?
  • 18. Pig for Spatial Analytics
  • 20. Pig for Trend Detection
  • 22. Sort Output & Display
  • 23. R Streaming Also Easy *from http://www.stat.uiowa.edu/~luke/classes/295-hpc/
  • 25. Business is recognizing the importance of analytics
  • 26. What data do we start with?
  • 27. We can also leverage... • Connection Graph • Company Pages • Recommendations • Talent Match • Address Book Uploads • Web Referrals • Search Logs • 1M+ Twitter Accounts • Profile Views & Activity • Wikipedia Data • Job Postings • Mechanical Turk • LinkedIn Groups • Census, BLS, & Data.gov • LinkedIn Questions • Much more...
  • 28. How do we think of Analytics? Data Jujitsu
  • 29. Lots of Medium can be more powerful than Big >
  • 30. Reconstruct Reality from Data Exhaust
  • 31. Data Scientist Lessons • Follow the data, avoid assumptions • Sanity check the extremes (0, infinity) • Don’t get mired in rare edge cases • Data Jujitsu: solve easier auxiliary problems • Build smaller consistent samples to test code • Establish a baseline model quickly, iterate often • Use the right tool for the job at hand • Iterate quickly with high level languages
  • 32. Where did the bankers go?