Rapid Data Exploration With Hadoop

•

16 gefällt mir•1,828 views

LinkedIn is the premiere professional social network with over 60 million users and a new user joining every second. One of LinkedIn's strategic advantages is their unique data. While most organizations consider data as a service function, LinkedIn considers data a cornerstone of their product portfolio. To rapidly develop these products LinkedIn leverages a number of technologies including open source, 3rd party solutions, and some we've had to invent along the way. This LinkedIn talk at the NYC Hadoop Meetup held 3/18 at ContextWeb focused on best practices for quickly uncovering patterns, visualizing trends, and generating actionable insights from large datasets.

Technologie

Rapid Data Exploration
With Hadoop
Peter Skomoroch
Senior Data Scientist

@peteskomoroch

Outline
• Overview: LinkedIn Biz, Tech, & Analytics
• Rapid Data Exploration 101
- Spatial Analytics Pig Code
- Trend detection with Pig & Python
- R Streaming Example
• Deep Dive: Our Data Analysis Approach
• Building Data Products
• LinkedIn Data Insights

Connect the world’s professionals to make
them more productive and successful

LinkedIn at a glance
• Founded in 2003
• #17 site in the US (Alexa)
• 60+ million members
• First million members = 477 days
• Latest million = 9 days
• 500K+ company proﬁles
• 12+ million small business professionals
• In 2009 - 1billion people searches
• Average age: 41
• Household income $107,000
• 42% are “decision makers”

How International?
• More than 50% international
(members in over 200 countries & territories)
• 13+ million in Europe
• 4+ million in India
• 3+ million in UK
• #13 site in UK (Alexa)

How do we keep the lights on?
• Proﬁtable since 2007
• Valued at over $1B at the last funding round
• Subscriptions
• Ads
• Job Postings
• Enterprise Client

Hadoop on LinkedIn
1,400+ members list “Hadoop” on their proﬁle
What other skills do they have?
•HBase, Lucene, Solr, MapReduce, Nutch...
Where are they? Who do they work for?
• 36% in Bay Area • 11% Yahoo!
• 8% in India • 2% Apache Software Foundation
• 6% in NYC • 1% LinkedIn
• 4% in Seattle • 1% Google
• 4% in Los Angeles • 1% Facebook

$Voldemort Data Storage Compact, compressed, binary data (something like Avro) Type can be any combination of int, double, float, String, Map, List, etc. => Sequence Files Example member definition: { ‘member_id’: ‘int32’, ‘first_name': 'string', ’last_name': ’string’, ‘age’ : ‘int32’ … }$

Getting Data In
•From Databases (user data, news, jobs etc.)
• Need a way to get data reliably periodically
• Need tests to verify data
• Support for incremental replication
• Solution: Transmogrify Driver Program
• InputReader: JDBCReader, CSV Reader
• Output Writer: JDBCWriter, HDFS writers
• From web logs (page views, search, clicks etc)
• Weblogs ﬁles are rsynced and loaded up in HDFS
• Hadoop jobs for date cleaning and transformation.

Giving Back: Open Source
http://sna-projects.com/sna/

We Build Things With Data

Give smart people great tools,
enable them to solve problems

How does Hadoop
enable rapid data
exploration?

R Streaming Also Easy

*from http://www.stat.uiowa.edu/~luke/classes/295-hpc/

Business is recognizing the importance of analytics

We can also leverage...
• Connection Graph • Company Pages
• Recommendations • Talent Match
• Address Book Uploads • Web Referrals
• Search Logs • 1M+ Twitter Accounts
• Proﬁle Views & Activity • Wikipedia Data
• Job Postings • Mechanical Turk
• LinkedIn Groups • Census, BLS, & Data.gov
• LinkedIn Questions • Much more...

How do we think of Analytics?

Data Jujitsu

Lots of Medium can be
more powerful than Big

>

Data Scientist Lessons
• Follow the data, avoid assumptions
• Sanity check the extremes (0, inﬁnity)
• Don’t get mired in rare edge cases
• Data Jujitsu: solve easier auxiliary problems
• Build smaller consistent samples to test code
• Establish a baseline model quickly, iterate often
• Use the right tool for the job at hand
• Iterate quickly with high level languages

We’re Hiring!
http://sna-projects.com/sna/
pskomoro@linkedin.com
@peteskomoroch

Empfohlen

Getting Started in Data ScienceThinkful

HighPredict Technical Introduction 2018Efi Jeremiah

Course Information for March 25th BatchUpXAcademy

Thinkful - Intro to Data Science - Washington DCTJ Stalcup

How to crack Big Data and Data Science rolesUpXAcademy

Career in Data Science (July 2017, DTLA)Thinkful

Ridepeer AI Database 2017Efi Jeremiah

O'Reilly Strata: Distilling Data ExhaustPeter Skomoroch

Empfohlen

Getting Started in Data ScienceThinkful

HighPredict Technical Introduction 2018Efi Jeremiah

Course Information for March 25th BatchUpXAcademy

Thinkful - Intro to Data Science - Washington DCTJ Stalcup

How to crack Big Data and Data Science rolesUpXAcademy

Career in Data Science (July 2017, DTLA)Thinkful

Ridepeer AI Database 2017Efi Jeremiah

O'Reilly Strata: Distilling Data ExhaustPeter Skomoroch

Getting started in data science (4:3)Thinkful

Getting started in Data Science (April 2017, Los Angeles)Thinkful

Data Science OverviewDavide Mauri

Demystifying Data Science with an introduction to Machine LearningJulian Bright

Data Culture Series - Keynote & Panel - Birmingham - 8th April 2015Jonathan Woodward

Building Satori: Web Data Extraction On HadoopNikolai Avteniev

What Is GDS and Neo4j’s GDS LibraryNeo4j

David golynskiy resume it5 David Golynskiy

Introduction to Graph databases and Neo4j (by Stefan Armbruster)barcelonajug

Semantically Enabled Personal Information Management with Cluug.comBernhard Schandl

Graph Databases - Where Do We Do the Modeling Part?DATAVERSITY

Data Science: Harnessing Open Data for High Impact SolutionsMohd Izhar Firdaus Ismail

AI in the Intelligent WorkplaceSharon O'Dea

Kurukshetra - Big Datashankar_radhakrishnan

Personalized News and Video Recomendation System at LinkSureLeanne Hwee

Paving The Way To Data DrivenMohd Izhar Firdaus Ismail

Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”VOGIN-academie

Big Data Content Organization, Discovery, and ManagementAccess Innovations, Inc.

Graph databases and the #panamapapersdarthvader42

Building Competitive Moats With DataPeter Skomoroch

Thinkful DC - Intro to Data Science TJ Stalcup

Weitere ähnliche Inhalte

Was ist angesagt?

Getting started in data science (4:3)Thinkful

Getting started in Data Science (April 2017, Los Angeles)Thinkful

Data Science OverviewDavide Mauri

Demystifying Data Science with an introduction to Machine LearningJulian Bright

Data Culture Series - Keynote & Panel - Birmingham - 8th April 2015Jonathan Woodward

Building Satori: Web Data Extraction On HadoopNikolai Avteniev

What Is GDS and Neo4j’s GDS LibraryNeo4j

David golynskiy resume it5 David Golynskiy

Introduction to Graph databases and Neo4j (by Stefan Armbruster)barcelonajug

Semantically Enabled Personal Information Management with Cluug.comBernhard Schandl

Graph Databases - Where Do We Do the Modeling Part?DATAVERSITY

Data Science: Harnessing Open Data for High Impact SolutionsMohd Izhar Firdaus Ismail

AI in the Intelligent WorkplaceSharon O'Dea

Kurukshetra - Big Datashankar_radhakrishnan

Personalized News and Video Recomendation System at LinkSureLeanne Hwee

Paving The Way To Data DrivenMohd Izhar Firdaus Ismail

Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”VOGIN-academie

Big Data Content Organization, Discovery, and ManagementAccess Innovations, Inc.

Graph databases and the #panamapapersdarthvader42

Was ist angesagt? (20)

Getting started in data science (4:3)

Getting started in Data Science (April 2017, Los Angeles)

Data Science Overview

Demystifying Data Science with an introduction to Machine Learning

Data Culture Series - Keynote & Panel - Birmingham - 8th April 2015

Building Satori: Web Data Extraction On Hadoop

What Is GDS and Neo4j’s GDS Library

David golynskiy resume it5

Introduction to Graph databases and Neo4j (by Stefan Armbruster)

Semantically Enabled Personal Information Management with Cluug.com

Graph Databases - Where Do We Do the Modeling Part?

Data Science: Harnessing Open Data for High Impact Solutions

AI in the Intelligent Workplace

Kurukshetra - Big Data

Personalized News and Video Recomendation System at LinkSure

Paving The Way To Data Driven

Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”

Big Data Content Organization, Discovery, and Management

Graph databases and the #panamapapers

Ähnlich wie Rapid Data Exploration With Hadoop

Building Competitive Moats With DataPeter Skomoroch

Thinkful DC - Intro to Data Science TJ Stalcup

Semantics and Machine LearningVladimir Alexiev, PhD, PMP

Data Science-Why?What?How? By Hari PrasadHari Prasad

Hadoop and SAP BI Praveen Kumar (Tyagi)

SDSC18 and DSATL Meetup March 2018 CareerBuilder.com

Data Foundation for Analytics Excellence by Tanimura, cathy from OktaTin Ho

2017 06-14-getting started with data scienceThinkful

Intro to Data ScienceTJ Stalcup

How Oracle Uses CrowdFlower For Sentiment AnalysisCrowdFlower

Architecting for Big Data: Trends, Tips, and Deployment OptionsCaserta

Ds01 data scienceDotNetCampus

Big databigideasit4bcVincent Ohprecio

Frank Bien Opening Keynote - Join 2016Looker

Big Data Landscape 2018Leanne Hwee

Data Science at LinkedIn - Data-Driven Products & InsightsYael Garten

Big Data for HRDavid Bernstein

What Managers Need to Know about Data ScienceAnnie Flippo

Neo4j GraphTalk Düsseldorf - How Graphs revolutionise Identity & Access Manag...Neo4j

Ähnlich wie Rapid Data Exploration With Hadoop (20)

Building Competitive Moats With Data

Thinkful DC - Intro to Data Science

Semantics and Machine Learning

Data Science-Why?What?How? By Hari Prasad

Hadoop and SAP BI

SDSC18 and DSATL Meetup March 2018

Data Foundation for Analytics Excellence by Tanimura, cathy from Okta

2017 06-14-getting started with data science

Intro to Data Science

How Oracle Uses CrowdFlower For Sentiment Analysis

Architecting for Big Data: Trends, Tips, and Deployment Options

Ds01 data science

Big databigideasit4bc

Frank Bien Opening Keynote - Join 2016

Big Data Landscape 2018

Data Science at LinkedIn - Data-Driven Products & Insights

Big Data for HR

What Managers Need to Know about Data Science

Neo4j GraphTalk Düsseldorf - How Graphs revolutionise Identity & Access Manag...

Mehr von Peter Skomoroch

Bridging the AI Gap: Building Stakeholder SupportPeter Skomoroch

Managing Machines: The New AI Dev StackPeter Skomoroch

Product Management for AIPeter Skomoroch

Executive Briefing: Why managing machines is harder than you thinkPeter Skomoroch

SF Data Science: Developing Data ProductsPeter Skomoroch

Skills, Reputation, and SearchPeter Skomoroch

LinkedIn Endorsements: Reputation, Virality, and Social TaggingPeter Skomoroch

Developing Data ProductsPeter Skomoroch

Practical Problem Solving with Data - Onlab Data Conference, TokyoPeter Skomoroch

Street Fighting Data SciencePeter Skomoroch

Data Mashups -Data Science SummitPeter Skomoroch

Geo Analytics Tutorial - Where 2.0 2011Peter Skomoroch

Prototyping Data Intensive Apps: TrendingTopics.orgPeter Skomoroch

Elasticwulf Pycon TalkPeter Skomoroch

Mehr von Peter Skomoroch (14)

Bridging the AI Gap: Building Stakeholder Support

Managing Machines: The New AI Dev Stack

Product Management for AI

Executive Briefing: Why managing machines is harder than you think

SF Data Science: Developing Data Products

Skills, Reputation, and Search

LinkedIn Endorsements: Reputation, Virality, and Social Tagging

Developing Data Products

Practical Problem Solving with Data - Onlab Data Conference, Tokyo

Street Fighting Data Science

Data Mashups -Data Science Summit

Geo Analytics Tutorial - Where 2.0 2011

Prototyping Data Intensive Apps: TrendingTopics.org

Elasticwulf Pycon Talk

Kürzlich hochgeladen

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix

A Call to Action for Generative AI in 2024Results

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

Scaling API-first – The story of a global engineering organizationRadu Cotescu

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

How to convert PDF to text with Nanonetsnaman860154

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

🐬 The future of MySQL is Postgres 🐘RTylerCroy

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge

Kürzlich hochgeladen (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx

Swan(sea) Song – personal research during my six years at Swansea ... and bey...

A Call to Action for Generative AI in 2024

Data Cloud, More than a CDP by Matt Robison

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Scaling API-first – The story of a global engineering organization

08448380779 Call Girls In Civil Lines Women Seeking Men

A Domino Admins Adventures (Engage 2024)

Injustice - Developers Among Us (SciFiDevCon 2024)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

Handwritten Text Recognition for manuscripts and early printed texts

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

How to convert PDF to text with Nanonets

Unblocking The Main Thread Solving ANRs and Frozen Frames

08448380779 Call Girls In Friends Colony Women Seeking Men

🐬 The future of MySQL is Postgres 🐘

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

Rapid Data Exploration With Hadoop

1. Rapid Data Exploration With Hadoop Peter Skomoroch Senior Data Scientist @peteskomoroch

2. Outline • Overview: LinkedIn Biz, Tech, & Analytics • Rapid Data Exploration 101 - Spatial Analytics Pig Code - Trend detection with Pig & Python - R Streaming Example • Deep Dive: Our Data Analysis Approach • Building Data Products • LinkedIn Data Insights

3. Connect the world’s professionals to make them more productive and successful

4. Professional Identity

5. LinkedIn at a glance • Founded in 2003 • #17 site in the US (Alexa) • 60+ million members • First million members = 477 days • Latest million = 9 days • 500K+ company proﬁles • 12+ million small business professionals • In 2009 - 1billion people searches • Average age: 41 • Household income $107,000 • 42% are “decision makers”

6. How International? • More than 50% international (members in over 200 countries & territories) • 13+ million in Europe • 4+ million in India • 3+ million in UK • #13 site in UK (Alexa)

7. How do we keep the lights on? • Proﬁtable since 2007 • Valued at over $1B at the last funding round • Subscriptions • Ads • Job Postings • Enterprise Client

8. Hadoop on LinkedIn 1,400+ members list “Hadoop” on their proﬁle What other skills do they have? •HBase, Lucene, Solr, MapReduce, Nutch... Where are they? Who do they work for? • 36% in Bay Area • 11% Yahoo! • 8% in India • 2% Apache Software Foundation • 6% in NYC • 1% LinkedIn • 4% in Seattle • 1% Google • 4% in Los Angeles • 1% Facebook

9. Hadoop at LinkedIn

10. Voldemort Data Storage Compact, compressed, binary data (something like Avro) Type can be any combination of int, double, float, String, Map, List, etc. => Sequence Files Example member definition: { ‘member_id’: ‘int32’, ‘first_name': 'string', ’last_name': ’string’, ‘age’ : ‘int32’ … }

11. Getting Data In •From Databases (user data, news, jobs etc.) • Need a way to get data reliably periodically • Need tests to verify data • Support for incremental replication • Solution: Transmogrify Driver Program • InputReader: JDBCReader, CSV Reader • Output Writer: JDBCWriter, HDFS writers • From web logs (page views, search, clicks etc) • Weblogs ﬁles are rsynced and loaded up in HDFS • Hadoop jobs for date cleaning and transformation.

12. Getting Data Out

13. Giving Back: Open Source http://sna-projects.com/sna/

14. Analytics Technologies

15. We Build Things With Data Give smart people great tools, enable them to solve problems

16. Prototyping Culture

17. How does Hadoop enable rapid data exploration?

18. Pig for Spatial Analytics

19. US County HeatMap

20. Pig for Trend Detection

21. Python Streaming Script

22. Sort Output & Display

23. R Streaming Also Easy *from http://www.stat.uiowa.edu/~luke/classes/295-hpc/

24. Let’s Talk Data

25. Business is recognizing the importance of analytics

26. What data do we start with?

27. We can also leverage... • Connection Graph • Company Pages • Recommendations • Talent Match • Address Book Uploads • Web Referrals • Search Logs • 1M+ Twitter Accounts • Proﬁle Views & Activity • Wikipedia Data • Job Postings • Mechanical Turk • LinkedIn Groups • Census, BLS, & Data.gov • LinkedIn Questions • Much more...

28. How do we think of Analytics? Data Jujitsu

29. Lots of Medium can be more powerful than Big >

30. Reconstruct Reality from Data Exhaust

31. Data Scientist Lessons • Follow the data, avoid assumptions • Sanity check the extremes (0, inﬁnity) • Don’t get mired in rare edge cases • Data Jujitsu: solve easier auxiliary problems • Build smaller consistent samples to test code • Establish a baseline model quickly, iterate often • Use the right tool for the job at hand • Iterate quickly with high level languages

32. Where did the bankers go?

33. We’re Hiring! http://sna-projects.com/sna/ pskomoro@linkedin.com @peteskomoroch