SlideShare ist ein Scribd-Unternehmen logo
1 von 13
Introduction to
Data Engineering
Vivek A. Ganesan
vivganes@gmail.com
Agenda
Copyright 2013, Vivek A. Ganesan, All rights reserved 1
o Introduction
o What is data engineering?
o Why data engineering?
o Required Skills
o Questions?
Introduction
Copyright 2013, Vivek A. Ganesan, All rights reserved 2
o What’s with the name?
o All other names were taken 
o Gods = Geeks on Data
o Well, it is now Geeking out on Data
o Why a Data Geek?
o Geeks are cool
o Data Geeks are way cool
Partial Omniscience (Super power of Prediction)
Data, Data, Data!
Copyright 2013, Vivek A. Ganesan, All rights reserved 3
• Significant increase in data (Volume)
• Social Networks
• Transaction Logs
• Fast streams of data (Velocity)
• Sensor data
• Machine-to-machine data
• Different kinds of data (Variety)
• Text
• Audio
• Video
• This trend is only going to grow!
Note : EB = Exabyte = 1 million Petabytes
Big Data Trends
Before Big Data
Copyright 2013, Vivek A. Ganesan, All rights reserved 4
• Life was simple … well mostly
• The ETL engineers managed data
pipelines
• The Data Scientists (they weren’t
called that, btw, they were
mostly Statisticians who
programmed in SAS, SPSS or S)
did the analysis
• Data Warehouses, Data marts
and OLAP cubes were the
platforms
• Data Analysts mostly generated
reports but they were proficient
in SQL, Excel, Pivot Tables etc.
• Data Architects …
well, they architected

• They managed :
• Data models
• Star Schemas
• Data Governance
• Master Data
Management
(MDM)
• Data Security
• For the most part, they
had to coax different
groups to share data
Big Data – What Changed?
Copyright 2013, Vivek A. Ganesan, All rights reserved 5
• Life … got interesting
• Huge data volumes – ETL became
a problem
• Traditional Statistical tools
couldn’t handle the volume
• Data Warehouses, Data marts
and OLAP cubes not primary
analytical means – “in situ”
analysis preferred i.e. no moving
data to an analytics platform
• Data Analysts still on point for
reports but now they no longer
had SQL interfaces (thanks to
NoSQL and Map Reduce)
• Data Architects …
well, they still need to
architect 
• Still need :
• Data models
• Data Governance
• Data Security
• For the most part, they
had to coax different
groups to share data
• They have to do all of
this when the
technology is rapidly
evolving
Life in the Big Data Universe
Copyright 2013, Vivek A. Ganesan, All rights reserved 6
• The Good
• Data recognized as an asset
• Data Driven Products more
common
• Working with Data is cool
• The Bad
• Complexity is overwhelming
• No sophisticated toolset yet
• Technology is fast changing
• The Ugly
• No SQL!
• Security
• Governance
• Performance
• The Opportunity
• Solve for :
• SQL semantics
• Data Governance
• Data Security
• Benchmarking, Pro
filing and
Performance
measurement tools
• Build :
• Real-time solutions
• Data Marts/Data
Warehouses on top
Life in the Big Data Universe
Copyright 2013, Vivek A. Ganesan, All rights reserved 7
Data Scientist Data AnalystData Engineer
• Building Models
• Validation/Testing
• Algorithms
• Continuous
Improvement
• Knowledge of :
• Statistics
• Linear Algebra
• Machine
Learning
• R,Matlab etc.
• Deep Domain
Knowledge
• Report Generation
• Data Exploration
• Hypotheses Testing
• Pattern Discovery
• Correlations
• Serendipitous
Discovery
• Data Pipelines
• Manage Platforms
• Productionalize
Algorithms
• Agile Development
• Knowledge of :
• Platforms
• Algorithms
• Java, C++ etc.
• Scripting
languagues
like python
Data Engineering
Copyright 2013, Vivek A. Ganesan, All rights reserved 8
• Strong CS Background
• Algorithms
• Database theory
• Scripting languages
• Server side languages
• Distributed Systems Background
• Clusters
• Networking
• Monitoring/Performance
• Data Science/Machine Learning
• Search/IR
• Text Analytics
• Classification
• Clustering
• Infrastructure
• Hadoop
• Cassandra
• Mongo DB
• Platforms
• Solr
• Hive
• HBase
• Mahout
• Applications
• Recommendation
Engines
• Fraud Prevention
• Disease Prevention
Data Engineer’s Role
Copyright 2013, Vivek A. Ganesan, All rights reserved 9
• Data Dialysis – Cleaning up Data
• Hard to do at Scale
• Newer tools in this space
• Great scope for innovation
• ETL -> ELT
• Distributed Bulk loading
• Full-fledged data pipelines
• Supporting both data scientists
and data analysts
• Productionalizing algorithms
• Production support
• Optimization
• A/B Testing and Continuous
Improvement
About this Meetup : Structure
Copyright 2013, Vivek A. Ganesan, All rights reserved 10
• Agile teams
• Monthly Scrum
• Week 1 : Introduction to Problem
• Week 2 : Algorithm + Platform
• Week 3 : Technical help
(Algorithm, Platform, Testing and
Deployment)
• Week 4 : Panel + Demo
• Showcase Startups/Experts in
the space
• Teams show demos
• Panel judges winners
• We might have prizes (needs
to be figured out)
• Weekly Meetup (on
Mondays)
• Might move to a bigger
venue if there is
enough demand
About this Meetup : Schedule
Copyright 2013, Vivek A. Ganesan, All rights reserved 11
• May 29th : Kickoff
• Scrum 1
• June 3rd – Collaborative
Filtering Introduction
• June 10th – Mongo DB
Introduction
• June 17th – Analytics on
Mongo DB
• June 24th – Panel + Demo
• Scrum 2 (TBD)
• Come along now, it will
be fun!
• Oh, the name 
Questions? Comments?
Thank You!
E-mail: vivganes@gmail.com
Twitter : onevivek
Copyright 2013, Vivek A. Ganesan, All rights
reserved
12

Weitere ähnliche Inhalte

Was ist angesagt?

Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Simplilearn
 

Was ist angesagt? (20)

Introduction to ETL and Data Integration
Introduction to ETL and Data IntegrationIntroduction to ETL and Data Integration
Introduction to ETL and Data Integration
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
 
Data engineering
Data engineeringData engineering
Data engineering
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture Design
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 

Ähnlich wie Introduction to Data Engineering

Ähnlich wie Introduction to Data Engineering (20)

Options for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketOptions for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current Market
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They Need
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Big Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data LakeBig Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data Lake
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big Data
 
Databases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems ImmunologyDatabases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems Immunology
 
Architecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleArchitecting Agile Data Applications for Scale
Architecting Agile Data Applications for Scale
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
 
Demystifying Data Engineering
Demystifying Data EngineeringDemystifying Data Engineering
Demystifying Data Engineering
 
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 GenoaHadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
PXL Data Engineering Workshop By Selligent
PXL Data Engineering Workshop By Selligent PXL Data Engineering Workshop By Selligent
PXL Data Engineering Workshop By Selligent
 
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Data modeling trends for analytics
Data modeling trends for analyticsData modeling trends for analytics
Data modeling trends for analytics
 
Hadoop and SAP BI
Hadoop and SAP BI   Hadoop and SAP BI
Hadoop and SAP BI
 
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data Warehouse
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 
Big data berlin
Big data berlinBig data berlin
Big data berlin
 

Mehr von Vivek Aanand Ganesan (6)

Big data pipelines
Big data pipelinesBig data pipelines
Big data pipelines
 
Collaborative filtering common_problems_and_solutions
Collaborative filtering common_problems_and_solutionsCollaborative filtering common_problems_and_solutions
Collaborative filtering common_problems_and_solutions
 
Mongodb hackathon 02
Mongodb hackathon 02Mongodb hackathon 02
Mongodb hackathon 02
 
Collaborative filtering getting_started
Collaborative filtering getting_startedCollaborative filtering getting_started
Collaborative filtering getting_started
 
Mongodb hackathon 01
Mongodb hackathon 01Mongodb hackathon 01
Mongodb hackathon 01
 
Recommendation Engines Program Kickoff
Recommendation Engines Program KickoffRecommendation Engines Program Kickoff
Recommendation Engines Program Kickoff
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Kürzlich hochgeladen (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 

Introduction to Data Engineering

  • 1. Introduction to Data Engineering Vivek A. Ganesan vivganes@gmail.com
  • 2. Agenda Copyright 2013, Vivek A. Ganesan, All rights reserved 1 o Introduction o What is data engineering? o Why data engineering? o Required Skills o Questions?
  • 3. Introduction Copyright 2013, Vivek A. Ganesan, All rights reserved 2 o What’s with the name? o All other names were taken  o Gods = Geeks on Data o Well, it is now Geeking out on Data o Why a Data Geek? o Geeks are cool o Data Geeks are way cool Partial Omniscience (Super power of Prediction)
  • 4. Data, Data, Data! Copyright 2013, Vivek A. Ganesan, All rights reserved 3 • Significant increase in data (Volume) • Social Networks • Transaction Logs • Fast streams of data (Velocity) • Sensor data • Machine-to-machine data • Different kinds of data (Variety) • Text • Audio • Video • This trend is only going to grow! Note : EB = Exabyte = 1 million Petabytes Big Data Trends
  • 5. Before Big Data Copyright 2013, Vivek A. Ganesan, All rights reserved 4 • Life was simple … well mostly • The ETL engineers managed data pipelines • The Data Scientists (they weren’t called that, btw, they were mostly Statisticians who programmed in SAS, SPSS or S) did the analysis • Data Warehouses, Data marts and OLAP cubes were the platforms • Data Analysts mostly generated reports but they were proficient in SQL, Excel, Pivot Tables etc. • Data Architects … well, they architected  • They managed : • Data models • Star Schemas • Data Governance • Master Data Management (MDM) • Data Security • For the most part, they had to coax different groups to share data
  • 6. Big Data – What Changed? Copyright 2013, Vivek A. Ganesan, All rights reserved 5 • Life … got interesting • Huge data volumes – ETL became a problem • Traditional Statistical tools couldn’t handle the volume • Data Warehouses, Data marts and OLAP cubes not primary analytical means – “in situ” analysis preferred i.e. no moving data to an analytics platform • Data Analysts still on point for reports but now they no longer had SQL interfaces (thanks to NoSQL and Map Reduce) • Data Architects … well, they still need to architect  • Still need : • Data models • Data Governance • Data Security • For the most part, they had to coax different groups to share data • They have to do all of this when the technology is rapidly evolving
  • 7. Life in the Big Data Universe Copyright 2013, Vivek A. Ganesan, All rights reserved 6 • The Good • Data recognized as an asset • Data Driven Products more common • Working with Data is cool • The Bad • Complexity is overwhelming • No sophisticated toolset yet • Technology is fast changing • The Ugly • No SQL! • Security • Governance • Performance • The Opportunity • Solve for : • SQL semantics • Data Governance • Data Security • Benchmarking, Pro filing and Performance measurement tools • Build : • Real-time solutions • Data Marts/Data Warehouses on top
  • 8. Life in the Big Data Universe Copyright 2013, Vivek A. Ganesan, All rights reserved 7 Data Scientist Data AnalystData Engineer • Building Models • Validation/Testing • Algorithms • Continuous Improvement • Knowledge of : • Statistics • Linear Algebra • Machine Learning • R,Matlab etc. • Deep Domain Knowledge • Report Generation • Data Exploration • Hypotheses Testing • Pattern Discovery • Correlations • Serendipitous Discovery • Data Pipelines • Manage Platforms • Productionalize Algorithms • Agile Development • Knowledge of : • Platforms • Algorithms • Java, C++ etc. • Scripting languagues like python
  • 9. Data Engineering Copyright 2013, Vivek A. Ganesan, All rights reserved 8 • Strong CS Background • Algorithms • Database theory • Scripting languages • Server side languages • Distributed Systems Background • Clusters • Networking • Monitoring/Performance • Data Science/Machine Learning • Search/IR • Text Analytics • Classification • Clustering • Infrastructure • Hadoop • Cassandra • Mongo DB • Platforms • Solr • Hive • HBase • Mahout • Applications • Recommendation Engines • Fraud Prevention • Disease Prevention
  • 10. Data Engineer’s Role Copyright 2013, Vivek A. Ganesan, All rights reserved 9 • Data Dialysis – Cleaning up Data • Hard to do at Scale • Newer tools in this space • Great scope for innovation • ETL -> ELT • Distributed Bulk loading • Full-fledged data pipelines • Supporting both data scientists and data analysts • Productionalizing algorithms • Production support • Optimization • A/B Testing and Continuous Improvement
  • 11. About this Meetup : Structure Copyright 2013, Vivek A. Ganesan, All rights reserved 10 • Agile teams • Monthly Scrum • Week 1 : Introduction to Problem • Week 2 : Algorithm + Platform • Week 3 : Technical help (Algorithm, Platform, Testing and Deployment) • Week 4 : Panel + Demo • Showcase Startups/Experts in the space • Teams show demos • Panel judges winners • We might have prizes (needs to be figured out) • Weekly Meetup (on Mondays) • Might move to a bigger venue if there is enough demand
  • 12. About this Meetup : Schedule Copyright 2013, Vivek A. Ganesan, All rights reserved 11 • May 29th : Kickoff • Scrum 1 • June 3rd – Collaborative Filtering Introduction • June 10th – Mongo DB Introduction • June 17th – Analytics on Mongo DB • June 24th – Panel + Demo • Scrum 2 (TBD) • Come along now, it will be fun! • Oh, the name 
  • 13. Questions? Comments? Thank You! E-mail: vivganes@gmail.com Twitter : onevivek Copyright 2013, Vivek A. Ganesan, All rights reserved 12