SlideShare ist ein Scribd-Unternehmen logo
1 von 22
wealthfront.com

DATA FLOW
IN THE DATA CENTER

Adam Cataldo @djscrooge
November 7, 2013
Wealthfront & Me
• Wealthfront is the largest and fastest growing softwarebased financial advisor
• We manage the first $10,000 for free the rest for only
0.25% a year
• Our automated trading system continuously rebalances
a portfolio of low-cost ETFs, with continuous tax-loss
harvesting for accounts over $100,000
• I’ve been working on the data platform we use for
website optimization, investment research, business
analytics, and operations

wealthfront.com | 2
Why the Ptolemy conference?
• This is not a talk about modeling, simulation, and
design of concurrent, real-time embedded systems
• This is a talk about the design of a data analytics
system
• It turns out many of the patterns are the same in both
fields

wealthfront.com | 3
MapReduce & Hadoop

wealthfront.com | 4
Hadoop at a Glance
• Scales well for large data sets
• Industry standard for data processing
• Optimized for throughput batch-processing

• Long latency
• Overkill for small data sets

wealthfront.com | 5
Cascading

wealthfront.com | 6
Why Cascading?
• Most real problems require multiple MapReduce jobs
• Provides a data-flow abstraction to specify data
transformations
• Builds on standard database concepts: joins, groups,
and so on
• Provides decent testing capabilities, which we’ve
extended

wealthfront.com | 7
From SQL to Cascading

select name from users join mails on users.email=mails.to

Pipe joined = new CoGroup(users, “email”, mails, “to);
Pipe name = new Retain(joined, “lastName”);

wealthfront.com | 8
Cascading to Hadoop

mails

mails
mappers
result
join
reducers

users

users
mappers

wealthfront.com | 9
Getting data ready for Cascading

Production
MySQL DB

Avro
Avro
Avrofile
file
files

extract

transform

Production
Amazon Simple
MySQL DB
Storage Service

load

wealthfront.com | 10
Why Avro?

• A compact data format, capable of storing large data sets
• We compress with Google
Snappy
• Compressed is splittable
into 128MB chunks
• De-facto file format for
Hadoop

wealthfront.com | 11
Running Cascading Jobs
Elastic MapReduce

Production
Amazon Simple
MySQL DB
Storage Service

Online
Systems

Redshift
data
warehouse

wealthfront.com | 12
What do we do with the data?
• We use it to track how well the investment product is
performing
• We use it to track how well the business is performing
• We use it to monitor our production systems
• We use it to test how well new features perform on the
website

wealthfront.com | 13
Bandit Testing
• When rolling new features out, we expose
the new version to some users and the old
version to the rest
• We monitor what percent of users
“convert”: sign up, fund account, etc.
• We gradually send more traffic to the
winning variant of the experiment
• Similar to A/B testing, but way faster

wealthfront.com | 14
Does anyone know
where the name bandit
testing comes from?
Thompson Sampling
1. Estimate the probability for each variant of the
experiment that it performs best, using Bayesian
inference
2. Weight the percentage of traffic sent to each variant
according to this probability
3. End the experiment when one variant has a 95%
chance of winning, or when the losing arms have no
more than a %5 chance of beating the winner by more
than 1%
4. In 2012, Kaufmann et al proved optimality of
Thompson sampling
wealthfront.com | 16
What’s Redshift?
• Amazon’s cloud-based data
warehouse database
• To support ad-hoc analysis,
we copy all raw and computed
data into redshift
• It’s a column-oriented
database, optimized for
aggregate queries and joins
over large batch sizes

wealthfront.com | 17
What are the technical challenges?
• Testing complicated analytics computations is nontrivial
-

We ended up writing a small library to make testing
Cascading jobs simpler

• Running multiple Hadoop jobs on large datasets takes a
long time
-

We use Spark for prototyping, to get a speedup

• Your assumptions about the constraints on the data is
always wrong

wealthfront.com | 18
Where’s this heading?
• We have a unique collection of
consumer web data and
financial data
• There are many ways we can
combine this data to make our
product better
• Hypothetical example: suggest
portfolio risk adjustments
based on a client’s withdrawal
patterns

wealthfront.com | 19
How is this relevant?
• We use data flow as the
primary model of computation
• While the time scales are much
slower, we have timing
constraints, called SLAs,
imposed by production use
cases
• We have to make sure all code
can safely execute
concurrently on multiple
machines, cores, and threads

wealthfront.com | 20
Disclosure
Nothing in this presentation should be construed as
a solicitation or offer, or recommendation, to buy
or sell any security. Financial advisory services
are only provided to investors who become
Wealthfront clients pursuant to a written agreement,
Tex
which investors are urged to read and carefully
consider in determining t
whether such agreement is
suitable for their individual facts and
circumstances. Past performance is no guarantee of
future results, and any hypothetical returns,
expected returns, or probability projections may not
reflect actual future performance. Investors should
review Wealthfront’s website for additional
information about advisory services.
wealthfront.com | 21
Data flow in the data center

Weitere ähnliche Inhalte

Was ist angesagt?

Network Infrastructure Upgrade - Nextrio
Network Infrastructure Upgrade - NextrioNetwork Infrastructure Upgrade - Nextrio
Network Infrastructure Upgrade - Nextrio
Aadil Hussaini
 

Was ist angesagt? (14)

Model driven telemetry
Model driven telemetryModel driven telemetry
Model driven telemetry
 
Kesif ve Zafiyet Tarama
Kesif ve Zafiyet TaramaKesif ve Zafiyet Tarama
Kesif ve Zafiyet Tarama
 
Deadlock management
Deadlock managementDeadlock management
Deadlock management
 
Asterisk High Availability Design Guide
Asterisk High Availability Design GuideAsterisk High Availability Design Guide
Asterisk High Availability Design Guide
 
9 virtual memory management
9 virtual memory management9 virtual memory management
9 virtual memory management
 
Kali Linux'da Sparta Kullanımı
Kali Linux'da Sparta KullanımıKali Linux'da Sparta Kullanımı
Kali Linux'da Sparta Kullanımı
 
Bringing up Aruba Mobility Master, Managed Device & Access Point
Bringing up Aruba Mobility Master, Managed Device & Access PointBringing up Aruba Mobility Master, Managed Device & Access Point
Bringing up Aruba Mobility Master, Managed Device & Access Point
 
Fundamental of Quality of Service(QoS)
Fundamental of Quality of Service(QoS) Fundamental of Quality of Service(QoS)
Fundamental of Quality of Service(QoS)
 
What is-twamp
What is-twampWhat is-twamp
What is-twamp
 
Network Infrastructure Upgrade - Nextrio
Network Infrastructure Upgrade - NextrioNetwork Infrastructure Upgrade - Nextrio
Network Infrastructure Upgrade - Nextrio
 
Computer architecture, a quantitative approach (solution for 5th edition)
Computer architecture, a quantitative approach (solution for 5th edition)Computer architecture, a quantitative approach (solution for 5th edition)
Computer architecture, a quantitative approach (solution for 5th edition)
 
Beyaz Şapkalı Hacker Eğitimi Yardımcı Ders Notları
Beyaz Şapkalı Hacker Eğitimi Yardımcı Ders NotlarıBeyaz Şapkalı Hacker Eğitimi Yardımcı Ders Notları
Beyaz Şapkalı Hacker Eğitimi Yardımcı Ders Notları
 
Ch05 coa9e
Ch05 coa9eCh05 coa9e
Ch05 coa9e
 
Pardus Kurulum Dokümanı
Pardus Kurulum DokümanıPardus Kurulum Dokümanı
Pardus Kurulum Dokümanı
 

Andere mochten auch

Andere mochten auch (8)

Be A Great Product Leader (Opower 2014)
Be A Great Product Leader (Opower 2014)Be A Great Product Leader (Opower 2014)
Be A Great Product Leader (Opower 2014)
 
Building Complex Data Workflows with Cascading on Hadoop
Building Complex Data Workflows with Cascading on HadoopBuilding Complex Data Workflows with Cascading on Hadoop
Building Complex Data Workflows with Cascading on Hadoop
 
Overview of Cascading 3.0 on Apache Flink
Overview of Cascading 3.0 on Apache Flink Overview of Cascading 3.0 on Apache Flink
Overview of Cascading 3.0 on Apache Flink
 
Data center network architectures v1.3
Data center network architectures v1.3Data center network architectures v1.3
Data center network architectures v1.3
 
Data center proposal
Data center proposalData center proposal
Data center proposal
 
Data Center Network Topologies
Data Center Network TopologiesData Center Network Topologies
Data Center Network Topologies
 
Intro To Cascading
Intro To CascadingIntro To Cascading
Intro To Cascading
 
Introduction to Data Center Network Architecture
Introduction to Data Center Network ArchitectureIntroduction to Data Center Network Architecture
Introduction to Data Center Network Architecture
 

Ähnlich wie Data flow in the data center

Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
Reaching scale limits on a Hadoop platform: issues and errors created by spee...
Reaching scale limits on a Hadoop platform: issues and errors created by spee...Reaching scale limits on a Hadoop platform: issues and errors created by spee...
Reaching scale limits on a Hadoop platform: issues and errors created by spee...
DataWorks Summit
 
Cloud Services helping in cloud service to be fully knowledgably .pptx
Cloud Services helping in cloud service to be fully knowledgably .pptxCloud Services helping in cloud service to be fully knowledgably .pptx
Cloud Services helping in cloud service to be fully knowledgably .pptx
terewog808
 

Ähnlich wie Data flow in the data center (20)

Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
 
Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?
 
How Data Drives Business at Choice Hotels
How Data Drives Business at Choice HotelsHow Data Drives Business at Choice Hotels
How Data Drives Business at Choice Hotels
 
2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final
 
Reaching scale limits on a Hadoop platform: issues and errors created by spee...
Reaching scale limits on a Hadoop platform: issues and errors created by spee...Reaching scale limits on a Hadoop platform: issues and errors created by spee...
Reaching scale limits on a Hadoop platform: issues and errors created by spee...
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
presentation slides
presentation slidespresentation slides
presentation slides
 
Deploy Apache Spark™ on Rackspace OnMetal™ for Cloud Big Data Platform
Deploy Apache Spark™ on Rackspace OnMetal™ for Cloud Big Data PlatformDeploy Apache Spark™ on Rackspace OnMetal™ for Cloud Big Data Platform
Deploy Apache Spark™ on Rackspace OnMetal™ for Cloud Big Data Platform
 
AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...
AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...
AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...
 
Getting It Right Exactly Once: Principles for Streaming Architectures
Getting It Right Exactly Once: Principles for Streaming ArchitecturesGetting It Right Exactly Once: Principles for Streaming Architectures
Getting It Right Exactly Once: Principles for Streaming Architectures
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Retail & CPG
Retail & CPGRetail & CPG
Retail & CPG
 
Cloud Services helping in cloud service to be fully knowledgably .pptx
Cloud Services helping in cloud service to be fully knowledgably .pptxCloud Services helping in cloud service to be fully knowledgably .pptx
Cloud Services helping in cloud service to be fully knowledgably .pptx
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
 
Conflict in the Cloud – Issues & Solutions for Big Data
Conflict in the Cloud – Issues & Solutions for Big DataConflict in the Cloud – Issues & Solutions for Big Data
Conflict in the Cloud – Issues & Solutions for Big Data
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2
 
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessData Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
 
Hadoop and the Relational Database: The Best of Both Worlds
Hadoop and the Relational Database: The Best of Both WorldsHadoop and the Relational Database: The Best of Both Worlds
Hadoop and the Relational Database: The Best of Both Worlds
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantage
 

Kürzlich hochgeladen

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Kürzlich hochgeladen (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Data flow in the data center

  • 1. wealthfront.com DATA FLOW IN THE DATA CENTER Adam Cataldo @djscrooge November 7, 2013
  • 2. Wealthfront & Me • Wealthfront is the largest and fastest growing softwarebased financial advisor • We manage the first $10,000 for free the rest for only 0.25% a year • Our automated trading system continuously rebalances a portfolio of low-cost ETFs, with continuous tax-loss harvesting for accounts over $100,000 • I’ve been working on the data platform we use for website optimization, investment research, business analytics, and operations wealthfront.com | 2
  • 3. Why the Ptolemy conference? • This is not a talk about modeling, simulation, and design of concurrent, real-time embedded systems • This is a talk about the design of a data analytics system • It turns out many of the patterns are the same in both fields wealthfront.com | 3
  • 5. Hadoop at a Glance • Scales well for large data sets • Industry standard for data processing • Optimized for throughput batch-processing • Long latency • Overkill for small data sets wealthfront.com | 5
  • 7. Why Cascading? • Most real problems require multiple MapReduce jobs • Provides a data-flow abstraction to specify data transformations • Builds on standard database concepts: joins, groups, and so on • Provides decent testing capabilities, which we’ve extended wealthfront.com | 7
  • 8. From SQL to Cascading select name from users join mails on users.email=mails.to Pipe joined = new CoGroup(users, “email”, mails, “to); Pipe name = new Retain(joined, “lastName”); wealthfront.com | 8
  • 10. Getting data ready for Cascading Production MySQL DB Avro Avro Avrofile file files extract transform Production Amazon Simple MySQL DB Storage Service load wealthfront.com | 10
  • 11. Why Avro? • A compact data format, capable of storing large data sets • We compress with Google Snappy • Compressed is splittable into 128MB chunks • De-facto file format for Hadoop wealthfront.com | 11
  • 12. Running Cascading Jobs Elastic MapReduce Production Amazon Simple MySQL DB Storage Service Online Systems Redshift data warehouse wealthfront.com | 12
  • 13. What do we do with the data? • We use it to track how well the investment product is performing • We use it to track how well the business is performing • We use it to monitor our production systems • We use it to test how well new features perform on the website wealthfront.com | 13
  • 14. Bandit Testing • When rolling new features out, we expose the new version to some users and the old version to the rest • We monitor what percent of users “convert”: sign up, fund account, etc. • We gradually send more traffic to the winning variant of the experiment • Similar to A/B testing, but way faster wealthfront.com | 14
  • 15. Does anyone know where the name bandit testing comes from?
  • 16. Thompson Sampling 1. Estimate the probability for each variant of the experiment that it performs best, using Bayesian inference 2. Weight the percentage of traffic sent to each variant according to this probability 3. End the experiment when one variant has a 95% chance of winning, or when the losing arms have no more than a %5 chance of beating the winner by more than 1% 4. In 2012, Kaufmann et al proved optimality of Thompson sampling wealthfront.com | 16
  • 17. What’s Redshift? • Amazon’s cloud-based data warehouse database • To support ad-hoc analysis, we copy all raw and computed data into redshift • It’s a column-oriented database, optimized for aggregate queries and joins over large batch sizes wealthfront.com | 17
  • 18. What are the technical challenges? • Testing complicated analytics computations is nontrivial - We ended up writing a small library to make testing Cascading jobs simpler • Running multiple Hadoop jobs on large datasets takes a long time - We use Spark for prototyping, to get a speedup • Your assumptions about the constraints on the data is always wrong wealthfront.com | 18
  • 19. Where’s this heading? • We have a unique collection of consumer web data and financial data • There are many ways we can combine this data to make our product better • Hypothetical example: suggest portfolio risk adjustments based on a client’s withdrawal patterns wealthfront.com | 19
  • 20. How is this relevant? • We use data flow as the primary model of computation • While the time scales are much slower, we have timing constraints, called SLAs, imposed by production use cases • We have to make sure all code can safely execute concurrently on multiple machines, cores, and threads wealthfront.com | 20
  • 21. Disclosure Nothing in this presentation should be construed as a solicitation or offer, or recommendation, to buy or sell any security. Financial advisory services are only provided to investors who become Wealthfront clients pursuant to a written agreement, Tex which investors are urged to read and carefully consider in determining t whether such agreement is suitable for their individual facts and circumstances. Past performance is no guarantee of future results, and any hypothetical returns, expected returns, or probability projections may not reflect actual future performance. Investors should review Wealthfront’s website for additional information about advisory services. wealthfront.com | 21