SlideShare ist ein Scribd-Unternehmen logo
1 von 70
How to Be
Productive
Data Engineer
Rafal Wojdyla - rav@spotify.com
Note: My views are my own and don't necessarily represent those of Spotify.
Speaker’s cut version – includes raw notes
Hi – my name is Rafal and I’m an engineer at Spotify, in this
presentation I will talk about how to be a productive data
engineer. I will combine knowledge of multiple productive
engineers at Spotify and touch different areas of our your
daily work life. I will use real world examples, failures,
success stories – but mostly failures. So if you are or want to
be a data engineer hopefully after this presentation every
single one of you will learn something now. I hope this
learning will improve your productivity, bring new feature to
your infrastructure or maybe spark a discussion inside your
team.
• Operations
• Development
• Organization
• Culture
We will go through lessons of productive data engineer and cover 4
different areas – operations, development, organization and culture.
So we will kinda work our way from low level admin tips and
spectacular disasters, after hard core operation we will talk about
development on Hadoop – what to avoid, how Spotify is overcoming
huge problem of legacy Hadoop tools. After development part we will
take a look at how organization structure can affect your productivity
and how one can tackle this problem. We will finish with ubiquitous
culture – how culture can help in being productive. So in a way we
start with low level scope of you productivity, how the cluster itself
can affect you, how operators can help you out, to later talk about
your development decision, structure of company and finish with how
environment can influence your work. There will be time for
questions at the end so please keep them till the end.
What is Spotify?
For everyone:
• Streaming Service
• Launched in October 2008
• 60 Million Monthly Users
• 15 Million Paid Subscribers
+ and for me:
• 1.3K nodes Hadoop cluster
Before we go deep into presentation let’s first talk about what spotify
is. Spotify is a streaming service, launched in 2008 in beautiful
Stockholm, Sweden. Current public numbers are that we have 60M
monthly users, 15M subscribers. And what’s unique about Spotify
service is that it can play a perfect song for every single moment, and
some of this is powered through Hadoop which makes it even cooler!
For me Spotify is also 1.3K Hadoop nodes – which is like a baby for
a team of 4 people. A baby that is sometimes very frustrating, shit
happens all the time and you have to wake up in the middle of the
night and clean it up, but it’s our baby and we love it. Without further
ado let’s move to the core of the topic and start with operations. If
there’s one lesson that comes from operating hadoop clusters from a
handful of nodes in the corner of the office to 1300 nodes – it will be
AUTOMATION.
Automation
Automation is crucial – especially if talking about Hadoop. Hadoop is
huge beast to manage, there is loads of moving parts, loads of new
stuff coming in and there’s always a reason for hadoop clusters to go
down. If hadoop was not enough there’s always something that your
company will push on poor operators – whether it’s a new linux distro
or maybe there’s a bug in libc and you need to restart all the
daemons and so on and so on.
ME
ADAM
You want to be proactive and do as little as possible – without
automation even coffee won’t help you. You want to be Adam – be
happy and work on new features, enhance hadoop – bring joy to
hadoop users. You don’t want to be the poor operator on the left,
focused primarily on putting out fires, exhausted. Btw this is the
picture of Adam and me after 40 hours of hadoop upgrade from
Hadoop 1 to 2 in 2013.
Apache Ambari
Cloudera Manager
So how to reach good enough automation of your cluster – let’s take Spotify as example first –
Spotify started with hadoop in 2009, very early, then there was a couple of tiny expansions, a
short episode of hadoop in EMR, and we went back to on premise with shinny new 60 nodes –
at that point we had to make a decision on how to manage hadoop – and because back then
CM was limited and Ambari didn’t exist and cause Spotify loves puppet we have decided to use
Puppet for this use case. It was rather big effort and took time, during which we had to drop
some work, put out fires and work on Puppet, but it was a great investment. Today after a few
iteration we like our puppet – as an example – most recent ongoing expansion is rather easy –
we name the machines using proper naming convention and puppet kicks in installs all the
services, configuration and keeps the machines in normalized state – very very important piece
of our infrastructure. But wait – the slide says something aboutAmbari and CM – yes – cause if
we were to set up a cluster today, we would most likely evaluate at least these two solution.
Like I said Spotify basically didn’t have a choice and we settled on puppet and we are happy
about it right now, but there’s huge leverage you can gain out of using these tools, loads of
features that you get out of the box, that we need to implement ourselves. So if you are
considering building hadoop cluster – make sure to give these tools a good try, they may not
solve all your issues and use cases but for sure will bring loads of value and in time your will get
even more features just from community – which is great, and is something that we are
missing.
+ Puppet
That said – even if you decided to use Ambari of CM – mostly likely
you will still need some kind of configuration management tool –
whether it will be puppet, or chef or salt or whatever is your favorite –
you will need one, there will always be some extra library that you
need to install and configure or some user to create and so on.
There’s another interesting outcome of us building our own puppet
infrastructure – we know exactly how our hadoop is configured –
every single piece of it – which comes in handy in case of trouble
shooting. In this case we touched a little bit a problem of 3rd party
solutions vs let’s implement our own tailored solutions. How many of
you are aware of NIH problem?
Not Invented
Here
I will argue that there’s a number of cases and teams where this
problem occurs at Spotify. NIH problem is an nutshell when you
undervalue 3rd party solutions and convince others to implement your
own solutions – in most cases this is a huge problem. The lesson
that we have learned is that you need to give external tools a try,
experiment but don’t expect something to solve all your problems –
preferable define metrics of acceptance priori evaluation of tools.
Never Invented
Here
But what is actually very interesting in case of data areas is kinda
sibling problem on NIH – which is NeIH – a problem described I
believe by Michael O Church – it’s kinda opposite approach, it’s
when you overvalue 3rd party solutions, and end you in a messy
place of glue implementation madness. There’s loads of great tools
in BigData areas – not all of the work well with each other, not all of
them do well what them meant to be doing. I urge you to be critical,
sometimes implementation of your own tool or postponing a new,
shinny framework from infrastructure may be a good thing to do – but
it has to be a data driven decision that brings value. Think about this
two problems, and ask yourself are there examples of such solutions
at your company?
Wild Wild West
To illustrate this I will tell you a real story – a story of great failure and success at the end. So we had an
external consultant at Spotify – and his goal was to certify our cluster – basically 4 days of looking at
different corner of our infrastructure. First two days went really smooth, we went through our configuration,
state of the cluster and so on, we could not find a way to improve our cluster easily – which made as feel
proud, cause you know, we have this world class, talented hadoop expert over and he can’t find a way to
improve our cluster right? But oh boy was that a big mistake – so on day number 3, we are sitting in a
room, whole team and consultant and due to miscommunication and misconfiguration our standby NN and
RM go down – but that is still fine, cause RM starts in minute or two, standby can start in background – but
unfortunately during the troubleshooting by mistake we have killed our active NN – at this point basically
whole infrastructure was down – at our scale that means about 2 hours of downtime. It was bad! But wait
for day number 4 – so next day we are sitting in the room, again whole team, consultant but also our
managers and we listen to consultant saying that our testing and deployment procedures are like Wild Wild
West and we act like cowboys – it was hard to listen to but he was right and we knew it. Next thing we do,
was to go to a room with the team and come up something to solve this issue, we came up with something
that may be obvious – a preproduction cluster – a cluster made out of the same machine profile and almost
identical configuration which we will use for testing. But how to test was a real question. We went into
research mode and started reading and watching presentation – we were especially impressed by tool
called HIT by Yahoo, so we contacted the creators, unfortunately there was no plan to open source it – but
they gave us a nice tip – look atApache Bigtop.
Apache Bigtop
Apache Bigtop primarily facilities building, testing and deployment of hadoop
distribution – but you can also us it in a slightly different way – you can point
you bigtop at your preproduction cluster and use its smoke tests to test the
infrastructure. So our current flow of testing and deployment is to first deploy
to preproduction cluster, run bigtop tests – get instant feedback about the
change, if feedback is fine deploy to production if not – there’s something
wrong with the change and we know that before it is deployed to production.
Some findings from using BigTop is that it’s actually very easy to extend so we
were able to add smoke tests for our own tools like snakebite and luigi but
also what is very important we also run some production workloads as part of
smoke tests – which actually makes us feel sure about the change.
So in case of Apache Bigtop the problem was testing of hadoop infrastructure
– even tho Bigtop is not perfect for this it provides loads of value just out of the
box and thus it’s a great example of preventing NIH problem.
Enable log
aggregation
As an operator there are many ways to help yourself and also
delegate some of the work to developers themselves – one disabled
by default, but great feature of Hadoop 2 is log aggregation – how
many of you have log aggregation enable on your cluster? So in a
nutshell this feature will aggregate yarn logs from workers and store
them on HDFS for inspection, very useful for troubleshooting. So
most of you probably know how to enable it right?
To enable log aggregation
yarn.log-aggregation-enable = true
yarn.log-aggregation.retain-seconds = ?
It’s dead simple. But there’s one question – how long should we keep
the logs for? So we thought about it for a while, talked with HW a little
bit and since we have huge cluster why not story it for long time –
maybe we will need these logs for some analytics etc.
+ <property>
+ <name>yarn.log-aggregation-enable</name>
+ <value>true</value>
+ </property>
+
+ <property>
+ <name>yarn.log-aggregation.retain-seconds</name>
+ <value>315569260</value>
+ <!--retention: 10 years-->
+ </property>
This is our initial change to configuration – 10 years. Does anyone
know what bad can happen if you do that?
Heap Memory used is 97%
If you run enough jobs/tasks after some time you will see something
like this in your NN – and when you see something like this in your
NN then you end up with hellelepahant!
Hellelephant
It’s a situation when your hadoop cluster spectacularly goes down.
What happen is that log aggregation will in time create many files,
very important consequence of many files on HDFS is growing heap
size, and then you get out of memory on NN. The lesson is that it’s
“good idea” to alert on heap usage for your master daemons but also
understand your configuration and its consequences, keep yourself
up to data on the changes in configuration, read code of hadoop
configuration keys and how they are connected between each other
– no all configuration parameters are documented.
Custom logs
• Profiling
• Garbage collection
It’s a situation when your hadoop cluster spectacularly goes down.
What happen is that log aggregation will in time create many files,
very important consequence of many files on HDFS is growing heap
size, and then you get out of memory on NN. The lesson is that it’s
“good idea” to alert on heap usage for your master daemons but also
understand your configuration and its consequences, keep yourself
up to data on the changes in configuration, read code of hadoop
configuration keys and how they are connected between each other
– no all configuration parameters are documented.
Right tool for
the job
There is a couple of interesting lessons about productive development – first
arguable most important is to pick right tool for the job – what is the current
most important value to bring. Let’s talk Spotify – in 2009 Spotify started with
hadoop streaming as a supported framework for MR development – hadoop
streaming basically enables you to implement MR jobs in languages different
than java – for many years it was THE framework – because Spotify loved
python and it enabled us to iterate faster thus provide knowledge for our
business. Time was passing and our hadoop cluster was growing – in time we
needed something different something better when it comes to performance
but also maturity. After long evaluation and I encourage you to watch
presentations by David Whiting about different frameworks, we have decided
to use Apache Crunch as supported framework for batch MR. Why – a couple
of reasons – first ease of testing and type safety.
David’s presentations:
* https://www.youtube.com/watch?v=XKY_0s7pESQ
* https://www.youtube.com/watch?v=-WOiZ2w7xtI
This graph shows number of successful and failed jobs divided by
framework for 6 month – and these are production jobs – as you can
see two most popular frameworks are hadoop streaming and crunch
– but the difference between failed and successful jobs is crucial.
Crunch jobs act much better and have better testing. Type safety
helps to discover problem at compile time and testing framework that
comes with crunch that we were able to enhance with hadoop
minicluster helps users to easily test their jobs – basically makes
testing easy, something that we missed for our hadoop streaming
jobs. But performance is another thing.
On this graph we can see map throughput for apache crunch and
hadoop streaming – these are production workloads for 6 months –
there’s huge difference. Crunch turns out to be on average 8 times
faster. What is more interesting is that we actually see higher
utilization of our cluster the more crunch jobs we see on the cluster –
which makes us super happy.
Right abstraction
for the job
Another thing that crunch provides is great abstraction – and that is
another thing that productive developer need to keep in mind – pick
the right abstraction for the jobs. In case of crunch we can start
thinking in terms of high level operations like filter, groupby, joins and
so on instead of old map/reduce legacy. This makes implementation
more intuitive and simply pleasant – thus make developer
experience much better. The interesting thing that we have observed
is that higher abstraction may remove some of the opportunities for
optimization thus it’s not as easy to implement the best performing
job – but on the other hand it reduces problem with premature
optimization, also on average performs really well – there’s very few
people that actually know how to optimize pure java MR jobs or
Hadoop streaming jobs at Spotify – but average optimization that we
get from crunch turns out to be really good as you could see on the
performance graph.
Scaling
machines is
easy, scaling
people is hard
We do have loads of nodes – and we have scaling machines nailed
down, crunch scales very well. But there’s big problem that we
currently have: scaling people. How you scale support and best
practices – we constantly see problems with code repetition, HDFS
mess, lack of data management, YARN resource contention – all this
brings our productivity down. There’s not enough time to go through
all of them but some of this problems we are trying to tackle with
nothing different then our beloved automation. Let’s see examples:
• Map split size
• Number of reducers
• HDFS data retention
• User feedback (ongoing)
Automation
We automate map split size calculation thus number of map tasks, but also number of
reducers therefor number and size of output files – all this is done by estimation and
historical data using our workflow manager Luigi – that I encourage you to take a look
at! Luigi github: https://github.com/spotify/luigi
We are about to finish our second iteration of HDFS retention policy that will
automatically remove data therefor reduce HDFS usage and in long term hopefully
reduce HDFS legacy mess.
Another ongoing effort is second iteration of automatic user feedback – we already
expose database with aggregated information about all MR jobs that our users can
query and learn how their jobs are performing – but we also plan another iteration
very simple iteration, focused on Crunch – that right after to workflow pipeline is done
will provide user with instant feedback – memory usage, garbage collection and so
on, very simple tweaks users can apply to improve their jobs – for example if a user
gives a pipeline 8GB of memory for each task and after going through counters we
see that tasks are actually using only max 3GB, instant feedback to reduce memory
could improve multitenancy of your cluster thus improve productivity.
Organization
With that let’s talk organization structure – how it can improve your
performance – but before that let’s take a look at a graph.
This graph show Hadoop availability by Quarter at Spotify, higher is
of course better – ok – so let’s see what happen here:
Ownerless
First part is hadoop cluster being ownerless, it was best effort
support by team of people that mostly didn’t even want to do
operations of hadoop, therefor multiple days of downtime happen,
and infrastructure was in bad shape, denormalized – overall terrible
state to be in. But there was a light at the end of the tunnel – in Q3
we have decided to create a squad – 3 people focused solely on
hadoop infrastructure.
Ownerless Squad
There was instant feedback right after squad was created – users
were happy and infrastructure was getting in shape, one of the first
decisions we have made was to move to yarn in Q4.
Ownerless
Squad
Upgrades
Q4 and beginnings of ‘14 we again saw drop in availability mostly
due to huge upgrade – and it’s consequences thereafter. The
upgrade itself took whole weekend, and after the upgrade we saw
many issues and fires that we had to put out, during this time we
were mostly reactive but also working on polishing our puppet
manifests. Whole situation stabilized after most fires were gone and
puppet was in good shape.
Ownerless
Squad
Upgrades Getting there
Our goal is too keep hadoop at 3 nines of availability – and we are
getting there since Q2 2014, hadoop squad is receiving constant
feedback from users and its common that hear that availability was
drastically improved which improved productivity and overall
experience, which is great and makes us want to work even harder
to achieve better results.
Culture
With that lets now talk about what surrounds us – the culture – I
strongly believe the culture at Spotify has huge influence on
productivity - there are tree main pillars of culture.
Experiment
Fail Fast
Embrace Failure
Experiment, fail fast and embrace failure. We love to experiment and
we have time to experiment whether it’s company wide hack week,
R&D days – if only one wish to experiment there’s time to do that and
there’s loads of curious people at Spotify there’s always something
going on. The most successful data based experiments are Luigi –
hadoop workflow manager and snakebite – pure python hdfs client –
I encourage you to take a look at them. Fail fast – do small steps –
don’t be afraid to admit failure, keep it as part of learning process to
the point of embracing it. Talk about your failures, share them publicly
for example through presentations both internally and externally – it
will make experimentation thus innovation flow much smoother. To
back this up by example let’s talk about one ongoing experiment.
Luigi github: https://github.com/spotify/luigi
Snakebite github: https://github.com/spotify/snakebite
Spark
But we
have
tried!
Non grata
Spark – it pretty much ongoing experiment that we come back to
every now and then, but officially it’s not welcome on production
cluster due to immaturity and poor multitenancy support – that said
most recent releases (>1.3) are very promising and we are
constantly playing with it and have high hopes for it, especially about
most recent dynamic resource allocation feature. There’s not so
much time left but I would like to share with you two important
lessons from our evaluation of a heavy spark job.
Spark
spark.storage.memoryFraction (0.6)
spark.shuffle.memoryFraction (0.2)
In shuffle heavy algorithms reduce
cache fraction in favour of shuffle.
First hint is about memory settings – there are two important settings
that can improve stability of your heavy Spark jobs – memory
available for caching – storage.memoryFraction and memory
available for shuffle – shuffle.memoryFraction. The default settings
are .6 and .2 leaving .2 for runtime. In our case we had a heavy
machine learning job that was doing almost terabyte of shuffle – but
very little (proportionally) caching – initially we had issues with shuffle
step, but reducing storage memory and leaving extra memory for
shuffle and runtime improved stability.
Spark
spark.executor.heartbeatInterval (10K)
spark.core.connection.ack.wait.timeout (60)
Increase in case of long GC pauses.
Another issue that we hit was long GC pauses – thus executors
would disappear which in turns triggers recomputation and in the end
potentially application failure. After tweaking hearbeat interval and
ack.wait timeout was saw improvement in stability and even tho GC
pauses still occurred they were less harmful.
Learnings
• Operations  Automation
• Development  Abstraction
• Organization  Team
• Culture  Experiment
Join the band
Engineers wanted in
NYC & Stockholm
http://spotify.com/jobs

Weitere ähnliche Inhalte

Andere mochten auch

Trivadis TechEvent 2017 With the CLI through the Oracle Cloud Martin Berger
Trivadis TechEvent 2017 With the CLI through the Oracle Cloud Martin BergerTrivadis TechEvent 2017 With the CLI through the Oracle Cloud Martin Berger
Trivadis TechEvent 2017 With the CLI through the Oracle Cloud Martin BergerTrivadis
 
Purchasing executive perfomance appraisal 2
Purchasing executive perfomance appraisal 2Purchasing executive perfomance appraisal 2
Purchasing executive perfomance appraisal 2tonychoper1004
 
Production executive perfomance appraisal 2
Production executive perfomance appraisal 2Production executive perfomance appraisal 2
Production executive perfomance appraisal 2tonychoper1004
 
Trivadis TechEvent 2017 Secrets of creation of reliable + maintainable (=cost...
Trivadis TechEvent 2017 Secrets of creation of reliable + maintainable (=cost...Trivadis TechEvent 2017 Secrets of creation of reliable + maintainable (=cost...
Trivadis TechEvent 2017 Secrets of creation of reliable + maintainable (=cost...Trivadis
 
Seo executive perfomance appraisal 2
Seo executive perfomance appraisal 2Seo executive perfomance appraisal 2
Seo executive perfomance appraisal 2tonychoper1004
 
Principal engineer perfomance appraisal 2
Principal engineer perfomance appraisal 2Principal engineer perfomance appraisal 2
Principal engineer perfomance appraisal 2tonychoper1004
 
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...Edureka!
 
2017 Florida Data Science for Social Good Big Reveal
2017 Florida Data Science for Social Good Big Reveal2017 Florida Data Science for Social Good Big Reveal
2017 Florida Data Science for Social Good Big RevealKarthikeyan Umapathy
 
Data Scientist/Engineer Job Demand Analysis
Data Scientist/Engineer Job Demand AnalysisData Scientist/Engineer Job Demand Analysis
Data Scientist/Engineer Job Demand AnalysisBilong Chen
 
Logistic executive perfomance appraisal 2
Logistic executive perfomance appraisal 2Logistic executive perfomance appraisal 2
Logistic executive perfomance appraisal 2tonychoper5504
 
Computer software engineer performance appraisal
Computer software engineer performance appraisalComputer software engineer performance appraisal
Computer software engineer performance appraisaljamespoter576
 

Andere mochten auch (13)

MA2017 | Danny Nou | The Science of Empathy
MA2017 | Danny Nou | The Science of Empathy MA2017 | Danny Nou | The Science of Empathy
MA2017 | Danny Nou | The Science of Empathy
 
Trivadis TechEvent 2017 With the CLI through the Oracle Cloud Martin Berger
Trivadis TechEvent 2017 With the CLI through the Oracle Cloud Martin BergerTrivadis TechEvent 2017 With the CLI through the Oracle Cloud Martin Berger
Trivadis TechEvent 2017 With the CLI through the Oracle Cloud Martin Berger
 
Purchasing executive perfomance appraisal 2
Purchasing executive perfomance appraisal 2Purchasing executive perfomance appraisal 2
Purchasing executive perfomance appraisal 2
 
Production executive perfomance appraisal 2
Production executive perfomance appraisal 2Production executive perfomance appraisal 2
Production executive perfomance appraisal 2
 
Trivadis TechEvent 2017 Secrets of creation of reliable + maintainable (=cost...
Trivadis TechEvent 2017 Secrets of creation of reliable + maintainable (=cost...Trivadis TechEvent 2017 Secrets of creation of reliable + maintainable (=cost...
Trivadis TechEvent 2017 Secrets of creation of reliable + maintainable (=cost...
 
Seo executive perfomance appraisal 2
Seo executive perfomance appraisal 2Seo executive perfomance appraisal 2
Seo executive perfomance appraisal 2
 
Energy to 2050
Energy to 2050Energy to 2050
Energy to 2050
 
Principal engineer perfomance appraisal 2
Principal engineer perfomance appraisal 2Principal engineer perfomance appraisal 2
Principal engineer perfomance appraisal 2
 
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
 
2017 Florida Data Science for Social Good Big Reveal
2017 Florida Data Science for Social Good Big Reveal2017 Florida Data Science for Social Good Big Reveal
2017 Florida Data Science for Social Good Big Reveal
 
Data Scientist/Engineer Job Demand Analysis
Data Scientist/Engineer Job Demand AnalysisData Scientist/Engineer Job Demand Analysis
Data Scientist/Engineer Job Demand Analysis
 
Logistic executive perfomance appraisal 2
Logistic executive perfomance appraisal 2Logistic executive perfomance appraisal 2
Logistic executive perfomance appraisal 2
 
Computer software engineer performance appraisal
Computer software engineer performance appraisalComputer software engineer performance appraisal
Computer software engineer performance appraisal
 

Ähnlich wie How to Be a Productive Data Engineer

A Self Funding Agile Transformation
A Self Funding Agile TransformationA Self Funding Agile Transformation
A Self Funding Agile TransformationDaniel Poon
 
Wait A Moment? How High Workload Kills Efficiency! - Roman Pickl
Wait A Moment? How High Workload Kills Efficiency! - Roman PicklWait A Moment? How High Workload Kills Efficiency! - Roman Pickl
Wait A Moment? How High Workload Kills Efficiency! - Roman PicklPROIDEA
 
Growth Hacking Conference '17 - Antwerp
Growth Hacking Conference '17 - AntwerpGrowth Hacking Conference '17 - Antwerp
Growth Hacking Conference '17 - AntwerpThibault Imbert
 
CppCat, an Ambitious C++ Code Analyzer from Tula
CppCat, an Ambitious C++ Code Analyzer from TulaCppCat, an Ambitious C++ Code Analyzer from Tula
CppCat, an Ambitious C++ Code Analyzer from TulaAndrey Karpov
 
Industry stories on agile, scrum and kanban
Industry stories on agile, scrum and kanbanIndustry stories on agile, scrum and kanban
Industry stories on agile, scrum and kanbanBusiness901
 
How spotify builds products
How spotify builds productsHow spotify builds products
How spotify builds products양미 김
 
The Open Commerce Conference - Premature Optimisation: The Root of All Evil
The Open Commerce Conference - Premature Optimisation: The Root of All EvilThe Open Commerce Conference - Premature Optimisation: The Root of All Evil
The Open Commerce Conference - Premature Optimisation: The Root of All EvilFabio Akita
 
DevOps Transition Strategies
DevOps Transition StrategiesDevOps Transition Strategies
DevOps Transition StrategiesAlec Lazarescu
 
Kanban discussion with David Anderson
Kanban discussion with David AndersonKanban discussion with David Anderson
Kanban discussion with David AndersonBusiness901
 
The Heek Product Cycle
The Heek Product CycleThe Heek Product Cycle
The Heek Product CycleHeek Team
 
Lean Blog Podcast #115 - Mark Graban Interviews Eric Ries on "The Lean Startup"
Lean Blog Podcast #115 - Mark Graban Interviews Eric Ries on "The Lean Startup"Lean Blog Podcast #115 - Mark Graban Interviews Eric Ries on "The Lean Startup"
Lean Blog Podcast #115 - Mark Graban Interviews Eric Ries on "The Lean Startup"Mark Graban
 
[DSC Adria 23] Radovan Bacovic Steal Our Knowledge Please.pptx
[DSC Adria 23] Radovan Bacovic Steal Our Knowledge Please.pptx[DSC Adria 23] Radovan Bacovic Steal Our Knowledge Please.pptx
[DSC Adria 23] Radovan Bacovic Steal Our Knowledge Please.pptxDataScienceConferenc1
 
The nuts bolts_of_prudal (2)
The nuts bolts_of_prudal (2)The nuts bolts_of_prudal (2)
The nuts bolts_of_prudal (2)Michael Grigsby
 
Jr devsurvivalguide
Jr devsurvivalguideJr devsurvivalguide
Jr devsurvivalguideJames York
 
Front Porch Keynote 2014
Front Porch Keynote 2014Front Porch Keynote 2014
Front Porch Keynote 2014amboy00
 
SaltStack For DevOps, Free Sample
SaltStack For DevOps, Free SampleSaltStack For DevOps, Free Sample
SaltStack For DevOps, Free SampleAymen EL Amri
 
The nuts bolts_of_prudal (1)
The nuts bolts_of_prudal (1)The nuts bolts_of_prudal (1)
The nuts bolts_of_prudal (1)Michael Grigsby
 
Scaling Product Development at a
Scaling Product Development at a Scaling Product Development at a
Scaling Product Development at a James Birchler
 

Ähnlich wie How to Be a Productive Data Engineer (20)

A Self Funding Agile Transformation
A Self Funding Agile TransformationA Self Funding Agile Transformation
A Self Funding Agile Transformation
 
Wait A Moment? How High Workload Kills Efficiency! - Roman Pickl
Wait A Moment? How High Workload Kills Efficiency! - Roman PicklWait A Moment? How High Workload Kills Efficiency! - Roman Pickl
Wait A Moment? How High Workload Kills Efficiency! - Roman Pickl
 
Growth Hacking Conference '17 - Antwerp
Growth Hacking Conference '17 - AntwerpGrowth Hacking Conference '17 - Antwerp
Growth Hacking Conference '17 - Antwerp
 
CppCat, an Ambitious C++ Code Analyzer from Tula
CppCat, an Ambitious C++ Code Analyzer from TulaCppCat, an Ambitious C++ Code Analyzer from Tula
CppCat, an Ambitious C++ Code Analyzer from Tula
 
Industry stories on agile, scrum and kanban
Industry stories on agile, scrum and kanbanIndustry stories on agile, scrum and kanban
Industry stories on agile, scrum and kanban
 
How spotify builds products
How spotify builds productsHow spotify builds products
How spotify builds products
 
The Open Commerce Conference - Premature Optimisation: The Root of All Evil
The Open Commerce Conference - Premature Optimisation: The Root of All EvilThe Open Commerce Conference - Premature Optimisation: The Root of All Evil
The Open Commerce Conference - Premature Optimisation: The Root of All Evil
 
DevOps Transition Strategies
DevOps Transition StrategiesDevOps Transition Strategies
DevOps Transition Strategies
 
Kanban discussion with David Anderson
Kanban discussion with David AndersonKanban discussion with David Anderson
Kanban discussion with David Anderson
 
The Heek Product Cycle
The Heek Product CycleThe Heek Product Cycle
The Heek Product Cycle
 
Lean Blog Podcast #115 - Mark Graban Interviews Eric Ries on "The Lean Startup"
Lean Blog Podcast #115 - Mark Graban Interviews Eric Ries on "The Lean Startup"Lean Blog Podcast #115 - Mark Graban Interviews Eric Ries on "The Lean Startup"
Lean Blog Podcast #115 - Mark Graban Interviews Eric Ries on "The Lean Startup"
 
[DSC Adria 23] Radovan Bacovic Steal Our Knowledge Please.pptx
[DSC Adria 23] Radovan Bacovic Steal Our Knowledge Please.pptx[DSC Adria 23] Radovan Bacovic Steal Our Knowledge Please.pptx
[DSC Adria 23] Radovan Bacovic Steal Our Knowledge Please.pptx
 
The nuts bolts_of_prudal (2)
The nuts bolts_of_prudal (2)The nuts bolts_of_prudal (2)
The nuts bolts_of_prudal (2)
 
DevOps for Managers
DevOps for ManagersDevOps for Managers
DevOps for Managers
 
Jr devsurvivalguide
Jr devsurvivalguideJr devsurvivalguide
Jr devsurvivalguide
 
Big Data and Hadoop in the Cloud
Big Data and Hadoop in the CloudBig Data and Hadoop in the Cloud
Big Data and Hadoop in the Cloud
 
Front Porch Keynote 2014
Front Porch Keynote 2014Front Porch Keynote 2014
Front Porch Keynote 2014
 
SaltStack For DevOps, Free Sample
SaltStack For DevOps, Free SampleSaltStack For DevOps, Free Sample
SaltStack For DevOps, Free Sample
 
The nuts bolts_of_prudal (1)
The nuts bolts_of_prudal (1)The nuts bolts_of_prudal (1)
The nuts bolts_of_prudal (1)
 
Scaling Product Development at a
Scaling Product Development at a Scaling Product Development at a
Scaling Product Development at a
 

Kürzlich hochgeladen

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 

Kürzlich hochgeladen (20)

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 

How to Be a Productive Data Engineer

  • 1. How to Be Productive Data Engineer Rafal Wojdyla - rav@spotify.com Note: My views are my own and don't necessarily represent those of Spotify. Speaker’s cut version – includes raw notes
  • 2. Hi – my name is Rafal and I’m an engineer at Spotify, in this presentation I will talk about how to be a productive data engineer. I will combine knowledge of multiple productive engineers at Spotify and touch different areas of our your daily work life. I will use real world examples, failures, success stories – but mostly failures. So if you are or want to be a data engineer hopefully after this presentation every single one of you will learn something now. I hope this learning will improve your productivity, bring new feature to your infrastructure or maybe spark a discussion inside your team.
  • 3. • Operations • Development • Organization • Culture
  • 4. We will go through lessons of productive data engineer and cover 4 different areas – operations, development, organization and culture. So we will kinda work our way from low level admin tips and spectacular disasters, after hard core operation we will talk about development on Hadoop – what to avoid, how Spotify is overcoming huge problem of legacy Hadoop tools. After development part we will take a look at how organization structure can affect your productivity and how one can tackle this problem. We will finish with ubiquitous culture – how culture can help in being productive. So in a way we start with low level scope of you productivity, how the cluster itself can affect you, how operators can help you out, to later talk about your development decision, structure of company and finish with how environment can influence your work. There will be time for questions at the end so please keep them till the end.
  • 5. What is Spotify? For everyone: • Streaming Service • Launched in October 2008 • 60 Million Monthly Users • 15 Million Paid Subscribers + and for me: • 1.3K nodes Hadoop cluster
  • 6. Before we go deep into presentation let’s first talk about what spotify is. Spotify is a streaming service, launched in 2008 in beautiful Stockholm, Sweden. Current public numbers are that we have 60M monthly users, 15M subscribers. And what’s unique about Spotify service is that it can play a perfect song for every single moment, and some of this is powered through Hadoop which makes it even cooler! For me Spotify is also 1.3K Hadoop nodes – which is like a baby for a team of 4 people. A baby that is sometimes very frustrating, shit happens all the time and you have to wake up in the middle of the night and clean it up, but it’s our baby and we love it. Without further ado let’s move to the core of the topic and start with operations. If there’s one lesson that comes from operating hadoop clusters from a handful of nodes in the corner of the office to 1300 nodes – it will be AUTOMATION.
  • 8. Automation is crucial – especially if talking about Hadoop. Hadoop is huge beast to manage, there is loads of moving parts, loads of new stuff coming in and there’s always a reason for hadoop clusters to go down. If hadoop was not enough there’s always something that your company will push on poor operators – whether it’s a new linux distro or maybe there’s a bug in libc and you need to restart all the daemons and so on and so on.
  • 10. You want to be proactive and do as little as possible – without automation even coffee won’t help you. You want to be Adam – be happy and work on new features, enhance hadoop – bring joy to hadoop users. You don’t want to be the poor operator on the left, focused primarily on putting out fires, exhausted. Btw this is the picture of Adam and me after 40 hours of hadoop upgrade from Hadoop 1 to 2 in 2013.
  • 12. So how to reach good enough automation of your cluster – let’s take Spotify as example first – Spotify started with hadoop in 2009, very early, then there was a couple of tiny expansions, a short episode of hadoop in EMR, and we went back to on premise with shinny new 60 nodes – at that point we had to make a decision on how to manage hadoop – and because back then CM was limited and Ambari didn’t exist and cause Spotify loves puppet we have decided to use Puppet for this use case. It was rather big effort and took time, during which we had to drop some work, put out fires and work on Puppet, but it was a great investment. Today after a few iteration we like our puppet – as an example – most recent ongoing expansion is rather easy – we name the machines using proper naming convention and puppet kicks in installs all the services, configuration and keeps the machines in normalized state – very very important piece of our infrastructure. But wait – the slide says something aboutAmbari and CM – yes – cause if we were to set up a cluster today, we would most likely evaluate at least these two solution. Like I said Spotify basically didn’t have a choice and we settled on puppet and we are happy about it right now, but there’s huge leverage you can gain out of using these tools, loads of features that you get out of the box, that we need to implement ourselves. So if you are considering building hadoop cluster – make sure to give these tools a good try, they may not solve all your issues and use cases but for sure will bring loads of value and in time your will get even more features just from community – which is great, and is something that we are missing.
  • 14. That said – even if you decided to use Ambari of CM – mostly likely you will still need some kind of configuration management tool – whether it will be puppet, or chef or salt or whatever is your favorite – you will need one, there will always be some extra library that you need to install and configure or some user to create and so on. There’s another interesting outcome of us building our own puppet infrastructure – we know exactly how our hadoop is configured – every single piece of it – which comes in handy in case of trouble shooting. In this case we touched a little bit a problem of 3rd party solutions vs let’s implement our own tailored solutions. How many of you are aware of NIH problem?
  • 16. I will argue that there’s a number of cases and teams where this problem occurs at Spotify. NIH problem is an nutshell when you undervalue 3rd party solutions and convince others to implement your own solutions – in most cases this is a huge problem. The lesson that we have learned is that you need to give external tools a try, experiment but don’t expect something to solve all your problems – preferable define metrics of acceptance priori evaluation of tools.
  • 18. But what is actually very interesting in case of data areas is kinda sibling problem on NIH – which is NeIH – a problem described I believe by Michael O Church – it’s kinda opposite approach, it’s when you overvalue 3rd party solutions, and end you in a messy place of glue implementation madness. There’s loads of great tools in BigData areas – not all of the work well with each other, not all of them do well what them meant to be doing. I urge you to be critical, sometimes implementation of your own tool or postponing a new, shinny framework from infrastructure may be a good thing to do – but it has to be a data driven decision that brings value. Think about this two problems, and ask yourself are there examples of such solutions at your company?
  • 20. To illustrate this I will tell you a real story – a story of great failure and success at the end. So we had an external consultant at Spotify – and his goal was to certify our cluster – basically 4 days of looking at different corner of our infrastructure. First two days went really smooth, we went through our configuration, state of the cluster and so on, we could not find a way to improve our cluster easily – which made as feel proud, cause you know, we have this world class, talented hadoop expert over and he can’t find a way to improve our cluster right? But oh boy was that a big mistake – so on day number 3, we are sitting in a room, whole team and consultant and due to miscommunication and misconfiguration our standby NN and RM go down – but that is still fine, cause RM starts in minute or two, standby can start in background – but unfortunately during the troubleshooting by mistake we have killed our active NN – at this point basically whole infrastructure was down – at our scale that means about 2 hours of downtime. It was bad! But wait for day number 4 – so next day we are sitting in the room, again whole team, consultant but also our managers and we listen to consultant saying that our testing and deployment procedures are like Wild Wild West and we act like cowboys – it was hard to listen to but he was right and we knew it. Next thing we do, was to go to a room with the team and come up something to solve this issue, we came up with something that may be obvious – a preproduction cluster – a cluster made out of the same machine profile and almost identical configuration which we will use for testing. But how to test was a real question. We went into research mode and started reading and watching presentation – we were especially impressed by tool called HIT by Yahoo, so we contacted the creators, unfortunately there was no plan to open source it – but they gave us a nice tip – look atApache Bigtop.
  • 22. Apache Bigtop primarily facilities building, testing and deployment of hadoop distribution – but you can also us it in a slightly different way – you can point you bigtop at your preproduction cluster and use its smoke tests to test the infrastructure. So our current flow of testing and deployment is to first deploy to preproduction cluster, run bigtop tests – get instant feedback about the change, if feedback is fine deploy to production if not – there’s something wrong with the change and we know that before it is deployed to production. Some findings from using BigTop is that it’s actually very easy to extend so we were able to add smoke tests for our own tools like snakebite and luigi but also what is very important we also run some production workloads as part of smoke tests – which actually makes us feel sure about the change. So in case of Apache Bigtop the problem was testing of hadoop infrastructure – even tho Bigtop is not perfect for this it provides loads of value just out of the box and thus it’s a great example of preventing NIH problem.
  • 24. As an operator there are many ways to help yourself and also delegate some of the work to developers themselves – one disabled by default, but great feature of Hadoop 2 is log aggregation – how many of you have log aggregation enable on your cluster? So in a nutshell this feature will aggregate yarn logs from workers and store them on HDFS for inspection, very useful for troubleshooting. So most of you probably know how to enable it right?
  • 25. To enable log aggregation yarn.log-aggregation-enable = true yarn.log-aggregation.retain-seconds = ?
  • 26. It’s dead simple. But there’s one question – how long should we keep the logs for? So we thought about it for a while, talked with HW a little bit and since we have huge cluster why not story it for long time – maybe we will need these logs for some analytics etc.
  • 27. + <property> + <name>yarn.log-aggregation-enable</name> + <value>true</value> + </property> + + <property> + <name>yarn.log-aggregation.retain-seconds</name> + <value>315569260</value> + <!--retention: 10 years--> + </property>
  • 28. This is our initial change to configuration – 10 years. Does anyone know what bad can happen if you do that?
  • 30. If you run enough jobs/tasks after some time you will see something like this in your NN – and when you see something like this in your NN then you end up with hellelepahant!
  • 32. It’s a situation when your hadoop cluster spectacularly goes down. What happen is that log aggregation will in time create many files, very important consequence of many files on HDFS is growing heap size, and then you get out of memory on NN. The lesson is that it’s “good idea” to alert on heap usage for your master daemons but also understand your configuration and its consequences, keep yourself up to data on the changes in configuration, read code of hadoop configuration keys and how they are connected between each other – no all configuration parameters are documented.
  • 33. Custom logs • Profiling • Garbage collection
  • 34. It’s a situation when your hadoop cluster spectacularly goes down. What happen is that log aggregation will in time create many files, very important consequence of many files on HDFS is growing heap size, and then you get out of memory on NN. The lesson is that it’s “good idea” to alert on heap usage for your master daemons but also understand your configuration and its consequences, keep yourself up to data on the changes in configuration, read code of hadoop configuration keys and how they are connected between each other – no all configuration parameters are documented.
  • 36. There is a couple of interesting lessons about productive development – first arguable most important is to pick right tool for the job – what is the current most important value to bring. Let’s talk Spotify – in 2009 Spotify started with hadoop streaming as a supported framework for MR development – hadoop streaming basically enables you to implement MR jobs in languages different than java – for many years it was THE framework – because Spotify loved python and it enabled us to iterate faster thus provide knowledge for our business. Time was passing and our hadoop cluster was growing – in time we needed something different something better when it comes to performance but also maturity. After long evaluation and I encourage you to watch presentations by David Whiting about different frameworks, we have decided to use Apache Crunch as supported framework for batch MR. Why – a couple of reasons – first ease of testing and type safety. David’s presentations: * https://www.youtube.com/watch?v=XKY_0s7pESQ * https://www.youtube.com/watch?v=-WOiZ2w7xtI
  • 37.
  • 38. This graph shows number of successful and failed jobs divided by framework for 6 month – and these are production jobs – as you can see two most popular frameworks are hadoop streaming and crunch – but the difference between failed and successful jobs is crucial. Crunch jobs act much better and have better testing. Type safety helps to discover problem at compile time and testing framework that comes with crunch that we were able to enhance with hadoop minicluster helps users to easily test their jobs – basically makes testing easy, something that we missed for our hadoop streaming jobs. But performance is another thing.
  • 39.
  • 40. On this graph we can see map throughput for apache crunch and hadoop streaming – these are production workloads for 6 months – there’s huge difference. Crunch turns out to be on average 8 times faster. What is more interesting is that we actually see higher utilization of our cluster the more crunch jobs we see on the cluster – which makes us super happy.
  • 42. Another thing that crunch provides is great abstraction – and that is another thing that productive developer need to keep in mind – pick the right abstraction for the jobs. In case of crunch we can start thinking in terms of high level operations like filter, groupby, joins and so on instead of old map/reduce legacy. This makes implementation more intuitive and simply pleasant – thus make developer experience much better. The interesting thing that we have observed is that higher abstraction may remove some of the opportunities for optimization thus it’s not as easy to implement the best performing job – but on the other hand it reduces problem with premature optimization, also on average performs really well – there’s very few people that actually know how to optimize pure java MR jobs or Hadoop streaming jobs at Spotify – but average optimization that we get from crunch turns out to be really good as you could see on the performance graph.
  • 44. We do have loads of nodes – and we have scaling machines nailed down, crunch scales very well. But there’s big problem that we currently have: scaling people. How you scale support and best practices – we constantly see problems with code repetition, HDFS mess, lack of data management, YARN resource contention – all this brings our productivity down. There’s not enough time to go through all of them but some of this problems we are trying to tackle with nothing different then our beloved automation. Let’s see examples:
  • 45. • Map split size • Number of reducers • HDFS data retention • User feedback (ongoing) Automation
  • 46. We automate map split size calculation thus number of map tasks, but also number of reducers therefor number and size of output files – all this is done by estimation and historical data using our workflow manager Luigi – that I encourage you to take a look at! Luigi github: https://github.com/spotify/luigi We are about to finish our second iteration of HDFS retention policy that will automatically remove data therefor reduce HDFS usage and in long term hopefully reduce HDFS legacy mess. Another ongoing effort is second iteration of automatic user feedback – we already expose database with aggregated information about all MR jobs that our users can query and learn how their jobs are performing – but we also plan another iteration very simple iteration, focused on Crunch – that right after to workflow pipeline is done will provide user with instant feedback – memory usage, garbage collection and so on, very simple tweaks users can apply to improve their jobs – for example if a user gives a pipeline 8GB of memory for each task and after going through counters we see that tasks are actually using only max 3GB, instant feedback to reduce memory could improve multitenancy of your cluster thus improve productivity.
  • 48. With that let’s talk organization structure – how it can improve your performance – but before that let’s take a look at a graph.
  • 49.
  • 50. This graph show Hadoop availability by Quarter at Spotify, higher is of course better – ok – so let’s see what happen here:
  • 52. First part is hadoop cluster being ownerless, it was best effort support by team of people that mostly didn’t even want to do operations of hadoop, therefor multiple days of downtime happen, and infrastructure was in bad shape, denormalized – overall terrible state to be in. But there was a light at the end of the tunnel – in Q3 we have decided to create a squad – 3 people focused solely on hadoop infrastructure.
  • 54. There was instant feedback right after squad was created – users were happy and infrastructure was getting in shape, one of the first decisions we have made was to move to yarn in Q4.
  • 56. Q4 and beginnings of ‘14 we again saw drop in availability mostly due to huge upgrade – and it’s consequences thereafter. The upgrade itself took whole weekend, and after the upgrade we saw many issues and fires that we had to put out, during this time we were mostly reactive but also working on polishing our puppet manifests. Whole situation stabilized after most fires were gone and puppet was in good shape.
  • 58. Our goal is too keep hadoop at 3 nines of availability – and we are getting there since Q2 2014, hadoop squad is receiving constant feedback from users and its common that hear that availability was drastically improved which improved productivity and overall experience, which is great and makes us want to work even harder to achieve better results.
  • 60. With that lets now talk about what surrounds us – the culture – I strongly believe the culture at Spotify has huge influence on productivity - there are tree main pillars of culture.
  • 62. Experiment, fail fast and embrace failure. We love to experiment and we have time to experiment whether it’s company wide hack week, R&D days – if only one wish to experiment there’s time to do that and there’s loads of curious people at Spotify there’s always something going on. The most successful data based experiments are Luigi – hadoop workflow manager and snakebite – pure python hdfs client – I encourage you to take a look at them. Fail fast – do small steps – don’t be afraid to admit failure, keep it as part of learning process to the point of embracing it. Talk about your failures, share them publicly for example through presentations both internally and externally – it will make experimentation thus innovation flow much smoother. To back this up by example let’s talk about one ongoing experiment. Luigi github: https://github.com/spotify/luigi Snakebite github: https://github.com/spotify/snakebite
  • 64. Spark – it pretty much ongoing experiment that we come back to every now and then, but officially it’s not welcome on production cluster due to immaturity and poor multitenancy support – that said most recent releases (>1.3) are very promising and we are constantly playing with it and have high hopes for it, especially about most recent dynamic resource allocation feature. There’s not so much time left but I would like to share with you two important lessons from our evaluation of a heavy spark job.
  • 65. Spark spark.storage.memoryFraction (0.6) spark.shuffle.memoryFraction (0.2) In shuffle heavy algorithms reduce cache fraction in favour of shuffle.
  • 66. First hint is about memory settings – there are two important settings that can improve stability of your heavy Spark jobs – memory available for caching – storage.memoryFraction and memory available for shuffle – shuffle.memoryFraction. The default settings are .6 and .2 leaving .2 for runtime. In our case we had a heavy machine learning job that was doing almost terabyte of shuffle – but very little (proportionally) caching – initially we had issues with shuffle step, but reducing storage memory and leaving extra memory for shuffle and runtime improved stability.
  • 68. Another issue that we hit was long GC pauses – thus executors would disappear which in turns triggers recomputation and in the end potentially application failure. After tweaking hearbeat interval and ack.wait timeout was saw improvement in stability and even tho GC pauses still occurred they were less harmful.
  • 69. Learnings • Operations  Automation • Development  Abstraction • Organization  Team • Culture  Experiment
  • 70. Join the band Engineers wanted in NYC & Stockholm http://spotify.com/jobs

Hinweis der Redaktion

  1. Hi – my name is Rafal and I’m an engineer at Spotify, in this presentation I will talk about how to be a productive data engineer. I will combine knowledge of multiple productive engineers at Spotify and touch different areas our your daily work life. I will use real world examples, failures, success stories – but mostly failures. So if you are or want to be a data engineer hopefully after this presentation every single one of you will take at least one ‘ahh’ moment with you – the moment when you learn something now. I hope this learning will improve you productivity , bring new feature to your infrastructure or maybe spark a discussion inside your team.
  2. We will go through lessons of productive data engineer and cover 4 different areas – operations, development, organization and culture. So we will kinda work our way from low level admin tips and spectacular disasters, after hard core operation we will talk about development on Hadoop – what to avoid, how Spotify is overcoming huge problem of legacy Hadoop tools. After development part we will take a look on how organization structure can affect your productivity and how one can tackle this problem, and finish with ubiquitous culture – how culture can help with in being productive. So in a way we start with low level scope or you productivity, how the cluster itself can affect you, how operators can help you out, to talk later talk about your development decision, structure of company and finish with how environment can influence your work. There will be time for questions at the end so please keep them till the end.
  3. Before we go deep into presentation let’s first talk about What spotify is. Spotify is a streaming service, launched in 2008 in beautiful Stockholm, Sweden. Current public numbers are that we have 60M monthly users, 15M subscribers. And what’s unique about Spotify service is that it can play a perfect song for every single moment, and some of this is powered through Hadoop which makes it even cooler! For my Spotify is also 1.3 Hadoop nodes – which is like a baby for a team of 4 people. A baby that is sometimes very frustrating, shit happens all the time and you have to wake up in the middle of the night and clean it up, but it’s our baby and we love it. Without further ado let’s move to the core of the topic and start with operations. If there’s one lesson that comes from operating hadoop clusters from a handful of nodes in the corner of the office to 1300 nodes – it will be AUTOMATION.
  4. Automation is crucial – especially if talking about Hadoop. Hadoop is huge beast to manage, there is loads of moving parts, loads of new stuff coming in and there’s always a reason for hadoop clusters to go down for some reason. If hadoop was not enough there’s always something that your company will push on poor operators – whether it’s a new linux distro or maybe there’s a bug in libc and you need to restart all the daemons and so on and so on.
  5. You want to be proactive and do as little as possible – without automation even coffee won’t help you. You want to be Adam on – be happy and work on new features, enhance hadoop – bring joy to hadoop users. You don’t want to be the poor operator on the left, focused primarily on putting out fires, exhausted. Btw this is the picture of Adam and me after 40 hours of hadoop ugrade from Hadoop 1 to 2 in 2013.
  6. So how to reach good enough automation of your cluster – let’s take spotify as example first – Spotify started with hadoop in 2009, very early, then the was a couple of tiny expansions, a short episode of hadoop in EMR, and we went back to on premise with shinny new 60 nodes – at that point we had to make a decision on how to manage hadoop – and because back then CM was limited and Ambari didn’t exist and cause Spotify loves puppet we have decided to use Puppet for this use case. It was rather big effort and took time, during which had had to drop some work, put out fires and work on Puppet, but it was a great investment. Today after a few iteration we like our puppet – as an example – most recent ongoing expansion is rather easy – we name the machines using proper naming convention and puppet kicks in installs all the services, configuration and keeps the machines in normalized state – very very important piece of our infrastructure. But wait – the slide says something about Ambari and CM – yes – cause if we were to set up a cluster today, we would most likely evaluate at least these two solution. Like I said Spotify basically didn’t have a choice and we settled on puppet and we are happy about it right now, but there’s huge leverage you can gain out of using these tools, loads of features that you get out of the box, that we need to implement ourselves. So if you are considering building hadoop cluster – make sure to give these tools a good try, them may not solve all your issues and use cases but for sure will bring loads of value and in time your will get even more features just from community – which is great, and is something that we are missing.
  7. That said – even if you decided to use Ambari of CM – mostly likely you will still need some kind of configuration management tool – whether it will be puppet, or chef or salt or whatever is your favorite – you will need one, there will always be some extra library that you need to install and configure or some user to create, ldap configuration and so on. There’s another interesting outcome of us building our own puppet infrastructure – we know exactly how our hadoop is configured – every single piece of it – which comes in handy in case of trouble shooting and so on. In this case we touched a little bit a problem of 3rd party solutions vs let’s implement our own tailored solutions. How many of you are aware of NIH problem?
  8. I will argue that there’s a number of cases and teams where this problem occurs at Spotify. NIH problem is an nutshell when you undervalue 3rd party solutions and convince others to implement your own solutions – in most cases this is a huge problem. The lesson that we have learned is that you need to give external tools a try, experiment but don’t expect something to solve all your problems – preferable define metrics of acceptance priori evaluation of tools.
  9. But what is actually very interesting in case of data areas is kinda sibling problem on NIH – which is NeIH – a problem described I believe by Michael O Church – it’s kinda opposite approach it’s when you overvalue 3rd party solutions, and end you in a messy place of glue implementation madness. There’s loads of great tools in BigData areas – not all of the work well with each other, not all of them do well what them meant to be doing. I urge you to be critical, sometimes implementation of your own tool or postponing a new, shinny framework from infrastructure may be a good thing to do – but it has to be a data driven decision that bring value. Think about this two problems, and ask yourself are there examples of such solutions at your company?
  10. To illustrate this I will tell you a real story – a story of great failure and success at the end. So we had an external consultant at Spotify – and his goal was to certify our cluster – basically 4 days of looking at different corner of our infrastructure. First two days went really smooth, we went through our configuration, state of the cluster and so on, we could not find a way to improve our cluster easily – which made as feel like proud, cause you know, we have this world class hadoop expert over and he can’t find a way to improve our cluster right. But oh boy was that a big mistake – so on day number 3, we are sitting in a room, whole team and consultant and due to miscommunication and misconfiguration our standby NN and RM go down – but that is still fine, cause RM starts in minute or two, standby can start in background – but unfortunately during the troubleshooting by mistake we have killed our active NN – at this point basically whole infrastructure was down – at our scale that means about 2 hours of downtime. It was bad! But wait for day number 4 – so next day we are sitting in the room, again whole time, consultant but also our managers and we listen to consultant saying that our testing and deployment procedures are like Wild Wild West and we act like cowboys – it was hard to listen to but he was right and we knew it. Next thing we do, was to go to a room with the team and come up something too solve this issue, we came up with something that may be obvious – a preproduction cluster – a cluster made out of the same machine profile and almost identical configuration that we will use for testing. But how to test was a real question. We went into research mode and started reading and watching presentation – we were especially impressed by tool called HIT by Yahoo, so we contacted the creator, unfortunately there was no plan to open source it – but they gave us a nice tip – look at apache bigtop.
  11. So apache bigtop primarily facilities building and deployment of hadoop distribution – but you can also us it in a slightly different way – you can point you bigtop at your preproduction cluster and use its smoke tests to test the infrastructure. So our current flow of testing and deployment is to first deploy to preproduction cluster, run bigtop tests – get instant feedback about the change, if feedback is fine deploy to production if not – there’s something wrong with the change and we know that before it is deployed to production. Some finding from using BigTop is that it’s actually very easy to extend so we were able to add smoke tests for our own tools like snakebite and luigi but also what is very important we also run some production workloads as part of smoke tests – which actually makes us feel sure about the change. So in case of Apache BigTop are problem was testing of hadoop infrastructure – even tho BigTop is not perfect for this it provides loads of value just out of the box and thus it’s a great example of preventing NIH problem.
  12. As an operator there are many ways to help yourself and also delegate some of the work to developers themselves – one disabled by default, but great feature is log aggregation – how many of you have log aggregation enable on your cluster? Cool. So in a nutshell this feature will aggregate yarn logs from workers and store them on HDFS for inspection, very useful for troubleshooting. So most of you probably know how to enable it right?
  13. It’s dead simple. But there’s one question – how long should we keep the logs for? So we thought about it for a while, talked with HW a little bit and since we have huge cluster why not story it for long time – maybe we will need these logs for some analytics etc.
  14. This is our initial change to configuration – 10 years. Does anyone know what bad can happen if you do that?
  15. If you run enough jobs after some time you will see something like this in your NN – and when you see something like this in your NN then you end up with hellelepahant!
  16. It’s a situation when your hadoop cluster spectacularly goes down. What happen is that log aggregation will create many many files, very important consequence of many files on HDFS is growing heap size, and when you get out of memory on NN. The lesson is that it’s good idea to alert on heap usage for your master daemons but also understand your configuration and its consequences, keep yourself up to data on the changes in configuration, read code of hadoop configuration keys and how they are connected between each other – no all configuration parameters all documented.
  17. While on the topic of log aggregation – it’s good to know that aggregation will take log directory and aggregate its content – so if you put extra information on task in there you can get aggregation for free thus likely bring value for developers – think about profiling and garbage collection logs. While talking about developer – let’s move to development and how it works at spotify.
  18. There is a couple of interesting lessons about productive development – first arguable most important is to pick right tool for the job – what is the current most important value to bring. Let’s talk Spotify – in 2009 Spotify started with hadoop streaming as a supported framework for MR development – hadoop streaming basically enables your to implement MR jobs in languages different than java – for many years it was THE framework – because Spotify loved python and it enabled us to iterate faster thus provide knowledge for our business. Time as passing and our hadoop cluster was growing – in time we needed something different something better when it comes to performance but also maturity. After long evaluation and I encourage you to watch presentations by David Whiting about different frameworks, we have decided to use Apache Crunch as supported framework for batch MR. Why – a couple of reasons – first ease of testing and type safety.
  19. This graph shows number of successful and failed jobs divided by framework for 6 month – and these are production jobs – as you can see two most popular frameworks are hadoop streaming and crunch – but the difference between failed and successful jobs is crucial. Crunch jobs act much better and have better testing. Type safety helps to discover problem at compile time and testing framework that comes with crunch that we were able to enhance with hadoop minicluster helps users to easily test their jobs – basically makes testing easy something that we missed for our hadoop streaming jobs. But performance is another thing -
  20. On this graph we can see map throughput in for apache crunch and hadoop streaming – there’s huge difference and again we are talking about production jobs here. Crunch turns out to be on average 8 times faster, 75% of all crunch jobs are much faster than all hadoop streaming jobs. What is more interesting is that we actually see higher utilization of our cluster the more crunch jobs we see on the cluster – which makes us super happy.
  21. Another thing that crunch provides is great abstraction – and that is another thing that productive developer need to keep in mind – pick the right abstraction for the jobs. In case of crunch we can start thinking in terms of high level operations like filter, groupby, joins and so on instead of old map/reduce legacy. This makes implementation more intuitive and simply pleasant – thus make developer experience much better. The interesting thing that we have observed is that higher abstraction may remove some of the opportunities for optimization thus it’s not as easy to implement the best performing job – but on the other hand it reduces problem with premature optimization but also on average performs really well – there’s very few people that actually know how to optimize pure MR jobs or Hadoop streaming jobs at Spotify – but average optimization that we get from crunch turns out to be really good as you could see on the performance graph.
  22. We do have loads of nodes – and we have scaling machines nailed down, crunch scales very well. But there’s big problem that we currently have: scaling people. How you scale support and best practices – we constantly see problems with code repetition, HDFS mess, lack of data management, YARN resource contention – all this bring our productivity down. There’s not enough time to go through all of them but some of this problems we are trying to tackle with nothing different then our beloved automation. Let’s see examples:
  23. We automate map split size calculation thus number of map tasks, but also number of reducers therefor number and size of output files – all this is done by estimation and historical data using our workflow manager Luigi – that I encourage you to take a look at! We about to finish our second iteration of HDFS retention policy that we automatically remove data therefor reduce HDFS usage and in long term hopefully reduce HDFS legacy mess. Another ongoing effort is second iteration of automatic user feedback – we already expose database with aggregated information about all MR jobs that our users can query and learn how their jobs are performing – but we also plan another iteration very simple iteration, focused on Crunch – that right after to workflow pipeline is done will provide user with instant feedback – memory usage, garbage collection and so on, very simple tweaks users can apply to improve their jobs – for example if a user gives a pipeline 8GB of memory for each task and after going though counters we see that tasks are actually using only max 3GB, instant feedback to reduce memory could improve multitenancy of your cluster thus improve productivity.
  24. With that let’s talk how organization structure can improve your performance – but before that let’s take a look at this graph.
  25. This graph show Hadoop availability by Quarter at Spotify, higher is of course better – ok – so let’s see what happen here:
  26. First part is hadoop cluster being ownerless, it was best effort support by team of people that mostly didn’t even want to do operations of hadoop, therefor multiple days of downtime happen, and infrastructure was in bad shape, denormalized – overall terrible state to be in. But there was a light at the end of the tunnel – in Q3 we have decided to create a squad – 3 people focused solely on hadoop infrastructure.
  27. There was instant right after squad was created – users were happy and infrastructure was getting in shape, one of the first decisions we have made was to move to yarn in Q4.
  28. Q4 and beginnings of ‘14 we again saw drop in availability mostly due to huge upgrade – and it’s consequences thereafter. The upgrade itself took whole weekend, and after the upgrade we saw many issues and fires that we had to put out, during this time we were mostly reactive but also working on polishing our puppet manifests. Whole situation stabilized after most fires were gone and puppet was in good shape.
  29. Our goal is too keep hadoop at 3 nines of availability – and we are getting there since Q2 2014, hadoop squad is receiving constant feedback from users and its common that hear that availability was drastically improved which improved productivity and overall experience, which is great and makes us want to work even harder to achieve better results. As you see there’s a small drop at the beginning on Q1 2015 – does anyone of you know why and can guess?
  30. With that lets now talk about what surrounds us – the culture – I strongly believe the culture at Spotify has huge influence on productivity - there are tree main pillars of culture.
  31. Experiment, fail fast and embrace failure. We love to experiment and we time to experiment whether it’s company wide hack week, R&D days – if only one wish to experiment there’s time to do that and there’s loads of curious people at Spotify there’s always something going on. The most successful data based experiments are Luigi – hadoop workflow manager and snakebite – pure python hdfs client – I encourage you to take a look at them. Fail fast - don’t be afraid to admit failure, keep it as part of learning process to the point of embracing it. Talk about your failures, share them publicly for example through presentations both internally and externally – it will make experimentation thus innovation flow much smoother. To back this up by example let’s talk about two most recent ongoing experiments.
  32. Another experiment is Spark – it pretty much ongoing experiment that we come back to every now and then, but officially it’s not welcome on production cluster due to immaturity and poor multitenancy support – that said most recent releases are very promising and we are constantly playing with it and have high hopes for it, especially about most recent dynamic resource allocation feature. There’s not so much time left but I would like to share with you two important lessons from our evaluation of a heavy spark job.
  33. First hint is about memory settings – there are two important settings that can improve stability of your heavy Spark jobs – memory available for caching – storage.memoryFraction and memory available for shuffle – shuffle.memoryFraction. The default settings are .6 and .2 leaving .2 for runtime. In our case we had a heave machine learning job that was doing almost terabyte of shuffle – but very little caching – initally we had issues with shuffle step, but reducing storage memory and leaving extra memory for shuffle and runtime improved stability.
  34. Another issue that we hit was long GC pauses – thus executors would disappear which in turns triggers recomputation and in the end potentially application failure. After tweaking hearbeat interval and ack.wait timeout was saw improvement in stability and even tho GC pauses still occurred they were less harmful.
  35. We will go through lessons of productive data engineer and cover 4 different areas – operations, development, organization and culture. So we will kinda work our way from low level admin tips and spectacular disasters, after hard core operation we will talk about development on Hadoop – what to avoid, how Spotify is overcoming huge problem of legacy Hadoop tools. After development part we will take a look on how organization structure can affect your productivity and how one can tackle this problem, and finish with obiqious culture – how you attitude can help with in being productive. Ok – let’s get this started with operations.
  36. Ok – if you were sleep the whole time – please wake up now for a few minutes