Puppet is widely known in DevOps community, but not so popular in data teams. Nevertheless, Puppet could easily empower your data teams. In the talk presented hands-on experience of using Puppet for different data topics starting from configuring Windows machine for Business Intelligence and finishing with advanced ranking infrastructures based on Puppet.
The talk will walk you through the process of setting up a standalone Puppet configuration, that used for provisioning Windows machine to be utilized for Business Intelligence purposes like Tableau and Talend Big Data configurations, ETL scheduling etc. Second part of the talk will cover a use-case of Puppet for enabling a lean ranking infrastructure.
Invezz.com - Grow your wealth with trading signals
Helping Data Teams with Puppet / Puppet Camp London - Apr 13, 2015
1. S T Y L I G H T . C O M
Helping Data Teams with
Puppet
S T Y L I G H T . C O M
S E R G I I K H O M E N K O , D A T A S C I E N T I S T ,
S E R G I I . K H O M E N K O @ S T Y L I G H T . C O M , @ l c 0 d 3 r
2. W h o ? W h a t ? W h y ?
S e t t i n g u p y o u r B I w i t h p u p p e t .
S m a l l t i p s a n d t r i c k s
P u p p e t y o u r r a n k i n g
A G E N D A
3. Data scientist at one of the biggest fashion communities,
STYLIGHT. Data analysis and visualization hobbyist.
Speaker at Berlin Buzzwords 2014, ApacheCon Europe 2014
Founder and speaker at Munich Golang UG, Munich Tableau UG.
Speaker at Munich UseR Group, Munich Search UG, Munich
Quantified Self UG.
Sergii Khomenko
Milos Radovanovic
Passionate about DevOps stuff:
1. microservices
2. docker
3. 12 factor apps
4. continuous integration/deployment
4.
5.
6. L i v e i n 1 2 c o u n t r i e s
STYLIGHT – international community
7. S T Y L I G H T . C O M
Setting up your BI with
puppet.
8. T a b l e a u - r e p o r t i n g a n d a d - h o c s
P y t h o n / T a l e n d E T L t o o l s
Minimum Viable BI
9. R U N N I N G P U P P E T I N A S T A N D A L O N E M O D E
Minimum Viable BI
We use Puppet for *nix servers and can’t merge
with Windows machine
Standalone mode for Puppet
– easier to start and develop
– windows machines are separated from *nix ones
10. R U N N I N G P U P P E T I N A S T A N D A L O N E M O D E
Minimum Viable BI
cd c:folderwithour-bi
git pull origin master
IF %ERRORLEVEL% NEQ 0 set
context=GIT_FAILURE && goto error_handler
puppet apply --modulepath=puppetmodules puppetwin-
node-name.net.pp
IF %ERRORLEVEL% NEQ 0 set
context=PUPPET_FAILURE && goto error_handler
goto end
11. R U N N I N G P U P P E T I N A S T A N D A L O N E M O D E
Minimum Viable BI
:error_handler
echo entering error_handler
EVENTCREATE /T ERROR /L APPLICATION /SO
Puppet_Scheduler /ID 100 /D "EXECUTION FAILED
REASON %context%"
goto end
:end
echo DONE
12. Minimum Viable BI
Standalone mode for Puppet
– configuration is totally separated
– custom modules --modulepath=puppetmodules
– Github hosted configuration
– Error handling via Windows event log
R U N N I N G P U P P E T I N A S T A N D A L O N E M O D E
13. Minimum Viable BI
node 'ʹwin-‐‑node-‐‑name.net'ʹ {
scheduled_task {'ʹrefresh-‐‑1'ʹ:
ensure => present,
enabled => true,
command => 'ʹC:pathtoyourscript.bat'ʹ,
arguments => 'ʹsome args 'ʹ,
S C H E D U L I N G I S I M P O R T A N T
14. Minimum Viable BI
user => 'ʹyour-‐‑user'ʹ,
password => 'ʹyour-‐‑password'ʹ,
trigger => {
schedule => daily,
start_time => 'ʹ06:00'ʹ,
}
}
S C H E D U L I N G I S I M P O R T A N T
15. Minimum Viable BI
# Can't use the Puppet's scheduled_task as it does not
support to run the schedule task every 5 minutes.
https://github.com/sdliangzhihua/windows-puppet-
example/blob/master/manifest.pp#L68
S Y N C M Y C O N F I G U R A T I O N E V E R Y 1 5 M I N
16. Minimum Viable BI
$cmd = 'C:Windowssystem32cmd.exe'
$job_name = 'sync_code'
exec { 'CreateCodeSyncScheduledTask':
command => "${cmd} /C schtasks /create /sc
MINUTE /mo 15 /tn ${job_name} /tr C:your
puppet.bat /ru administrator /f",
onlyif => ["${cmd} /C schtasks /query /tn ${job_name}
& if errorlevel 1 (exit /b 0) else exit /b 1"],
S Y N C M Y C O N F I G U R A T I O N E V E R Y 1 5 M I N
17. S T Y L I G H T . C O M
Small tips and tricks
do not repeat yourself and other tricks
18. Minimum Viable BI
node 'ʹwin-‐‑node-‐‑name.net'ʹ {
scheduled_task {'ʹrefresh-‐‑1'ʹ:
ensure => present,
enabled => true,
command => 'ʹC:pathtoyourscript.bat'ʹ,
arguments => 'ʹsome args 'ʹ,
S C H E D U L I N G I S I M P O R T A N T
19. Small tips and tricks
class job_scheduler(
$ensure = $job_scheduler::params::ensure,
$enabled = $job_scheduler::params::enabled,
$user = $job_scheduler::params::user,
$password = $job_scheduler::params::password,
$working_dir = $job_scheduler::params::working_dir,
)inherits job_scheduler::params{
}
20. Small tips and tricks
define job_scheduler::job
(
$arguments ='ʹtableau_adobe.py'ʹ,
$command ='ʹc:Py27-‐‑32python.exe'ʹ,
$schedule_type ='ʹdaily'ʹ,
$start_time ='ʹ08:15'ʹ,
$day_of_week ='ʹevery'ʹ,
)
{
21. Small tips and tricks
define job_scheduler::tableau_job
(
$arguments ='ʹdefault-‐‑tableau'ʹ,
$command ='ʹc:foldertableau.bat'ʹ,
$schedule_type ='ʹdaily'ʹ,
$start_time ='ʹ21:00'ʹ,
$day_of_week ='ʹevery'ʹ,
)
{
22. Small tips and tricks
# Params with default values for the tableau job
# that might be changed in a job definition
#
# 1. $arguments ='default-argument',
# 2. $command ='c:folderscript.bat',
# 3. $schedule_type ='daily',
# 4. $start_time ='21:00',
# 5. $day_of_week ='every',
####################
24. Small tips and tricks
job_scheduler::redshift_job {
'ʹRS tagged products'ʹ: start_time => 'ʹ00:40'ʹ, params =>
'ʹ..datasourcessomething.tds'ʹ;
'ʹRS another job'ʹ: start_time => 'ʹ00:50'ʹ, params => 'ʹ..
datasourceselse.tds'ʹ
25. S T Y L I G H T . C O M
Puppet your ranking
Lean, flexible, powerful
26. A r a n k i n g i s a r e l a t i o n s h i p
b e t w e e n a s e t o f i t e m s s u c h t h a t ,
f o r a n y t w o i t e m s , t h e f i r s t i s
e i t h e r ' r a n k e d h i g h e r t h a n ' ,
' r a n k e d l o w e r t h a n ' o r ' r a n k e d
e q u a l t o ' t h e s e c o n d .
27. Ranking specifics:
• Seasonal influence
• Trends
• Cold start of new countries, shops
• Multiple dimensions of ranking model
28. Requirements:
• Decreasing time to implement new ranking
model
• Keeping working infrastructure alive
• A/B testing without changing entire
infrastructure
• Performance level - “still fast” and
“transparent”
Lean approach to Ranking
M u l t i p l e p o i n t s o f e v a l u a t i o n
33. Lean approach to Ranking
<% urls.each do |url| -%>
if ($args ~* <% if url['gender'] > 0 -%>gender_id%3A<
%= url['gender'] %>.*<% end -%><% url['tags'].each
do |tag| -%>tag_id%3A<%= tag %>.*<% end -%><%
if url['brand'] > 0 -%>brand_id%3A%28<%=
url['brand'] %>%29<% end -%>) {
set $orig $args;
set $args "q={!boost+b=%24b+defType=dismax+v=
%24qq}&qq=id:*";
rewrite ^(.*)$ "$1?$orig" break;
}
<% end -%>
nginx / templates / conf / solr-rewrites.conf.erb
34. Stages to evaluate a model:
• R ranking model
• Independent Solr-node
1. For internal use-cases
2. Testing for some of pages
3. A/B roll out for % of users
• Production roll out
Lean approach to Ranking
M u l t i p l e p o i n t s o f e v a l u a t i o n
36. S T Y L I G H T . C O M
Sergii Khomenko
Data Scientist
STYLIGHT GmbH
sergii.khomenko@stylight.com
@lc0d3r
Nymphenburger Straße 86
80636 Munich, Germany