SlideShare ist ein Scribd-Unternehmen logo
1 von 7
Project: Cloudera Data Lake Migration
Cloudera Data Lake existing workflows details in *** as of 11th July 2017 (presumably it has grown by at least 30% since
then)
Category Description Count (July 2017) Count (June 2018) estimate
PROD Production workflows XXX XXX
UAT UAT workflows XXX XXX
MSC Miscellaneous workflows XX XX
Total workflows: XXX XXX
Total coordinators: XXX XXX
• Most workflows include dependencies (jar files, bash scripts, connection files, Hive/Pig scripts etc.)
• Dependencies usually stored in workflow’s workspace
• AVRO schema files stored separately
There are 2 Oozie applications working back by back in Cloudera:
• Oozie Engine running a workflow defined in XML format
• Oozie Editor running a workflow defined in JSON format
Development is done in Oozie Editor so all workflows should be migrated in JSON format with all dependencies.
Workflows summary
Cloudera
migration tools
• Hue dump +
restore
• Dump MySQL
database +
restore
Third party tools
• IT consulting
services
providers
• Cloudera
partners etc.
Custom in-house
tool
• UiPath or other
Robotic Process
Automation tool
• Custom Scala
application
Migration options
Tech Enablers are currently working
with Cloudera on this option
Current and proposed deployment process
A workflow
developed and
tested in Dev
environment
A workflow
committed in GIT
A workflow and its
dependencies is
deployed in UAT and
then in Production
•Deployment
is manual
A modified in
UAT/Production
workflow is
committed to GIT
A workflow
developed and
tested in Dev
environment
A workflow
committed in GIT
•GIT always has
the latest
version
A workflow is
automatically
deployed in SIT/
UAT/ Production
•All workflows
are read-only
in Production
This step is
sometimes
skipped/ forgotten/
missed
Current Proposed
Phase 1 (POC) - DONE
• Compile a list of all UAT/Production Oozie workflows for Team A
• Download/export their definitions in JSON format
• Map each artefact with its GIT path (optional)
• Automatically deploy all artefacts into a separate Cloudera environment
Phase 2
Once we are happy with Phase 1 we can do the same for all teams. After obtaining the latest version of all
artefacts for all teams scrum masters will map each artefact with its GIT path.
Phase 3
All downloaded artefacts will be stored into a separate GIT repository (the structure will be decided later) with
all dependencies (bash scripts, options files, Hive/pig scripts etc.), jar files will be stored separately in Nexus. A
separate application will deploy each artefact automatically to new Cloudera environment using configurable
parameters.
Proposed project plan
Demonstrated POC application workflow
GIT
SIT
DEV
UAT PROD
JENKINS
Workflow/Coordinator JSON
Dependencies
Spark Jar
Deployment configuration (YAML)
Automatic tests
NEXUS
SonarQube
Benefits
CI/CI POC application (Scala) vs other methods
No admin privileges needed, only normal
development user with access to Hue UI
Works with all Cloudera and Hue versions
Fully customisable to meet special requirements (PKs
sync between environments, different/same users,
changing workflow/coordinators names/parameters etc.)

Weitere ähnliche Inhalte

Was ist angesagt?

BlaBlaCar and infrastructure automation
BlaBlaCar and infrastructure automationBlaBlaCar and infrastructure automation
BlaBlaCar and infrastructure automation
sinfomicien
 

Was ist angesagt? (20)

Promcon2016
Promcon2016Promcon2016
Promcon2016
 
OpenDataPlane Testing in Travis
OpenDataPlane Testing in TravisOpenDataPlane Testing in Travis
OpenDataPlane Testing in Travis
 
Sprint 38 review
Sprint 38 reviewSprint 38 review
Sprint 38 review
 
Using heka
Using hekaUsing heka
Using heka
 
ELK at LinkedIn - Kafka, scaling, lessons learned
ELK at LinkedIn - Kafka, scaling, lessons learnedELK at LinkedIn - Kafka, scaling, lessons learned
ELK at LinkedIn - Kafka, scaling, lessons learned
 
Ansible: Infrastructure as Code for OpenShift
Ansible: Infrastructure as Code for OpenShiftAnsible: Infrastructure as Code for OpenShift
Ansible: Infrastructure as Code for OpenShift
 
BlaBlaCar and infrastructure automation
BlaBlaCar and infrastructure automationBlaBlaCar and infrastructure automation
BlaBlaCar and infrastructure automation
 
Cloud Native Logging / Fluentd Summit Tokyo
Cloud Native Logging / Fluentd Summit TokyoCloud Native Logging / Fluentd Summit Tokyo
Cloud Native Logging / Fluentd Summit Tokyo
 
Scaling Up Logging and Metrics
Scaling Up Logging and MetricsScaling Up Logging and Metrics
Scaling Up Logging and Metrics
 
Luigi presentation OA Summit
Luigi presentation OA SummitLuigi presentation OA Summit
Luigi presentation OA Summit
 
Initial presentation of openstack (for montreal user group)
Initial presentation of openstack (for montreal user group)Initial presentation of openstack (for montreal user group)
Initial presentation of openstack (for montreal user group)
 
BlaBlaCar Elastic Search Feedback
BlaBlaCar Elastic Search FeedbackBlaBlaCar Elastic Search Feedback
BlaBlaCar Elastic Search Feedback
 
K8s@Pollfish - Can you run a monolith on k8s?
K8s@Pollfish - Can you run a monolith on k8s?K8s@Pollfish - Can you run a monolith on k8s?
K8s@Pollfish - Can you run a monolith on k8s?
 
Multitenant SaaS Apps In Rails By Iqbal Hasnan
Multitenant SaaS Apps In Rails By Iqbal HasnanMultitenant SaaS Apps In Rails By Iqbal Hasnan
Multitenant SaaS Apps In Rails By Iqbal Hasnan
 
Monitoring your VM's at Scale
Monitoring your VM's at ScaleMonitoring your VM's at Scale
Monitoring your VM's at Scale
 
Creating basic workflows as Jupyter Notebooks to use Cytoscape programmatically.
Creating basic workflows as Jupyter Notebooks to use Cytoscape programmatically.Creating basic workflows as Jupyter Notebooks to use Cytoscape programmatically.
Creating basic workflows as Jupyter Notebooks to use Cytoscape programmatically.
 
Gitlab runner in aws
Gitlab runner in aws Gitlab runner in aws
Gitlab runner in aws
 
Oslo Vancouver Project Update
Oslo Vancouver Project UpdateOslo Vancouver Project Update
Oslo Vancouver Project Update
 
faastRuby - Building a FaaS platform with Redis (RedisConf19)
faastRuby - Building a FaaS platform with Redis (RedisConf19)faastRuby - Building a FaaS platform with Redis (RedisConf19)
faastRuby - Building a FaaS platform with Redis (RedisConf19)
 
Kolla Project Update (Vancouver 2018)
Kolla Project Update (Vancouver 2018)Kolla Project Update (Vancouver 2018)
Kolla Project Update (Vancouver 2018)
 

Ähnlich wie Cloudera migration oozie_hadoop_ci_cd_pipeline

Git presentation
Git presentationGit presentation
Git presentation
jordimash
 
Firefox Crash Reporting (@ Open Source Bridge)
Firefox Crash Reporting (@ Open Source Bridge)Firefox Crash Reporting (@ Open Source Bridge)
Firefox Crash Reporting (@ Open Source Bridge)
lauraxthomson
 

Ähnlich wie Cloudera migration oozie_hadoop_ci_cd_pipeline (20)

Developing In Python On Red Hat Platforms (Nick Coghlan & Graham Dumpleton)
Developing In Python On Red Hat Platforms (Nick Coghlan & Graham Dumpleton)Developing In Python On Red Hat Platforms (Nick Coghlan & Graham Dumpleton)
Developing In Python On Red Hat Platforms (Nick Coghlan & Graham Dumpleton)
 
Developing in Python on Red Hat Platforms (DevNation 2016)
Developing in Python on Red Hat Platforms (DevNation 2016)Developing in Python on Red Hat Platforms (DevNation 2016)
Developing in Python on Red Hat Platforms (DevNation 2016)
 
Deep Dive Azure Functions - Global Azure Bootcamp 2019
Deep Dive Azure Functions - Global Azure Bootcamp 2019Deep Dive Azure Functions - Global Azure Bootcamp 2019
Deep Dive Azure Functions - Global Azure Bootcamp 2019
 
Thrombus Training Dec. 2013
Thrombus Training Dec. 2013Thrombus Training Dec. 2013
Thrombus Training Dec. 2013
 
GoDocker presentation
GoDocker presentationGoDocker presentation
GoDocker presentation
 
Migrating To GitHub
Migrating To GitHub  Migrating To GitHub
Migrating To GitHub
 
Git presentation
Git presentationGit presentation
Git presentation
 
Spring Roo Add-On Development & Distribution
Spring Roo Add-On Development & DistributionSpring Roo Add-On Development & Distribution
Spring Roo Add-On Development & Distribution
 
Composer JSON kills make files
Composer JSON kills make filesComposer JSON kills make files
Composer JSON kills make files
 
BigQuery case study in Groovenauts & Dive into the DataflowJavaSDK
BigQuery case study in Groovenauts & Dive into the DataflowJavaSDKBigQuery case study in Groovenauts & Dive into the DataflowJavaSDK
BigQuery case study in Groovenauts & Dive into the DataflowJavaSDK
 
Tribal Nova Docker feedback
Tribal Nova Docker feedbackTribal Nova Docker feedback
Tribal Nova Docker feedback
 
.NET Core, ASP.NET Core Course, Session 2
.NET Core, ASP.NET Core Course, Session 2.NET Core, ASP.NET Core Course, Session 2
.NET Core, ASP.NET Core Course, Session 2
 
Hackaton for health 2015 - Sharing the Code we Make
Hackaton for health 2015 - Sharing the Code we MakeHackaton for health 2015 - Sharing the Code we Make
Hackaton for health 2015 - Sharing the Code we Make
 
Introducción a Stream Processing utilizando Kafka Streams
Introducción a Stream Processing utilizando Kafka StreamsIntroducción a Stream Processing utilizando Kafka Streams
Introducción a Stream Processing utilizando Kafka Streams
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticians
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
 
How to plan and define your CI-CD pipeline
How to plan and define your CI-CD pipelineHow to plan and define your CI-CD pipeline
How to plan and define your CI-CD pipeline
 
Mulvery Detail - English
Mulvery Detail - EnglishMulvery Detail - English
Mulvery Detail - English
 
[FOSDEM 2020] Lazy distribution of container images
[FOSDEM 2020] Lazy distribution of container images[FOSDEM 2020] Lazy distribution of container images
[FOSDEM 2020] Lazy distribution of container images
 
Firefox Crash Reporting (@ Open Source Bridge)
Firefox Crash Reporting (@ Open Source Bridge)Firefox Crash Reporting (@ Open Source Bridge)
Firefox Crash Reporting (@ Open Source Bridge)
 

Mehr von Vera Ekimenko

KeyAchivementsMimecast
KeyAchivementsMimecastKeyAchivementsMimecast
KeyAchivementsMimecast
Vera Ekimenko
 
KeyAchivementsJustisPublishing
KeyAchivementsJustisPublishingKeyAchivementsJustisPublishing
KeyAchivementsJustisPublishing
Vera Ekimenko
 

Mehr von Vera Ekimenko (13)

Data Quality with AI
Data Quality with AIData Quality with AI
Data Quality with AI
 
AML Knowledge Graph
AML Knowledge GraphAML Knowledge Graph
AML Knowledge Graph
 
Deep Reinforcement Learning for Portfolio Optimization
Deep Reinforcement Learning for Portfolio OptimizationDeep Reinforcement Learning for Portfolio Optimization
Deep Reinforcement Learning for Portfolio Optimization
 
Artificial Intelligence for Data Quality
Artificial Intelligence for Data QualityArtificial Intelligence for Data Quality
Artificial Intelligence for Data Quality
 
Unsupervised AI for Data Quality
Unsupervised AI for Data QualityUnsupervised AI for Data Quality
Unsupervised AI for Data Quality
 
Deep Learning Hackathon
Deep Learning HackathonDeep Learning Hackathon
Deep Learning Hackathon
 
Artificial Intelligence Hackathon
Artificial Intelligence HackathonArtificial Intelligence Hackathon
Artificial Intelligence Hackathon
 
CSharp
CSharpCSharp
CSharp
 
DWHRestructure
DWHRestructureDWHRestructure
DWHRestructure
 
KeyAchivementsMimecast
KeyAchivementsMimecastKeyAchivementsMimecast
KeyAchivementsMimecast
 
KeyAchivementsJustisPublishing
KeyAchivementsJustisPublishingKeyAchivementsJustisPublishing
KeyAchivementsJustisPublishing
 
buy_in
buy_inbuy_in
buy_in
 
HCM Access Insight Dashboard
HCM Access Insight DashboardHCM Access Insight Dashboard
HCM Access Insight Dashboard
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 

Cloudera migration oozie_hadoop_ci_cd_pipeline

  • 1. Project: Cloudera Data Lake Migration
  • 2. Cloudera Data Lake existing workflows details in *** as of 11th July 2017 (presumably it has grown by at least 30% since then) Category Description Count (July 2017) Count (June 2018) estimate PROD Production workflows XXX XXX UAT UAT workflows XXX XXX MSC Miscellaneous workflows XX XX Total workflows: XXX XXX Total coordinators: XXX XXX • Most workflows include dependencies (jar files, bash scripts, connection files, Hive/Pig scripts etc.) • Dependencies usually stored in workflow’s workspace • AVRO schema files stored separately There are 2 Oozie applications working back by back in Cloudera: • Oozie Engine running a workflow defined in XML format • Oozie Editor running a workflow defined in JSON format Development is done in Oozie Editor so all workflows should be migrated in JSON format with all dependencies. Workflows summary
  • 3. Cloudera migration tools • Hue dump + restore • Dump MySQL database + restore Third party tools • IT consulting services providers • Cloudera partners etc. Custom in-house tool • UiPath or other Robotic Process Automation tool • Custom Scala application Migration options Tech Enablers are currently working with Cloudera on this option
  • 4. Current and proposed deployment process A workflow developed and tested in Dev environment A workflow committed in GIT A workflow and its dependencies is deployed in UAT and then in Production •Deployment is manual A modified in UAT/Production workflow is committed to GIT A workflow developed and tested in Dev environment A workflow committed in GIT •GIT always has the latest version A workflow is automatically deployed in SIT/ UAT/ Production •All workflows are read-only in Production This step is sometimes skipped/ forgotten/ missed Current Proposed
  • 5. Phase 1 (POC) - DONE • Compile a list of all UAT/Production Oozie workflows for Team A • Download/export their definitions in JSON format • Map each artefact with its GIT path (optional) • Automatically deploy all artefacts into a separate Cloudera environment Phase 2 Once we are happy with Phase 1 we can do the same for all teams. After obtaining the latest version of all artefacts for all teams scrum masters will map each artefact with its GIT path. Phase 3 All downloaded artefacts will be stored into a separate GIT repository (the structure will be decided later) with all dependencies (bash scripts, options files, Hive/pig scripts etc.), jar files will be stored separately in Nexus. A separate application will deploy each artefact automatically to new Cloudera environment using configurable parameters. Proposed project plan
  • 6. Demonstrated POC application workflow GIT SIT DEV UAT PROD JENKINS Workflow/Coordinator JSON Dependencies Spark Jar Deployment configuration (YAML) Automatic tests NEXUS SonarQube
  • 7. Benefits CI/CI POC application (Scala) vs other methods No admin privileges needed, only normal development user with access to Hue UI Works with all Cloudera and Hue versions Fully customisable to meet special requirements (PKs sync between environments, different/same users, changing workflow/coordinators names/parameters etc.)