2. Cloudera Data Lake existing workflows details in *** as of 11th July 2017 (presumably it has grown by at least 30% since
then)
Category Description Count (July 2017) Count (June 2018) estimate
PROD Production workflows XXX XXX
UAT UAT workflows XXX XXX
MSC Miscellaneous workflows XX XX
Total workflows: XXX XXX
Total coordinators: XXX XXX
• Most workflows include dependencies (jar files, bash scripts, connection files, Hive/Pig scripts etc.)
• Dependencies usually stored in workflow’s workspace
• AVRO schema files stored separately
There are 2 Oozie applications working back by back in Cloudera:
• Oozie Engine running a workflow defined in XML format
• Oozie Editor running a workflow defined in JSON format
Development is done in Oozie Editor so all workflows should be migrated in JSON format with all dependencies.
Workflows summary
3. Cloudera
migration tools
• Hue dump +
restore
• Dump MySQL
database +
restore
Third party tools
• IT consulting
services
providers
• Cloudera
partners etc.
Custom in-house
tool
• UiPath or other
Robotic Process
Automation tool
• Custom Scala
application
Migration options
Tech Enablers are currently working
with Cloudera on this option
4. Current and proposed deployment process
A workflow
developed and
tested in Dev
environment
A workflow
committed in GIT
A workflow and its
dependencies is
deployed in UAT and
then in Production
•Deployment
is manual
A modified in
UAT/Production
workflow is
committed to GIT
A workflow
developed and
tested in Dev
environment
A workflow
committed in GIT
•GIT always has
the latest
version
A workflow is
automatically
deployed in SIT/
UAT/ Production
•All workflows
are read-only
in Production
This step is
sometimes
skipped/ forgotten/
missed
Current Proposed
5. Phase 1 (POC) - DONE
• Compile a list of all UAT/Production Oozie workflows for Team A
• Download/export their definitions in JSON format
• Map each artefact with its GIT path (optional)
• Automatically deploy all artefacts into a separate Cloudera environment
Phase 2
Once we are happy with Phase 1 we can do the same for all teams. After obtaining the latest version of all
artefacts for all teams scrum masters will map each artefact with its GIT path.
Phase 3
All downloaded artefacts will be stored into a separate GIT repository (the structure will be decided later) with
all dependencies (bash scripts, options files, Hive/pig scripts etc.), jar files will be stored separately in Nexus. A
separate application will deploy each artefact automatically to new Cloudera environment using configurable
parameters.
Proposed project plan
6. Demonstrated POC application workflow
GIT
SIT
DEV
UAT PROD
JENKINS
Workflow/Coordinator JSON
Dependencies
Spark Jar
Deployment configuration (YAML)
Automatic tests
NEXUS
SonarQube
7. Benefits
CI/CI POC application (Scala) vs other methods
No admin privileges needed, only normal
development user with access to Hue UI
Works with all Cloudera and Hue versions
Fully customisable to meet special requirements (PKs
sync between environments, different/same users,
changing workflow/coordinators names/parameters etc.)