Companies are gathering and processing more data than ever, and data lakes give them a central location for storing that data in its native format. However, deploying a data lake can be complex and time-consuming, as it requires the installation and configuration of numerous tools and services. Ansible, a powerful open-source tool that allows you to automate the configuration, deployment, and management of IT infrastructure, can simplify the deployment of data lakes. In this lecture, we will introduce you to Ansible and demonstrate how we used it to automate the deployment of an Elasticsearch-based data lake for the Cyber4De project.
2. inteligencija.com
• leading ICT consulting company in South-Eastern Europe
• strong partnerships with all the leading technology vendors
• Data & Analytics, Performance Management, Data Migration, Data
Engineering and Machine Learning
• 200+ employees
• offices in London, Vienna, Stockholm, Ljubljana, Zagreb, Belgrade,
Podgorica and Sarajevo
Poslovna inteligencija
3. inteligencija.com
• “Cyber Rapid Response Toolbox for Defence Use” (CYBER4DE)
• European Defence Industrial Development Programme in December 2021
• “develop an easily deployable, modular, and scalable cyber rapid response
toolbox to manage cyber incidents in different complex national and
international scenarios”
Cyber4De project
8. inteligencija.com
• centralized repository
• store all your structured and
unstructured data
• perform data processing tasks on
the data without having to move it
to different locations or convert it to
different formats
Data lake
9. inteligencija.com
• open-source, command-line automation tool
• configuration, deployment and management of IT infrastructure
• simple, easy to use
• bare-metal servers, virtual machines
• Unix, Windows
• agentless – SSH or WinRM
• highly extensible
Ansible
12. inteligencija.com
• Playbooks:
• automation blueprints, in YAML format
• contains a list of plays
• ansible-playbook
• Play:
• an ordered list of tasks
• maps tasks to managed nodes
• Task:
• defines the operations (calls to modules)
Ansible
18. inteligencija.com
Speak with one of our
Data Experts and discuss your current data
analytics challenges!
TURN YOUR DATA
INTO A
STRATEGIC ASSET
SCHEDULE A FREE CONSULTATION
Hinweis der Redaktion
Hello, my name is Kristijan Pavlović and I work as a Data Engineer at Poslovna inteligencija. Today, I’m going to introduce you to Ansible and demonstrate how we used it to automate the deployment of an Elasticsearch-based data lake for the Cyber4De project.
So who are we?
Poslovna inteligencija is the leading ICT consulting company in South-Eastern Europe
We nurture strong partnerships with all the leading technology vendors and are technology agnostic
We have over 20 years of experience in the largest Data & Analytics, Performance Management, Data Migration, Data Engineering and Machine learning projects
We have more than 200 employees that are primarily located in Zagreb, but we also operate from offices in London, Vienna, Stockholm, Ljubljana, Belgrade, Podgorica and Sarajevo
“Cyber Rapid Response Toolbox for Defence Use”, or CYBER4DE for short is a project…
launched under the European Defence Industrial Development Programme in December 2021 and…
it takes on the challenge to develop an easily deployable, modular, and scalable cyber rapid response toolbox to manage cyber incidents in different complex national and international scenarios – this is a direct quote which I will explain in simpler terms in a few slides
The project has 11 participants from 6 countries…
Our coordinator is:
Baltic Institute of Advanced Technology, BPTI from Lithuania
and the consortium is comprised of members from:
Lithuania (NRD Cyber Security),
Poland (Asseco),
Romania (The Military Equipment and Technologies Research Agency, METRA - part of Ministry of National Defence of Romania),
Croatia (Infigo IS, Poslovna Inteligencija),
France (Airbus Cybersecurity, Thales),
Italy (Leonardo)
We also have some linked third parties on this project:
CY4GATE (Italy) and ComCERT (Poland)
This is the design of the toolbox:
The idea is to have a back-office infrastructure (right side) on the location of the client (for example, a Ministry of Defence of a country) and to have ability to create, so called, remote-sites
The remote-site is defined as an on an off-site location (for example, military base or a part of a critical infrastructure) that collect the data and analyses it, it can also send the data to the back-office (depending on the type of the mission) for further analysis
Poslovna inteligencija is working on Data Science Workplace module (my colleague Petar Zečević is the module owner, he couldn’t be here today) and Entity Linking module (my colleague Goran Gvozden is the module owner, he also has a presentation today)
The role of the Data Science Workplace module is to provide a fast, configurable data lake for logs and network activity
The module integrates data from different data sources and provides a single source of data available for visualization, analysis, and reporting by other modules
This is an overview of just the DS Module:
it collects data from collection modules and OPSEC and Vulnerability Assessment modules
Data is enriched with output from Anomaly Analysis, Binary and Code Analysis and Entity Linking modules
Data is visualized in the Visualization Module
Analysts can individually access Data Science module to manually analyze the available data
This is our module’s tech stack:
We use Ansible to deploy and manage everything listed on this slide, we have scripts that set up everything and tear everything down, more on that later…
Everything is deployed on Kubernetes, where we have 3 master and 3 worker nodes and everything is set up to be highly available
We use Elasticsearch as our main storage engine for the collected data. It is a distributed, highly-available search engine that allows quick indexing and retrieval of various structured and semi-structured data
Apache Spark is a general-purpose, distributed and fast data processing engine offering DS module users data analysis and cross-incident correlation capabilities
Apache Airflow is a workflow management tool that enables the Data Science module to schedule and orchestrate data synchronization jobs
JupyterHub offers a flexible and powerful environment for writing jobs that interact with Apache Spark and Elasticsearch
Data lake is a centralized repository which stores all of your structured and unstructured data at any scale.
Allows organizations to perform data processing tasks on their data without having to move it to different locations or convert it to different formats.
Ansible is an open-source, command-line automation tool written in Python
It can configure systems, deploy software, and orchestrate advanced workflows to support application deployment, system updates, and more
Ansible’s main strengths are simplicity and ease of use. It also has a strong focus on security and reliability, featuring minimal moving parts
It can be used on bare-metal servers and VMs
It’s designed to configure both Unix-like systems and Microsoft Windows but needs to be run from Unix-like machine (or WSL distribution)
Ansible is agentless, relying on temporary remote connections via SSH or Windows Remote Management
It’s highly extensible – you can write a custom module but you can also use one of the 1000 included modules
Control node is a system on which Ansible is installed. You run Ansible commands on a control node.
Managed node is a remote system, or host, that Ansible controls.
Inventory provides a list of managed nodes that are logically organized. You create an inventory on the control node to describe host deployments to Ansible.
Based on the main components mentioned on the previous slide, this is our inventory file…
master-01 is our control node, we installed Ansible on it and use it to connect to the rest of the machine
every other node is a managed node. We organized them in groups…
The first step would be to set up SSH connections so Ansible can connect to the managed nodes which is done by adding your public SSH key to the authorized_keys file on each remote system. That can be done manually or by using Ansible playbooks. We asked our colleagues from the IT department to create a user with root privileges on all of the VMs and we did everything else via Ansible.
After creating the inventory files, we need to write our automations…
Playbooks on Ansible are automation blueprints, in YAML format, that Ansible uses to deploy and configure managed nodes. It can be run from terminal by using the ansible-playbook command.
Play is an ordered list that maps tasks to managed nodes in an inventory
Task defines the operations that Ansible performs
There are many possible ways to organize playbook contents and we are following the Ansible “roles” organization feature
Inventory is defined by the file in the root of the directory, in our case – development and production files
Playbook is represented by the file in the root of the directory, in our case – elasticsearch.yml
Playbook runs the play located in the roles/ folder
Every play can have multiple folders and but we are using just a few of them… explain all of the folders in the elasticsearch/ folder
Also, group_vars defines variables to a particular group from the inventory file (for example, k8s-masters nodes) and host vars defines variables to a particular host from the inventory file (for example, master-01 host)
This is an example based on our Elasticsearch Ansible playbook
As you can see, it’s quite a small file that contains hosts line with one of the groups from the inventory file. In our case, this playbook is only executed on the Kubernetes master
It also contains the name of the role, in our case, just the Elasticsearch role (play)
It runs the roles/elasticsearch/tasks/main.yml file
If we didn’t use the Ansible “roles” organization feature, we could have put all of the tasks and variables right here in the playbook file, but that is not a best practice
This is the file that the playbook runs
It is composed of the list of tasks, some of them are here on the slide
It loads all of the variables (by order of precedence):
all of the variables from the defaults folder
all of the variables from the group_vars/ and host_vars/ folders from the root of the directory,
all of the variables from the vars folder
It also makes all of the files from the files/ and templates/ folders available to this play (you just write the name of the file, no need for relative/absolute paths)
To run the Ansible playbook, you run the ansible-playbook command from your terminal
You provide the path to the inventory file by using the –i flag and at the end you provide a path to the playbook file
You can also provide other information by using flags, for example, extra variables by using –e flag -> they have the highest precedence
The playbook generates output and at the end, you get the recap of the executed tasks
If there’s an error in any of the steps, the output will provide the error message
And that’s it! If you run the Kubernetes command to get all of the pods of the Elasticsearch namespace, you will see that the pods are all starting and in a few moments, everything is started and ready