In Data Engineer's Lunch #55, CEO of Anant, Rahul Singh, will cover 10 resources every data engineer needs to get started or master their game.
Accompanying Blog: Coming Soon!
Accompanying YouTube: Coming Soon!
Sign Up For Our Newsletter: http://eepurl.com/grdMkn
Join Data Engineer’s Lunch Weekly at 12 PM EST Every Monday:
https://www.meetup.com/Data-Wranglers-DC/events/
Cassandra.Link:
https://cassandra.link/
Follow Us and Reach Us At:
Anant:
https://www.anant.us/
Awesome Cassandra:
https://github.com/Anant/awesome-cassandra
Email:
solutions@anant.us
LinkedIn:
https://www.linkedin.com/company/anant/
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-1072927283
Facebook:
https://www.facebook.com/AnantCorp/
Join The Anant Team:
https://www.careers.anant.us
2. Business Platform Success
We help our clients build their global
platforms on scalable data platforms
with our Playbook, Framework, and
Knowledge base.
4. ARCHITECT
noun: architect; chief builder
verb: architect; design or make (COMPUTING)
“We create and manage global platforms that run on
Cassandra and related technologies.”
5. 5
Things We Love : Scalable Fast Data
Without Datastax
With Datastax
6. The Landscape of Cloud Data Engineering
Query BI
Data
Warehouse
DataOps
DevOps
Data Engineering
SQL
NoSQL
Queues
Data Lake
7. SQL - The foundation of data
engineering. Still very relevant.
8. SQL / Relational Databases in Data Engineering
1. MySQL - The most popular DB /
variant of SQL in use (MariaDB).
2. PostgreSQL - Used by more and
more to replace Oracle
3. Microsoft SQL - Still relevant. Not
going anywhere.
4. Oracle - Big companies use this. Still
relevant.
1. Popularity - Very popular because
most software commercial or open
source runs on relational databases.
2. Function - What SQL can do in
relation to ACID transactions
currently hard to beat in NoSQL
3. Staying Power - Open, Commercial,
Cloud options. No reason to see it
disappearing.
Tools Factors
9. NoSQL - The foundation for big
data applications. Lots of variants.
10. NoSQL / Non-relational DBs in Data Engineering
1. Mongo - Due to popularity in Node
world, in use everywhere.
2. Redis - Needed not only for Apps but
in the process of data engineering.
3. Dynamo - Easy to get started. Lots of
AWS play apps on Dynamo.
4. Cassandra - In use by the largest
companies with critical ops.
1. Popularity - Popular because of
ease of use to get started
2. Function - Each has its own special
reason to be useful.
3. Staying Power - Different variants /
implementations / managed services
for these DBs mean that enough
people need it for these additional
markets of services.
Tools Factors
11. Data Lakes on HDFS - The
standard for storage and retrieval of
files - structured, unstructured,
semi-structured, or binary.
12. Data Lakes on HDFS / S3 Distributed File Storage
1. HDFS - Universal protocol for
distributed file system access.
2. Amazon S3 - Supports HDFS and S3
object API also a standard now.
3. Google Storage - Does what S3 does
on Google
4. Azure Blob - Does what S3 does on
Azure
1. Popularity - Popularized due to big
data and clouds needing their own
distributed file storage.
2. Function - Use as an object storage
(key:value) or to store raw files , or
structured data for use later in query
engines.
3. Staying Power - Is responsible for
the massive storage of all “cold” data
that doesn’t need to be in a
database. HDFS/S3 standards now
universal.
Tools Factors
14. Streams / Queues in Data Engineering
1. Popularity - Popular because of the
rise of real-time use-cases in
business platforms.
2. Function - Used to store “everything”
that’s happening as well as for
focused “events” to trigger
processes.
3. Staying Power - Different reasons for
staying power: demand in the market
and current users continue to grow
use-cases.
Tools Factors
1. RabbitMQ - Lots of use in business,
works well until it doesn’t.
2. Apache Kafka - Full ecosystem and
variants that support Kafka protocol.
3. Amazon SQS - Easy to get started
and use in Amazon. Similar services
in other Clouds
16. Popular Data Engineering Tools
1. Popularity - Different reasons for
popularity. Commercial tools save
tons of time.
2. Function - Allows to consolidate and
standardize all flows into a single
system.
3. Staying Power - Apache Spark is a
core part of cloud offerings.
Stitchdata, Fivetran popular at large
companies. Dbt is new but has good
growth.
Tools Factors
1. Apache Spark - The most popular big
data engineering toolkit. Python,
Scala, Java, R, C#
2. Dbt - New tool but very powerful.
Abstracts database engineering into
SQL.
3. Fivetran - Commercial tool for
visually managing data flows.
4. Stitch - Similar to Fivetran, many
connectors / open Singer framework.
18. Data Operations in Data Engineering
1. Popularity - Traction in big and small
companies.
2. Function - Allows to orchestrate
complex workflows of tasks (DAG).
3. Staying Power - Airflow future proof
in Kubernetes, Argo is the new kid in
Kubernetes. Jenkins is in use in
many companies.
Tools Factors
1. Airflow - Many connectors to
manage complex data flows.
2. Jenkins - Used for CICD can do
linear pipelines.
3. Prefect - New but powerful tool in
Python
4. Argo - Does CICD but the Workflow
engine is useful, runs Kubeflow
20. Data Warehouse - Analytics Across Data
1. Popularity - Warehousing
conventions around for a while -
dimensions, facts.
2. Function - After bringing data
together and relating it , can do
massive SQL queries.
3. Staying Power - Theory isn’t going
anywhere. Technologies my change,
but the core concept is solid.
Tools Factors
1. Redshift - Widely used due to
Amazon
2. BigQuery - Well integrated query
engine in Google.
3. Snowflake - Does a bit of data
engineering as well as query engine.
4. MsSQL/Oracle - Commercial DBs
have a data warehouse
configuration.
22. Query Engines - Analytics Across Data Sources
1. Popularity - Hive is a standard,
works in different systems like
Spark/Hadoop. Presto popular.
Denodo coming up.
2. Function - Separates storage from
query. “Virtualizes” queries.
3. Staying Power - The theory has
been now implemented in
Snowflake, Redshift - separate
storage from query. These will stick.
Tools Factors
1. Apache Hive - Available in Hadoop
ecosystem or some variants by
cloud vendors.
2. Spark SQL / Hive - Like Hive but on
Spark.
3. PrestoDB - Open data virtualization,
can run on Spark, works with Hive.
4. Denodo - Commercial data
virtualization, can run on Spark
24. Business Intelligence tools for Data Engineers
1. Popularity - BI is HUGE. Learning it
is not just about the tool. Tools are
always coming and going.
2. Function - Allows non programmers
to discover, analyze, and create
visualizations, and reports that other
non-technical people can consume.
3. Staying Power - Tableau will stick
around. Open source Redash now
supported by Databricks.
Tools Factors
1. Tableau - Very popular since they
give people community access.
2. Looker - Commercial grade tool -
expect good UI.
3. Redash - Powerful open source tool
for data professionals to make
reports/dashboards.
4. Metabase - Easy to use tool for non
admin / dba types.
26. Dev Ops Tools for Data Engineering
Tools More Tools
1. Terraform - Manage different clouds
with one language.
2. Prometheus / Grafana - The O.G. of
time series system data vis.
3. Ansible - Organizes commands that
need to be run better - Setup,
Configure, Run ad-hoc commands
1. Docker - Customize your image.
2. Kubernetes - Run your cluster.
3. Argo - CICD for Containers in
Kubernetes land.
4. Jenkins - General purpose CICD -
can use this to run other tools.
28. Create and
manage global
data platforms.
www.anant.us | solutions@anant.us | (855) 262-6826
3 Washington Circle, NW | Suite 301 | Washington, DC 20037