What is the power of business departments? What is missing in communication between layers responsible for building big data solutions? What mistakes can happen when IT departments are too proactive in creating solutions for big data?
1. Lukáš Vereš May 3, 2021
Webinar:
Building big data pipelines: Lessons learned
2. 2
About me
› Lead big data delivery projects; act as
a delivery lead and solution architect
› Projects focused on the creation of
pipeline generation frameworks and
data delivery for customers
› Experience from big pharma and
finance-related companies
› 10+ years experience in the industry
Lukáš Vereš
Delivery lead for big
data projects
3. 3
PROFINIT
Our competencies
Company stats
SOFTWARE DEVELOPMENT
APPLICATION OUTSOURCING
ENTERPRISE INTEGRATION
BUSINESS INTELLIGENCE/DWH
BIG DATA AND DATA SCIENCE
22+ yrs.
On the
tech market
since 1998.
Prague
Headquarters
at cenrte of
Europe.
500+
Experienced
and enthusiastic
professionals.
Top 3
CAD company
in Czech Republic
(IDC study).
26M €
Company
revenue in
2019.
Multiple areas
Clients from
Finance, Insurance
and Telco industry.
50+
We serve many
prominent world
clients
Certifications, culture & quality
A long history of technical engineering excellence has
lead western companies to rely heavily on skills and
expertise from the Czech Repubic. We are proud of quality
of our services and the certificates ISO 9001, ISO 27001,
ISO 20000, PRINCE 2, underpinning our commitment
to provide high quality sustainable services.
ISO 9000 ISO 270000 ISO 20000
8. 8
The business perspective on big data
› Validity
– Is the system under development or is it stable?
– Are data secure and can you trust them?
– Are the data compliant with laws and regulatory policies like the GDPR and CCPA?
› Value
– Set up your value objectives and then use the chosen metrics
› Visualisation
– All data flows and processes need to be monitored and illustrated descriptively
– Understand what is actually being carried out and how
9. 9
Technical perspective on big data
› Data storage
– Data storage systems:
HDFS, GlusterFS, etc.
– File systems with internal
structures: AVRO,
PARQUET, DeltaLake, etc.
› Data processing
– Data transformation: Spark,
MapRed, etc.
– SQL engines: Hadoop, Impala,
Presto, Hive, Hbase, Phoenix
› Open technologies
– Free access with the option to
get support for special components
› Big data volume
– Datasets bigger than 1TB,
hundreds of datasets or more
from one source, different file
formats
12. 12
Real life stories
› Implemented two different
architectures and toolsets
for the same purpose
› Added one more reporting
tool to the pile
13. 13
Benefits of multiple technologies
› Two different solutions for the same thing
create a competitive environment
› New ideas are created to try to differentiate
› People get the chance to decide which
technology they prefer to work with
14. 14
Downsides of multiple technologies
› Choosing a toolset can add more work to onboarding projects
› Less transparency in decision-making
› Every tool has to be supported,
adding complexity to the infrastructure
15. 15
Key Takeaways
› Do your work and write down any discrepancies and recommendations
› Be transparent in your decision-making
› IT teams need to support business teams by teaching them how to use
existing tools
17. 17
Real life stories
› The development team relocated a support
team member to work with them on product
development
› A member of the support team had
a workstation next to the developers
› The data warehouse team struggled to
understand the impact of a data lake
on their transformations
18. 18
Benefits of Intensive Collaboration
› Helps the support team understand the technology and improves
constructive discussion
› Involvement of the other team can improve the quality of the delivery
› Decreases frustration in teams where there are misunderstandings
19. 19
Downsides of Intensive Collaboration
› It consumes the capacity of the support team member
› Not everyone is both a good learner and a good teacher
20. 20
Key Takeaways
› Invest in teams, not only in individuals
› It is a process, not a one-time experience;
it will take time to evolve
23. 23
Real life stories
› Too many data sources
to load
– The goal was to speed up
the pipeline delivery
process
– Difficulties with changing
architecture
› Developed a solution for
generating data ingest pipelines
– A semi-manual self-service
approach to speeding up delivery
– Built on the AWS platform as
serverless architecture
24. 24
Benefits of Data Pipelines
› Speeding up pipeline development can increase data scientists’
and data analysts’ interest in getting data in a more standardized
way
› Standardization of data ingest increases the data quality of work
done by data scientists and data analysts
› Serverless architecture
25. 25
Downsides of Data Pipelines
› Managing the lifecycle of data sources
› It takes time to build it
› Transparent costs
26. 26
Key Takeaways
› Build a framework for pipeline generation; it will pay
off in the long run
– Save money on support
– Unified way to give analytics teams access in a standardized
way
› Think about how to load datasets to target systems
faster
– Open new possibilities for business customers
– Give even very small customers without big budgets access
to data
28. 28
Real life stories
› Implemented framework
from scratch without
involving the business
side from the beginning
› Loading all datasets in the
data lake to prepare it for the
data analyst or data scientist
was the wrong dogma
29. 29
Benefits of Late Business Involvement
› Gives developers space to focus on technology and ideas
› Can try new things with dead ends
› Loading all datasets ahead of time gets data closer to the data
scientists and data analysts so it is ready anytime they need it
30. 30
Downsides of Late Business Involvement
› If developers have more space, they might build
a solution that doesn’t fit real use cases
› Source systems are changing, and businesses
need to pay to support these changes and
impacts
31. 31
Key Takeaways
› Think about when the right time is to involve people from
the business side
– The business can give developers space to work on a framework,
but at the same time, they should provide specific use cases
› The dogma for loading all datasets in advance is wrong
– Higher costs for pipeline support
– Frustration from fixing issues on pipelines that no one uses
– The focus is on fixing bugs instead of delivering higher quality
33. 33
Lessons to be learned from this presentation
› Be constructive and honest when choosing technologies
› Have people work with different teams
› Deliver datasets from source systems faster
› Create solutions around business use cases
34. Profinit EU, s.r.o.
Tychonova 2, 160 00 Prague 6 | Phone + 420 224 316 016
Web
www.profinit.eu
LinkedIn
linkedin.com/company/profinit
Twitter
twitter.com/Profinit_EU
Facebook
facebook.com/Profinit.EU
Youtube
Profinit EU
Thank you
for your attention
35. 35
We need your help to be better!
› Since you are here, please help us
improve our events and webinars
and take a look at our short survey.
We appreciate your interest to help
us grow. www.bigdataforbanking.com
linkedin.com/company/profinit
www.profinit.eu
› Contacts
Lukáš Vereš
lukas.veres@profinit.eu
Delivery lead for big data projects