Data is growing exponentially and companies must look at building a data governance strategy and road-map to take full advantage of their big data. This presentation was done by Raquel Seville at Jamaica Computer Society BizTech 2017.
1. Building your Big Data
Roadmap and Data
Governance Strategy
Lessons from a Data Nerd - Raquel Seville
2. About Raquel
➔ Big Data Nerd: I have been working with massive volumes of
structured and unstructured data for over a decade
➔ Analytics Wrangler: I am keen on BI user adoption and having the
right tools, talent and processes to ensure maximum ROI
➔ Blogger: I blog all things data at www.exportBI.com
➔ Author: SAP OpenUI5 for Mobile BI and Analytics
➔ SAP Mentor
➔ Foodie, Travel Addict @QuelzSeville
3. Digital Age?
“The Stone Age did not
end for lack of stone, and
the Oil Age will end long
before the world runs out
of oil.”
- Saudi Oil Minister
Sheikh Yamani
4. What is the fastest growing
commodity in the world?
6. Data Never
Sleeps: In 60 seconds,
Whatsapp users send 29
million messages,
Google receives 4
million search queries,
Instagram has over 65k
photos uploaded, and
Facebook has 3.3
million posts
7. Data is growing!
The exponential growth of data
and intelligent things in an
environment of ubiquitous
Internet connectivity is enabling
a fourth industrial revolution —
digital business
transformation
- Jen Underwood, Founder,
Impact Analytix, LLC
8. Say Hello to Netflix.
Over 109 million members
globally in over 190 countries
Streaming over 125 million
hours of content per day
Data warehouse size is over
60 petabytes
9. Netflix Big Data
Two streams of data - event and
dimension data
Event data from cloud services via
Ursula (data pipeline)
Dimension data is pulled from
Cassandra cluster
S3 is the single source of truth (Cloud
DW from AWS)
11. Netflix offers Big Data as
a Service
Netflix developed Genie to manage
access to clusters and data abstraction
Genie is a federated job orchestration
engine
It is designed to manage various big
data jobs such as Hadoop, Pig, Hive
Read more: https://github.com/Netflix/genie
12. Source: QCon SF 2016 - Netflix Big Data Infrastructure
https://www.infoq.com/presentations/netflix-big-data-infrastructure
13. Quality user experience
Netflix uses big data to predict the next hit
series and this helps to strengthen their
position as a content provider
Big Data also helps to drive customer
recommendations and helps to improve
predictions based on customer’s viewing
habits
14. Your Roadmap
When building your big data roadmap and
data governance strategy, there are two
broad areas that you must focus on and these
areas can be asked as questions:
➔ Where are you now?
Take a closer look at your existing
environment, tools, data, users
➔ Where do you want to be?
Develop a plan and determine
accessibility, budget, stakeholders and
so on.
15. Where are you now?
Existing tools, infrastructure
and resources that support
reporting, data warehousing,
dashboards, data mining
and analytics
16. Where are you now?
What are your big data
sources (in-house databases,
social media, websites)
17. Where are you now?
Analyse user base, company
size, user roles (SME/Domain
expert, analysts, power users,
consumers), access and
security restrictions
18. Where are you now?
What problems are you
trying to solve?
What decisions are you trying
to make?
19. Where are you now?
What is the existing data
culture?
Document processes and
identify gaps
20. Where do you want to be?
Create a project
plan and identify
scope, budget,
timelines, KPIs,
stakeholders
21. Where do you want to be?
Determine
accessibility
needs for
deployment, such as
desktop, mobile, cloud,
apps, web
22. Where do you want to be?
Deliver quick wins
and tangible value-
added solutions
23. Where do you want to be?
Align to analytics value
escalator
descriptive,
diagnostic,
predictive, or
prescriptive
24. “You can have data without
information, but you cannot
have information without
data.”
– Daniel Keys Moran
Hinweis der Redaktion
After disruption, there is a shift.
With advances is technology, there is greater diversification of energy resources; solar, wind, nuclear etc.
Internet of things
Digital business
A petabyte (PB) is 1015 bytes of data, 1,000 terabytes (TB) or 1,000,000 gigabytes (GB).
Multiple hadoop clusters
Apache Cassandra is a free and open-source distributed NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.
https://github.com/Netflix/genie
Parquet file format: It is column oriented, allowing for improved compression. Parquet files also store additional metadata, such as information about the min / max length of columns and their sizes. This allows operations such as counts or skipping to be performed very quickly. Hadoop Distributed File System ( HDFS )
netflix data: time spent selecting movies, time of day, playback habits
What we are not fully aware of however is how to leverage all this data to get the most value for decision making and analysis. The main driver that sits at the core of demystifying this problem is a solid, yet evolving big data roadmap and data governance strategy.
Problems: poor quality, redundancy, security, privacy, availability, updates, complexity, volume
Decisions: company goals, improve decision making in specific areas, strategic objectives
Senior management buy-in is critical
Determine accessibility needs for deployment, such as desktop, mobile, cloud, apps, web etc.
Use iterative and agile methods to deliver quick wins and deliver tangible value-added solutions in short sprints (high-value, low-cost)
Decide where you want to be along the analytics value escalator - Descriptive, Diagnostic, Predictive or Prescriptive (Machine Learning/Artificial Intelligence)?