In the race to invent multi-million dollar business opportunities with exclusive insights, data scientists and engineers are hampered by a multitude of challenges just to make one use case a reality – the need to ingest data from multiple sources, apply real-time analytics, build machine learning algorithms, and intermix different data processing models, all while navigating around their legacy data infrastructure that is just not up to the task. This need has created the demand for Virtual Analytics, where the complexities of disparate data and technology silos have been abstracted away, coupled with a powerful range of analytics and processing horsepower, all in one unified data platform. This talk describes how Databricks is powering this revolutionary new trend with Apache Spark.
11. VIRTUAL
ANALYTICS
Decoupled compute and storage
Uniform data management and
security model
Unified analytics engine
Enterprise-wide collaboration
Data Warehouses
DATA
Cloud
storage
Cloud
Storage
And many
others…
Hadoop Storage
PEOPLE
Data Science
Data Engineering
And many
others…
BI Analysts
The New Paradigm
12. Is Spark the Answer?
Data Warehouses
DATA
Cloud
storage
Cloud
Storage
And many
others…
Hadoop Storage
PEOPLE
Data Science
Data Engineering
And many
others…
BI Analysts
13. Databricks + Apache Spark
Managed
Cloud Platform
Integrated Workspace
Production
Workflow
Automation
Optimized
Data Access
Layer
Databricks Enterprise Security
Data Warehouses
DATA
Cloud
storage
Many others…
Cloud
Storage
And many
others…
Hadoop Storage
PEOPLE
Data Science
Data Engineering
And many
others…
BI Analysts
14. Case Study |
Video quality
Real-time anomaly
detection
Viewer loyalty
Grow the Viacom audience
In every industry sector I’ve encountered, the interest in big data is stronger than ever.
Why are they so interested? They believe data is the key to transforming their businesses.
You’ve already heard of some of these examples;
Yesterday, Salesforce came on stage and talked about their plan to build their next-generation CRM product with AI – what they call Einstein. And they are using Spark.
Today, we will hear from the likes of HP – a pedigreed company built on manufacturing devices, and who is using Spark to create a service-based business model with IoT data
Or another familiar name – McGraw Hill – who has been creating education material for decades but is now looking to Spark to revolutionize learning. They want to use behavior data from students to identify gaps in understanding and provide personalized learning approaches to achieve better outcomes.
Many of the companies we talk to aspire to leverage greater intelligence with data throughout their business, but unfortunately this is much more difficult than it seems.
The first observation is about the catalyst – the data,
Everyone knows that data is bigger and more diverse, but what people underestimate is just how inaccessible and siloed they are.
The reason that the volume and the variety of the data is growing so fast, is because now you have many more ways to generate data – it’s gone beyond just web servers or enterprise resource planning systems.
Today, it’s the electronic medical records at your doctor’s office, connected sensors embedded in transformers in an electrical substation, or even more outrageously – a fusion of medical records and connected sensors in the form of fitness trackers that you wear every minute of the day.
And in every instance, new data stores are being instantiated in all corners of the business faster than you can ever imagine.
So yes, storage is a problem, but that’s not even _the_ problem
The real problem at the enterprise level is how to catalog, organize, secure, govern this complex federation of data.
Next, let’s talk about AI
AI is a loose collection of many different algorithms that allows machines to make predictions, or make decisions.
It’s a game-changer once developed, it can automate complex tasks, or aid human decision making.
There are many varieties of algorithms at our disposal today, and more are being developed constantly.
The challenge to building great AI – in addition to having the right data of course, is to pick the right algorithm for the problem
How would you know what’s the right algorithm? Well, it’s hard to say, you may have to try a few different approaches.
Certainly when you have many use cases, there is unlikely going to be a single approach that can use used everywhere.
This means that problem is not just getting one algorithm to work, but to have a way of applying many different types of algorithms depending on the context.
Finally, let’s talk about all functional roles that’s involved in making every use case successful.
This is probably the most often over-looked element in this whole equation.
In every enterprise data use case, many different teams must work together seamlessly to be successful.
What I mean by work together is that:
You first need to business context – someone who has the domain knowledge
You then need the experts who can bring the data together – the data integration, cleansing, all in a reliable and timely way
You need people who can systematically use the data to derive answers, or use algorithms to build models that derive the answers
These different roles exist because today’s enterprises, and their business models are so vast and complex, not a single team can do all these jobs.
The data engineering team need to
Typically people start with the data warehouse.
It was created to solve a very narrow and specific problem: When data is very structured and you give business analysts a way to use data for decision making.
It has many limitations:
First, it does not scale up to big data - only a small percentage of enterprise data used in decision-making
Second, the data warehouse does not offer a way to build AI, so there is no way to automate decision-making. Business still have to rely on a handful of business analysts to manually sift through the data, build dashboards or create reports to support the business.
Typically people start with the data warehouse.
It was created to solve a very narrow and specific problem: When data is very structured and you give business analysts a way to use data for decision making.
It has many limitations:
First, it does not scale up to big data - only a small percentage of enterprise data used in decision-making
Second, the data warehouse does not offer a way to build AI, so there is no way to automate decision-making. Business still have to rely on a handful of business analysts to manually sift through the data, build dashboards or create reports to support the business.
Instead of centralizing data and building a complex zoo of tools on top of single storage system, there is another approach
Separate compute and storage
The new approach uses a flexible compute layer to:
Connect to different data stores without migrating data, manage metadata across silos
Run diverse workloads to support a wide range of analytics approaches
Provide simplified interfaces for users with different skillsets and objectives
Effectively, we want to virtualize the analytics layer
Viacom is the parent company of MTV and Nickelodeon. It is one of the largest media companies in the world, its content is broadcasted in more than 160 countries.
Delivering high-quality video and growing the engagement of the viewers is the core mission of Viacom.