"Hadoop: What we've learned in 5 years", Martin Oberhuber, Senior Data Scientist at ThinkBig

•Als PPTX, PDF herunterladen•

1 gefällt mir•344 views

"Hadoop 2015: What we’ve learned in 5 years", Martin Oberhuber, Senior Data Scientist at ThinkBig YouTube Link: https://www.youtube.com/watch?v=odOTsGgfzm8 Watch more from Data Natives 2015 here: http://bit.ly/1OVkK2J Visit the conference website to learn more: www.datanatives.io Follow Data Natives: https://www.facebook.com/DataNatives https://twitter.com/DataNativesConf Stay Connected to Data Natives by Email: Subscribe to our newsletter to get the news first about Data Natives 2016: http://bit.ly/1WMJAqS

Daten & Analysen

1st
Professional services provider with 100% focus on
open source and Big Data Hadoop ecosystem
• Founded 2010
• 100+ Successful Programs
• 80+ Clients
• Global Delivery Capabilities

3 Core Analytic Solution Domains
Device Analytics Customer AnalyticsRisk
Analytics
eCommerce
2 of Global Top 5
Retail
2 of Global Top 5
Social Networking
Global #1
Telecommunications
2 of Global Top 5
Media & Advertising
2 of Global Top 5
Internet Transaction Security
Global #1
Semiconductor
2 of Global Top 5
Data Storage Device
3 of Global Top 5
Disk Manufacturing
Global #1
Telecommunications
2 of Global Top 5
Brokerage & Mutual Funds
2 of Global Top 5
Asset Management
Global #1
Credit Issuer
2 of Global Top 5
Banking
4 of Global Top 10
Financial Data Services
2 of Global Top 5

Think Big VELOCITY Methodology
Big Data
Strategy
Think Big
Academy
Big Data
Program Mgt
Business
Analytics
Managed
Services
Data
Engineering
Big Data Lab
Think Big engages with its client’s business, technical, analyst and support teams in an
agile inspired VELOCITY Methodology to continuously develop Big Data solutions

Think Big Enterprise Data Lake
Downstream
ApplicationsInformation Sources
Evaluate
Source
Data
Prepare Source
Metadata
Prepare Data for
Ingest
Enterprise Data Lake
Sequence Automate
Apply Structure
Compress Protect
Dashboard Engine
Collect & Manage
Metadata
Perimeter-Authentication-Authorisation
Ingest

Think Big Data Science
- Customer Segmentation
- Product Performance Analysis
- Ad Hoc Insights
- Cluster Analysis
- Path Analysis
- Fraud Prediction
- Churn Modeling
- Demand Forecasting
- Predictive Asset Maintenance
- Energy Consumption Prediction
- Proactive Customer Support
- Display Targeting Optimization
- Dynamic Pricing
- Promotion Strategy Optimization
- A/B testing
Descriptive Statistics
Predictive Modeling
Real-time Optimization

Leveraging Expertise Across Industries
Dynamic Pricing
Fraud Detection
Customer Segmentation
Recommendation
Engine
Predictive Asset Maintenance
Proactive
Customer
Support
Credit Default
Prediction
Churn Modeling
Scenario Simulation
A/B Testing
Display Targeting Optimization
Demand Forecast
Cluster Analysis &
Segmentation
Device Analytics
Risk Analytics
Customer Analytics

Machine
Learning
Features
Target
Historical Data
New Data
Features
Model
Scoring
Prediction
Model Building (offline)
Model Scoring (online) Analytics GUI
Act!
Predictive Modeling Approach

Data Science Approaches
Single Workstation
- Small data sets
- No distributed analytics
across multiple nodes
- Powerful tools are R or
Python
- Data Scientist can focus on
business problem
Mixed
Single Workstation + Cluster
- Small or large data sets
- Data wrangling and feature
engineering is performed on
cluster
- Predictive analysis and
modeling can be performed on
single workstation
- Powerful tools are Hadoop
Streaming and Spark
combined with R and Python
- Data Scientist has to
parallelize of some data
mining tasks
Cluster
- Large data sets
- Both data wrangling and
modeling is performed on
cluster
- Spark is one of the few tools
that support efficient parallel
machine learning
- Parallelizing machine learning
algorithms is challenging

Data Lake (HDFS)
Core Data Science
Production
• Model scoring
• Dashboards
Plug & play model deployment
Data Sources
Ingestion
Real-time
Optimization with
Multi-armed Bandit
Data
Real-time Data
Productionizing Analytics

Weitere ähnliche Inhalte

Was ist angesagt?

Harm Olde KPN

TalentEvent

Satyam open analytics nyc

Open Analytics

This changes everything. When it comes to data analytics, accuracy and data quality is crucial. Location Hub Analytics ® is the only self-service analytics engine that leverages Canada’s most robust, accurate and up-to-date location-based data for precise, compelling, unbiased results. CLEANSE Location Hub Analytics automatically validates, standardizes, and geocodes your address database. Each record is assigned a Unique Address Identifier (UAID®) ENRICH Location Hub Analytics enriches your data with Canadian demographics information for further analysis and greater customer intelligence. ANALYZE Location Hub Analytics quickly processes and analyzes your data, objectively revealing meaningful patterns and trends INFILL Location Hub Analytics helps you generate new prospect lists by infilling the addresses within a specific territory that are not in your current database. VISUALIZE Unlike other analytics engines, Location Hub Analytics allows you to visualize and interact with your results on a map for better data profiling SHARE Quickly and easily share your customized report with key stakeholders

DMTI Spatial Location Hub Analytics: big data, analytics, visualization

DMTI Spatial

The Journey to Success with Big Data

Cloudera, Inc.

Enable Advanced Analytics with Hadoop and an Enterprise Data Hub

Cloudera, Inc.

Case study: Hadoop as ELT for Leading US Retailer - Happiest Minds

Happiest Minds Technologies

Microsoft jeroen ter heerdt

BigDataExpo

The State of Big Data Adoption: A Glance at Top Industries Adopting Big Data ...

Datameer

Machine learning and artificial intelligence can change the world. Diagnosing heart disease. Detecting fraud. Predicting insurance claims. Revolutionizing agriculture. In business, machine learning and artificial intelligence drive new sources of revenue and lower costs. But executives struggle to define an investment strategy. Researchers introduce innovations in machine learning daily. Technical jargon is opaque. Vendor hype muddies the waters. Industry analysts cover the field, but only at a high level. Cloudera Fast Forward Labs accelerates your machine learning journey. We deliver a unique blend of applied research and hands-on explanations that you can apply to your business today. In this webinar you will: Meet the Cloudera Fast Forward Labs team Cut through machine learning hype Explore recent examples of applied research See exciting new ML techniques Hear how machine learning is delivering real business value on multiple use cases 3 things to learn: Explore recent examples of applied research See exciting new ML techniques Hear how machine learning is delivering real business value on multiple use cases

Cloudera Fast Forward Labs: Accelerate machine learning

Cloudera, Inc.

Roadmap to data driven advice michael goedhart 1v0

BigDataExpo

Cisco_Big_Data_Webinar_At-A-Glance_ABSOLUTE_FINAL_VERSION

Renee Yao

Explore how data integration (or “mashups”) can maximize analytic value and help business teams create streamlined data pipelines that enables ad-hoc analytic inquiries. You’ll learn why businesses increasingly focused on blending data on demand and at the source, the concrete analytic advantages that this approach delivers, and the type of architectures required for delivering trusted, blended data. We provide a checklist to assess your data integration needs and capabilities, and review some real-world examples of how blending various data types has created significant analytic value and concrete business impact.

Data Mashups for Analytics

Katharine Bierce

Big Data & Analytics in the Manufacturing Industry: The Vaasan Group

IBM Analytics

Rethink Analytics with an Enterprise Data Hub

Cloudera, Inc.

Bde presentation dv

BigDataExpo

Data Science Day New York: Data Science: A Personal History

Cloudera, Inc.

Importance of Big Data Analytics

Impetus Technologies

940 diamond sponsor sengupta

Rising Media, Inc.

Was ist angesagt? (18)

Harm Olde KPN

Satyam open analytics nyc

DMTI Spatial Location Hub Analytics: big data, analytics, visualization

The Journey to Success with Big Data

Enable Advanced Analytics with Hadoop and an Enterprise Data Hub

Case study: Hadoop as ELT for Leading US Retailer - Happiest Minds

Microsoft jeroen ter heerdt

The State of Big Data Adoption: A Glance at Top Industries Adopting Big Data ...

Cloudera Fast Forward Labs: Accelerate machine learning

Roadmap to data driven advice michael goedhart 1v0

Cisco_Big_Data_Webinar_At-A-Glance_ABSOLUTE_FINAL_VERSION

Data Mashups for Analytics

Big Data & Analytics in the Manufacturing Industry: The Vaasan Group

Rethink Analytics with an Enterprise Data Hub

Bde presentation dv

Data Science Day New York: Data Science: A Personal History

Importance of Big Data Analytics

940 diamond sponsor sengupta

Ähnlich wie "Hadoop: What we've learned in 5 years", Martin Oberhuber, Senior Data Scientist at ThinkBig

This session looks at where we are today with data and analytics and what is needed to transition to the Artificially Intelligent Enterprise. How do you mobilise developers to exploit what data scientists and business analysts have built? How do you align it all with business strategy to maximise business outcomes? How do you combine BI, predictive and prescriptive analytics, automation and reinforcement learning to get maximum value across the enterprise? What is the blueprint for building the artificially intelligent enterprise? •Data and analytics – Where are we? •Why is the journey only half-way done? •2021 and beyond – The new era of AI usage and not just build •The requirement – event-driven, on-demand and automated analytics •Operationalising what you build – DataOps, MLOps and RPA •Mobilising the masses to integrate AI into processes – what needs to be done? •Business strategy alignment – the guiding light to AI utilisation for high reward •Agility step change – the shift to no-code integration of AI by citizen developers •Recording decisions, and analysing business impact •Reinforcement-learning – transitioning to continuous reward

Building the Artificially Intelligent Enterprise

Databricks

Capturing big value in big data

BSP Media Group

12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics

Revolution Analytics

Big data analytics presented at meetup big data for decision makers

Ruhollah Farchtchi

Gain New Insights by Analyzing Machine Logs using Machine Data Analytics and BigInsights. Half of Fortune 500 companies experience more than 80 hours of system down time annually. Spread evenly over a year, that amounts to approximately 13 minutes every day. As a consumer, the thought of online bank operations being inaccessible so frequently is disturbing. As a business owner, when systems go down, all processes come to a stop. Work in progress is destroyed and failure to meet SLA’s and contractual obligations can result in expensive fees, adverse publicity, and loss of current and potential future customers. Ultimately the inability to provide a reliable and stable system results in loss of $$$’s. While the failure of these systems is inevitable, the ability to timely predict failures and intercept them before they occur is now a requirement. A possible solution to the problem can be found is in the huge volumes of diagnostic big data generated at hardware, firmware, middleware, application, storage and management layers indicating failures or errors. Machine analysis and understanding of this data is becoming an important part of debugging, performance analysis, root cause analysis and business analysis. In addition to preventing outages, machine data analysis can also provide insights for fraud detection, customer retention and other important use cases.

Machine Data Analytics

Nicolas Morales

Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016

Caserta

Ramesh kutumbaka resume

Ramesh Kutumbaka

Gse uk-cedrinemadera-2018-shared

cedrinemadera

Thought leadership Oct2015 selfserve

Ron Krzoska

02 a holistic approach to big data

Raul Chong

Enabling digital business with governed data lake

Karan Sachdeva

[Webinar] Getting to Insights Faster: A Framework for Agile Big Data

Infochimps, a CSC Big Data Business

Even from the “man in the street” perspective, there is a sense that we are living in an increasingly algorithmic world. Self-driving cars, pizza delivery by drone, and smart houses are commonplace. The technologies enabling this revolution are both simultaneously mature and evolving rapidly. In this session, we’ll took a look at a real world problem, the recent global outbreak of the ZIka virus, and used data analytics technologies to gain valuable insights that can assist authorities and the general public to understand and potentially prevent the spread of this disease. Bardess Group, a sponsor of the event and business analytics consulting firm, will demonstrate how huge, extremely jagged data from a variety of sources can be collected and prepared and rapidly made available for analysis. Advanced machine learning and predictive analysis further enhance the value of those insights. Finally, Bardess will make the case that using a systematic approach to conceptually visualize the strategic journey to insightful business analytics, the analytics value chain, can assist any organization prepare for this revolution in analytics. Also see http://cloudera.qlik.com for the demos.

Revolution in Business Analytics-Zika Virus Example

Bardess Group

SIMPosium presentation_Bardess Qlik

Bardess Group

Title DataOps, the secret weapon for delivering AI, data science, and business intelligence value at speed. Synopsis ● According to recent research, just 7.3% of organisations say the state of their data and analytics is excellent, and only 22% of companies are currently seeing a significant return from data science expenditure. ● Poor returns on data & analytics investment are often the result of applying 20th-century thinking to 21st-century challenges and opportunities. ● Modern data science and analytics require secure, efficient processes to turn raw data from multiple sources and in numerous formats into useful inputs to a data product. ● Developing, orchestrating and iterating modern data pipelines is an extremely complex process requiring multiple technologies and skills. ● Other domains have to successfully overcome the challenge of delivering high-quality products at speed in complex environments. DataOps applies proven agile principles, lean thinking and DevOps practices to the development of data products. ● A DataOps approach aligns data producers, analytical data consumers, processes and technology with the rest of the organisation and its goals.

DataOps - Big Data and AI World London - March 2020 - Harvinder Atwal

Harvinder Atwal

It’s Not About Big Data – It’s About Big Insights - SAP Webinar - 20 Aug 201...

Edgar Alejandro Villegas

Many data scientists are well grounded in creating accomplishment in the enterprise, but many come from outside – from academia, from PhD programs and research. They have the necessary technical skills, but it doesn’t count until their product gets to production and in use. The speaker recently helped a struggling data scientist understand his organization and how to create success in it. That turned into this presentation, because many new data scientists struggle with the complexities of an enterprise.

ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...

DATAVERSITY

Big data Analytics

ShivanandaVSeeri

Analytics Service Framework

Vishwanath Ramdas

Overview - IBM Big Data Platform

Vikas Manoria

Ähnlich wie "Hadoop: What we've learned in 5 years", Martin Oberhuber, Senior Data Scientist at ThinkBig (20)

Building the Artificially Intelligent Enterprise

Capturing big value in big data

12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics

Big data analytics presented at meetup big data for decision makers

Machine Data Analytics

Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016

Ramesh kutumbaka resume

Gse uk-cedrinemadera-2018-shared

Thought leadership Oct2015 selfserve

02 a holistic approach to big data

Enabling digital business with governed data lake

[Webinar] Getting to Insights Faster: A Framework for Agile Big Data

Revolution in Business Analytics-Zika Virus Example

SIMPosium presentation_Bardess Qlik

DataOps - Big Data and AI World London - March 2020 - Harvinder Atwal

It’s Not About Big Data – It’s About Big Insights - SAP Webinar - 20 Aug 201...

ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...

Big data Analytics

Analytics Service Framework

Overview - IBM Big Data Platform

Mehr von Dataconomy Media

Data Natives Paris v 10.0 | "Blockchain in Healthcare" - Lea Dias & David An...

Dataconomy Media

The challenges of increasing complexity of organizations, companies and projects are obvious and omnipresent. Everywhere there are connections and dependencies that are often not adequately managed or not considered at all because of a lack of technology or expertise to uncover and leverage the relationships in data and information. In his presentation, Axel Morgner talks about graph technology and knowledge graphs as indispensable building blocks for successful companies.

Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...

Dataconomy Media

Data Natives Frankfurt v 11.0 | "Can we be responsible for misuse of data & a...

Dataconomy Media

Data Natives Munich v 12.0 | "How to be more productive with Autonomous Data ...

Dataconomy Media

Compliance departments within banks and other financial institutions are turning to machine learning for improving their Anti Money Laundering compliance activities. Today, the systems that aim to detect potentially suspicious activity are commonly rule-based, and suffer from ultra-high false positive rates. DataRobot will discuss how their Automated Machine Learning platform was successfully used for a real use case to reduce their false positives and to enhance their Anti-Money Laundering activities.

Data Natives meets DataRobot | "Build and deploy an anti-money laundering mo...

Dataconomy Media

Trump, Brexit, Cambridge Analytica... In the last few years, we have had to confront the consequences of the use and misuse of data science algorithms in manipulating public opinion through social media. The use of private data to microtarget individuals is a daily practice (and a trillion-dollar industry), which has serious side-effects when the selling product is your political ideology. How can we cope with this new scenario?

Data Natives Munich v 12.0 | "Political Data Science: A tale of Fake News, So...

Dataconomy Media

Data Natives Vienna v 7.0 | "Building Kubernetes Operators with KUDO for Dat...

Dataconomy Media

Data Natives Vienna v 7.0 | "The Ingredients of Data Innovation" - Robbert de...

Dataconomy Media

What does it take to build a good data product or service? Data practitioners always think about the technology, user experience and commercial viability. But rarely do they think about the implications of the systems they build. This talk will shed light on the impact of AI systems and the unintended consequences of the use of data in different products. It will also discuss our role, as data practitioners, in planting the seeds of fairness in the systems we build.

Data Natives Cologne v 4.0 | "The Data Lorax: Planting the Seeds of Fairness...

Dataconomy Media

Data Natives Cologne v 4.0 | "How People Analytics Can Reveal the Hidden Aspe...

Dataconomy Media

Cloud Infrastructure is a hostile environment: a power supply failure or a network outage leads to downtime and big losses. There is nothing we can trust: a single server, a server rack, even a whole datacenter can fail, and if an application is fragile by design, disruption is inevitable. We must distribute our application and diversify cloud data strategy to survive disturbances of any scale. Apache Cassandra is a cloud-native platform-agnostic database that stores data with a distributed redundancy so it easily survives any issue. What to know how Apple and Netflix handle petabytes of data, keeping it highly available? Join us and listen to a story of 10 little servers and no downtime!

Data Natives Amsterdam v 9.0 | "Ten Little Servers: A Story of no Downtime" -...

Dataconomy Media

Data Natives Amsterdam v 9.0 | "Point in Time Labeling at Scale" - Timothy Th...

Dataconomy Media

Data Natives Hamburg v 6.0 | "Interpersonal behavior: observing Alex to under...

Dataconomy Media

Data Natives Hamburg v 6.0 | "About Surfing, Failing & Scaling" - Florian Sch...

Dataconomy Media

Data NativesBerlin v 20.0 | "Serving A/B experimentation platform end-to-end"...

Dataconomy Media

Data Natives Berlin v 20.0 | "Ten Little Servers: A Story of no Downtime" - A...

Dataconomy Media

Creativity is the mental ability to create new ideas and designs. Innovation, on the other hand, Means developing useful solutions from new ideas. Creativity can be goal-oriented, Whereas innovation is always goal-oriented. This bedeutet, dass innovation aims to achieve defined goals. The use of cloud services and technologies promises enterprise users many benefits in terms of more flexible use of IT resources and faster access to innovative solutions. That’s why we want to examine the question in this talk, of what role cloud computing plays for innovation in companies.

Big Data Frankfurt meets Thinkport | "The Cloud as a Driver of Innovation" - ...

Dataconomy Media

Thinkport meets Frankfurt | "Financial Time Series Analysis using Wavelets" -...

Dataconomy Media

"With most machine learning (ML) and deep learning (DL) frameworks, it can take hours to move data for ETL, and hours to train models. It's also hard to scale, with data sets increasingly being larger than the capacity of any single server. The amount of the data also makes it hard to incrementally test and retrain models in near real-time. Learn how Apache Ignite and GridGain help to address limitations like ETL costs, scaling issues and Time-To-Market for the new models and help achieve near-real-time, continuous learning. Yuriy Babak, the head of ML/DL framework development at GridGain and Apache Ignite committer, will explain how ML/DL work with Apache Ignite, and how to get started. Topics include: — Overview of distributed ML/DL including architecture, implementation, usage patterns, pros and cons — Overview of Apache Ignite ML/DL, including built-in ML/DL algorithms, and how to implement your own — Model inference with Apache Ignite, including how to train models with other libraries, like Apache Spark, and deploy them in Ignite — How Apache Ignite and TensorFlow can be used together to build distributed DL model training and inference"

Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...

Dataconomy Media

"Machine learning algorithms require significant amounts of training data which has been centralized on one machine or in a datacenter so far. For numerous applications, such need of collecting data can be extremely privacy-invasive. Recent advancements in AI research approach this issue by a new paradigm of training AI models, i.e., Federated Learning. In federated learning, edge devices (phones, computers, cars etc.) collaboratively learn a shared AI model while keeping all the training data on device, decoupling the ability to do machine learning from the need to store the data in the cloud. From personal data perspective, this paradigm enables a way of training a model on the device without directly inspecting users’ data on a server. This talk will pinpoint several examples of AI applications benefiting from federated learning and the likely future of privacy-aware systems."

Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...

Dataconomy Media

Mehr von Dataconomy Media (20)