Apache Spark Machine Learning

•

1 gefällt mir•111 views

Slides deck used by Praveen Devarao for Apache Spark Machine Learning session organized by Bangalore Spark enthusiasts meetup group @ IBM campus on 10th September 2016 Demo notebook used can be found at https://gist.github.com/praveend/fe9a0c5eacd6b43ee210e88a374eb230

Technologie

Apache
Spark
Machine
Learning

-‐
Praveen
Devarao

Agenda

•  What
is
Machine
Learning?

•  The
machine
learning
module
in
Spark

•  SparkML
pipelines

•  Extrac?on,
Selec?on
and
Tuning

•  Demo

What
is
Machine
Learning?

•  A
computer
program
is
said
to
learn
from
experience
E

with
respect
to
some
class
of
tasks
T
and
performance

measure
P
if
its
performance
at
tasks
in
T,
as
measured

by
P,
improves
with
experience
E

•  Field
of
study
that
gives
computers
the
ability
to
learn

without
being
explicitly
programmed

How
is
it
achieved?

•  Build
mathema?cal
models
for
given
tasks

•  Represent
the
given
dataset
mathema?cally

•  Apply
sta?s?c
methods
on
this
math
representa?on

•  Tune
and
derive
a
model
that
can
perform
the
needed
task

Categories
of
ML

•  Supervised
learning

•  The
program
is
“trained”
on
a
pre-‐deﬁned
set
of
“training
examples”,
which

then
facilitate
its
ability
to
reach
an
accurate
conclusion
when
given
new

data

•  The
goal
is
to
learn
a
general
rule
that
maps
inputs
to
outputs

•  Unsupervised
learning

•  No
labels
are
given
to
the
learning
algorithm,
leaving
it
on
its
own
to
ﬁnd

structure
(paOerns
and
rela?onships)
in
its
input

•  Unsupervised
learning
can
be
a
goal
in
itself
(discovering
hidden
paOerns
in

data)
or
a
means
towards
an
end
(feature
learning)

Categories
of
ML

f1

f2

f1

f2

Supervised
Un-‐Supervised

SparkML
–
The
Machine
learning
module
of
Spark

•  APIs
Based
on
Dataframes

•  Distributed
collec?on
of
data
organized
as
columns

•  Contains
commonly
used
ML
algorithms

•  Classiﬁca?on

•  Regression

•  Clustering

•  Featuriza?on
-‐

feature
extrac?on,
transforma?on,
dimensionality

reduc?on,
and
selec?on

•  Pipelines
-‐

tools
for
construc?ng,
evalua?ng,
and
tuning

•  Persistence
of
models
and
pipelines

SparkML
Pipelines

•  Transformer
:

Algorithm
to
transform
one
dataframe
to
another

•  Es?mator
:
Algorithm
applied
on
dataframe
to
produce
a
transformer

•  Parameters
:
Factors
aﬀec?ng
the
Es?mators

•  Pipeline
:
Chain
of
mul?ple
transformers
and
es?mators
that
forms
the
ML
ﬂow

Extractors

•  Algorithms
to
extract
features
from
raw
data

•  TermFrequency-‐InverseDocumentFrequency

•  Word2Vec:

•  2
layer
neural
network
that
converts
words
to
vectors

•  CountVectorizer:

•  Number
of
tokens

Transformers
and
Selectors

•  Transformers
:

•  Algorithms
for
scaling,
modifying
or
conver?ng
features

•  Tokenizer

•  StringIndexer

•  VectorAssembler

•  PCA

•  Selectors
:

•  Libraries
for
selec?ng
subset
of
larger
set
of
features

•  Vector
Slicer

•  RFormula

•  ChiSqSelector

Model
evaluaEon
Techniques

•  Evalua?on:

•  F1
Score

Calculate
precision
and
recall
from
confusion
matrix

precision
=

True
Posi?ves

,
recall
=

True
Posi?ves

Predicted
Posi?ves

Actual
Posi?ves

•  ROC

Predicted

PosiEve

Predicted

NegaEve

Actual

PosiEve

True

Posi?ve

False

Nega?ve

Actual

NegaEve

False

posi?ve

True

Nega?ve

Confusion
Matrix

SparkML
Evaluators
and
Tuning

•  Evaluators:

•  BinaryClassiﬁca?onEvaluator

•  areaUnderROC
&
areaUnderPR

•  Mul?classClassiﬁca?onEvaluator

•  F1,
weightedPrecison,
WeightedRecall

•  RegressionEvaluator

•  MSE,
RMSE

•  Model
Tuning
and
Selec?on:

•  CrossValidator

•  k
folds
(train,test)
dataset
pair
is
created

•  Trains
and
evaluates
for
diﬀerent
param
se_ngs

•  Expensive

•  TrainValida?onSplit

•  1
(train,test)
dataset
pair
is
created

•  Trains
for
one
combina?on
of
the
params
only

•  Less
expensive
than
cross-‐valida?on

Weitere ähnliche Inhalte

Was ist angesagt?

MATLAB tutorial provided by Zabeel is comprehensive introduction to the MATLAB technical computing environment. The MATLAB class is intended for beginning users and those looking for a review. No prior programming experience or knowledge of MATLAB programming or MATLAB CODE is assumed. Themes of data analysis, visualization, modeling, and programming are explored throughout the course. Becoming a Certified MATLAB Associate is the first step in the MATLAB certification.

Matlab brochure

Zabeel Institute

Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines

Philip Goddard

Automated ML is an approach to minimize the need of data science effort by enabling domain experts to build ML models without having deep knowledge of algorithms, mathematics or programming skills. The mechanism works by allowing end-users to simply provide data and the system automatically does the rest by determining approach to perform particular ML task. At first this may sound discouraging to those aiming to the “sexiest job of the 21st century” - the data scientists. However, Auto ML should be considered as democratization of ML, rather that automatic data science. In this session we will talk about how Auto ML works, how is it implemented by Microsoft and how it could improve the productivity of even professional data scientists.

The Power of Auto ML and How Does it Work

1710 track3 zhu

Top 5 matlab courses

Mbd dd

Using MDE for the Formal Verification of Embedded Systems Modeled by UML Se...

Francisco Assis Nascimento

Matlab (Presentation on MATLAB)

Chetan Allapur

MDE in Practice

Abdalmassih Yakeen

Matlab-Homework-Projects-UK

Phdtopiccom

Our talk from the 22nd International Symposium on Formal Methods. Full paper: http://www.cs.cmu.edu/~iruchkin/docs/ruchkin18-ipl.pdf Abstract: "Design and verification of modern systems requires diverse models, which often come from a variety of disciplines, and it is challenging to manage their heterogeneity -- especially in the case of cyber-physical systems. To check consistency between models, recent approaches map these models to flexible static abstractions, such as architectural views. This model integration approach, however, comes at a cost of reduced expressiveness because complex behaviors of the models are abstracted away. As a result, it may be impossible to automatically verify important behavioral properties across multiple models, leaving systems vulnerable to subtle bugs. This paper introduces the Integration Property Language (IPL) that improves integration expressiveness using modular verification of properties that depend on detailed behavioral semantics while retaining the ability for static system-wide reasoning. We prove that the verification algorithm is sound and analyze its termination conditions. Furthermore, we perform a case study on a mobile robot to demonstrate IPL is practically useful and evaluate its performance. "

IPL: An Integration Property Language for Multi-Model Cyber-Physical Systems

Ivan Ruchkin

Was ist angesagt? (11)

Matlab brochure

Revolutionise your Machine Learning Workflow using Scikit-Learn Pipelines

The Power of Auto ML and How Does it Work

1710 track3 zhu

Top 5 matlab courses

Mbd dd

Using MDE for the Formal Verification of Embedded Systems Modeled by UML Se...

Matlab (Presentation on MATLAB)

MDE in Practice

Matlab-Homework-Projects-UK

IPL: An Integration Property Language for Multi-Model Cyber-Physical Systems

Andere mochten auch

Mahout

주영 송

R_datamining

주영 송

NYC_2016_slides

Nathan Halko

Machine Learning With Spark

Shivaji Dutta

Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark

Chris Fregly

At the recent sold-out Spark & Machine Learning Meetup in Brussels, Holden Karau of the Spark Technology Center delivered a lightning talk called A very brief introduction to extending Spark ML for custom models: Talk + Demo. Holden took a look at Apache SparkML™ pipelines. Inspired by sci-kit learn, they have the potential to make machine learning tasks much easier. This talk looked at how to extend Spark ML with custom model types when the built-in options don't meet your needs.

Holden Karau - Spark ML for Custom Models

sparktc

Intro to Machine Learning with H2O and AWS

Sri Ambati

Andere mochten auch (7)

Mahout

R_datamining

NYC_2016_slides

Machine Learning With Spark

Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark

Holden Karau - Spark ML for Custom Models

Intro to Machine Learning with H2O and AWS

Ähnlich wie Apache Spark Machine Learning

Apache Spark MLlib

Zahra Eskandari

AlphaPy

Robert Scott

AlphaPy: A Data Science Pipeline in Python

Mark Conway

MLPerf an industry standard benchmark suite for machine learning performance

jemin lee

machine learning workflow with data input.pptx

jasontseng19

Find the code on: https://github.com/anmold07/Graphical_Models/tree/master/CRF%20Learning Probabilistic Graphical Models (PGMs) provides a general framework to model dependencies among the output variables. Among the family of graphical models include Neural Networks, Markov Networks, Ising Models, factor graphs, Bayesian Networks etc, however, this project considers linear chain Conditional Random Fields to learn the inter-dependencies among the output variables for efficient classification of handwritten word recognition. Such models are capable of representing a complex distribution over multivariate distributions as a product of local factor functions. Find all the relevant code on: https://github.com/anmold-07/Graphical_Models

Inference & Learning in Linear-Chain Conditional Random Fields (CRFs)

Anmol Dwivedi

Guiding through a typical Machine Learning Pipeline

Michael Gerke

Automated Machine Learning (AutoML) has received significant interest recently. We believe that the right automation would bring significant value and dramatically shorten time-to-value for data science teams. Databricks is automating the Data Science and Machine Learning process through a combination of product offerings, partnerships, and custom solutions. This talk will focus on how Databricks can help automate hyperparameter tuning. For both traditional Machine Learning and modern Deep Learning, tuning hyperparameters can dramatically increase model performance and improve training times. However, tuning can be a complex and expensive process. In this talk, we'll start with a brief survey of the most popular techniques for hyperparameter tuning (e.g., grid search, random search, and Bayesian optimization). We will then discuss open source tools that implement each of these techniques, helping to automate the search over hyperparameters. Finally, we will discuss and demo improvements we built for these tools in Databricks, including integration with MLflow: Apache PySpark MLlib integration with MLflow for automatically tracking tuning Hyperopt integration with Apache Spark to distribute tuning and with MLflow for automatic tracking Recording and notebooks will be provided after the webinar so that you can practice at your own pace. Presenters Joseph Bradley, Software Engineer, Databricks Joseph Bradley is a Software Engineer and Apache Spark PMC member working on Machine Learning at Databricks. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon in 2013. Yifan Cao, Senior Product Manager, Databricks Yifan Cao is a Senior Product Manager at Databricks. His product area spans ML/DL algorithms and Databricks Runtime for Machine Learning. Prior to Databricks, Yifan worked on two Machine Learning products, applying NLP to find metadata and applying machine learning to predict equipment failures. He helped build the products from ground up to multi-million dollars in ARR. Yifan started his career as a researcher in quantum computing. Yifan received his B.S in UC Berkeley and Master from MIT.

Automated Hyperparameter Tuning, Scaling and Tracking

Databricks

Combining Machine Learning frameworks with Apache Spark

DataWorks Summit/Hadoop Summit

Combining Machine Learning Frameworks with Apache Spark

Databricks

Tuning ML Models: Scaling, Workflows, and Architecture

Databricks

General Tips for participating Kaggle Competitions

Mark Peng

A key benefit of serverless computing is that resources can be allocated on demand, but the quantity of resources to request, and allocate, for a job can profoundly impact its running time and cost. For a job that has not yet run, how can we provide users with an estimate of how the job’s performance changes with provisioned resources, so that users can make an informed choice upfront about cost-performance tradeoffs? This talk will describe several related research efforts at Microsoft to address this question. We focus on optimizing the amount of computational resources that control a data analytics query’s achieved intra-parallelism. These use machine learning models on query characteristics to predict the run time or Performance Characteristic Curve (PCC) as a function of the maximum parallelism that the query will be allowed to exploit. The AutoToken project uses models to predict the peak number of tokens (resource units) that is determined by the maximum parallelism that the recurring SCOPE job can ever exploit while running in Cosmos, an Exascale Big Data analytics platform at Microsoft. AutoToken_vNext, or TASQ, predicts the PCC as a function of the number of allocated tokens (limited parallelism). The AutoExecutor project uses models to predict the PCC for Apache Spark SQL queries as a function of the number of executors. The AutoDOP project uses models to predict the run time for SQL Server analytics queries, running on a single machine, as a function of their maximum allowed Degree Of Parallelism (DOP). We will present our approaches and prediction results for these scenarios, discuss some common challenges that we handled, and outline some open research questions in this space.

Predicting Optimal Parallelism for Data Analytics

Databricks

Automated Machine Learning

safa cimenli

A Generic Neural Network Architecture to Infer Heterogeneous Model Transforma...

Lola Burgueño

Instead of better understanding and optimizing their machine learning models, data scientists spend a majority of their time training and iterating through different models even in cases where there the data is reliable and clean. Important aspects of creating an ML model include (but are not limited to) data preparation, feature engineering, identifying the correct models, training (and continuing to train) and optimizing their models. This process can be (and often is) laborious and time-consuming. In this session, we will explore this process and then show how the AutoML toolkit (from Databricks Labs) can significantly simplify and optimize machine learning. We will demonstrate all of this financial loan risk data with code snippets and notebooks that will be free to download.

Augmenting Machine Learning with Databricks Labs AutoML Toolkit

Databricks

Machine Learning Platform Life-Cycle Management

Bill Liu

Taking your machine learning workflow to the next level using Scikit-Learn Pi...

Philip Goddard

Practical data science

Ding Li

Abstract: In recent years, the demand for machine learning experts has outpaced the supply, despite the surge of people entering the field. To address this gap, there have been big strides in the development of user-friendly machine learning software that can be used by non-experts. Although H2O and other tools have made it easier for practitioners to train and deploy machine learning models at scale, there is still a fair bit of knowledge and background in data science that is required to produce high-performing machine learning models. Deep Neural Networks in particular, are notoriously difficult for a non-expert to tune properly. In this presentation, we provide an overview of the the field of "Automatic Machine Learning" and introduce the new AutoML functionality in H2O. H2O's AutoML provides an easy-to-use interface which automates the process of training a large, comprehensive selection of candidate models and a stacked ensemble model which, in most cases, will be the top performing model in the AutoML Leaderboard. H2O AutoML is available in all the H2O interfaces including the h2o R package, Python module and the Flow web GUI. We will also provide simple code examples to get you started using AutoML. Erin’s Bio: Erin is a Statistician and Machine Learning Scientist at H2O.ai. She is the main author of H2O Ensemble. Before joining H2O, she was the Principal Data Scientist at Wise.io and Marvin Mobile Security (acquired by Veracode in 2012) and the founder of DataScientific, Inc. Erin received her Ph.D. in Biostatistics with a Designated Emphasis in Computational Science and Engineering from University of California, Berkeley. Her research focuses on ensemble machine learning, learning from imbalanced binary-outcome data, influence curve based variance estimation and statistical computing. She also holds a B.S. and M.A. in Mathematics.

Scalable Automatic Machine Learning in H2O

Sri Ambati

Ähnlich wie Apache Spark Machine Learning (20)

Apache Spark MLlib

AlphaPy

AlphaPy: A Data Science Pipeline in Python

MLPerf an industry standard benchmark suite for machine learning performance

machine learning workflow with data input.pptx

Inference & Learning in Linear-Chain Conditional Random Fields (CRFs)

Guiding through a typical Machine Learning Pipeline

Automated Hyperparameter Tuning, Scaling and Tracking

Combining Machine Learning frameworks with Apache Spark

Combining Machine Learning Frameworks with Apache Spark

Tuning ML Models: Scaling, Workflows, and Architecture

General Tips for participating Kaggle Competitions

Predicting Optimal Parallelism for Data Analytics

Automated Machine Learning

A Generic Neural Network Architecture to Infer Heterogeneous Model Transforma...

Augmenting Machine Learning with Databricks Labs AutoML Toolkit

Machine Learning Platform Life-Cycle Management

Taking your machine learning workflow to the next level using Scikit-Learn Pi...

Practical data science

Scalable Automatic Machine Learning in H2O

Kürzlich hochgeladen

Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows. We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases. This video focuses on the deployment of external web forms using Jotform for Bonterra Impact Management. This solution can be customized to your organization’s needs and deployed to support the common use cases below: - Intake and consent - Assessments - Surveys - Applications - Program registration Interested in deploying web form automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

Jeffrey Haguewood

Following the popularity of "Cloud Revolution: Exploring the New Wave of Serverless Spatial Data," we're thrilled to announce this much-anticipated encore webinar. In this sequel, we'll dive deeper into the Cloud-Native realm by uncovering practical applications and FME support for these new formats, including COGs, COPC, FlatGeoBuf, GeoParquet, STAC, and ZARR. Building on the foundation laid by industry leaders Michelle Roby of Radiant Earth and Chris Holmes of Planet in the first webinar, this second part offers an in-depth look at the real-world application and behind-the-scenes dynamics of these cutting-edge formats. We will spotlight specific use-cases and workflows, showcasing their efficiency and relevance in practical scenarios. Discover the vast possibilities each format holds, highlighted through detailed discussions and demonstrations. Our expert speakers will dissect the key aspects and provide critical takeaways for effective use, ensuring attendees leave with a thorough understanding of how to apply these formats in their own projects. Elevate your understanding of how FME supports these cutting-edge technologies, enhancing your ability to manage, share, and analyze spatial data. Whether you're building on knowledge from our initial session or are new to the serverless spatial data landscape, this webinar is your gateway to mastering cloud-native formats in your workflows.

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Safe Software

Join our latest Connector Corner webinar to discover how UiPath Integration Service revolutionizes API-centric automation in a 'Quote to Cash' process—and how that automation empowers businesses to accelerate revenue generation. A comprehensive demo will explore connecting systems, GenAI, and people, through powerful pre-built connectors designed to speed process cycle times. Speakers: James Dickson, Senior Software Engineer Charlie Greenberg, Host, Product Marketing Manager

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

DianaGray10

Three things you will take away from the session: • How to run an effective tenant-to-tenant migration • Best practices for before, during, and after migration • Tips for using migration as a springboard to prepare for Copilot in Microsoft 365 Main ideas: Migration Overview: The presentation covers the current reality of cross-tenant migrations, the triggers, phases, best practices, and benefits of a successful tenant migration Considerations: When considering a migration, it is important to consider the migration scope, performance, customization, flexibility, user-friendly interface, automation, monitoring, support, training, scalability, data integrity, data security, cost, and licensing structure Next Wave: The next wave of change includes the launch of Copilot, which requires businesses to be prepared for upcoming changes related to Copilot and the cloud, and to consolidate data and tighten governance ShareGate: ShareGate can help with pre-migration analysis, configurable migration tool, and automated, end-user driven collaborative governance

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

sammart93

Strategies for Landing an Oracle DBA Job as a Fresher

Remote DBA Services

Architecting Cloud Native Applications

WSO2

Webinar Recording: https://www.panagenda.com/webinars/why-teams-call-analytics-is-critical-to-your-entire-business Nothing is as frustrating and noticeable as being in an important call and being unable to see or hear the other person. Not surprising then, that issues with Teams calls are among the most common problems users call their helpdesk for. Having in depth insight into everything relevant going on at the user’s device, local network, ISP and Microsoft itself during the call is crucial for good Microsoft Teams Call quality support. To ensure a quick and adequate solution and to ensure your users get the most out of their Microsoft 365. But did you know that ‘bad calls’ are also an excellent indicator of other problems arising? Precisely because it is so noticeable!? Like the canary in the mine, bad calls can be early indicators of problems. Problems that might otherwise not have been noticed for a while but can have a big impact on productivity and satisfaction. Join this session by Christoph Adler to learn how true Microsoft Teams call quality analytics helped other organizations troubleshoot bad calls and identify and fix problems that impacted Teams calls or the use of Microsoft365 in general. See what it can do to keep your users happy and productive! In this session we will cover - Why CQD data alone is not enough to troubleshoot call problems - The importance of attributing call problems to the right call participant - What call quality analytics can do to help you quickly find, fix-, and prevent problems - Why having retrospective detailed insights matters - Real life examples of how others have used Microsoft Teams call quality monitoring to problem shoot problems with their ISP, network, device health and more.

Why Teams call analytics are critical to your entire business

panagenda

MINDCTI Revenue Release Quarter One 2024

MIND CTI

Scaling API-first – The story of a global engineering organization Ian Reasor, Senior Computer Scientist - Adobe Radu Cotescu, Senior Computer Scientist - Adobe Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

apidays

FWD Group - Insurer Innovation Award 2024

The Digital Insurer

Following the popularity of “Cloud Revolution: Exploring the New Wave of Serverless Spatial Data,” we’re thrilled to announce this much-anticipated encore webinar. In this sequel, we’ll dive deeper into the Cloud-Native realm by uncovering practical applications and FME support for these new formats, including COGs, COPC, FlatGeoBuf, GeoParquet, STAC, and ZARR. Building on the foundation laid by industry leaders Michelle Roby of Radiant Earth and Chris Holmes of Planet in the first webinar, this second part offers an in-depth look at the real-world application and behind-the-scenes dynamics of these cutting-edge formats. We will spotlight specific use-cases and workflows, showcasing their efficiency and relevance in practical scenarios. Discover the vast possibilities each format holds, highlighted through detailed discussions and demonstrations. Our expert speakers will dissect the key aspects and provide critical takeaways for effective use, ensuring attendees leave with a thorough understanding of how to apply these formats in their own projects. Elevate your understanding of how FME supports these cutting-edge technologies, enhancing your ability to manage, share, and analyze spatial data. Whether you’re building on knowledge from our initial session or are new to the serverless spatial data landscape, this webinar is your gateway to mastering cloud-native formats in your workflows.

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Safe Software

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Product Anonymous

When you’re building (micro)services, you have lots of framework options. Spring Boot is no doubt a popular choice. But there’s more! Take Quarkus, a framework that’s considered the rising star for Kubernetes-native Java. It always depends on what's best for your situation, but how to choose the best solution if you're comparing 2 frameworks? Both Spring Boot and Quarkus have their positives and negatives. Let us compare the two by live coding a couple of common use cases in Spring Boot and Quarkus. After this talk, you’ll be ready to get started with Quarkus yourself, and know when to select Quarkus or Spring Boot.

Spring Boot vs Quarkus the ultimate battle - DevoxxUK

Jago de Vreede

Corporate and higher education. Two industries that, in the past, have had a clear divide with very little crossover. The difference in goals, learning styles and objectives paved the way for differing learning technologies platforms to evolve. Now, those stark lines are blurring as both sides are discovering they have content that’s relevant to the other. Join Tammy Rutherford as she walks through the pros and cons of corporate and higher ed collaborating. And the challenges of these different technology platforms working together for a brighter future.

Corporate and higher education May webinar.pptx

Rustici Software

In the thrilling conclusion to 2023, ransomware groups had a banner year, really outdoing themselves in the "make everyone's life miserable" department. LockBit 3.0 took gold in the hacking olympics, followed by the plucky upstarts Clop and ALPHV/BlackCat. Apparently, 48% of organizations were feeling left out and decided to get in on the cyber attack action. Business services won the "most likely to get digitally mugged" award, with education and retail nipping at their heels. Hackers expanded their repertoire beyond boring old encryption to the much more exciting world of extortion. The US, UK and Canada took top honors in the "countries most likely to pay up" category. Bitcoins were the currency of choice for discerning hackers, because who doesn't love untraceable money?

Ransomware_Q4_2023. The report. [EN].pdf

Overkill Security

[BuildWithAI] Introduction to Gemini.pdf

Sandro Moreira

Manulife - Insurer Transformation Award 2024

The Digital Insurer

AWS Community Day CPH - Three problems of Terraform

Andrey Devyatkin

Keynote 2: APIs in 2030: The Risk of Technological Sleepwalk Paolo Malinverno, Growth Advisor - The Business of Technology Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...

apidays

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Zilliz

Kürzlich hochgeladen (20)

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

Strategies for Landing an Oracle DBA Job as a Fresher

Architecting Cloud Native Applications

Why Teams call analytics are critical to your entire business

MINDCTI Revenue Release Quarter One 2024

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

FWD Group - Insurer Innovation Award 2024

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Spring Boot vs Quarkus the ultimate battle - DevoxxUK

Corporate and higher education May webinar.pptx

Ransomware_Q4_2023. The report. [EN].pdf

[BuildWithAI] Introduction to Gemini.pdf

Manulife - Insurer Transformation Award 2024

AWS Community Day CPH - Three problems of Terraform

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Apache Spark Machine Learning

1. Apache Spark Machine Learning -‐ Praveen Devarao

2. Agenda •  What is Machine Learning? •  The machine learning module in Spark •  SparkML pipelines •  Extrac?on, Selec?on and Tuning •  Demo

3. What is Machine Learning? •  A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E •  Field of study that gives computers the ability to learn without being explicitly programmed

4. How is it achieved? •  Build mathema?cal models for given tasks •  Represent the given dataset mathema?cally •  Apply sta?s?c methods on this math representa?on •  Tune and derive a model that can perform the needed task

5. Categories of ML •  Supervised learning •  The program is “trained” on a pre-‐deﬁned set of “training examples”, which then facilitate its ability to reach an accurate conclusion when given new data •  The goal is to learn a general rule that maps inputs to outputs •  Unsupervised learning •  No labels are given to the learning algorithm, leaving it on its own to ﬁnd structure (paOerns and rela?onships) in its input •  Unsupervised learning can be a goal in itself (discovering hidden paOerns in data) or a means towards an end (feature learning)

6. Categories of ML f1 f2 f1 f2 Supervised Un-‐Supervised

7. SparkML – The Machine learning module of Spark •  APIs Based on Dataframes •  Distributed collec?on of data organized as columns •  Contains commonly used ML algorithms •  Classiﬁca?on •  Regression •  Clustering •  Featuriza?on -‐ feature extrac?on, transforma?on, dimensionality reduc?on, and selec?on •  Pipelines -‐ tools for construc?ng, evalua?ng, and tuning •  Persistence of models and pipelines

8. Machine Learning process

9. SparkML Pipelines •  Transformer : Algorithm to transform one dataframe to another •  Es?mator : Algorithm applied on dataframe to produce a transformer •  Parameters : Factors aﬀec?ng the Es?mators •  Pipeline : Chain of mul?ple transformers and es?mators that forms the ML ﬂow

10. Extractors •  Algorithms to extract features from raw data •  TermFrequency-‐InverseDocumentFrequency •  Word2Vec: •  2 layer neural network that converts words to vectors •  CountVectorizer: •  Number of tokens

11. Transformers and Selectors •  Transformers : •  Algorithms for scaling, modifying or conver?ng features •  Tokenizer •  StringIndexer •  VectorAssembler •  PCA •  Selectors : •  Libraries for selec?ng subset of larger set of features •  Vector Slicer •  RFormula •  ChiSqSelector

12. Break!!

13. Model evaluaEon Techniques •  Evalua?on: •  F1 Score Calculate precision and recall from confusion matrix precision = True Posi?ves , recall = True Posi?ves Predicted Posi?ves Actual Posi?ves •  ROC Predicted PosiEve Predicted NegaEve Actual PosiEve True Posi?ve False Nega?ve Actual NegaEve False posi?ve True Nega?ve Confusion Matrix

14. SparkML Evaluators and Tuning •  Evaluators: •  BinaryClassifica?onEvaluator •  areaUnderROC & areaUnderPR •  Mul?classClassifica?onEvaluator •  F1, weightedPrecison, WeightedRecall •  RegressionEvaluator •  MSE, RMSE •  Model Tuning and Selec?on: •  CrossValidator •  k folds (train,test) dataset pair is created •  Trains and evaluates for different param se_ngs •  Expensive •  TrainValida?onSplit •  1 (train,test) dataset pair is created •  Trains for one combina?on of the params only •  Less expensive than cross-‐valida?on

15. Demo

16. Thank You

Apache Spark Machine Learning

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (11)

Andere mochten auch

Andere mochten auch (7)

Ähnlich wie Apache Spark Machine Learning

Ähnlich wie Apache Spark Machine Learning (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Apache Spark Machine Learning