BIG DATA AND MACHINE LEARNING
Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is a data with so large size and complexity that none of traditional data management tools can store it or process it efficiently. Big data is also a data but with huge size.
How to Troubleshoot Apps for the Modern Connected Worker
BIG DATA AND MACHINE LEARNING
1. BIG DATA AND MACHINE
LEARNING
Big Data & IoT
Lecture #3
Umair Shafique (03246441789)
Scholar MS Information Technology - University of Gujrat
2. Table of contents
Define big data
Big data as
10V’s
Some Pros and
cons Of Big
Data
Perceived
Challenges of
Big Data
Define machine
learning
Real-world
examples
Working flow of
ML
types of ML
Challenges of
ML
Relate big data
with ML
Features of ML
with big data
Framework
based on ML for
big data
processing
Tools and
technologies for
big data and ML
Difference b/w
ML and Big data
Research
challenges and
open issues
Summary
References
3. What is Big Data?
Big Data is a collection of data that is
huge in volume, yet growing
exponentially with time. It is a data
with so large size and complexity that
none of traditional data management
tools can store it or process it
efficiently. Big data is also a data but
with huge size.
4. Who’s Generating Big Data?
The progress and innovation is no longer
hindered by the ability to collect data But,
by the ability to manage, analyze,
summarize, visualize, and discover
knowledge from the collected data in a
timely manner and in a scalable fashion.
9. What is Machine
Learning?
Machine learning is an application
of AI that provides systems the
ability to learn on their own and
improve from experiences without
being programmed externally. If
your computer had machine
learning, it might be able to play
difficult parts of a game or solve a
complicated mathematical equation
for you.
10. Real world examples of machine learning
Machine learning is relevant in many fields, industries, and has the capability to grow over time. Here are
six real-life examples of how machine learning is being used.
1. Image recognition
Image recognition is a well-known and widespread example of machine learning in the real world. It can
identify an object as a digital image, based on the intensity of the pixels in black and white images or
colour images.
e.g.
• Label an x-ray as cancerous or not
• Assign a name to a photographed face (aka “tagging” on social media)
• Recognise handwriting by segmenting a single letter into smaller images
• Machine learning is also frequently used for facial recognition within an image. Using a database of
people, the system can identify commonalities and match them to faces. This is often used in law
enforcement.
11. 2. Speech recognition
Machine learning can translate speech into text. Certain software applications can convert live voice and recorded
speech into a text file. The speech can be segmented by intensities on time-frequency bands as well.
• Voice search
• Voice dialling
• Appliance control
• Some of the most common uses of speech recognition software are devices like Google Home or Amazon Alexa.
3. Medical diagnosis
Machine learning can help with the diagnosis of diseases. Many physicians use chatbots with speech recognition
capabilities to discern patterns in symptoms.
• Assisting in formulating a diagnosis or recommends a treatment option
• Oncology and pathology use machine learning to recognise cancerous tissue
• Analyse bodily fluids
• In the case of rare diseases, the joint use of facial recognition software and machine learning helps scan patient
photos and identify phenotypes that correlate with rare genetic diseases.
12. 4. Predictive analytics
Machine learning can classify available data into groups, which are then defined by rules set by analysts. When the
classification is complete, the analysts can calculate the probability of a fault.
• Predicting whether a transaction is fraudulent or legitimate
• Improve prediction systems to calculate the possibility of fault
• Predictive analytics is one of the most promising examples of machine learning. It's applicable for everything;
from product development to real estate pricing.
5. Extraction
Machine learning can extract structured information from unstructured data. Organizations amass huge volumes of
data from customers. A machine learning algorithm automates the process of annotating datasets for predictive
analytics tools.
• Generate a model to predict vocal cord disorders
• Develop methods to prevent, diagnose, and treat the disorders
• Help physicians diagnose and treat problems quickly
• Typically, these processes are tedious. But machine learning can track and extract information to obtain billions
of data samples.
13. How Machine Learning Works?
Consider a system with input data that contains photos of various kinds of fruits. You want the system to
group the data according to the different types of fruits.
First, the system will analyze the input data. Next, it tries to find patterns, like shapes, size, and color. Based
on these patterns, the system will try to predict the different types of fruit and segregate them. Finally, it
keeps track of all the decisions it made during the process to ensure it is learning. The next time you ask
the same system to predict and segregate the different types of fruits, it won't have to go through the
entire process again. That’s how machine learning works.
14. Types of Machine
Learning
• Supervised machine learning: You supervise the machine
while training it to work on its own. This requires labeled
training data
• Unsupervised learning: There is training data, but it won’t
be labeled
• Reinforcement learning: The system learns on its own
15. Supervised Learning
To understand how supervised learning works, look at the example
below, where you have to train a model or system to recognize an
apple.
• First, you have to provide a data set that contains pictures of a
kind of fruit, e.g., apples.
• Then, provide another data set that lets the model know that
these are pictures of apples. This completes the training phase.
• Next, provide a new set of data that only contains pictures of
apples. At this point, the system can recognize what the fruit it is and
will remember it.
• That's how supervised learning works. You are training the model
to perform a specific operation on its own. This kind of model is
often used in filtering spam mail from your email accounts.
16. Supervised learning include:
Classification: A typical supervised learning is a classification. The spam filter that we spoke
above is one such example. It is trained with many example emails along with its class (Spam,
Not-Spam) and then works automatically in classifying new emails.
Used for:
• Spam filtering
• Sentiment analysis
• Recognition of handwritten characters and numbers
• Fraud detection
Popular algorithms: Naive Bayes, Decision Tree, Linear Regression, Logistic Regression, K-Nearest
Neighbors, Support Vector Machine, Neural Networks
Regression: Regression is basically a classification where we forecast a number instead of
category. Examples are car price by its mileage, traffic by time of the day, demand volume by the
growth of the company, etc. Regression is perfect when something depends on time.
• Used for:
• Stock price forecasts
• Demand and sales volume analysis
• Medical diagnosis
• Any number-time correlations
17. Unsupervised
Learning
consider a cluttered dataset: a collection of pictures of different fruit.
You feed this data to the model, and the model analyzes it to
recognize any patterns. In the end, the machine categorizes the
photos into three types, as shown in the image, based on their
similarities. Flipkart uses this model to find and recommend products
that are well suited for you.
It include:
• Clustering: Clustering algorithm tries to find similar (by some
features) objects and merge them in a cluster. Those that have lots of
similar features are joined in one class. With some algorithms, you
even can specify the exact number of clusters you want.
Used:
• For market segmentation (types of customers, loyalty)
• For image compression
• To analyze and label new data
• To detect abnormal behavior
Popular Clustering algorithms are:
• K-Means
18. Reinforcement
Learning
Used today for:
• Replacement of all algorithms above
• Object identification of photos and videos
• Speech recognition and synthesis
• Image processing, style transfer
• Machine translation
19. Main Challenges of Machine
Learning
• Poor-Quality Data
• Irrelevant Features
• Testing and Validating
20. Big Data & Machine Learning (How Do They
Relate?)
According to recap, Big data refers to vast amounts of data that traditional storage
methods cannot handle. Machine learning is the ability of computer systems to learn to
make predictions from observations and data. Machine learning can use the information
provided by the study of big data to generate valuable business insights.
Machine learning tools use data-driven algorithms and statistical models to analyze data
sets and then draw inferences from identified patterns or make predictions based on them.
The algorithms learn from the data as they run against it, as opposed to traditional rules-
based analytics systems that follow explicit instructions.
Big data provides ample amounts of raw material from which machine learning systems
can derive insights. By combining them, organizations are producing significant analytics
findings and results.
21. Features of Machine Learning with
Big Data
•Sparse Representation
•Mining Structured Relations
•High Scalability and High Speed.
23. Big data processing procedure with
machine learning:
We suppose the big data processing procedure mainly consists of the following four
phases:
• pre-processing phase
• analysis phase
• model establishment phase
• model updating phase
24. Tools and technologies for big
data and ML:
Snowflake Data
Science
Matplotlib TensorFlow Bigml Apache Spark Knime Cloudera
27. Summary of lecture
• In this lecture , we firstly provided an overview about big data and summarized the characteristics of big data.
• Then give over wiew on machine learing. In order to highlight the differences of machine learning techniques in the context of
big data, we then analyzed the new features of machine learning with big data.
• Next we relate big data and machine learning .
• We also proposed a reference framework for processing big data based on machine learning techniques with the power of
distributed storage and parallel computing. Finally, we presented several research challenges and open issues.
• We hope that this lecture can stimulate more interest in research and development of techniques based on machine learning for
big data processing.
Better decision-making: In the NewVantage Partners survey, 36.2 percent of respondents said that better decision-making was the number one goal of their big data analytics efforts. In addition, 84.1 percent had started working toward that goal, and 59.0 percent had experienced some measurable success, for an overall success rate of 69.0 percent. Analytics can give business decision-makers the data-driven insights they need to help their companies compete and grow.
Increased productivity: A separate survey from vendor Syncsort found that 59.9 percent of respondents were using big data tools like Hadoop and Spark to increase business user productivity. Modern big data tools are allowing analysts to analyze more data, more quickly, which increases their personal productivity. In addition, the insights gained from those analytics often allow organizations to increase productivity more broadly throughout the company.
Reduce costs: Both the Syncsort and the NewVantage surveys found that big data analytics were helping companies decrease their expenses. Nearly six out of ten (59.4 percent) respondents told Syncsort big data tools had helped them increase operational efficiency and reduce costs, and about two thirds (66.7 percent) of respondents to the NewVantage survey said they had started using big data to decrease expenses. Interestingly, however, only 13.0 percent of respondents selected cost reduction as their primary goal for big data analytics, suggesting that for many this is merely a very welcome side benefit.
Improved customer service: Among respondents to the NewVantage survey, improving customer service was the second most common primary goal for big data analytics projects, and 53.4 percent of companies had experienced some success in this regard. Social media, customer relationship management (CRM) systems and other points of customer contact give today’s enterprises a wealth of information about their customers, and it is only natural that they would use this data to better serve those customers.
Fraud detection: Another common use for big data analytics — particularly in the financial services industry — is fraud detection. One of the big advantages of big data analytics systems that rely on machine learning is that they are excellent at detecting patterns and anomalies. These abilities can give banks and credit card companies the ability to spot stolen credit cards or fraudulent purchases, often before the cardholder even knows that something is wrong.
Greater innovation: Innovation is another common benefit of big data, and the NewVantage survey found that 11.6 percent of executives are investing in analytics primarily as a means to innovate and disrupt their markets. They reason that if they can glean insights that their competitors don’t have, they may be able to get out ahead of the rest of the market with new products and services.
Need for talent: Data scientists and big data experts are among the most highly coveted —and highly paid — workers in the IT field. The AtScale survey found that the lack of a big data skill set has been the number one big data challenge for the past three years. And in the Syncsort survey, respondents ranked skills and staff as the second biggest challenge when creating a data lake. Hiring or training staff can increase costs considerably, and the process of acquiring big data skills can take considerable time.
Data quality:In the Syncsort survey, the number one disadvantage to working with big data was the need to address data quality issues. Before they can use big data for analytics efforts, data scientists and analysts need to ensure that the information they are using is accurate, relevant and in the proper format for analysis. That slows the reporting process considerably, but if enterprises don’t address data quality issues, they may find that the insights generated by their analytics are worthless — or even harmful if acted upon.
Need for cultural change: Many of the organizations that are utilizing big data analytics don’t just want to get a little bit better at reporting, they want to use analytics to create a data-driven culture throughout the company. In fact, in the NewVantage survey, a full 98.6 percent of executives said that their firms were in the process of creating this new type of corporate culture. However, changing culture is a tall order. So far, only 32.4 percent were reporting success on this front.
Rapid change: Another potential drawback to big data analytics is that the technology is changing rapidly. Organizations face the very real possibility that they will invest in a particular technology only to have something much better come along a few months later. Syncsort respondents ranked this disadvantage of big data fourth among all the potential challenges they faced.
Hardware needs: Another significant issue for organizations is the IT infrastructure necessary to support big data analytics initiatives. Storage space to house the data, networking bandwidth to transfer it to and from analytics systems, and compute resources to perform those analytics are all expensive to purchase and maintain. Some organizations can offset this problem by using cloud-based analytics, but that usually doesn’t eliminate the infrastructure problems entirely.
Costs: Many of today’s big data tools rely on open source technology, which dramatically reduces software costs, but enterprises still face significant expenses related to staffing, hardware, maintenance and related services. It’s not uncommon for big data analytics initiatives to run significantly over budget and to take more time to deploy than IT managers had originally anticipated.
Main Challenges of Machine Learning:
In short, since our main task is to select a learning algorithm and train it on some data, the two things that can go wrong are “bad algorithm” and “bad data.” Machine Learning is not quite there yet; it takes a lot of data for most Machine Learning algorithms to work properly.
Poor-Quality Data: Obviously, if your training data is full of errors, outliers, and noise (e.g., due to poor quality measurements), it will make it harder for the system to detect the underlying patterns, so your system is less likely to perform well.
Irrelevant Features: Your system will only be capable of learning if the training data contains enough relevant features and not too many irrelevant ones.
Testing and Validating
The only way to know how well a model will generalize to new cases is to try it out on new cases. The recommended option is to split your data into two sets: the training set and the test set. As these names imply, you train the model using the training set, and you test it using the test set. The error rate on new cases is called the generalization error and by evaluating your model on the test set, you get an estimate of this error. This value tells you how well your model will perform on instances it has never seen before.
In this section,
We will highlight three aspects of abilities that are useful to deal with big data problems for machine
learning techniques in detail, i.e., sparse representation and feature selection, mining structured
relations, high scalability and high speed.
Sparse Representation
For the high-dimensional data, it is difficult to handle by using traditional data processing methods. Therefore, effective dimension reduction is increasingly viewed as a necessary step in dealing with these problems. In terms of high-dimensional big data, we highlight the feature selection and sparse representation methods for machine learning techniques, which are two commonly adopted approaches in dealing with high-dimensional data.
Feature selection is a key issue in building robust data processing models through the process of selecting a subset of meaningful features. It should be able to help visualize the data, to construct better statistical models, and improve prediction accuracy through mapping the high dimensional data into the underlying low dimensional manifold. And for high-dimensional big data, a sparse data representation is more and more important for many algorithms.
Mining Structured Relations
Big data is generally from different sources with obviously
heterogeneous types including structured, unstructured and semi-structured representation forms.Dealing with such a heterogeneous dataset, the great challenge is perceivable, thus machine learning system needs infer the structure behind the data when it is not known beforehand. One way of structuring data is to discover the relevance based on inherent data properties through structured
learning and structured prediction.
The main purpose of mining structured relations from a set of data is to aggregate massive amounts of data and divide it into smaller chunks which can be easily handled by machine learning systems.
High Scalability and High Speed.
The unprecedented volumes of big data require quite high scalability of their data mining and processing tools. In current researches, the techniques which are used to enhance the scalability issue of machine learning algorithms mainly focus on the following two aspects: i) the scalability of cloud computing makes it possible to analyze enormous datasets, which aggregates multiple workloads with varying performance goalsinto multi-tenanted computing clusters.
Machine learning with cloud computing owns more efficient and higher performance for processing
and analyzing big data; ii) distributed storage and parallel computing have helped to solve machine
learning algorithms’ scalability problems. A useful approach to boost the speed of big data processing is through maximally identifying and exploiting the potential parallelism in the machine learning algorithms. High scalability and high speed can give machine learning high power to handle big data
pre-processing phase
Because data sources almost cover all different kinds of domains, raw big data
collecting from the environment are greatly complex and has tremendous redundancies. Therefore, we need delete the invalid and dirty data at first in pre-processing phase
In addition, we frequently have to
face massive uncertain and incomplete data in real life and we need append some important attributes to improve their processing practicability
analysis phase
After raw data pre-processing phase, we need analyze these
valid and useful data to find out how to utilize the data through trial and error. Data visualization is a
fundamental problem in the analysis of big data, and we can adopt sparse representation to achieve
effective dimension reduction for the high-dimensional data
model establishment phase
Through essential parameters analysis, we should be able to select some important features to establish the feasible model for dealing with real problems. In terms of model establishment phase, we try to mine the structured relations between data to obtain statistical information and trend at first, and then split data into training and testing sets
model updating phase
In the
end, we can decide what kind of model should be generated for utilization and build up the
corresponding model. While the model is established, we need configure parameters for the model and
apply the generated model obtained from the model establishment phase into actual operations to test
the performance of the big data processing model. In this phase, we emphasize the input data is
real-time. We should make dynamic adjustments to update the model based on effects of model
application
In terms of the four phases in the procedure of big data processing, the anterior three phases are
offline processing. In these phases, we are able to adopt offline learning methods which include two
categories of supervised learning and unsupervised learning. In the model testing and updating phase,
we mainly focus on the real-time characteristic of input data. To deal with the problem of real-time
processing, online learning methods are necessary and the reinforcement learning is preferred.