Machine learning platforms are one of the fastest growing services of the public cloud. ML, an approach and set of technologies that use Artificial Intelligence (AI) concepts, is directly related to pattern recognition and computational learning. Early adopters of AI have now rolled out cloud-based services that are bringing AI to the masses.
How are AI, deep learning, machine learning, big data, and cloud related? Can machine learning algorithms enable the use of an individual’s comprehensive biological information to predict or diagnose diseases, and to find or develop the best therapy for that individual? How is Quantum Computing in the Cloud related to the use of AI and Cybersecurity?
Join this webinar to learn more about:
- Machine Learning, Data Discovery and Cloud
- Cloud-Based ML Applications and ML services from AWS and Google Cloud
- How to Automate Machine Learning
2. 2
• Head of Innovation at TokenEx
• Chief Technology Officer at Protegrity
• Chief Technology Officer at Atlantic BT Security Solutions
• Chief Technology Officer at Compliance Engineering
• Developer at IBM Research and Development
• Inventor of 70+ issued US patents
• Providing products and services for Data Encryption and
Tokenization, Data Discovery, Cloud Application Security Broker,
Web Application Firewall, Managed Security Services and
Security Operation Center
Ulf Mattsson
2
5. 5
Artificial Intelligence
• At the core of most, if not all, advanced artificial intelligence or
machine learning systems is optimization problems.
• Machine learning is an incredibly iterative process, and utilizes
huge data sets to learn and evolve to figure out improved
approaches to the problem at hand.
• Novel quantum algorithms could dramatically accelerate the
underlying processing required for machine learning.
The strange, nearly metaphysical nature that governs how qubits
operate in quantum computing, not only hold the key for better
and faster artificial intelligence, but may also be the secret to true
artificial intelligence.
6. 6
The Difference Between Artificial Intelligence and Machine
Learning
• Artificial Intelligence describes the ability of machines to perform tasks that
are typically associated with human activity and intelligence: reasoning,
learning, natural language processing, perception, etc. Any “smart” activity
performed by a machine falls under AI.
• Artificial Intelligence is the capability of a machine to imitate intelligent
human behavior.
• Machine Learning is a subset of AI.
• ML is a set of algorithms that are built to achieve AI: those algorithms
require the ability to learn from data, modify themselves when exposed to
more data, and are able to achieve a goal without being explicitly
programmed.
Source: BigID and Groundlabs
10. 10
Supervised vs Unsupervised Learning
• ML tasks are often classified into two main categories: Supervised Learning and Unsupervised Learning.
• You often need both to adequately analyze and draw value from large data sets.
• A Supervised Learning algorithm infers a prediction model from a training set: an algorithm that can map a conclusion
based on the typical path from input to output.
• The goal of supervised learning is that when you have new data (input), you can (accurately) predict the output of that
data.
• A straightforward data classification task, for instance, can be approached with a supervised learning algorithm if it’s
starting with enough data – x type of data falls into z category.
• When you add new data, that algorithm will then be able to identify that [x] type of data, based on [y] identifiers, can be
classified into [z] category. Supervised learning algorithms usually require human assistance to label the data.
• An Unsupervised Learning algorithm, on the other hand, tries to find commonalities in data (without any human labeling
of the data) to gain insights.
• Given a set of files, an unsupervised algorithm can group data into sub-groups based on the features, content or metadata,
of the documents.
• Clustering, for example, is a type of unsupervised learning algorithm: clustering algorithms scan through data to discover
and identify natural clusters that indicate they’re the same type of data – that might mean specific types of personal
information, formats that typically contain personal information, and more.
Source: BigID and Groundlabs
11. 11
Deep Learning
• Deep Learning is a set of algorithms that aims to perform both supervised and
unsupervised machine learning tasks.
• These algorithms are modeled after the way that humans process data and
recognize patterns.
• It’s another layer of classification and clustering to help make sense of
independent and unlabeled data.
• Deep learning brings another level of sophistication to mapping and analyzing
large data sets.
• It brings in layered architecture to better approach complex data challenges like
how to process natural language, make sense of big data, and process otherwise
unstructured and diverse sets of data.
Source: BigID and Groundlabs
16. 16
Amazon AWS has a broad and deep set of machine learning and AI services
17. 17
AI and machine learning products
1. AI Hub, our hosted repository of plug-and-play AI components, encourages experimentation and
collaboration within your organization.
2. AI building blocks make it easy for developers to add sight, language, conversation, and structured data
to their applications.
3. AI Platform, our code-based data science development environment, lets ML developers and data
scientists quickly take projects from ideation to deployment.
Cloud AI
1. Cloud AutoML - Service to train and deploy custom machine learning models.
2. Cloud TPU - Accelerators used by Google to train machine learning models.
3. Cloud Machine Learning Engine - Managed service for training and building machine learning models based on
mainstream frameworks.
4. Cloud Job Discovery - Service based on Google's search and machine learning capabilities for recruiting ecosystem.
5. Dialogflow Enterprise - Development environment based on Google's machine learning for building conversational
interfaces.
6. Cloud Natural Language - Text analysis service based on Google Deep Learning models.
7. Cloud Speech-to-Text - Speech to text conversion service based on machine learning.
8. Cloud Text-to-Speech - Text to speech conversion service based on machine learning.
9. Cloud Translation API - Service to dynamically translate between thousands of available language pairs
10. Cloud Vision API - Image analysis service based on machine learning
11. Cloud Video Intelligence - Video analysis service based on machine learning
18. 18
AWS has a broad and deep
set of machine learning and
AI services
AWS pre-trained AI Services provide
ready-made intelligence for your
applications and workflows.
AI Services easily integrate with your
applications to address common use
cases such as personalized
recommendations, modernizing
your contact center, improving
safety and security, and increasing
customer engagement.
Because we use the same deep
learning technology that powers
Amazon.com and our ML Services,
you get quality and accuracy from
continuously-learning APIs.
And best of all, AI Services on AWS
don't require machine learning
experience.
20. 20
Pseudonymisation Under the GDPR
Within the text of the GDPR, there are multiple references to
pseudonymisation as an appropriate mechanism for protecting personal
data.
Pseudonymisation—replacing identifying or sensitive data with
pseudonyms, is synonymous with tokenization—replacing identifying or
sensitive data with tokens.
Article 4 – Definitions
• (1) ‘personal data’ means any information relating to an identified
or identifiable natural person (‘data subject’); …such as a name, an
identification number, location data, an online identifier…
• (5) ‘pseudonymisation’ means the processing personal data in such
a manner that the data can no longer be attributed to a specific
data subject without the use of additional information, provided that
such additional information is kept separately…
What is Personal Data according to EU GDPR?
22. 22
Automate Consent Tracking And Data Governance
• Advanced ML PII & PI Discovery & Access Intelligence For Security & Privacy.
Petabyte Scale. ML Driven. Structured & Unstructured. Data Subject Rights.
• Artificial Intelligence (AI) is prevalent in everything from autocorrect to music
recommendations, from Frankenstein’s monster to replicants and paranoid
robots.
• Formalized in the 1950s, AI has moved past speculative fiction and is an
inescapable part of our everyday lives.
• The past few years have seen a significant rise in software projects that use
Artificial Intelligence and Machine Learning.
• Although often used interchangeably, Artificial Intelligence (AI) and Machine
Learning (ML) are not the same.
• Think of AI as intelligence, and ML as knowledge.
Source: BigID and Groundlabs
23. 23
Gartner Hype Cycle for Data Security
Data Classification
Privacy by Design
(EU GDPR)
24. 24
Use AI and ML to get more value from your data
• BigID leverages artificial intelligence and machine learning to discover, classify, analyze,
and protect identity and entity data.
• BigID uses AI, ML, and deep learning applications to add context to independent data
points, correlate data into individual identities, build a personal data catalog, and get full
visibility across data stores of personal, consumer, and entity data.
• Learn more about how BigID uses AI and ML to redefine personal data protection and
privacy across your enterprise data stores.
ML Based PI Discovery
• The problem of discovering sensitive information is not new. However, traditional data
discovery solutions haven’t been able to deal with big data volumes, lack Identity-
awareness essential for privacy and don’t determine data ownership or provenance.
• BigID provides a purpose-built data discovery tool for identity oriented information like
customer, employee or client data.
• Using BigID organizations can find and inventory data subject information, accurately and
at scale.
Source: BigID (TokenEx partner)
25. 25
Data Discovery Beyond DLP - Heat Maps
• The challenge that many organizations first face when they look to enhance data security or meet
proliferating privacy regulations is accurately identifying what is and isnʼt personal identifiable
information.
• Legacy tools like DLP with their dependence on Regular Expression matching and lack of context
awareness struggle distinguishing between similar looking data, canʼt always scan both structured and
unstructured data sources, fail to measure data sensitivity and canʼt discern residency.
• BigID's data science based discovery, correlation and classification gives organizations a first of its kind
protection and privacy solution, purpose-built for identity data.
• Without understanding where personal data is stored, it’s difficult to implement policies and controls for
how that data is moved, used and secured. Knowing where to look, and what to look for is critical in the
PII discovery process.
• For that reason, in addition to performing detailed attribute level discovery, BigID can also provide fine-
grained analysis to rapidly ‘heat map’ PII concentrations across data center endpoints and cloud.
• This has benefits in multiple use cases, including cloud migrations where assessment of server data
sensitivity is essential as well as developer environments where data stores and micro-services should be
monitored for potential PII contamination.
Source: BigID (TokenEx partner)
26. 26
PII Inventory
• Locating sensitive PII is essential to protecting it.
• However data maps alone can't provide a complete protection or privacy
picture.
• New privacy protection regulations mandate an individual's right to access
their own data, the right-tobe-forgotten, the right to port their data and
the right to be notified of a breach.
• All these require knowing what data belongs to whom.
• BigID’s data discovery technology determines which data belongs to which
data subject and with what level of correlation.
Source: BigID (TokenEx partner)
27. 27
Accuracy at Scale
• Personally identifiable data is a legal concept, not a technical concept.
• The challenge with pattern matching based on regular expressions is the
approach doesn’t integrate enough context to distinguish between data that may
have similar formats (such as Social Security number and a telephone number),
but have very different implications for privacy protection.
• BigID utilizes a combination of machine learning and analytics to automatically
calculate the identifiability of attributes and sets of attributes (including pseudo
identifiers) improving accuracy of what gets classified as personal data.
• Moreover by distributing search across big data end points, limiting discovery
scope to known or learned PII and by pre-sampling the data using heat map
surveys, BigID is able to find the PII both more accurately and at scale.
Source: BigID (TokenEx partner)
28. 28
Data Minimization
• Increasingly organizations are adopting data minimization strategies for security and
privacy reasons. By deleting or reducing inessential duplicate or unused data,
organizations can minimize potential attack vectors.
• Unlike prior discovery tools, BigID can both quickly report on duplicate data but also
provide residency and usage detail so minimization strategies can be based on secondary
factors like jurisdiction and activity history.
• BigID is transforming enterprise protection and privacy of personal data.
• Organizations are facing record breaches of personal information and proliferating global
privacy regulations with fines reaching 10% of annual revenue.
• Today enterprises lack dedicated purpose built technology to help them track and govern
their customer data.
• By bringing data science to data privacy, BigID aims to give enterprises the software to
safeguard and steward the most important asset organizations manage: their customer
data.
Source: BigID (TokenEx partner)
29. 29
ML Driven Data Classification
• The definition of sensitive data is no longer readily encapsulated in a
regular expression.
• Increasingly, companies need to classify data that is sensitive based on
context to a person, or a thing like patent or account.
• This requires a new approach to classification that can identify contextually
sensitive data across all modern data stores - unstructured, structured, Big
Data, Cloud and enterprise applications like SAP.
• BigID provides a first of its kind approach that combines Machine Learning
and Contextual Intelligence to deliver on advanced data classification,
categorization, cataloging and correlation for privacy.
Source: BigID (TokenEx partner)
30. 30
ML-Driven Classification
• Traditional pattern matching approaches to discovery and classification still
struggle with accurately identifying contextually sensitive data like Personal
Information (PI) and disambiguating similar looking information.
• Moreover, regular expression based classifiers which predominate in data
loss prevention, database activity monitoring, and data access governance
products tend to operate on a limited number of data sources, like
relational databases or on-prem unstructured file shares.
• BigID leverages machine learning to classify, categorize and compare data
and files across structured, unstructured, semistructured and Big Data in
the cloud or on-prem.
• BigID can resolve similar looking entities and build association graphs to
correlate data back to a specific entity or person - essential for meeting
emerging privacy use cases like personal data rights
Source: BigID (TokenEx partner)
31. 31
Correlation plus classification
• Even with AI and ML classification approaches like clustering or random
forest, classifiers can improve accuracy through smarter matching and
comparison analysis - but lack the context to understand who the data
relates to.
• This is a common problem for privacy requirements and regulated
industries. The capability to build a graph of connected or relevant data can
be characterized as a correlation problem.
• Correlation helps an organization find sensitive data because of its
association to other sensitive data.
• BigID provides a first of its kind model that can, not only match similar data
within the same class based on ML analysis, but also match connected data
of different classes based on relevancy and connectedness.
• This correlation-based classification is critical to privacy.
Source: BigID (TokenEx partner)
32. 32
Cataloging plus Classification
• BigID's ML-based classifiers use advanced AI techniques to match data within a
class and also correlate data of different classes that have a common sensitivity
level owing to a shared association.
• But, there is a third way sensitivity can be measured. Most data also has certain
attributes associated with it, such as date of creation, last modification of
ownership and access details.
• Unlike traditional classifiers, BigID can also integrate meta-data analysis to
provide a richer view of the data and its usage.
• This meta-data input can be used to better and more automatically catalog data
for easier discovery via search as well as measure sensitivity risk.
• The combination of intelligent classification, correlation and cataloging give
organizations the unique ability to find, inventory and map sensitive data by
additional dimensions than just data class or category.
• These include finding data by person, residency, application and ownership.
Source: BigID (TokenEx partner)
33. 33
Intelligent labeling and tagging
• Enforcement of security protection and privacy compliance requires data risk and
sensitivity knowledge.
• BigID helps organizations understand data sensitivity through advanced ML-based
classification, correlation and cataloging to provide a complete view of data.
• To simplify enforcement on classified data, BigID enables customers to automatically
assign data tags for files and objects.
• These classification tags can be consumed through Microsoft's Azure Information
Protection framework as policy labels, BigID's labeling APIs or additional frameworks like
Box.
• Using these labels, organizations can classify or categorize data - such as Highly Sensitive,
as well as Personal Data based on privacy, health or financial services compliance
mandates.
• These tags can then be used for more granular policy enforcement actions by DLP,
information rights management, database activity monitoring or other enforcement
products.
Source: BigID (TokenEx partner)
36. 36
AI, Big Data and Quantum Computing
• In 2016, Northeastern University stated that globally, we create 2.5 exabytes, or
2.5 quintillion bytes of data every single day.
• Realistically, we probably create much more data now than we did in 2016, but
to put 2.5 exabytes in perspective, it’s equivalent to 250,000 times the entirety
of the Libraries of Congress.
• This virtual flood of data from laptops, computers, phones, and other technology
has created the advent of Big Data, unfathomably large data sets on everything
from how many people like to wear red polo shirts on a sunny day to what
they’re likely to eat after a hard session at the gym.
The problem is that we now have so much data that it’s hard to find out what to
do with it. Analysis on a higher level is nearly impossible due to just how much
information is contained.
• It’s an ideal problem for quantum computing, which can handle massive data
sets with ease, providing insights to artificial intelligence which can then analyze
it further.
37. 37
Why is Machine Learning so Useful in Security?
Insider Threats and
Behavior Security
Analytics
52. 52
• Verizon Data Breach Investigations Report
• Enterprises are losing ground in the fight against
persistent cyber-attacks
• We simply cannot catch the bad guys until it is too
late. This picture is not improving
• Verizon reports concluded that less than 14% of
breaches are detected by internal monitoring tools
• JP Morgan Chase data breach
• Hackers were in the bank’s network for months
undetected
• Network configuration errors are inevitable, even at
the larges banks
• Capital One data breach
• A hacker gained access to 100 million credit card
applications and accounts
• Amazon Web Services, the cloud hosting company
that Capital One was using
Enterprises Losing Ground Against Cyber-attacks
54. 54
Data sources
Data
Warehouse
In Italy
Complete policy-
enforced de-
identification of
sensitive data across
all bank entities
Tokenization for Cross Border Data-centric Security (EU GDPR)
• Protecting Personally Identifiable Information
(PII), including names, addresses, phone, email,
policy and account numbers
• Compliance with EU Cross Border Data
Protection Laws
• Utilizing Data Tokenization, and centralized
policy, key management, auditing, and
reporting
55. 55
Type of
Data
Use
Case
I
Structured
How Should I Secure Different Types of Data?
I
Un-structured
Simple –
Complex –
PCI
PHI
PII
Encryption
of Files
Card
Holder
Data
Tokenization
of Fields
Protected
Health
Information
Personally Identifiable Information
57. 57
Examples of Tokenized Data
Field Real Data Tokenized / Pseudonymized
Name Joe Smith csu wusoj
Address 100 Main Street, Pleasantville, CA 476 srta coetse, cysieondusbak, CA
Date of Birth 12/25/1966 01/02/1966
Telephone 760-278-3389 760-389-2289
E-Mail Address joe.smith@surferdude.org eoe.nwuer@beusorpdqo.org
SSN 076-39-2778 076-28-3390
CC Number 3678 2289 3907 3378 3846 2290 3371 3378
Business URL www.surferdude.com www.sheyinctao.com
Fingerprint Encrypted
Photo Encrypted
X-Ray Encrypted
Healthcare /
Financial
Services
Dr. visits, prescriptions, hospital stays and
discharges, clinical, billing, etc.
Financial Services Consumer Products and
activities
Protection methods can be equally applied
to the actual data, but not needed with de-
identification
58. 58
Cloud Data Security
Operating System
Security Controls
OS File System
Database
Application Framework
Application Source Code
Application
Data
Network
External Network
Internal Network
Application Server
58
Publi
c
Cloud
Secure
Cloud
Security
Separation
Armor.com
59. 59
Security Separation in Cloud
Internal Network
Administrator
Remote User
Internal User
Public Cloud Examples
Each
authorized
field is in
clear
Cloud
Gateway
Data Security for including encryption, tokenization or
masking of fields or files (at transit and rest)
Secure Cloud
Security
Separation
Armor.com
60. 60
On Premise tokenization
• Limited PCI DSS scope reduction - must still maintain a
CDE with PCI data
• Higher risk – sensitive data still resident in environment
• Associated personnel and hardware costs
Cloud-Based tokenization
• Significant reduction in PCI DSS scope
• Reduced risk – sensitive data removed from the
environment
• Platform-focused security
• Lower associated costs – cyber insurance, PCI audit,
maintenance
Total Cost and Risk of Tokenization