Machine Learning in a Flash: An Introduction to Natural Language Processing

•Download as PPTX, PDF•

0 likes•797 views

Learn the basics behind machine learning and natural language processing. In this presentation we’ll go over a handful of really quick machine learning algorithms. We’ll cover the difference between unsupervised and supervised AI, classification, clustering, and a little bit of natural language processing to classify sentences as being about “eating”.

Technology

Machine Learning in a
Flash
Kory Becker
June, 2017, http://primaryobjects.com
1
Sponsored by

AI !== Machine Learning
 Logical AI, Symbolic, Knowledge-
based
 Pattern Recognition, Representation
 Inference, Common Sense, Planning
 Heuristics, Ontology, Artificial Life,
Genetic
 Machine Learning, Statistics
2

Machine Learning
Algorithms
Supervised
Linear Regression
Logistic Regression
Support Vector Machines
Neural Networks
Unsupervised
K-means Clustering
Principal Component Analysis (Dimensionality
Reduction)
3

Logistic Regression
Linear Classification

Support Vector Machine
Non-Linear Classification

Question 1: Supervised
or Unsupervised?
 You are designing an agent for The Matrix.
 It’s task is to classify people that are threats to the system.
 Feature Set:
 Age
 IQ
 Level of Education
 # of Times They Watched the Movie The Matrix
 Training Set of 100,000 people: 50k threats, 50k non-threats

Question 2: Supervised
or Unsupervised?
 You are designing the brain of a battle robot.
 It’s primary attack is hand-to-hand combat. Your task is to
find the most effective move combos.
 Feature Set:
 # of Kicks
 # of Punches
 # of Head-butts
 # of Leg Sweeps
 Training Set of 100,000 winning battles

Natural Language
Processing
Convert text into a numerical representation
Find commonalities within data
Clustering
Make predictions from data
Classification
Category, Popularity, Sentiment,
Relationships

Bag of Words Model
Corpus
Cats like to chase mice.
Dogs like to eat big bones.

Create a Dictionary Dictionary
0 - cats
1 - like
2 - chase
3 - mice
4 - dogs
5 - eat
6 - big
7 - bones
Cats like to chase mice.
Dogs like to eat big bones.
Corpus

Digitize Text
Cats like to chase mice.
1 1 1 1 0 0 0 0
Dogs like to eat big bones.
0 1 0 0 1 1 1 1
Vector Length = 8
Corpus
Dictionary
0 - cats
1 - like
2 - chase
3 - mice
4 - dogs
5 - eat
6 - big
7 - bones

Classify Documents
(eating)
Cats like to chase mice.
1 1 1 1 0 0 0 0
Dogs like to eat big bones.
0 1 0 0 1 1 1 1
0
1
Corpus
Dictionary
0 - cats
1 - like
2 - chase
3 - mice
4 - dogs
5 - eat
6 - big
7 - bones

Predict on New Data
Cats like to chase mice.
1 1 1 1 0 0 0 0
Dogs like to eat big bones.
0 1 0 0 1 1 1 1
Bats eat bugs.
0 0 0 0 0 1 0 0
0
1
?
Dictionary
0 - cats
1 - like
2 - chase
3 - mice
4 - dogs
5 - eat
6 - big
7 - bones

Predict on New Data
Cats like to chase mice.
1 1 1 1 0 0 0 0
Dogs like to eat big bones.
0 1 0 0 1 1 1 1
Bats eat bugs.
0 0 0 0 0 1 0 0
0
1
1
Dictionary
0 - cats
1 - like
2 - chase
3 - mice
4 - dogs
5 - eat
6 - big
7 - bones

Does it Really Work?
> data
[1] "Cats like to chase mice." "Dogs like to eat big
bones."
> train
big bone cat chase dog eat like mice y
1 0 0 1 1 0 0 1 1 0
2 1 1 0 0 1 1 1 0 1
> predict(fit, newdata = train)
[1] 0 1
> data2
[1] "Bats eat bugs."
> test
big bone cat chase dog eat like mice
1 0 0 0 0 0 1 0 0
> predict(fit, newdata = test)
[1] 1
Document
Term Matrix
100% Accuracy Training
Test Case
Success! Source code:
https://goo.gl/UxjPBs

More from Kory Becker

Intelligent Heuristics for the Game Isolation

Kory Becker

Tips for Submitting a Proposal to Grace Hopper GHC 2020

Kory Becker

Grace Hopper 2019 Quantum Computing Recap

Kory Becker

Learn about quantum computing and how it changes the way traditional computers are used. This presentation covers a brief background introduction to quantum mechanics, the double-slit experiment, classical computers versus quantum computers, quantum algorithms, and quantum programming language frameworks. We'll also explore a game developed with the QisKit quantum computing framework and see how it works.

An Introduction to Quantum Computing - Hopper X1 NYC 2019

Kory Becker

Is it possible for a computer program to write its own programs? While this kind of idea could seem far-fetched, it may actually be closer than we think. This presentation introduces "AI Programmer", a machine learning system, which can automatically generate full software programs requiring only minimal human guidance. The system uses genetic algorithms coupled with a tightly constrained programming language. We’ll cover an overview of the system design and see examples of its software-generation capabilities. #GHC18

Self-Programming Artificial Intelligence Grace Hopper GHC 2018 GHC18

Kory Becker

2017 CodeFest Wrap-up Presentation

Kory Becker

Learn the basics behind machine learning, neural networks, natural language processing, and clustering. In this presentation we’ll go over a handful of really quick machine learning algorithms. We’ll cover the difference between unsupervised and supervised learning in artificial intelligence, classification, clustering, and natural language processing to classify sentences as being about “eating”. We'll also see how to automatically categorize data under specific groups, using unsupervised learning, and apply topic detection to a finance data-set.

Machine Learning in a Flash (Extended Edition 2): An Introduction to Neural N...

Kory Becker

Self Programming Artificial Intelligence - Lightning Talk

Kory Becker

Self Programming Artificial Intelligence

Kory Becker

IBM Watson Concept Insights

Kory Becker

This is part of a presentation for QCon New York 2015. On April 23, 2013 the stock market experienced one of its biggest flash-crash drops of the year, with the Dow Jones industrial average falling 143 points (over 1%) in a matter of minutes. Unlike the 2012 stock market blip, this one wasn't caused by an individual trade, but rather by a single tweet from The Associated Press account on the social network, Twitter. The tweet, of course, wasn't written by AP, but rather by an impostor who had temporarily gained control of the account. Could a computer program have detected the tweet as hacked? In this presentation, we'll discuss how machine learning was used to classify tweets as having been authored by The Associated Press or not. As a final test, the program was run on the hacked tweet and we'll reveal if it was able to successfully classify the tweet as being authentic or hacked. Full article: http://www.primaryobjects.com/cms/article158

Detecting a Hacked Tweet with Machine Learning (5 Minute Presentation)

Kory Becker

More from Kory Becker (11)

Intelligent Heuristics for the Game Isolation

Tips for Submitting a Proposal to Grace Hopper GHC 2020

Grace Hopper 2019 Quantum Computing Recap

An Introduction to Quantum Computing - Hopper X1 NYC 2019

Self-Programming Artificial Intelligence Grace Hopper GHC 2018 GHC18

2017 CodeFest Wrap-up Presentation

Machine Learning in a Flash (Extended Edition 2): An Introduction to Neural N...

Self Programming Artificial Intelligence - Lightning Talk

Self Programming Artificial Intelligence

IBM Watson Concept Insights

Detecting a Hacked Tweet with Machine Learning (5 Minute Presentation)

Recently uploaded

Boost Fertility New Invention Ups Success Rates.pdf

sudhanshuwaghmare1

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Rafal Los

Created by Mozilla Research in 2012 and now part of Linux Foundation Europe, the Servo project is an experimental rendering engine written in Rust. It combines memory safety and concurrency to create an independent, modular, and embeddable rendering engine that adheres to web standards. Stewardship of Servo moved from Mozilla Research to the Linux Foundation in 2020, where its mission remains unchanged. After some slow years, in 2023 there has been renewed activity on the project, with a roadmap now focused on improving the engine’s CSS 2 conformance, exploring Android support, and making Servo a practical embeddable rendering engine. In this presentation, Rakhi Sharma reviews the status of the project, our recent developments in 2023, our collaboration with Tauri to make Servo an easy-to-use embeddable rendering engine, and our plans for the future to make Servo an alternative web rendering engine for the embedded devices industry. (c) Embedded Open Source Summit 2024 April 16-18, 2024 Seattle, Washington (US) https://events.linuxfoundation.org/embedded-open-source-summit/ https://ossna2024.sched.com/event/1aBNF/a-year-of-servo-reboot-where-are-we-now-rakhi-sharma-igalia

A Year of the Servo Reboot: Where Are We Now?

Igalia

Data Cloud, More than a CDP by Matt Robison

Anna Loughnan Colquhoun

Strategies for Landing an Oracle DBA Job as a Fresher

Remote DBA Services

Partners Life - Insurer Innovation Award 2024

The Digital Insurer

How to Troubleshoot Apps for the Modern Connected Worker

ThousandEyes

What are drone anti-jamming systems? The drone anti-jamming systems and anti-spoof technology protect against interference, jamming, and spoofing of the UAVs. To protect their security, countries are beginning to research drone anti-jamming systems, also known as drone strike weapons. The anti-jam and anti-spoof technology protects against interference, jamming and spoofing. A drone strike weapon is a drone attack weapon that can attack and destroy enemy drones. So what is so unique about this amazing system?

What Are The Drone Anti-jamming Systems Technology?

Antenna Manufacturer Coco

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

The Digital Insurer

The presentation explores the development and application of artificial intelligence (AI) from its inception to its current status in the modern world. The term "artificial intelligence" was first coined by John McCarthy in 1956 to describe efforts to develop computer programs capable of performing tasks that typically require human intelligence. This concept was first introduced at a conference held at Dartmouth College, where programs demonstrated capabilities such as playing chess, proving theorems, and interpreting texts. In the early stages, Alan Turing contributed to the field by defining intelligence as the ability of a being to respond to certain questions intelligently, proposing what is now known as the Turing Test to evaluate the presence of intelligent behavior in machines. As the decades progressed, AI evolved significantly. The 1980s focused on machine learning, teaching computers to learn from data, leading to the development of models that could improve their performance based on their experiences. The 1990s and 2000s saw further advances in algorithms and computational power, which allowed for more sophisticated data analysis techniques, including data mining. By the 2010s, the proliferation of big data and the refinement of deep learning techniques enabled AI to become mainstream. Notable milestones included the success of Google's AlphaGo and advancements in autonomous vehicles by companies like Tesla and Waymo. A major theme of the presentation is the application of generative AI, which has been used for tasks such as natural language text generation, translation, and question answering. Generative AI uses large datasets to train models that can then produce new, coherent pieces of text or other media. The presentation also discusses the ethical implications and the need for regulation in AI, highlighting issues such as privacy, bias, and the potential for misuse. These concerns have prompted calls for comprehensive regulations to ensure the safe and equitable use of AI technologies. Artificial intelligence has also played a significant role in healthcare, particularly highlighted during the COVID-19 pandemic, where it was used in drug discovery, vaccine development, and analyzing the spread of the virus. The capabilities of AI in healthcare are vast, ranging from medical diagnostics to personalized medicine, demonstrating the technology's potential to revolutionize fields beyond just technical or consumer applications. In conclusion, AI continues to be a rapidly evolving field with significant implications for various aspects of society. The development from theoretical concepts to real-world applications illustrates both the potential benefits and the challenges that come with integrating advanced technologies into everyday life. The ongoing discussion about AI ethics and regulation underscores the importance of managing these technologies responsibly to maximize their their benefits while minimizing potential harms.

Artificial Intelligence: Facts and Myths

Joaquim Jorge

Three things you will take away from the session: • How to run an effective tenant-to-tenant migration • Best practices for before, during, and after migration • Tips for using migration as a springboard to prepare for Copilot in Microsoft 365 Main ideas: Migration Overview: The presentation covers the current reality of cross-tenant migrations, the triggers, phases, best practices, and benefits of a successful tenant migration Considerations: When considering a migration, it is important to consider the migration scope, performance, customization, flexibility, user-friendly interface, automation, monitoring, support, training, scalability, data integrity, data security, cost, and licensing structure Next Wave: The next wave of change includes the launch of Copilot, which requires businesses to be prepared for upcoming changes related to Copilot and the cloud, and to consolidate data and tighten governance ShareGate: ShareGate can help with pre-migration analysis, configurable migration tool, and automated, end-user driven collaborative governance

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

sammart93

What is a good lead in your organisation? Which leads are priority? What happens to leads? When sales and marketing give different answers to these questions, or perhaps aren't sure of the answers at all, frustrations build and opportunities are left on the table. Join us for an illuminating session with Cian McLoughlin, HubSpot Principal Customer Success Manager, as we look at that crucial piece of the customer journey in which leads are transferred from marketing to sales.

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

HampshireHUG

GenAI Risks & Security Meetup 01052024.pdf

lior mazor

Abhishek Deb(1), Mr Abdul Kalam(2) M. Des (UX) , School of Design, DIT University , Dehradun. This paper explores the future potential of AI-enabled smartphone processors, aiming to investigate the advancements, capabilities, and implications of integrating artificial intelligence (AI) into smartphone technology. The research study goals consist of evaluating the development of AI in mobile phone processors, analyzing the existing state as well as abilities of AI-enabled cpus determining future patterns as well as chances together with reviewing obstacles as well as factors to consider for more growth.

Exploring the Future Potential of AI-Enabled Smartphone Processors

debabhi2

Tech Trends Report 2024 Future Today Institute.pdf

hans926745

Scaling API-first – The story of a global engineering organization Ian Reasor, Senior Computer Scientist - Adobe Radu Cotescu, Senior Computer Scientist - Adobe Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

apidays

This presentation explores the impact of HTML injection attacks on web applications, detailing how attackers exploit vulnerabilities to inject malicious code into web pages. Learn about the potential consequences of such attacks and discover effective mitigation strategies to protect your web applications from HTML injection vulnerabilities. for more information visit https://bostoninstituteofanalytics.org/category/cyber-security-ethical-hacking/

HTML Injection Attacks: Impact and Mitigation Strategies

Boston Institute of Analytics

MySQL Webinar, presented on the 25th of April, 2024. Summary: MySQL solutions enable the deployment of diverse Database Architectures tailored to specific needs, including High Availability, Disaster Recovery, and Read Scale-Out. With MySQL Shell's AdminAPI, administrators can seamlessly set up, manage, and monitor these solutions, ensuring efficiency and ease of use in their administration. MySQL Router, on the other hand, provides transparent routing from the application traffic to the backend servers in the architectures, requiring minimal configuration. Completely built in-house and supported by Oracle, these solutions have been adopted by enterprises of all sizes for their business-critical applications. In this presentation, we'll delve into various database architecture solutions to help you choose the right one based on your business requirements. Focusing on technical details and the latest features to maximize the potential of these solutions.

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Miguel Araújo

AWS Community Day CPH - Three problems of Terraform

Andrey Devyatkin

In this session, we will delve into strategic approaches for optimizing knowledge management within Microsoft 365, amidst the evolving landscape of Copilot. From leveraging automatic metadata classification and permission governance with SharePoint Premium, to unlocking Viva Engage for the cultivation of knowledge and communities, you will gain actionable insights to bolster your organization's knowledge-sharing initiatives. In this session, we will also explore how to facilitate solutions to enable your employees to find answers and expertise within Microsoft 365. You will leave equipped with practical techniques and a deeper understanding of how there is more to effective knowledge management than just enabling Copilot, but building actual solutions to prepare the knowledge that Copilot and your employees can use.

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Drew Madelung

Recently uploaded (20)

Boost Fertility New Invention Ups Success Rates.pdf

The 7 Things I Know About Cyber Security After 25 Years | April 2024

A Year of the Servo Reboot: Where Are We Now?

Data Cloud, More than a CDP by Matt Robison

Strategies for Landing an Oracle DBA Job as a Fresher

Partners Life - Insurer Innovation Award 2024

How to Troubleshoot Apps for the Modern Connected Worker

What Are The Drone Anti-jamming Systems Technology?

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Artificial Intelligence: Facts and Myths

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

GenAI Risks & Security Meetup 01052024.pdf

Exploring the Future Potential of AI-Enabled Smartphone Processors

Tech Trends Report 2024 Future Today Institute.pdf

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

HTML Injection Attacks: Impact and Mitigation Strategies

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

AWS Community Day CPH - Three problems of Terraform

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Machine Learning in a Flash: An Introduction to Natural Language Processing

1. Machine Learning in a Flash Kory Becker June, 2017, http://primaryobjects.com 1 Sponsored by

2. AI !== Machine Learning  Logical AI, Symbolic, Knowledge- based  Pattern Recognition, Representation  Inference, Common Sense, Planning  Heuristics, Ontology, Artificial Life, Genetic  Machine Learning, Statistics 2

3. Machine Learning Algorithms Supervised Linear Regression Logistic Regression Support Vector Machines Neural Networks Unsupervised K-means Clustering Principal Component Analysis (Dimensionality Reduction) 3

4. Linear Regression

5. Logistic Regression

6. Logistic Regression Linear Classification

7. Support Vector Machine Non-Linear Classification

8. Support Vector Machine Gaussian Kernel

9. Pop Quiz!

10. Question 1: Supervised or Unsupervised?  You are designing an agent for The Matrix.  It’s task is to classify people that are threats to the system.  Feature Set:  Age  IQ  Level of Education  # of Times They Watched the Movie The Matrix  Training Set of 100,000 people: 50k threats, 50k non-threats

11. Question 2: Supervised or Unsupervised?  You are designing the brain of a battle robot.  It’s primary attack is hand-to-hand combat. Your task is to find the most effective move combos.  Feature Set:  # of Kicks  # of Punches  # of Head-butts  # of Leg Sweeps  Training Set of 100,000 winning battles

12. Natural Language Processing Convert text into a numerical representation Find commonalities within data Clustering Make predictions from data Classification Category, Popularity, Sentiment, Relationships

13. Bag of Words Model Corpus Cats like to chase mice. Dogs like to eat big bones.

14. Create a Dictionary Dictionary 0 - cats 1 - like 2 - chase 3 - mice 4 - dogs 5 - eat 6 - big 7 - bones Cats like to chase mice. Dogs like to eat big bones. Corpus

15. Digitize Text Cats like to chase mice. 1 1 1 1 0 0 0 0 Dogs like to eat big bones. 0 1 0 0 1 1 1 1 Vector Length = 8 Corpus Dictionary 0 - cats 1 - like 2 - chase 3 - mice 4 - dogs 5 - eat 6 - big 7 - bones

16. Classify Documents (eating) Cats like to chase mice. 1 1 1 1 0 0 0 0 Dogs like to eat big bones. 0 1 0 0 1 1 1 1 0 1 Corpus Dictionary 0 - cats 1 - like 2 - chase 3 - mice 4 - dogs 5 - eat 6 - big 7 - bones

17. Predict on New Data Cats like to chase mice. 1 1 1 1 0 0 0 0 Dogs like to eat big bones. 0 1 0 0 1 1 1 1 Bats eat bugs. 0 0 0 0 0 1 0 0 0 1 ? Dictionary 0 - cats 1 - like 2 - chase 3 - mice 4 - dogs 5 - eat 6 - big 7 - bones

18. Predict on New Data Cats like to chase mice. 1 1 1 1 0 0 0 0 Dogs like to eat big bones. 0 1 0 0 1 1 1 1 Bats eat bugs. 0 0 0 0 0 1 0 0 0 1 ? Dictionary 0 - cats 1 - like 2 - chase 3 - mice 4 - dogs 5 - eat 6 - big 7 - bones

19. Predict on New Data Cats like to chase mice. 1 1 1 1 0 0 0 0 Dogs like to eat big bones. 0 1 0 0 1 1 1 1 Bats eat bugs. 0 0 0 0 0 1 0 0 0 1 1 Dictionary 0 - cats 1 - like 2 - chase 3 - mice 4 - dogs 5 - eat 6 - big 7 - bones

20. Does it Really Work? > data [1] "Cats like to chase mice." "Dogs like to eat big bones." > train big bone cat chase dog eat like mice y 1 0 0 1 1 0 0 1 1 0 2 1 1 0 0 1 1 1 0 1 > predict(fit, newdata = train) [1] 0 1 > data2 [1] "Bats eat bugs." > test big bone cat chase dog eat like mice 1 0 0 0 0 0 1 0 0 > predict(fit, newdata = test) [1] 1 Document Term Matrix 100% Accuracy Training Test Case Success! Source code: https://goo.gl/UxjPBs

Editor's Notes

Hi everyone! I'm really excited to be presenting one of my favorite topics: artificial intelligence. Specifically, I thought it would be interesting to present a “crash course” on machine learning, which is a small subset of AI. In this presentation, we’ll go over a handful of really quick machine learning algorithms. We’ll cover the difference between unsupervised and supervised AI, classification, clustering, and a little bit of natural language processing to classify sentences as being about “eating”. Sound like fun? Let’s get started!
I want to briefly start this presentation off by just clearing up some media hype that has been steadily growing over the past year or two, surrounding what AI is and all of the amazing things it's going to do. The news likes to say things like "chatbots are going to take over everyone's jobs, machine learning is changing everything, etc". Not to belittle machine learning, as it truly is an amazing branch of AI that has made significant leaps and bounds in accuracy over the past few years (largely due to massive online datasets, increased computing speed, and deep learning). However, AI is not just machine learning. There is a lot more to it! AI encompasses many different branches. There is logical AI, which deals with representing knowledge as logical sentences. There is Symbolic AI (also called Classical AI), which uses human-readable representations of problems (my STRIPS planning library is an example of this). There is knowledge-based AI like the Cyc database, pattern recognition such as image recognition of cats, dogs, and the CIFAR dataset. There is also AI planning (see http://stripsfiddle.herokuapp.com for an example of this, where I demonstrate AI for solving Starcraft build orders! How cool is that?). There are heuristics like A* Search. But the focus of this presentation will be specifically on “machine learning”.
Machine learning is a statistical based approach to artificial intelligence. It focuses on algorithms that provide a distinct and measurable learning component. This focus on “measurability” is what makes machine learning so appealing, as we can understand whether an algorithm is actually “learning” anything. Machine learning consists of two areas: supervised and unsupervised algorithms. Supervised algorithms largely deal with classification, and include regression, support vector machines, and neural networks, just to name a few. Unsupervised algorithms deal with clustering - finding similarities within large sets of data and grouping accordingly. Examples of unsupervised learning algorithms include kmeans and PCA.
One of the most basic machine learning algorithms is linear regression. In its simplest form, this algorithm can dictate a trend line through data, allowing you to predict values for data when given specific features as input. For example, looking at the chart above, imagine we’re trying to predict home sale prices in your neighborhood. The x-axis represents the square footage of a house, while the y-axis represents the price. The blue dots are houses and the red line is a linear regression. Notice how as the square footage for a house increases, so too does the sale price. The linear regression plot shows this as the red trend line. Now, imagine you have a completely new house with a particular square footage and you want to predict the sale price. You could look at the linear regression plot to guess what a sale price might be, based upon other houses with similar features.
Another basic machine learning algorithm is logistic regression. This is a classification algorithm and is often the de-facto “go to” algorithm when initially trying to classify data. In the above chart, imagine that we have college students that we’re trying to classify as to whether they might be hired at our company. We have a bunch of historical data from college students, including 2 exams, and the result of whether they were hired (the blue circles) or not (the x’s). This could just as easily be cancer diagnosis or any other classification topic, but we’ll go with new hires here. Can we determine whether a student will get hired, based upon these two exam scores? Now, you could probably eyeball the data and see a rough boundary in the data that you might be able to use to classify the students. This is where logistic regression comes in.
Logistic regression divides the data into an optimal decision boundary. As you can see in the above chart, it draws a diagonal line through the data, which separates students that were hired versus those that were not. If you look at the data points on each side of the line, you can see how this separation is pretty good at halving the hires versus non-hires according to their exam scores. So, based on this result, we can probably predict on a completely new student whether that would be hired or not, by plotting their point on the chart from their exam scores, and seeing which side of the boundary they lie upon. This is specifically called “linear classification”, because we’re predicting a yes/no or 0/1 classification for a data point. There is also multi-class classification, which can label data according to any number of classifications (for example, classifying images to a corresponding digit 0-9, types of fruit, categories for a movie, etc.).
Support vector machines are another form of classification, specifically for non-linear classification. The above chart might represent cancer diagnoses. The red x’s represent positives, while the blue circles negatives. Now, we could probably draw a straight line through this data, using logistic regression, and get some degree of accuracy in diagnosing cancer in these patients. However, a better fit (and higher accuracy) could probably be achieved through a support vector machine. In the chart above, you can see how an SVM is able to draw a non-linear classification circle around the group of data. We can then predict on a new patient as to whether they have cancer or not, by seeing if they fall within the classification boundary.
Support vector machines are pretty powerful for classification. You can use different kernel to classify the data in different ways. The above chart shows an example of a Gaussian kernel, which uses a sort-of concentric circle approach to finding the optimal boundary in the classification of data. In this case, all of the white circles will be classified under one topic, while the black circles in the other topic. You can adjust the Gaussian kernel values to shrink or expand the boundary to fine-tune accuracy.
Let’s try a quick quiz.
Supervised or Unsupervised? You are designing an agent for The Matrix. It’s task is to classify people that are threats to the system. The feature set includes: age, IQ, level of education, the number of times they’ve watched the movie “The Matrix”. The training set consists of 100,000 people, divided into 50k threats and 50k non-threats. Answer: Supervised Reason: You can train a classification algorithm, such as logistic regression or a neural network by providing the 4 features as input, with a single output of 0 or 1 – corresponding to threat or non-threat. With an equally split training set, there is a better chance of accuracy.
Supervised or Unsupervised? You are designing the brain of a battle robot. It’s primary attack is hand-to-hand combat. Your task is to find the most effective move combos. The feature set includes: the number of kicks, the number of punches, the number of head-butts, and the number of leg sweeps. The training set consists of 100,000 winning battles and their associated moves. Answer: Unsupervised Reason: Since we’re looking for move combinations (i.e., sets of moves that were used in winning battles), we can use an unsupervised clustering algorithm that can group the data and identify common move patterns. From these clusters, we can identify winning move combinations.
Natural Language Processing The most basic form of natural language processing is to simply convert text into a numerical representation. This gives you an array of numbers. So, each document becomes a same-sized array of numbers. With this, you can apply machine learning algorithms, such as clustering and classification. This allows you to build unique insights into a set of documents, determining characteristics like category, popularity, sentiment, and relationships. This is the same type of processing that many popular online machine learning APIs use to classify data. For example, IBM Watson, Microsoft, Amazon, and Google, all include NLP APIs for working with data.
Bag of Words Model Let’s take a look at a quick example. Here are two documents: “Cats like to chase mice.” and “Dogs like to eat big bones”. We’re going to try to categorize these documents as being about “eating”. To do this, we’ll build a bag-of-words model and then apply a classification algorithm. Now, the first thing to note is that the two documents are of different lengths. If you think about it, most documents will practically always be of different lengths. This is fine, because after we digitize the corpus, you’ll see that the resulting data fits neatly within same-sized vectors.
Create a Dictionary So, the first step is to create a dictionary from our corpus. First, we apply a stemming algorithm on the corpus. This will remove the stop-word “to”. Next, we find each unique term and add it to our dictionary. You can see the resulting list on the right-side of this slide. Our dictionary contains 8 terms.
Digitize Text With our dictionary created, we can now digitize the documents. Since our dictionary has 8 terms, each document will be encoded into a vector of length 8. This ensures that all documents end up having the same length. This makes it easier to process with machine learning algorithms. Let’s look at the first document. We’ll take the first term in the dictionary and see if it exists in the first document. The term is “cats”, which does indeed exist in the first document. Therefore, we’ll set a 1 as the first bit. The next term is “like”. Again, it exists in the first document, so we’ll set a 1 as the next bit. This repeats until we see the term “dogs”. This does not exist in the first document, so we set a “0”. Finally, we run through all terms in the dictionary and end up with a vector of length 8 for the first document. We repeat the same steps for the second document, going through each term in the dictionary and checking if it exists in the document.
Classify Documents (Eating) Once the data is digitized, we can classify the documents with regard to “eating”. Since the first document is about chasing mice, maybe playing, we’ll assign a 0. It doesn’t really have to do with eating. The second document is clearly about eating. So, we’ll assign it a 1. At this point, we can train the data with logistic regression, a neural network, a support vector machine, etc.
Predict on New Data Once our model has finished training, we can try predicting on new data to see if it’s classified correctly. Here you can see we have a new document, “Bats eat bugs.”. This document has never been seen by our machine learning algorithm yet. We want to try and categorize it as being about “eating” or not. We’ll first digitize the document, just like we did with our training corpus. In this case, we only have 1 term found in the dictionary.
Predict on New Data The machine learning algorithm is probably going to find a relationship with this particular bit, highlighted in red above. This bit corresponds to the term “eat”, and is found in the training document that was classified as 1 for the category “eating”. Based on this similarity, our model is probably going to predict our new document as … ?
Predict on New Data So this is the general idea behind natural language processing. Now, we didn’t have to classify just on “eating”. We could have just as easily classified based upon sentiment. In fact, this is a common method for performing sentiment analysis with machine learning. (Another non-machine learning method for sentiment analysis is using the AFINN word-list approach). This was a very basic example of natural language processing. In a real-world case, you could have tens of thousands of documents, with perhaps, multiple classifications. There are also various ways to encode the corpus, such as the count of the term within the sentence, tf*idf, and more.
Does it Really Work? Here is an actual example in R. The code takes the original sentences from this example and builds a document-term-matrix. Notice how the 1’s and 0’s align perfectly with what we’ve seen in the previous slides. The order of the terms is a little different, but otherwise the values are the same. The ‘y’ column is the classification (eating). We train on the data using a generalized linear model, with 100% accuracy. It’s only 2 training cases, so it’s not all that difficult to train. You can see the results of training when we call “predict”. It outputs the same ‘y’ values as the training data. We then run the model on our test sentence, that the AI has never seen before, and call “predict”. It outputs a 1, which is correct, since this sentence is indeed about “eating”. There is a link to the source code in this slide https://goo.gl/UxjPBs for anyone that is curious and wants to try running it.

Machine Learning in a Flash: An Introduction to Natural Language Processing

Recommended

Recommended

More Related Content

More from Kory Becker

More from Kory Becker (11)

Recently uploaded

Recently uploaded (20)

Machine Learning in a Flash: An Introduction to Natural Language Processing

Editor's Notes