SlideShare ist ein Scribd-Unternehmen logo
1 von 7
Downloaden Sie, um offline zu lesen
PyData London 2014
Jonathan Sedar, Applied AI
@jonsedar
Text mining to correct
missing CRM information
A practical data science project
LIGHTNING EDITION
In a nutshell
● A CRM dataset (100k business accounts) belonging to
a national energy supplier
● A knotty problem: multiple accounts per company,
without any grouping ids
● How can we to find groups of accounts (larger
company structures), using just the CRM data?
● Machine Learning (ML) and Natural Language
Processing (NLP) tools and techniques in Python.
● Import: Scikit Learn and TextBlob (NLTK & Pattern)
Show me the data!
Company
ID
Account name Contact
name
Premises address
lines 1 - 4
Billing address
lines 1 - 4
1 Bob’s Pizza Big Bob 5 High St, Wexford 5 High St, Wexford
1 Bob’s Pizza Big Bob Temple Bar, D2 5 High St, Wexford
1 Mike’s Kebabs Mad Mike 3 Upper St, Dublin 5 High St, Wexford
2 Mark’s Kebabs Mild Mark 8 Upper St, Dublin Main St, Waterford
3 Fred’s Falafel Fat Fred 9 Henry St, Cork 9 Henry st, cork
3 Fred Fallafell Freddie Bridges St, Galway Henrys St, Cork
… x100,000
This crucial bit of info groups the separately-
recorded accounts into companies…
and was missing from the dataset
Transforming text into useful structures
Account holder
Business name
Premises address
Billing address
...
x100,000
TF-IDF
2-D MATRIX
Cleaned, parsed,
tokenised text
strings
SVD N-D
MATRIX
NTLK.PunktTokeniser
sklearn.TfidfVectorizer
sklearn.TruncatedSVD
Now we can identify similar accounts:
1. Suggest similar
accounts to be grouped
sklearn.MiniBatchKMeans
sklear.AffinityPropagation
sklearn.RadiusNeighborsClassifier
2. Human
validation &
verification 3. Incorporate & propagate
valid groupings
A pragmatic process using OOTB Python
machine learning and human expertise
● A very quick turnaround from raw data to tagged
companies to 93% accuracy
● ~40% of accounts found to belong to a company, ~3.5
accounts per company
● NLP toolkits and scikit-learn allowed rapid
development and testing of solution
● Incorporated human identification at critical stages:
no ML problem is an island
PyData London 2014
Jonathan Sedar, Applied AI
@jonsedar
Thank you
Any questions?

Weitere ähnliche Inhalte

Andere mochten auch

Requirements for Processing Datasets for Recommender Systems
Requirements for Processing Datasets for Recommender SystemsRequirements for Processing Datasets for Recommender Systems
Requirements for Processing Datasets for Recommender Systems
Stoitsis Giannis
 
Recommendation techniques
Recommendation techniques Recommendation techniques
Recommendation techniques
sun9413
 
Role-Based Contextual Recommendation
Role-Based Contextual RecommendationRole-Based Contextual Recommendation
Role-Based Contextual Recommendation
phonecom
 
Recommender system algorithm and architecture
Recommender system algorithm and architectureRecommender system algorithm and architecture
Recommender system algorithm and architecture
Liang Xiang
 

Andere mochten auch (12)

Requirements for Processing Datasets for Recommender Systems
Requirements for Processing Datasets for Recommender SystemsRequirements for Processing Datasets for Recommender Systems
Requirements for Processing Datasets for Recommender Systems
 
Customer Relationship Management in Ireland Managing your Customers for Busin...
Customer Relationship Management in Ireland Managing your Customers for Busin...Customer Relationship Management in Ireland Managing your Customers for Busin...
Customer Relationship Management in Ireland Managing your Customers for Busin...
 
Recommendation Engine Demystified
Recommendation Engine DemystifiedRecommendation Engine Demystified
Recommendation Engine Demystified
 
Recommendation techniques
Recommendation techniques Recommendation techniques
Recommendation techniques
 
Data mining
Data miningData mining
Data mining
 
Data Mining Techniques for CRM
Data Mining Techniques for CRMData Mining Techniques for CRM
Data Mining Techniques for CRM
 
Association rule mining
Association rule miningAssociation rule mining
Association rule mining
 
The comparative study of apriori and FP-growth algorithm
The comparative study of apriori and FP-growth algorithmThe comparative study of apriori and FP-growth algorithm
The comparative study of apriori and FP-growth algorithm
 
Role-Based Contextual Recommendation
Role-Based Contextual RecommendationRole-Based Contextual Recommendation
Role-Based Contextual Recommendation
 
Lecture13 - Association Rules
Lecture13 - Association RulesLecture13 - Association Rules
Lecture13 - Association Rules
 
Apriori Algorithm
Apriori AlgorithmApriori Algorithm
Apriori Algorithm
 
Recommender system algorithm and architecture
Recommender system algorithm and architectureRecommender system algorithm and architecture
Recommender system algorithm and architecture
 

Ähnlich wie Text Mining to Correct Missing CRM Information by Jonathan Sedar

All india data in our hands
All india data in our handsAll india data in our hands
All india data in our hands
quickdatasale
 
Frakture demo slides 4.1
Frakture demo slides 4.1Frakture demo slides 4.1
Frakture demo slides 4.1
Chris Lundberg
 
E Marketplace Strategy
E Marketplace StrategyE Marketplace Strategy
E Marketplace Strategy
rlynes
 
IT Vertical Overview
IT Vertical OverviewIT Vertical Overview
IT Vertical Overview
Tom Burns
 

Ähnlich wie Text Mining to Correct Missing CRM Information by Jonathan Sedar (20)

SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
 
Into dq ed wrazen
Into dq ed wrazenInto dq ed wrazen
Into dq ed wrazen
 
Data Operations for CRM and Marketing - OpenRefine Demo - Webinar
Data Operations for CRM and Marketing - OpenRefine Demo - WebinarData Operations for CRM and Marketing - OpenRefine Demo - Webinar
Data Operations for CRM and Marketing - OpenRefine Demo - Webinar
 
Time Difference: How Tomorrow's Companies Will Outpace Today's
Time Difference: How Tomorrow's Companies Will Outpace Today'sTime Difference: How Tomorrow's Companies Will Outpace Today's
Time Difference: How Tomorrow's Companies Will Outpace Today's
 
DownUnder Dreaming - 5 steps to dreamy data
DownUnder Dreaming - 5 steps to dreamy dataDownUnder Dreaming - 5 steps to dreamy data
DownUnder Dreaming - 5 steps to dreamy data
 
Customer Data Operations: Your hidden growth engine (CXL Live 2018)
Customer Data Operations: Your hidden growth engine (CXL Live 2018)Customer Data Operations: Your hidden growth engine (CXL Live 2018)
Customer Data Operations: Your hidden growth engine (CXL Live 2018)
 
Communauté UiPath Suisse romande - Séance de janvier 2024
Communauté UiPath Suisse romande - Séance de janvier 2024Communauté UiPath Suisse romande - Séance de janvier 2024
Communauté UiPath Suisse romande - Séance de janvier 2024
 
Communauté UiPath Suisse romande - Séance de janvier 2024
Communauté UiPath Suisse romande - Séance de janvier 2024Communauté UiPath Suisse romande - Séance de janvier 2024
Communauté UiPath Suisse romande - Séance de janvier 2024
 
How Retail Banks Use MongoDB
How Retail Banks Use MongoDBHow Retail Banks Use MongoDB
How Retail Banks Use MongoDB
 
What it takes to predict the future
What it takes to predict the futureWhat it takes to predict the future
What it takes to predict the future
 
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
 
[Conference] Cognitive Graph Analytics on Company Data and News
[Conference] Cognitive Graph Analytics on Company Data and News[Conference] Cognitive Graph Analytics on Company Data and News
[Conference] Cognitive Graph Analytics on Company Data and News
 
All india data in our hands
All india data in our handsAll india data in our hands
All india data in our hands
 
Frakture demo slides 4.1
Frakture demo slides 4.1Frakture demo slides 4.1
Frakture demo slides 4.1
 
ML Business Data Guide
ML Business Data GuideML Business Data Guide
ML Business Data Guide
 
E Marketplace Strategy
E Marketplace StrategyE Marketplace Strategy
E Marketplace Strategy
 
IT Vertical Overview
IT Vertical OverviewIT Vertical Overview
IT Vertical Overview
 
D3 IDP Slides.pdf
D3 IDP Slides.pdfD3 IDP Slides.pdf
D3 IDP Slides.pdf
 
Apics12
Apics12Apics12
Apics12
 
Using graph technologies to fight fraud
Using graph technologies to fight fraudUsing graph technologies to fight fraud
Using graph technologies to fight fraud
 

Mehr von PyData

Mehr von PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Kürzlich hochgeladen (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 

Text Mining to Correct Missing CRM Information by Jonathan Sedar

  • 1. PyData London 2014 Jonathan Sedar, Applied AI @jonsedar Text mining to correct missing CRM information A practical data science project LIGHTNING EDITION
  • 2. In a nutshell ● A CRM dataset (100k business accounts) belonging to a national energy supplier ● A knotty problem: multiple accounts per company, without any grouping ids ● How can we to find groups of accounts (larger company structures), using just the CRM data? ● Machine Learning (ML) and Natural Language Processing (NLP) tools and techniques in Python. ● Import: Scikit Learn and TextBlob (NLTK & Pattern)
  • 3. Show me the data! Company ID Account name Contact name Premises address lines 1 - 4 Billing address lines 1 - 4 1 Bob’s Pizza Big Bob 5 High St, Wexford 5 High St, Wexford 1 Bob’s Pizza Big Bob Temple Bar, D2 5 High St, Wexford 1 Mike’s Kebabs Mad Mike 3 Upper St, Dublin 5 High St, Wexford 2 Mark’s Kebabs Mild Mark 8 Upper St, Dublin Main St, Waterford 3 Fred’s Falafel Fat Fred 9 Henry St, Cork 9 Henry st, cork 3 Fred Fallafell Freddie Bridges St, Galway Henrys St, Cork … x100,000 This crucial bit of info groups the separately- recorded accounts into companies… and was missing from the dataset
  • 4. Transforming text into useful structures Account holder Business name Premises address Billing address ... x100,000 TF-IDF 2-D MATRIX Cleaned, parsed, tokenised text strings SVD N-D MATRIX NTLK.PunktTokeniser sklearn.TfidfVectorizer sklearn.TruncatedSVD
  • 5. Now we can identify similar accounts: 1. Suggest similar accounts to be grouped sklearn.MiniBatchKMeans sklear.AffinityPropagation sklearn.RadiusNeighborsClassifier 2. Human validation & verification 3. Incorporate & propagate valid groupings
  • 6. A pragmatic process using OOTB Python machine learning and human expertise ● A very quick turnaround from raw data to tagged companies to 93% accuracy ● ~40% of accounts found to belong to a company, ~3.5 accounts per company ● NLP toolkits and scikit-learn allowed rapid development and testing of solution ● Incorporated human identification at critical stages: no ML problem is an island
  • 7. PyData London 2014 Jonathan Sedar, Applied AI @jonsedar Thank you Any questions?