3. Agenda
• Text Analytics concepts and terms
• Azure ML capabilities for text classification
• Implementation Details
• Spam Detection Model – binary classification
• Model for classifying issues from SR text descriptions – multi-class
classification
• Operationalization of the model
4. Text Analytics
Def: The term text analytics describes a set of linguistic, statistical, and machine learning
techniques that model and structure the information content of textual sources for
business intelligence, exploratory data analysis, research, or investigation.
Text Classification
• Binary Classification (for example: Spam Detection)
• Multiclass Classification (for example: Product classification by text description)
Text Clustering
• Grouping same or similar text documents based on distance/similarity function (usually cos
similarity in vector-space model)
Sentiment Analysis
• Identify and extract subjective information in source materials
• Positive, Negative, Neutral
Name Entity Recognition
• Subtask of information extraction that seeks to locate and classify elements in text into
pre-defined categories such as the names of persons, organizations, locations,
expressions of times, quantities, monetary values, percentages, etc.
5. Text Representation – transform text into numerical vectors
Bag Of Words Model (Vector Space Model)
• Each dimension (axis) corresponds to a document feature.
• Features: words or phrases (bag of words model)
• TF (term frequency): number of occurrences of each word in a document
• TFIDF (term frequency inverted term frequency) table : weight assigned to each term describing a document:
Wij = TF * IDF = tfij * log (N / dfi)
TF – Term Frequency
IDF – Inverse Document Frequency
Wij – Weight of the i-th term in j-th document
tfij – Frequency of the i-th term in document j
N – The total number of documents in the collection
dfi – The number of documents containing the i-th term
• N-grams – representing text features
Example: Text classification is an important area in text analytics
2-grams:
Text classification | classification is | is an | an important | important area | area in | in text | text analytics
7. Azure ML Text Classification Workflow
Step 1. Data Preparation
SQL queries, Excel, R, …
Result:
Label Text
1 This is a spam
0 This is a text that is not a spam
1 This is another spam
9. Step 3. Feature Representation and Extraction – 2 AML modules
- Feature Hashing
Parameters: Hashing bit size, N-grams
- Filter Based Feature Selection
- Properties
Step 4. Train Model
- A lot of binary and multiclass learners: linear regression, logistic regression, boosted decision tree, SVM, decision forest, ….
Step 5. Evaluate model
- Cross Validate Model
- Score Model
- Evaluate Model
Step 6. Visualization of the results and numerical metrics
- Binary Classifiers – Precision, Recall, F1 Score, AUC, AUC graphics
- Multiclass – Confusion table, custom script for precision/recall calculations
10. Spam Detection in answers.microsoft.com forums
Business Scenario:
Automatic spam detection in answer.microsoft.com threads.
Today there are many volunteers and MS FTEs who spend a lot
of time and efforts to clean up the forums from spam
messages. The solution is automatic spam detection.
Example POC : Spam detection in AML, based on the message
content.
14. Predicting Products/Issues by SR problem description
Business Scenario:
The Azure support portal (Ibiza) wans to get rid of the user selections for the product and the
problem/issue, because users make mistakes or select “Other” when they are confused what to select. This
leads to SR miss-routing and hence slowing down the process of the issue resolving. (We have seen up to 9
SR transfers during the SR life cycle).
16. Customer ‘accuracy’ compared
to SE selection:
• ~ 75% - Service (level 0)
• ~ 50% - Feature (level 1)
• ~ 25% - Issue (level 2)
Why?
• Too many topics, customers
cannot discriminate
• Poorly defined topics
• Customers seldom traverse up
the tree to find more relevant
topics
• Customer don’t know how to
classify their symptoms
• Enter anything to talk to assisted
support
Consequence
• Less self-help, more support
volume
• Poor routing, more MPI,
Support Topic Taxonomy
Note: Current MOP experience. POR is UE be replaced
with text input only ‘Maven’ UI
17. Current Office 365 online support experience
Note: Current MOP experience. POR
is UE be replaced with text input
only ‘Maven’ UI
24. Predicting O365 Issues by Problem Description – Analysis
1. Accuracy is not high enough to get rid completely of the problem descriptions
2. Idea about the functionality based on the results from the ML model:
- Sort/Rank the Products and the Problems/Issues in the selection list boxes by probability returned from the
ML model.
Expected Result: Decrease of the wrong selections based on the assumption that the user will find the correct
selection options at the beginning of the list.
This is an example of usefulness of the ML models even when they cannot solve a problem completely.
25. Operationalization
1. Azure ML creates automatically REST Web service
2. Azure ML provides an easy way to deploy the production version of the model on a production environment.
3. Performance – slower than TLC
4. Poor debugging capabilities.
5. Poor code instrumentation/troubleshooting capabilities
6. Scalability – deployment on a limited set of machine (16)
Consider all above proc/cons when making decision to have AML production model.
But the reality is, we ask the customer to select from too many topics, many which are confused with others. Customers ability to reliably select the right symptom falls to 25% when compared what the Support Engineer would choose. (NOTE: we are moving to PFAs for a better ‘ground truth’)
While the symptom tree (called support topics) is only 3 levels deep, it is very broad and growing as new products and features are introduced to O365. As you can see, the top ‘Service’ level has 19 classes. Each service has between 3 and 23 issue groups (we call features), and each feature bucket contains anywhere from 4 to 38 issues. Overall, there are 1,300 possible topics to choose from! The dilemma is how to surface a reduced but relevant taxonomy to a customer.
What’s the cost? When the customer does not properly self-classify, they can’t be provided the best self-help. The customers consequently submits a service request or makes a call to assisted support where the cost per service request is high. In correct symptom self classification also increases the chance for miss-routes - the wrong team getting the request. Even should one argue that customers self-selects at a 80% accuracy rather than 25%, the costs, with a volume of 100,000 cases per month is in the millions per year.
Here is a screen shot of the existing customer experience in Office 365, where the customer first selects their top service level, then the feature and symptom level, and then describe their issue.
# of apps support big data / data lake solutions (COSMOS/HDI etc)
# of apps enabled for near real time services
# of apps supporting data insights
# of applications supporting self service capabilities