SlideShare ist ein Scribd-Unternehmen logo
1 von 10
INDIA BUILD
DATASET AND CASES
CIB ANALYTICS CONSULTING
Paris, July 19th 2016
A public e-mail dataset: The case of Enron
Where can one find a corporate e-mail dataset?
The story of Enron
Enron was a U.S. company that used to be one of the biggest corporations in the world. It specialized in energy and commodity
services, mostly it traded energy and commodity financial derivatives (such as gas futures or weather derivatives). It filed for
bankruptcy on the 2nd of December 2001, after it came to light that their revenues and profits had been exaggerated making it
one of the biggest accounting frauds recorded in history. If you want to find out more about the company, there is quite a
comprehensive Wikipedia page: https://en.wikipedia.org/wiki/Enron.
From bankruptcy to a open source corporate e-mail dataset…
As they were filing for bankruptcy, the Federal Energy Regulatory Commission chose to make the e-mail dataset from 150 users
of Enron (mostly senior management) public. Thus, this dataset is the largest, free of access, corporate e-mail database. For
more general information about the dataset (how it was assembled, cleaned and made available to the public), please visit:
https://www.cs.cmu.edu/~./enron/. (Carnegie Mellon university, computer science)
Who’s who?
The individuals that can be seen sending and receiving e-mails in this database can be divided into three groups:
o The 150 individuals targeted by this data release
o Other individuals working at Enron with whom they have conversations
o All other individuals that are outside of Enron but talk to people working for Enron
2
Academic papers
Here are a some of the many academic papers describing the data and presenting findings on various analytical inquiries
Proceedings of the Sixth SIAM International Conference on Data Mining
Available on this link: https://books.google.fr/
The 2001 Annotated (by Topic) Enron Email Data Set
Dr. Michael, W. Berry and Murray Browne, April 10, 2007
The Enron Email Dataset - Database Schema and Brief Statistical Report
Jitesh Shetty, Jafar Adibi
B Klimt, Y Yang - CEAS, 2004Introducing the Enron Corpus
Discovering important nodes through graph entropy the case of enron email database
J Shetty, J Adibi - Proceedings of the 3rd international workshop on Link discovery, 2005
3
Description of the dataset
What can we say about the data? Network of users
Are 150 users enough?
One could ask himself if it is
possible to build a proper and
complex network of individuals
from only 150 users. We believe
that Yes, here is a graph
representation of only a small part
(1%) of the communications:
THE DATASET
GENERAL DATA
- 700,000+ e-mails
- Sent, received, deleted items
- All e-mails from the users apart from
some private e-mails
- Most of the traffic occurs between
2000 and 2002
METADATA
- Subject, sender, receiver, copied
members…
- Name of attachments
- Date and Time.
INTEGRITY ISSUES
- E-mails are not unique
- Addresses in irregular
formats
- Attachments not
included
CONTENT
- Full body as string
- Signatures and footers
- Phone numbers
- Name of attachments
- Reply/Forward tags
4
The challenge: broken down into 3 steps
Identify Link Analyze
5
o Contacts
o Phone numbers
o E-mail addresses
o Who knows who?
o Which members are close?
o Visual representation of the
network
o Sentiment between members
o Discussed topics
o Shared interests
Ideally, those three steps will be embedded into one main comprehensive and fully integrated tool
Step 1: Build a clean and consolidated address book
Main Objective What to do
Clean and consolidated Address Book
extracted from emails header (recipients,
Ccs, expeditor) content and footers
(signatures)
Extract first names, last names,
company, entity
Enrich your database with:
o Phone numbers (mobile,
landline…)
o E-mail addresses (may have
several addresses)
o Job title and position in the
company
Clean the database, remove
duplicates, special characters…
Check completeness of the output
Nice to have: ??
Find a way of dealing with issues
such as homonymy
Suggested Techniques
6
Named Entity
Recognition (NER)
Contextual data
analysis
REGular
EXpressions
(REGEX )
Step 2: Map the contact network within the dataset
Contact network and its analysis
(edges weighting, recommendation
engine, ...)
Use the e-mail headers or implied
metadata to link recipients,
expeditor, cc…
Weight the edges of the output
graph in function of closeness
between two members
Try to build some part of the graph
as a hierarchy (e.g. company org
chart using job titles)
Propose your own graph analysis,
using for example frequency, how
recent was the last communication…
Display a visual representation of
this network
Propose a new contact
recommendation engine.
7
NetworkX,
Linkurious
Page Rank
Graph mining
Main Objective What to do Suggested Techniques
Step 3: Enrich the contact network
Some enriched social networks
(sentiment, interests, discussed topics, …)
Continue enriching the social
network by adding knowledge to
the edges
Try to identify interests shared by
several members by scanning e-
mail content
Try to infer the sentiment people
have toward one another from
email content and communication
patterns
Propose your own analysis…
8
Textblob
NLTK (natural
language
processing
toolkit)
Main Objective What to do Suggested Techniques
Academic papers and useful sites
References that you might find useful specifically when dealing with Step 1
A survey of named entity recognition and classification
D Nadeau, S Sekine - Lingvisticae Investigationes, 2007
Duplicate record detection: A survey
AK Elmagarmid, PG Ipeirotis… - IEEE Transactions on …, 2007
Open sourcing our email signature parsing library
http://blog.mailgun.com/open-sourcing-our-email-signature-parsing-library/
References that you might find useful specifically when dealing with Step 2
NetworkX
https://networkx.github.io/
Linkurious
https://linkurio.us/
Building a Recommendation Engine
https://neo4j.com/developer/guide-build-a-recommendation-engine/
References that you might find useful specifically when dealing with Step 3
Opinion mining and sentiment analysis
B Pang, L Lee - Foundations and trends in information retrieval, 2008
Sentiment analysis algorithms and applications: A survey
W Medhat, A Hassan, H Korashy - Ain Shams Engineering Journal, 2014
9
General guidelines
Philosophy
Open source material
You are encouraged to use open source material as much as you can in order to leverage the power of your own application.
Open mindedness
For each one of the three challenges we have defined what we consider to be the minimum that has to be done (flagged with the
color ), however we encourage you to be as creative as you can be and do not hesitate to promote disruptive methods or to try to
expand as much as possible on the original subject.
Technical requirements
Program architecture
Think modules. Please remember that your code in the end could be used on various datasets that come in different formats, so the
code that you write has to be highly interfacable (on the output side as well as on the input side). For example on the input side there
should be a procedure that transforms the data into a standard format such as a list of dictionaries for example and then the main
part of the algorithm should use that input, this way if we try to run the code in the future on another dataset we just have to write a
procedure that produces the same standard format. The same goes for the output: the program has to produce an easily convertible
result.
Best practices
In terms of guidelines for code, it would be sensible to follow the development guidelines that apply to Python, than can be found at :
https://readthedocs.org/projects/python-guide/downloads/pdf/latest/.
10

Weitere ähnliche Inhalte

Was ist angesagt?

FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESFINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESijnlc
 
Data Structures by Yaman Singhania
Data Structures by Yaman SinghaniaData Structures by Yaman Singhania
Data Structures by Yaman SinghaniaYaman Singhania
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrievalssbd6985
 
Semantic Search Component
Semantic Search ComponentSemantic Search Component
Semantic Search ComponentMario Flecha
 
Empowerment Technologies - Module 3
Empowerment Technologies - Module 3Empowerment Technologies - Module 3
Empowerment Technologies - Module 3Jesus Rances
 
Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibEl Habib NFAOUI
 
Model of information retrieval (3)
Model  of information retrieval (3)Model  of information retrieval (3)
Model of information retrieval (3)9866825059
 
Automatic extraction of top k pages from the web final
Automatic extraction of top k pages from the web finalAutomatic extraction of top k pages from the web final
Automatic extraction of top k pages from the web finalPatrica Harris
 
บริการต่างๆบนอินเตอร์เน็ต
บริการต่างๆบนอินเตอร์เน็ตบริการต่างๆบนอินเตอร์เน็ต
บริการต่างๆบนอินเตอร์เน็ตnoey15m
 
Textmining Information Extraction
Textmining Information ExtractionTextmining Information Extraction
Textmining Information Extractionguest0edcaf
 
Text mining and data mining
Text mining and data mining Text mining and data mining
Text mining and data mining Bhawi247
 
Duplicate Detection in Hierarchical Data Using XPath
Duplicate Detection in Hierarchical Data Using XPathDuplicate Detection in Hierarchical Data Using XPath
Duplicate Detection in Hierarchical Data Using XPathiosrjce
 

Was ist angesagt? (19)

FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESFINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
 
Some Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBASome Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBA
 
Data Structures by Yaman Singhania
Data Structures by Yaman SinghaniaData Structures by Yaman Singhania
Data Structures by Yaman Singhania
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
Semantic Search Component
Semantic Search ComponentSemantic Search Component
Semantic Search Component
 
Empowerment Technologies - Module 3
Empowerment Technologies - Module 3Empowerment Technologies - Module 3
Empowerment Technologies - Module 3
 
Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_Habib
 
Using r
Using rUsing r
Using r
 
Model of information retrieval (3)
Model  of information retrieval (3)Model  of information retrieval (3)
Model of information retrieval (3)
 
Automatic extraction of top k pages from the web final
Automatic extraction of top k pages from the web finalAutomatic extraction of top k pages from the web final
Automatic extraction of top k pages from the web final
 
บริการต่างๆบนอินเตอร์เน็ต
บริการต่างๆบนอินเตอร์เน็ตบริการต่างๆบนอินเตอร์เน็ต
บริการต่างๆบนอินเตอร์เน็ต
 
Textmining Information Extraction
Textmining Information ExtractionTextmining Information Extraction
Textmining Information Extraction
 
คอมปอ
คอมปอคอมปอ
คอมปอ
 
Text mining and data mining
Text mining and data mining Text mining and data mining
Text mining and data mining
 
Information Retrieval thru Cellular Devices
Information Retrieval thru Cellular DevicesInformation Retrieval thru Cellular Devices
Information Retrieval thru Cellular Devices
 
Text mining
Text miningText mining
Text mining
 
The internet
The internetThe internet
The internet
 
Duplicate Detection in Hierarchical Data Using XPath
Duplicate Detection in Hierarchical Data Using XPathDuplicate Detection in Hierarchical Data Using XPath
Duplicate Detection in Hierarchical Data Using XPath
 

Ähnlich wie India build problem

Analyzing Social Media with Digital Methods. Possibilities, Requirements, and...
Analyzing Social Media with Digital Methods. Possibilities, Requirements, and...Analyzing Social Media with Digital Methods. Possibilities, Requirements, and...
Analyzing Social Media with Digital Methods. Possibilities, Requirements, and...Bernhard Rieder
 
Semantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data MiningSemantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data MiningEditor IJCATR
 
Open government data portals: from publishing to use and impact
Open government data portals: from publishing to use and impactOpen government data portals: from publishing to use and impact
Open government data portals: from publishing to use and impactElena Simperl
 
I want to know more about compuerized text analysis
I want to know more about   compuerized text analysisI want to know more about   compuerized text analysis
I want to know more about compuerized text analysisLuke Czarnecki
 
The web of data: how are we doing so far?
The web of data: how are we doing so far?The web of data: how are we doing so far?
The web of data: how are we doing so far?Elena Simperl
 
Anomalous symmetry succession for seek out
Anomalous symmetry succession for seek outAnomalous symmetry succession for seek out
Anomalous symmetry succession for seek outiaemedu
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Data Science: Harnessing Open Data for High Impact Solutions
Data Science: Harnessing Open Data for High Impact SolutionsData Science: Harnessing Open Data for High Impact Solutions
Data Science: Harnessing Open Data for High Impact SolutionsMohd Izhar Firdaus Ismail
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceMahir Haque
 
The web of data: how are we doing so far
The web of data: how are we doing so farThe web of data: how are we doing so far
The web of data: how are we doing so farElena Simperl
 
Open Calais Release 4.0
Open Calais Release 4.0Open Calais Release 4.0
Open Calais Release 4.0Krista Thomas
 
Modern association rule mining methods
Modern association rule mining methodsModern association rule mining methods
Modern association rule mining methodsijcsity
 
Open Government Data for transparency, innovation and public engagement in so...
Open Government Data for transparency, innovation and public engagement in so...Open Government Data for transparency, innovation and public engagement in so...
Open Government Data for transparency, innovation and public engagement in so...Samos2019Summit
 
JIMS Rohini IT Flash Monthly Newsletter - October Issue
JIMS Rohini IT Flash Monthly Newsletter  - October IssueJIMS Rohini IT Flash Monthly Newsletter  - October Issue
JIMS Rohini IT Flash Monthly Newsletter - October IssueJIMS Rohini Sector 5
 

Ähnlich wie India build problem (20)

Analyzing Social Media with Digital Methods. Possibilities, Requirements, and...
Analyzing Social Media with Digital Methods. Possibilities, Requirements, and...Analyzing Social Media with Digital Methods. Possibilities, Requirements, and...
Analyzing Social Media with Digital Methods. Possibilities, Requirements, and...
 
Semantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data MiningSemantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data Mining
 
Open government data portals: from publishing to use and impact
Open government data portals: from publishing to use and impactOpen government data portals: from publishing to use and impact
Open government data portals: from publishing to use and impact
 
I want to know more about compuerized text analysis
I want to know more about   compuerized text analysisI want to know more about   compuerized text analysis
I want to know more about compuerized text analysis
 
The web of data: how are we doing so far?
The web of data: how are we doing so far?The web of data: how are we doing so far?
The web of data: how are we doing so far?
 
Anomalous symmetry succession for seek out
Anomalous symmetry succession for seek outAnomalous symmetry succession for seek out
Anomalous symmetry succession for seek out
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Aj35198205
Aj35198205Aj35198205
Aj35198205
 
Information Systems
Information SystemsInformation Systems
Information Systems
 
Data Science: Harnessing Open Data for High Impact Solutions
Data Science: Harnessing Open Data for High Impact SolutionsData Science: Harnessing Open Data for High Impact Solutions
Data Science: Harnessing Open Data for High Impact Solutions
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
The web of data: how are we doing so far
The web of data: how are we doing so farThe web of data: how are we doing so far
The web of data: how are we doing so far
 
[IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav S...
[IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav S...[IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav S...
[IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav S...
 
Open Calais Release 4.0
Open Calais Release 4.0Open Calais Release 4.0
Open Calais Release 4.0
 
Modern association rule mining methods
Modern association rule mining methodsModern association rule mining methods
Modern association rule mining methods
 
18231979 Data Mining
18231979 Data Mining18231979 Data Mining
18231979 Data Mining
 
Open Government Data for transparency, innovation and public engagement in so...
Open Government Data for transparency, innovation and public engagement in so...Open Government Data for transparency, innovation and public engagement in so...
Open Government Data for transparency, innovation and public engagement in so...
 
Data Architecture (i.e., normalization / relational algebra) and Database Sec...
Data Architecture (i.e., normalization / relational algebra) and Database Sec...Data Architecture (i.e., normalization / relational algebra) and Database Sec...
Data Architecture (i.e., normalization / relational algebra) and Database Sec...
 
Introduction abstract
Introduction abstractIntroduction abstract
Introduction abstract
 
JIMS Rohini IT Flash Monthly Newsletter - October Issue
JIMS Rohini IT Flash Monthly Newsletter  - October IssueJIMS Rohini IT Flash Monthly Newsletter  - October Issue
JIMS Rohini IT Flash Monthly Newsletter - October Issue
 

Kürzlich hochgeladen

『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书rnrncn29
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作ys8omjxb
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationLinaWolf1
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa494f574xmv
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书rnrncn29
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhimiss dipika
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITMgdsc13
 
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一Fs
 
Q4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxQ4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxeditsforyah
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Sonam Pathan
 
NSX-T and Service Interfaces presentation
NSX-T and Service Interfaces presentationNSX-T and Service Interfaces presentation
NSX-T and Service Interfaces presentationMarko4394
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predieusebiomeyer
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)Christopher H Felton
 
Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMartaLoveguard
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxDyna Gilbert
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Paul Calvano
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书zdzoqco
 
Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Excelmac1
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一Fs
 

Kürzlich hochgeladen (20)

『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 Documentation
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhi
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITM
 
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
 
Q4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxQ4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptx
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
 
NSX-T and Service Interfaces presentation
NSX-T and Service Interfaces presentationNSX-T and Service Interfaces presentation
NSX-T and Service Interfaces presentation
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predi
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
 
Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptx
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptx
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
 
Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...Blepharitis inflammation of eyelid symptoms cause everything included along w...
Blepharitis inflammation of eyelid symptoms cause everything included along w...
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
 
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
 

India build problem

  • 1. INDIA BUILD DATASET AND CASES CIB ANALYTICS CONSULTING Paris, July 19th 2016
  • 2. A public e-mail dataset: The case of Enron Where can one find a corporate e-mail dataset? The story of Enron Enron was a U.S. company that used to be one of the biggest corporations in the world. It specialized in energy and commodity services, mostly it traded energy and commodity financial derivatives (such as gas futures or weather derivatives). It filed for bankruptcy on the 2nd of December 2001, after it came to light that their revenues and profits had been exaggerated making it one of the biggest accounting frauds recorded in history. If you want to find out more about the company, there is quite a comprehensive Wikipedia page: https://en.wikipedia.org/wiki/Enron. From bankruptcy to a open source corporate e-mail dataset… As they were filing for bankruptcy, the Federal Energy Regulatory Commission chose to make the e-mail dataset from 150 users of Enron (mostly senior management) public. Thus, this dataset is the largest, free of access, corporate e-mail database. For more general information about the dataset (how it was assembled, cleaned and made available to the public), please visit: https://www.cs.cmu.edu/~./enron/. (Carnegie Mellon university, computer science) Who’s who? The individuals that can be seen sending and receiving e-mails in this database can be divided into three groups: o The 150 individuals targeted by this data release o Other individuals working at Enron with whom they have conversations o All other individuals that are outside of Enron but talk to people working for Enron 2
  • 3. Academic papers Here are a some of the many academic papers describing the data and presenting findings on various analytical inquiries Proceedings of the Sixth SIAM International Conference on Data Mining Available on this link: https://books.google.fr/ The 2001 Annotated (by Topic) Enron Email Data Set Dr. Michael, W. Berry and Murray Browne, April 10, 2007 The Enron Email Dataset - Database Schema and Brief Statistical Report Jitesh Shetty, Jafar Adibi B Klimt, Y Yang - CEAS, 2004Introducing the Enron Corpus Discovering important nodes through graph entropy the case of enron email database J Shetty, J Adibi - Proceedings of the 3rd international workshop on Link discovery, 2005 3
  • 4. Description of the dataset What can we say about the data? Network of users Are 150 users enough? One could ask himself if it is possible to build a proper and complex network of individuals from only 150 users. We believe that Yes, here is a graph representation of only a small part (1%) of the communications: THE DATASET GENERAL DATA - 700,000+ e-mails - Sent, received, deleted items - All e-mails from the users apart from some private e-mails - Most of the traffic occurs between 2000 and 2002 METADATA - Subject, sender, receiver, copied members… - Name of attachments - Date and Time. INTEGRITY ISSUES - E-mails are not unique - Addresses in irregular formats - Attachments not included CONTENT - Full body as string - Signatures and footers - Phone numbers - Name of attachments - Reply/Forward tags 4
  • 5. The challenge: broken down into 3 steps Identify Link Analyze 5 o Contacts o Phone numbers o E-mail addresses o Who knows who? o Which members are close? o Visual representation of the network o Sentiment between members o Discussed topics o Shared interests Ideally, those three steps will be embedded into one main comprehensive and fully integrated tool
  • 6. Step 1: Build a clean and consolidated address book Main Objective What to do Clean and consolidated Address Book extracted from emails header (recipients, Ccs, expeditor) content and footers (signatures) Extract first names, last names, company, entity Enrich your database with: o Phone numbers (mobile, landline…) o E-mail addresses (may have several addresses) o Job title and position in the company Clean the database, remove duplicates, special characters… Check completeness of the output Nice to have: ?? Find a way of dealing with issues such as homonymy Suggested Techniques 6 Named Entity Recognition (NER) Contextual data analysis REGular EXpressions (REGEX )
  • 7. Step 2: Map the contact network within the dataset Contact network and its analysis (edges weighting, recommendation engine, ...) Use the e-mail headers or implied metadata to link recipients, expeditor, cc… Weight the edges of the output graph in function of closeness between two members Try to build some part of the graph as a hierarchy (e.g. company org chart using job titles) Propose your own graph analysis, using for example frequency, how recent was the last communication… Display a visual representation of this network Propose a new contact recommendation engine. 7 NetworkX, Linkurious Page Rank Graph mining Main Objective What to do Suggested Techniques
  • 8. Step 3: Enrich the contact network Some enriched social networks (sentiment, interests, discussed topics, …) Continue enriching the social network by adding knowledge to the edges Try to identify interests shared by several members by scanning e- mail content Try to infer the sentiment people have toward one another from email content and communication patterns Propose your own analysis… 8 Textblob NLTK (natural language processing toolkit) Main Objective What to do Suggested Techniques
  • 9. Academic papers and useful sites References that you might find useful specifically when dealing with Step 1 A survey of named entity recognition and classification D Nadeau, S Sekine - Lingvisticae Investigationes, 2007 Duplicate record detection: A survey AK Elmagarmid, PG Ipeirotis… - IEEE Transactions on …, 2007 Open sourcing our email signature parsing library http://blog.mailgun.com/open-sourcing-our-email-signature-parsing-library/ References that you might find useful specifically when dealing with Step 2 NetworkX https://networkx.github.io/ Linkurious https://linkurio.us/ Building a Recommendation Engine https://neo4j.com/developer/guide-build-a-recommendation-engine/ References that you might find useful specifically when dealing with Step 3 Opinion mining and sentiment analysis B Pang, L Lee - Foundations and trends in information retrieval, 2008 Sentiment analysis algorithms and applications: A survey W Medhat, A Hassan, H Korashy - Ain Shams Engineering Journal, 2014 9
  • 10. General guidelines Philosophy Open source material You are encouraged to use open source material as much as you can in order to leverage the power of your own application. Open mindedness For each one of the three challenges we have defined what we consider to be the minimum that has to be done (flagged with the color ), however we encourage you to be as creative as you can be and do not hesitate to promote disruptive methods or to try to expand as much as possible on the original subject. Technical requirements Program architecture Think modules. Please remember that your code in the end could be used on various datasets that come in different formats, so the code that you write has to be highly interfacable (on the output side as well as on the input side). For example on the input side there should be a procedure that transforms the data into a standard format such as a list of dictionaries for example and then the main part of the algorithm should use that input, this way if we try to run the code in the future on another dataset we just have to write a procedure that produces the same standard format. The same goes for the output: the program has to produce an easily convertible result. Best practices In terms of guidelines for code, it would be sensible to follow the development guidelines that apply to Python, than can be found at : https://readthedocs.org/projects/python-guide/downloads/pdf/latest/. 10