India build problem

INDIA BUILD
DATASET AND CASES
CIB ANALYTICS CONSULTING
Paris, July 19th 2016

A public e-mail dataset: The case of Enron
Where can one find a corporate e-mail dataset?
The story of Enron
Enron was a U.S. company that used to be one of the biggest corporations in the world. It specialized in energy and commodity
services, mostly it traded energy and commodity financial derivatives (such as gas futures or weather derivatives). It filed for
bankruptcy on the 2nd of December 2001, after it came to light that their revenues and profits had been exaggerated making it
one of the biggest accounting frauds recorded in history. If you want to find out more about the company, there is quite a
comprehensive Wikipedia page: https://en.wikipedia.org/wiki/Enron.
From bankruptcy to a open source corporate e-mail dataset…
As they were filing for bankruptcy, the Federal Energy Regulatory Commission chose to make the e-mail dataset from 150 users
of Enron (mostly senior management) public. Thus, this dataset is the largest, free of access, corporate e-mail database. For
more general information about the dataset (how it was assembled, cleaned and made available to the public), please visit:
https://www.cs.cmu.edu/~./enron/. (Carnegie Mellon university, computer science)
Who’s who?
The individuals that can be seen sending and receiving e-mails in this database can be divided into three groups:
o The 150 individuals targeted by this data release
o Other individuals working at Enron with whom they have conversations
o All other individuals that are outside of Enron but talk to people working for Enron
2

Academic papers
Here are a some of the many academic papers describing the data and presenting findings on various analytical inquiries
Proceedings of the Sixth SIAM International Conference on Data Mining
Available on this link: https://books.google.fr/
The 2001 Annotated (by Topic) Enron Email Data Set
Dr. Michael, W. Berry and Murray Browne, April 10, 2007
The Enron Email Dataset - Database Schema and Brief Statistical Report
Jitesh Shetty, Jafar Adibi
B Klimt, Y Yang - CEAS, 2004Introducing the Enron Corpus
Discovering important nodes through graph entropy the case of enron email database
J Shetty, J Adibi - Proceedings of the 3rd international workshop on Link discovery, 2005
3

Description of the dataset
What can we say about the data? Network of users
Are 150 users enough?
One could ask himself if it is
possible to build a proper and
complex network of individuals
from only 150 users. We believe
that Yes, here is a graph
representation of only a small part
(1%) of the communications:
THE DATASET
GENERAL DATA
- 700,000+ e-mails
- Sent, received, deleted items
- All e-mails from the users apart from
some private e-mails
- Most of the traffic occurs between
2000 and 2002
METADATA
- Subject, sender, receiver, copied
members…
- Name of attachments
- Date and Time.
INTEGRITY ISSUES
- E-mails are not unique
- Addresses in irregular
formats
- Attachments not
included
CONTENT
- Full body as string
- Signatures and footers
- Phone numbers
- Name of attachments
- Reply/Forward tags
4

The challenge: broken down into 3 steps
Identify Link Analyze
5
o Contacts
o Phone numbers
o E-mail addresses
o Who knows who?
o Which members are close?
o Visual representation of the
network
o Sentiment between members
o Discussed topics
o Shared interests
Ideally, those three steps will be embedded into one main comprehensive and fully integrated tool

Step 1: Build a clean and consolidated address book
Main Objective What to do
Clean and consolidated Address Book
extracted from emails header (recipients,
Ccs, expeditor) content and footers
(signatures)
Extract first names, last names,
company, entity
Enrich your database with:
o Phone numbers (mobile,
landline…)
o E-mail addresses (may have
several addresses)
o Job title and position in the
company
Clean the database, remove
duplicates, special characters…
Check completeness of the output
Nice to have: ??
Find a way of dealing with issues
such as homonymy
Suggested Techniques
6
Named Entity
Recognition (NER)
Contextual data
analysis
REGular
EXpressions
(REGEX )

Step 2: Map the contact network within the dataset
Contact network and its analysis
(edges weighting, recommendation
engine, ...)
Use the e-mail headers or implied
metadata to link recipients,
expeditor, cc…
Weight the edges of the output
graph in function of closeness
between two members
Try to build some part of the graph
as a hierarchy (e.g. company org
chart using job titles)
Propose your own graph analysis,
using for example frequency, how
recent was the last communication…
Display a visual representation of
this network
Propose a new contact
recommendation engine.
7
NetworkX,
Linkurious
Page Rank
Graph mining
Main Objective What to do Suggested Techniques

Step 3: Enrich the contact network
Some enriched social networks
(sentiment, interests, discussed topics, …)
Continue enriching the social
network by adding knowledge to
the edges
Try to identify interests shared by
several members by scanning e-
mail content
Try to infer the sentiment people
have toward one another from
email content and communication
patterns
Propose your own analysis…
8
Textblob
NLTK (natural
language
processing
toolkit)
Main Objective What to do Suggested Techniques

Academic papers and useful sites
References that you might find useful specifically when dealing with Step 1
A survey of named entity recognition and classification
D Nadeau, S Sekine - Lingvisticae Investigationes, 2007
Duplicate record detection: A survey
AK Elmagarmid, PG Ipeirotis… - IEEE Transactions on …, 2007
Open sourcing our email signature parsing library
http://blog.mailgun.com/open-sourcing-our-email-signature-parsing-library/
NetworkX
https://networkx.github.io/
Linkurious
https://linkurio.us/
Building a Recommendation Engine
https://neo4j.com/developer/guide-build-a-recommendation-engine/
Opinion mining and sentiment analysis
B Pang, L Lee - Foundations and trends in information retrieval, 2008
Sentiment analysis algorithms and applications: A survey
W Medhat, A Hassan, H Korashy - Ain Shams Engineering Journal, 2014
9

General guidelines
Philosophy
Open source material
You are encouraged to use open source material as much as you can in order to leverage the power of your own application.
Open mindedness
For each one of the three challenges we have defined what we consider to be the minimum that has to be done (flagged with the
color ), however we encourage you to be as creative as you can be and do not hesitate to promote disruptive methods or to try to
expand as much as possible on the original subject.
Technical requirements
Program architecture
Think modules. Please remember that your code in the end could be used on various datasets that come in different formats, so the
code that you write has to be highly interfacable (on the output side as well as on the input side). For example on the input side there
should be a procedure that transforms the data into a standard format such as a list of dictionaries for example and then the main
part of the algorithm should use that input, this way if we try to run the code in the future on another dataset we just have to write a
procedure that produces the same standard format. The same goes for the output: the program has to produce an easily convertible
result.
Best practices
In terms of guidelines for code, it would be sensible to follow the development guidelines that apply to Python, than can be found at :
https://readthedocs.org/projects/python-guide/downloads/pdf/latest/.
10

India build problem

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (19)

Ähnlich wie India build problem

Ähnlich wie India build problem (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

India build problem