This document discusses how web data can reveal information about employees, business partners, and persons of interest. It outlines the business case for using web data to conduct background checks and screenings. It also discusses challenges like collecting good data from various sources and analyzing large amounts of unstructured data. Advanced text analytics solutions that use entity resolution and relationship extraction are presented as helping to understand web data. The document concludes by describing how these techniques were applied in a project with Thorn to detect child sex trafficking online.
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Employees, Business Partners and Bad Guys: What Web Data Reveals About Persons of Interest
1. Employees, Business Partners and Bad Guys:
What Web data reveals about persons of interest
Presenters: Gina Cerami, VP of Marketing, Connotate
Dave Danielson, VP of Marketing, Digital Reasoning
Cl i S h id Di f PClaire Schmidt, Director of Programs,
Thorn: Digital Defenders of Children
(formerly DNA Foundation)
Date: November 28, 2012
2. Today’s Discussion
• What Web Data Reveals: The Fundamentals
The business case
Employee background check business partner screening persons of interestEmployee background check – business partner screening – persons of interest
• Collecting Good Data: Not That Easy
Where to start? Best practices
Differences in data sources – the automation processDifferences in data sources – the automation process
• Analyzing Data: A Difficult Problem
Why advanced text analytics matters
Making sense of big dataMaking sense of big data
• Automation and Advanced Analytics: A Powerful Combination
Background check accuracy enhanced with Entity Resolution
• Thorn: Working to End Child Sexual Exploitation
Combined solution applied to detecting child sex trafficking online
• Q&AQ&A
2
3. What Web Data Reveals:What Web Data Reveals:
The Fundamentals
3
4. The Business Case
news – blogs – social media
trillions of URLstrillions of URLs
court records – registries – sanctions lists
4
5. What Web Data Reveals About Persons of Interest
Bad GuysBusiness PartnersProspective Employees
• Extract precise data
from 10,000+ records
on URLs linked to
• Check sanctions lists
• Identify politically
• 3-minute screening
using public records
in 1 500 jurisdictions on URLs linked to
illegal activities
• Use advanced
analytics to narrow the
exposed persons
• Reduce business risk
• Avoid fines – comply
in 1,500 jurisdictions
• Eliminate human error
• Save time, money;
analytics to narrow the
scope of investigations
Avoid fines comply
with AML/KYC ruleshire right the first time
Automated, precise data collection
is key to success
55
5
6. The Cost of Not Knowing Your Employees
The cost of fraud in the workplace:
• $400 to 600 billion/year in the U S (Harvard)$400 to 600 billion/year in the U.S. (Harvard)
• 5% revenue on average (Association of Fraud Examiners)
• $3.2B in Canada in 2011 (Certified General Accountants Assoc. of Canada)
The cost of re-hiring (not getting it right the first time)
• From $3.5K (U.S. average cost-per-hire) to $$millions for CEOs$ ( g p ) $$
How does this happen?
• 50% of resumes have factual errors
• 1 in 5 job applications have a major lie or discrepancy (UK 2009 survey)
• Many background checks are manual (error prone) or incomplete
6
7. Solution: Comprehensive Search
Regular monitoring of all levels of government sites
• National state county and localNational, state, county and local
• If you outsource – make sure your screening service continually monitors
these sites for updates
• If you already do it yourself consider automating the Web data collection• If you already do it yourself – consider automating the Web data collection
process to ensure accuracy and timeliness
Connotate’s software powers over
250,000 background checks per month;
3 million to date3 million to date
7
8. The Cost of Not Knowing Business Partners
Recent Bank Secrecy Act (BSA) Penalties
• $1 2 B – Citibank April 2012$1.2 B Citibank, April 2012
• $7 M – Pacific National Bank, March 2011
BSA / Anti-Money Laundering (AML) Penalties
• $10.9 M – Ocean Bank (FL) August 2011
Reputation Risk – substantial
8
9. Solution: Comprehensive Search
Comprehensive searches by third-party services are
available for specific vertical industriesavailable for specific vertical industries
If you wish to conduct customized searches on a
regular basis consider automated data collectionregular basis, consider automated data collection
• Sanctions lists (Treasury.gov, ICE, EU Terrorism List, FBI Most Wanted,
OCC Shell Bank, etc.)
• PACER, national and state lists
• Social media may reveal that the person of interest is associating with others
on sanction lists
9
13. Polling Question: Web Data Collection
Are you currently collecting background data
from the Web?from the Web?
Yes – we are doing this using an automated processg g p
Yes – however, we are collecting Web data using a manual process
No – we outsource background check to a third-party service
13
14. An Overview of the Automation Process
Transform Deliver
• Structure
Classify
• Reports
Dashboards
Collect Data
Internal Sources
• Databases
External Sources
• Social Media • Classify
• Prep for Analysis
• Dashboards
• Workflow
• BI Plug-ins
• Databases
• Interviews
• Resumes
• Social Media
• Surface Web
• Hidden Web
•Secured Sites
14
16. New Content Sources
Require Advanced Analytics
Outputs
q y
TransformCollect DataCollect Data Advanced Analytics
• Reports
• Dashboards
• Workflow
• BI Plug ins
• Remove Formatting
• Text Only
• Unstructured and
Structured Data
• Variety of Sources
• Scalable
• Automated
• BI Plug-ins
Resolving the Unique Individual
Associating Time and Geographic data
Fact/Assertion Extraction
Relationship Identification and Extraction
1616
17. Synthesys Overview:
A software platform for making sense of big datap g g
READ RESOLVE REASON
Synthesys Platform
DISPARATE
DATA
APPLICATION-
READY
Deep processing
of unstructured
data
Assemble,
organize, and
relate
Uncover
relationships,
compare & correlate
News
Web
Email
Research
DATA
App Integration
Events/Alarms
Network Analysis
READY
ANALYTICS
Instant Messages
Analytic Primitives
• Natural Language
Processing
• Entity Resolution
• Synonym Generation
• Similarity Algorithms
• ConnectivityProcessing
• Extraction
• Geocoding
• Time normalization
• Synonym Generation • Connectivity
Machine Learningg
Distributed Processing (Hadoop MapReduce)
Distributed Storage (HBase, Cassandra, Cloudbase)
Synthesys reads, resolves and reasons about entities and relationships in space and time.
17
18. Other solutions are flawed
and don’t make automated understanding possibleg p
Historically, the market has built tools to help find reading material
Search
Google, Fast, Autonomy, Recommind,
Lucene
Entity Extraction
Basis, Janya, Aerotext, Attensity,
SAP/Inxight, Lexalytics, SRA NetOwl
Comprehensive Ontologies
or Data Models
Clarabridge, Endeca, Expert Systems, IBM
Entity Analytics, Informatica
Other text analytics solutions still require the human to read to understand
18
19. Synthesys turns data into “knowledge objects”
President Masayoshi Son wants to repeat the success
VBZNNP NNP NNP VBTO DT NN PRP
NP VP - PP VP
PERSON – PROPER NOUN
President Masayoshi Son
Japan
he had while building Softbank into Japan’s third-
l t i l i S t t t k k t
NNP NNP
S
Y
M
VBD JJSJJININVBD POS
NP NP
NP
PERSON – PROPER NOUN
LOCATION
Japan
ORGANIZATION
Softbank
PREDICATE
Built
largest wireless carrier. Son wants to take market
share from entrenched giants and deliver more data to
NN NNP. VBZ VB NN NN INNN TO
NP
NP
ORGANIZATION
NOUN - ENTITY
More Data
To Deliver
NOUN - ENTITY
Market Share
PREDICATE
smartphones, tablets, cars and even bicycles.
CC
NNS
NNSNNSNNS
NNS NNS ,CC VBJJ
RB
DT TO
NPNP NP
ENTITY
Smartphones
Tablets PREDICATE
To Take
PREDICATE
CC NNSNNSNNS , .RB
NLP ExtractionEOS TOK POS CHUNK NER SREX
ENTITY
EOS TOK POS CHUNK NER SREX
19
20. Resolution makes “Concept” or “Semantic”
understanding possibleg p
Concept: California-based Apple
References/Mentions:
Apple
Apple inc
Apple, inc.pp
California-based Apple
Secretive Apple
iPhone inventor
Steve Job’s Company
AAPL
Technology Innovator Apple
Synthesys resolves multiple, varied mentions across the entire data set
b i f h b d h i ias being part of the same concept based on their usage in context.
20
21. Synthesys is “Software that Learns”
new languages, patterns, categories, etc.g g , p , g ,
Supervised machine learning techniques and patent-pending workflows allow content
experts to train models and achieve quality improvement without any programming.
User uploads
example of new
document
domain/language
11
55
domain/language
Synthesys predicts
annotation
22
Operator corrects 1Operator corrects
annotation and
adds categories
33
Completed
annotation is44
22
33
44
annotation is
submitted to server
4
Completed model
training is submitted to
Synthesys
55
3
21
22. Synthesys Powers Tools:
Providing a Common Global View
Leading
Visualization
Platforms
R l ti l D t b
Platform
(Data Organized &
Application-
Ready)
Relational Database
Management System
(RDBMS)
Data Sources
U t t d D tSt t d D t Unstructured DataStructured Data
22
23. Polling Question: Data Analysis
Are you looking to use analytics on Web data to
resolve entities or understand relationships that mightresolve entities or understand relationships that might
help in background investigations?
Yes – we are analyzing Web data manually today
Yes – we analyzing with text extractors or other text mining tools
Yes – we have a near-term project to analyze Web data
N b t h d t l W b d t i th f tNo – but we may have a need to analyze Web data in the future
No – we have no plans to analyze Web data
23
24. Web Data and Advanced Analytics:Web Data and Advanced Analytics:
A Powerful Combination
24
25. Employee Screening: A Delicate Balance
The cost of mistaken identity (incomplete screening)
• Class action suits have been filed over erroneous sex offender reportingClass action suits have been filed over erroneous sex offender reporting
• Digital Reasoning’s Solution: Entity Resolution with Synthesys®
Positions of trust Employee privacy
Safe workplace
Right hire the
first time
Libel: Impact = job loss
EEOC / FCRA
25
26. Business Partner Screening:
Avoiding Legal and Reputation RiskAvoiding Legal and Reputation Risk
Anti Money
L d iDo we have the right Laundering
Political
Corruption
Do we have the right
person?
(Nicknames,
misspellings, etc.)
Foreign Corrupt
Practices Act
Terrorist
Do we know who is
connected with this
company? Terrorist
Financing
Suspicious
Activity Report
What about Foreign
Language Sources?
company?
Activity Report
Increasingly, the sources of the information you need
are in unstructured web content
26
28. Thorn Overview
Thorn’s focus: The role technology plays in crimes
involving the sexual exploitation of children.
Thorn’s goal: To disrupt and deflate predatory behavior
in the fight to end child sexual exploitation.
Thorn creates tools, policies and programs to bring
an end to illicit activities that could harm childrenan end to illicit activities that could harm children.
Technology Task Force consists of over 25 top tech
companies that collaborate on technology initiatives to
fight child sexual exploitation.g p
Works closely with law enforcement, NGOs, private
sector and its Technology Task Force
Part of the White House’s Office of Science andPart of the White House s Office of Science and
Technology’s commitment to end trafficking
28
Claire Schmidt is the Director of Programs for Thorn
28
29. The Challenge
• The explosive growth of online media has made it more
difficult to monitor and identify illicit activities, includingy , g
child sex trafficking
• Traditional analytics tools do a poor job of monitoring these
forms of online media
• Data is “messy” and unstructured
Data is often false• Data is often false
• Real age is difficult to determine from online data content
• Law enforcement has few, if any, tools to combat thisy
problem
29
30. Combating Sex Trafficking: Project Overview
• Thorn desired to determine the feasibility of using advanced text
analytics to detect child sex trafficking in online media.y g
• Connotate built a process to automatically download data from
selected websites.
• Digital Reasoning developed analytics to detect potential child sexDigital Reasoning developed analytics to detect potential child sex
trafficking activity within the collected data.
Widely Varied
Data Sources
Data Aggregation
and Cleansing
Analytics, resolution,
and pattern matching
Analytics results,
reports, charts
30
31. Project Methodology
Interview Law Enforcement Experts
• Interviewed Law Enforcement officials and determined three major focal• Interviewed Law Enforcement officials and determined three major focal
points for automated understanding
Isolate and Map Semantic Features
• Interview results were mapped into semantic features (“signatures”)
Develop models for use in Synthesys
• Analytic models were created by training Synthesys on the semantic
signatures
Identify sources of Internet dataIdentify sources of Internet data
• Then configured into Connotate for automated collection, cleansing and
transformation of data
31
32. Key Innovative Developments
• Accurate telephone number extractor
• Unique profiles for people posting ads
• Analytic assessment models for text
Achieved High Level of Accuracy• Achieved High Level of Accuracy
32
33. Web Data Collection and Advanced Analytics
OutputsTransformCollect DataCollect Data Advanced Analytics
Connotate Digital Reasoning
• Reports
• Dashboards
• Workflow
• BI Plug ins
• Remove Formatting
• Text Only
• Unstructured and
Structured Data
• Variety of Sources
• Scalable
• Automated
• BI Plug-ins
Connotate provides precise quality
data, formatted for delivery to your
Digital Reasoning applies advanced
analytics to resolve identities, enrich, y y
analysis tools data and develop unique profiles of
individuals targeted for investigation
3333
34. Web Data Can Reveal Insights of
Tremendous Value
Valid insights
require precise,
quality data
Avoid mistaken
identity with entity
resolution
Automation is
the key to
extracting
Obtain a deeper
understanding of
partner operationse t act g
precise,
quality data
partner operations
and key relationships
34
35. Q & A
Connotate will email a link to this presentation as well as ap
copy of the slides to you within 2 business days.
If you would like to use advanced Web data collection solutiony
to support background check of employees or business
partners in-house, please call (+1) 732-296-8844 or visit
www connotate com or www connotate co ukwww.connotate.com or www.connotate.co.uk
For more information about law enforcement applications and
advanced analytics please visit www digitalreasoning comadvanced analytics, please visit www.digitalreasoning.com.
35
36. Thank You
If you have an immediate need and would like us to contacty
you about a forthcoming project, please check the appropriate
box in the last polling question or call (+1) 732-296-8844.
For more information, visit
www connotate com or www connotate co ukwww.connotate.com or www.connotate.co.uk
and
www digitalreasoning comwww.digitalreasoning.com
36