SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Downloaden Sie, um offline zu lesen
©	
  Hortonworks	
  Inc.	
  2015	
  
PageRank for Anomaly Detection
Hadoop Summit – SF Data Mining Meetup
San Jose, 2015
Ofer	
  Mendelevitch,	
  Hortonworks	
  
©	
  Hortonworks	
  Inc.	
  2015	
   Page 2
About Us
Ofer	
  Mendelevitch	
  
Director,	
  Data	
  Science	
  @	
  Hortonworks	
  
Previously:	
  Nor1,	
  Yahoo!,	
  Risk	
  Insight,	
  Quiver	
  
blog:	
  hHp://hortonworks.com/blog/author/ofermend/	
  	
  
	
  
Joint	
  work	
  with	
  Jiwon	
  Seo	
  
Ph.D	
  Candidate	
  @	
  Stanford	
  
SoNware	
  Engineer	
  @	
  Pinterest	
  
Designed	
  SociaLite	
  (w/	
  professor	
  Monica	
  Lam)	
  
©	
  Hortonworks	
  Inc.	
  2015	
   Page 3
What is this talk about?
• Why is fraud detection important in healthcare?
• The Medicare-B dataset
• Our approach: Similarity and PageRank
• Implementation: Apache Pig and SociaLite
• Some Results
©	
  Hortonworks	
  Inc.	
  2015	
   Page 4
Fraud prevention is important in healthcare
Recovery rates are still low, e.g., 3-4%
Source: https://fullfact.org/wp-content/uploads/2014/03/The-Financial-Cost-of-Healthcare-Fraud-Report-2014-11.3.14a.pdf
$0	
  
$500	
  
$1,000	
  
$1,500	
  
$2,000	
  
$2,500	
  
US	
  
EU	
  
$2,270	
  
$940	
  
$171	
  
$71	
  
Healthcare	
  Expenditures	
  (Billions)	
  
Fraud	
  
Non	
  fraud	
  
©	
  Hortonworks	
  Inc.	
  2015	
   Page 5
Example fraud cases in healthcare…
• A doctor billing too often for most expensive office
visits
http://www.dallasnews.com/investigations/20140515-medicare-data-
reveals-unusual-billing-patterns-by-nearly-80-texas-doctors-medical-
practitioners.ece
• Medical supply stores paid off local doctors to
prescribe motorized wheelchairs worth $7500 but
instead provided scooters worth $1500
http://blog.operasolutions.com/bid/388511/Data-Science-As-the-
Panacea-for-Healthcare-Fraud-Waste-and-Abuse
©	
  Hortonworks	
  Inc.	
  2015	
   Page 6
What are some fraud patterns?
• Billing for services that were not actually
performed
• Performing unnecessary services
• Using stolen patient IDs to submit claims
• Unbundling: billing each stage of a procedure as
if it is performed separately
• Upcoding: billing for more expensive services
than were actually performed
• Billing cosmetic surgeries as necessary repairs
• Etc…
©	
  Hortonworks	
  Inc.	
  2015	
   Page 7
Most healthcare providers have some type
of system in place to identify such fraud
• Rules based:
– Business	
  rules	
  catch	
  known	
  fraud	
  paHerns	
  
• Machine-learning based:
– Automated	
  learning	
  catches	
  difficult	
  to	
  characterize	
  
fraud	
  paHerns	
  
• What are “good features” in the model that
increase the accuracy?
– Claim	
  features,	
  e.g.	
  total	
  amount	
  
– Provider	
  features,	
  e.g.,	
  total	
  payment	
  last	
  year	
  
– Pa`ent	
  features,	
  e.g.,	
  current	
  set	
  of	
  diagnoses	
  
©	
  Hortonworks	
  Inc.	
  2015	
   Page 8
Why PageRank for fraud detection?
• Most approaches apply supervised learning
– Graph	
  algorithms	
  not	
  as	
  widely-­‐used	
  
• The main idea:
– Produce	
  new	
  “features”	
  for	
  the	
  exis`ng	
  model	
  
– Specifically,	
  a	
  score	
  per	
  provider	
  reflec`ng	
  its	
  degree	
  of	
  
anomaly	
  rela`ve	
  to	
  a	
  medical	
  specialty	
  
©	
  Hortonworks	
  Inc.	
  2015	
   Page 9
Our Dataset
• Medicare-B – real world public healthcare dataset
– Released	
  by	
  CMS	
  (US	
  Centers	
  for	
  Medicare	
  and	
  
Medicaid	
  Services)	
  in	
  2014	
  
– Includes	
  provider	
  payment	
  informa`on	
  for	
  2012	
  
– 9.5M	
  records;	
  880K+	
  providers;	
  5616	
  CPT	
  (procedure)	
  
codes	
  
• We will only use 4 fields:
– NPI:	
  provider	
  ID	
  
– Specialty:	
  e.g.	
  Internal	
  Medicine,	
  Den`st,	
  etc	
  
– CPT	
  code:	
  medical	
  procedure	
  code	
  
– Count:	
  #	
  of	
  procedures	
  performed	
  (normalized)	
  
©	
  Hortonworks	
  Inc.	
  2015	
   Page 10
Example rows from the dataset
1003000126 ENKESHAFI ARDALAN M.D. M I 900 SETON DR CUMBERLAND 215021854 MD US Internal Medicine
Y F99222 Initial hospital care 115 112 115 135.25 0 199 0 108.11565217 0.9005883395
1003000126 ENKESHAFI ARDALAN M.D. M I 900 SETON DR CUMBERLAND 215021854 MD US Internal Medicine
Y F99223 Initial hospital care 93 88 93 198.59 0 291 9.5916630466 158.87 0
1003000134 CIBULL THOMAS L M.D. M I 2650 RIDGE AVE EVANSTON HOSPITAL EVANSTON 602011718 IL US
Pathology Y F88304 Tissue exam by pathologist 226 207 209 11.64 0 115 0 8.9804424779 1.7203407716
1003000134 CIBULL THOMAS L M.D. M I 2650 RIDGE AVE EVANSTON HOSPITAL EVANSTON 602011718 IL US
Pathology Y F88305 Tissue exam by pathologist 6070 3624 4416 37.729960461 0.0012569747 170 0 28.984504119
5.6268316462
1003000134 CIBULL THOMAS L M.D. M I 2650 RIDGE AVE EVANSTON HOSPITAL EVANSTON 602011718 IL US
Pathology Y F88311 Decalcify tissue 13 13 13 12.7 0 39 0 7.8153846154 4.2806624494
We use only 4 fields: NPI, specialty, CPT code and count:
1003000126, Internal Medicine, Initial hospital care (F99222), 115
1003000126, Internal Medicine, Initial hospital care (F99223), 88
1003000134, Pathology, Tissue exam by pathologist (F88304), 209
1003000134, Pathology, Tissue exam by pathologist (F88305), 4416
1003000134, Pathology, Decalcify tissue (F88311), 13
©	
  Hortonworks	
  Inc.	
  2015	
   Page 11
Our approach – the steps
• Step 1: Data Preparation/cleansing
• Step 2: Compute similarities, build graph
• Step 3: Compute PageRank, identify anomalies
	
  
©	
  Hortonworks	
  Inc.	
  2015	
   Page 12
Step 1: Data cleansing
1003000126	
  	
  	
  	
  	
  	
  ENKESHAFI	
  	
  	
  	
  	
  	
  	
  ARDALAN	
  	
  	
  	
  	
  	
  	
  	
  	
  M.D.	
  	
  	
  	
  M	
  	
  	
  	
  	
  	
  	
  I	
  	
  	
  	
  	
  	
  	
  900	
  SETON	
  DR	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  CUMBERLAND	
  	
  	
  	
  	
  	
  
215021854	
  	
  	
  	
  	
  	
  	
  MD	
  	
  	
  	
  	
  	
  US	
  	
  	
  	
  	
  	
  Internal	
  Medicine	
  	
  	
  	
  	
  	
  	
  Y	
  	
  	
  	
  	
  	
  	
  F99222	
  	
  	
  	
  Ini`al	
  hospital	
  care	
  	
  	
  115	
  	
  	
  	
  	
  112	
  	
  	
  	
  	
  
115	
  	
  	
  	
  	
  135.25	
  	
  0	
  	
  	
  	
  	
  	
  	
  199	
  	
  	
  	
  	
  0	
  	
  	
  	
  	
  	
  	
  108.11565217	
  	
  	
  	
  0.9005883395	
  
	
  
1003000126	
  	
  	
  	
  	
  	
  ENKESHAFI	
  	
  	
  	
  	
  	
  	
  ARDALAN	
  	
  	
  	
  	
  	
  	
  	
  	
  M.D.	
  	
  	
  	
  M	
  	
  	
  	
  	
  	
  	
  I	
  	
  	
  	
  	
  	
  	
  900	
  SETON	
  DR	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  CUMBERLAND	
  	
  	
  	
  	
  	
  
215021854	
  	
  	
  	
  	
  	
  	
  MD	
  	
  	
  	
  	
  	
  US	
  	
  	
  	
  	
  	
  Internal	
  Medicine	
  	
  	
  	
  	
  	
  	
  Y	
  	
  	
  	
  	
  	
  	
  F99223	
  	
  	
  	
  Ini`al	
  hospital	
  care	
  	
  	
  93	
  	
  	
  	
  	
  	
  88	
  	
  	
  	
  	
  	
  
93	
  	
  	
  	
  	
  	
  198.59	
  	
  0	
  	
  	
  	
  	
  	
  	
  291	
  	
  	
  	
  	
  9.5916630466	
  	
  	
  	
  158.87	
  	
  0	
  
	
  
1003000134	
  	
  	
  	
  	
  	
  CIBULL	
  	
  THOMAS	
  	
  L	
  	
  	
  	
  	
  	
  	
  M.D.	
  	
  	
  	
  M	
  	
  	
  	
  	
  	
  	
  I	
  	
  	
  	
  	
  	
  	
  2650	
  RIDGE	
  AVE	
  	
  EVANSTON	
  HOSPITAL	
  	
  	
  	
  	
  	
  	
  
EVANSTON	
  	
  	
  	
  	
  	
  	
  	
  602011718	
  	
  	
  	
  	
  	
  	
  IL	
  	
  	
  	
  	
  	
  US	
  	
  	
  	
  	
  	
  Pathology	
  	
  	
  	
  	
  	
  	
  Y	
  	
  	
  	
  	
  	
  	
  F88304	
  	
  	
  	
  Tissue	
  exam	
  by	
  pathologist	
  	
  	
  	
  	
  	
  
226	
  	
  	
  	
  	
  207	
  	
  	
  	
  	
  209	
  	
  	
  	
  	
  11.64	
  	
  	
  0	
  	
  	
  	
  	
  	
  	
  115	
  	
  	
  	
  	
  0	
  	
  	
  	
  	
  	
  	
  8.9804424779	
  	
  	
  	
  1.7203407716	
  
10030126	
  Internal	
  Medicine	
  Ini`al	
  care(F99222)	
  115	
  
	
  
10030126	
  Internal	
  Medicine	
  Ini`al	
  care(F99223)	
  88	
  
	
  
10030134	
  Pathology	
  Tissue	
  exam(F88304)	
  209	
  Filter columns, data
cleansing
• Extract needed data fields from dataset
– NPI	
  (Na`onal	
  Provider	
  ID),	
  Specialty,	
  CPT	
  (procedure)	
  
code,	
  count	
  
– For	
  count,	
  we	
  chose:	
  “bene_day_srvc_cnt”	
  (number	
  of	
  
dis`nct	
  Medicare	
  beneficiary	
  per	
  day	
  services)	
  
• Re-compute “specialty” due to data quality issues
©	
  Hortonworks	
  Inc.	
  2015	
   Page 13
Specialty Lookup: NPI and NUCC datasets
• Problem:
– Some	
  “specialty”	
  values	
  are	
  inaccurate	
  or	
  not	
  specific	
  
enough	
  
• Solution: pre-processing step
– NPI	
  data:	
  maps	
  NPI	
  to	
  specialty	
  code	
  
– NUCC	
  data:	
  maps	
  specialty	
  code	
  to	
  taxonomy	
  
	
  
©	
  Hortonworks	
  Inc.	
  2015	
   Page 14
Step 2: build graph by similarities
10030126	
  Internal	
  Medicine	
  Ini`al	
  care(F99222)	
  115	
  
	
  
10030126	
  Internal	
  Medicine	
  Ini`al	
  care(F99223)	
  88	
  
	
  
10030134	
  Pathology	
  Tissue	
  exam(F88304)	
  209	
  
• Two providers are “similar” if they have the same
“procedure code patterns”
• We use “Cosine Similarity”
– Each	
  provider	
  represented	
  as	
  vector	
  of	
  5949	
  CPT	
  codes	
  
	
  
©	
  Hortonworks	
  Inc.	
  2015	
   Page 15
Example: similar providers
• NPI1
• NPI2
CPT	
   93042	
   99283	
   99284	
   99285	
   99291	
  
Count	
   280	
   29	
   265	
   410	
   28	
  
CPT	
   99283	
   99284	
   99285	
   99291	
  
Count	
   118	
   151	
   270	
   37	
  
CPT	
   Descrip`on	
  
93042	
   Rhythm	
  Ecg	
  report	
  
99283	
   Emergency	
  dept	
  visit	
  (1)	
  
99284	
   Emergency	
  dept	
  visit	
  (2)	
  
99285	
   Emergency	
  dept	
  visit	
  (3)	
  
99291	
   Cri`cal	
  care	
  first	
  hour	
  
©	
  Hortonworks	
  Inc.	
  2015	
   Page 16
Computing similarity at large scale…
• Number of providers: ~880,000
• 880K * 880K = 77,440,000,000 similarity computations
• Each one a “dot product” between vectors of length 5949
(but sparse)
©	
  Hortonworks	
  Inc.	
  2015	
   Page 17
How do we address scalability?
• Our Implementation:
– Heuris`cs:	
  
– Only	
  compute	
  similarity	
  between	
  NPI1	
  and	
  NPI2	
  if	
  they	
  
share	
  their	
  most	
  important	
  CPT	
  codes	
  
– Filter	
  out	
  NPIs	
  with	
  less	
  than	
  3	
  CPT	
  codes	
  
– Use	
  Apache	
  PIG	
  on	
  a	
  Hadoop	
  cluster	
  (with	
  UDFs)	
  to	
  
compute	
  in	
  parallal	
  
• Alternatives:
– DIM-­‐SUM	
  (map-­‐reduce	
  or	
  Spark)	
  
– Locality	
  Sensi`ve	
  Hashing	
  (DataFu)	
  
©	
  Hortonworks	
  Inc.	
  2015	
   Page 18
PIG code: compute similarity
GRP = group DATA by npi parallel 10;
PTS = foreach GRP generate group as npi, DATA.(cpt_inx, count) as cpt_vec;
PTS_TOP = foreach PTS generate npi, cpt_vec, FLATTEN(udfs.top_cpt(cpt_vec)) as (cpt_inx: int, count: int);
PTS_TOP_CPT = foreach PTS_TOP generate npi, cpt_vec, cpt_inx;
CPT_CLUST = foreach (group PTS_TOP_CPT by cpt_inx parallel 10) generate PTS_TOP_CPT.(npi, cpt_vec) as clust_bag;
RANKED = RANK CPT_CLUST;
ID_WITH_CLUST = foreach RANKED generate $0 as clust_id, clust_bag;
ID_WITH_SMALL_CLUST = foreach ID_WITH_CLUST generate clust_id, FLATTEN(udfs.breakLargeBag(clust_bag, 2000)) as clust_bag;
ID_WITH_SMALL_CLUST_RAND = foreach ID_WITH_SMALL_CLUST generate clust_id, clust_bag, RANDOM() as r;
ID_WITH_SMALL_CLUST_SHUF = foreach (GROUP ID_WITH_SMALL_CLUST_RAND by r parallel 240)
generate FLATTEN($1) as (clust_id, clust_bag, r);
NPI_AND_CLUST_ID = foreach ID_WITH_CLUST generate FLATTEN(clust_bag) as (npi: int, cpt_vec), clust_id;
CLUST_JOINED = join ID_WITH_SMALL_CLUST_SHUF by clust_id, NPI_AND_CLUST_ID by clust_id using 'replicated';
PAIRS = foreach CLUST_JOINED generate npi as npi1, FLATTEN(udfs.similarNpi(npi, cpt_vec, clust_bag, 0.85)) as npi2;
OUT = distinct PAIRS parallel 20;
Things to highlight:
• Using “replicated” joins (map-side joins) where possible
• Handling Data Skew
• Using Python UDFs to compute similarity, break large
bags, etc
©	
  Hortonworks	
  Inc.	
  2015	
   Page 19
Step 3: Personalized PageRank
Run Personalized
PageRank with SociaLite
• Compute specialty-centric “Personalized
PageRank” for each node (provider)
• Anomaly candidate: high score but wrong
specialty
0.025
0.3 0.092
0.095
0.15
0.2
0.002
0.005
0.02
0.01
0.012
0.2
©	
  Hortonworks	
  Inc.	
  2015	
   Page 20
PageRank – a quick overview
•  Random walk over the graph
•  Start from any (randomly selected)
node
•  At each step, walker can:
–  Move	
  to	
  an	
  adjacent	
  node	
  
(probability	
  d	
  =	
  80%)	
  
–  Randomly	
  jump	
  (or	
  “teleport”)	
  to	
  any	
  
node	
  in	
  the	
  graph	
  
(probability	
  1-­‐d	
  =	
  20%)	
  
All doctor names are fictitious
Dr.	
  Miller	
  
9.6	
  %	
  
Dr.	
  Jones	
  
9.6	
  %	
  
Dr.	
  Lam	
  
9.6	
  %	
  
Dr.	
  Ng	
  
12.5	
  %	
  
Dr.	
  Cheng	
  
6.7	
  %	
  
Dr.	
  Das	
  
12.0	
  %	
  
Dr.	
  Seo	
  
9.2	
  %	
  
Dr.	
  Page	
  
6.6	
  %	
  
Dr.	
  Ortega	
  
12.0	
  %	
  
Dr.	
  Padian	
  
12.1	
  %	
  
©	
  Hortonworks	
  Inc.	
  2015	
   Page 21
Personalized PageRank
Focused on a given specialty
•  Random walk over the graph
•  Start from any random node IN THE
SPECIALTY GROUP
•  At each step, walker can:
–  Move	
  to	
  an	
  adjacent	
  node	
  
(probability	
  d	
  =	
  80%)	
  
–  Randomly	
  jump	
  (or	
  “teleport”)	
  to	
  any	
  
node	
  OF	
  THE	
  GIVEN	
  SPECIALTY	
  GROUP	
  
(probability	
  1-­‐d	
  =	
  20%)	
  
All doctor names are fictitious
Dr.	
  Miller	
  
1.6	
  %	
  
Dr.	
  Jones	
  
1.6	
  %	
  
Dr.	
  Lam	
  
1.6	
  %	
  
Dr.	
  Ng	
  
3.3	
  %	
  
Dr.	
  Cheng	
  
4.6	
  %	
  
Dr.	
  Das	
  
20.7	
  %	
  
Dr.	
  Seo	
  
15.7	
  %	
  
Dr.	
  Page	
  
11.8	
  %	
  
Dr.	
  Ortega	
  
20.7	
  %	
  
Dr.	
  Padian	
  
18.2	
  %	
  
©	
  Hortonworks	
  Inc.	
  2015	
   Page 22
Personalized PageRank
Focused on a given specialty
•  Random walk over the graph
•  Start from any random node IN THE
SPECIALTY GROUP
•  At each step, walker can:
–  Move	
  to	
  an	
  adjacent	
  node	
  
(probability	
  d	
  =	
  80%)	
  
–  Randomly	
  jump	
  (or	
  “teleport”)	
  to	
  any	
  
node	
  OF	
  THE	
  GIVEN	
  SPECIALTY	
  GROUP	
  
(probability	
  1-­‐d	
  =	
  20%)	
  
All doctor names are fictitious
Dr.	
  Miller	
  
20.1	
  %	
  
Dr.	
  Jones	
  
20.1	
  	
  %	
  
Dr.	
  Lam	
  
20.1	
  %	
  
Dr.	
  Ng	
  
23.2	
  	
  %	
  
Dr.	
  Cheng	
  
5.8	
  %	
  
Dr.	
  Das	
  
2.2	
  %	
  
Dr.	
  Seo	
  
1.7	
  %	
  
Dr.	
  Page	
  
0.9	
  %	
  
Dr.	
  Ortega	
  
2.2	
  %	
  
Dr.	
  Padian	
  
3.9	
  %	
  
©	
  Hortonworks	
  Inc.	
  2015	
   Page 23
Personalized PageRank with SociaLite
`Rank(int npi:0..$MAX_NPI_ID, int i:iter, float rank).`
`Rank(source_npi, 0, pr) :- Source(source_npi), pr=1.0f/$N.`
for i in range(10):
`Rank(node, $i+1, $sum(pr)) :- Source(node), pr = 0.2f*1.0f/$N ;
:- Rank(src, $i, pr1), pr1>1e-8, EdgeCnt(src, cnt),
pr = 0.8f*pr1/cnt, Graph(src, node).`
Initialize PageRank value of source providers to be 1/N
In each iteration:
•  Teleport to source providers (w/ probability 0.2) ;
•  Random walk to one of neighbors (w/ probability 0.8)
©	
  Hortonworks	
  Inc.	
  2015	
   Page 24
What’s so cool about SociaLite?
• PageRank in 3 lines of code
• Python integration
• You don’t have to “think like a node”. Declarative
language – “looks like” the formula
©	
  Hortonworks	
  Inc.	
  2015	
   Page 25
Using the PageRank scores?
Rules
Fraud
Model
Claim
Generate
Features
PageRank Scores
Decision
Provider
Patient
Amount
Date, time
Etc…
Pa`ent	
  
informa`on	
  
Provider	
  
Informa`on	
  
Etc…	
  
Feature 1
Feature 2
…
Feature N
PR Feature 1
PR Feature 2
…
PR feature M
©	
  Hortonworks	
  Inc.	
  2015	
   Page 26
Example result #1: Ophthalmology
Found internist with high score, but these CPT codes:
•  Internal eye photography
•  Cmptr ophth img optic nerve
•  Echo exam of eye thickness
•  Cptr ophth dx img post segmt
•  Revise eyelashes
•  Ophthalmic biometry
•  Eye exam new patient
•  Eye exam established pat
•  After cataract laser surgery
•  Eye exam & treatment
•  Eye exam with photos
•  Cataract surg w/iol 1 stage
•  Visual field examination(s)
©	
  Hortonworks	
  Inc.	
  2015	
   Page 27
Example result #2: Plastic Surgery
Found Otolaryngologist with high score, but these CPT codes:
•  Skin tissue rearrangement (multiple variants)
•  Biopsy skin lesion
©	
  Hortonworks	
  Inc.	
  2015	
   Page 28
Thank	
  you!	
  
Any	
  Ques`ons?	
  
Ofer	
  Mendelevitch,	
  ofer@hortonworks.com,	
  @ofermend	
  
	
  
Code	
  available	
  here:	
  	
  
hHps://github.com/ofermend/medicare-­‐demo/	
  	
  
	
  
Blog	
  post	
  series:	
  
hHp://hortonworks.com/blog/using-­‐pagerank-­‐detect-­‐anomalies-­‐
fraud-­‐healthcare/	
  	
  	
  

Weitere ähnliche Inhalte

Ähnlich wie Page rank for anomaly detection

PP for HHA providers
PP for HHA providersPP for HHA providers
PP for HHA providersbootchalk
 
Using the Right Metrics to Improve Physician Practice Management
Using the Right Metrics to Improve Physician Practice ManagementUsing the Right Metrics to Improve Physician Practice Management
Using the Right Metrics to Improve Physician Practice ManagementWarren E. Corprew, Jr. MBA CMA CHFP
 
Health IT Summit in Seattle 2014 – “Think Big, Act Small” with Deborah Dahl, ...
Health IT Summit in Seattle 2014 – “Think Big, Act Small” with Deborah Dahl, ...Health IT Summit in Seattle 2014 – “Think Big, Act Small” with Deborah Dahl, ...
Health IT Summit in Seattle 2014 – “Think Big, Act Small” with Deborah Dahl, ...Health IT Conference – iHT2
 
ICD-10 Testing
ICD-10 TestingICD-10 Testing
ICD-10 TestingQualitest
 
The Analytics Opportunity in Healthcare
The Analytics Opportunity in HealthcareThe Analytics Opportunity in Healthcare
The Analytics Opportunity in HealthcareDATA360US
 
October 2014 ICD-10 Open Line Friday
October 2014 ICD-10 Open Line FridayOctober 2014 ICD-10 Open Line Friday
October 2014 ICD-10 Open Line FridayFlorida Blue
 
From the Archives, 2008:Clinical and Economic Advantages Implantable Defibril...
From the Archives, 2008:Clinical and Economic Advantages Implantable Defibril...From the Archives, 2008:Clinical and Economic Advantages Implantable Defibril...
From the Archives, 2008:Clinical and Economic Advantages Implantable Defibril...David Lee Scher, MD
 
Post-Market-Surveillance_Lesotho-Case-Study_11Oct0217.pptx
Post-Market-Surveillance_Lesotho-Case-Study_11Oct0217.pptxPost-Market-Surveillance_Lesotho-Case-Study_11Oct0217.pptx
Post-Market-Surveillance_Lesotho-Case-Study_11Oct0217.pptxtribowofauzan
 
Post-Market-Surveillance_Lesotho-Case-Study_11Oct0217.pptx
Post-Market-Surveillance_Lesotho-Case-Study_11Oct0217.pptxPost-Market-Surveillance_Lesotho-Case-Study_11Oct0217.pptx
Post-Market-Surveillance_Lesotho-Case-Study_11Oct0217.pptxbiruktesfaye27
 
New Ways to Improve Hospital Flow with Predictive Analytics
New Ways to Improve Hospital Flow with Predictive AnalyticsNew Ways to Improve Hospital Flow with Predictive Analytics
New Ways to Improve Hospital Flow with Predictive AnalyticsHealth Catalyst
 
2009 Partners In Hope
2009 Partners In Hope2009 Partners In Hope
2009 Partners In Hopebobjay
 
2012 to 2013 Australian Hospital Digital Scanning Survey
2012 to 2013 Australian Hospital Digital Scanning Survey2012 to 2013 Australian Hospital Digital Scanning Survey
2012 to 2013 Australian Hospital Digital Scanning Surveysquareearth
 
Financing Healthcare (Part 2) Lecture A
Financing Healthcare (Part 2) Lecture AFinancing Healthcare (Part 2) Lecture A
Financing Healthcare (Part 2) Lecture ACMDLearning
 
Focus on Post Acute Care: Lower Costs, Fewer Readmissions, Happier Patients (...
Focus on Post Acute Care: Lower Costs, Fewer Readmissions, Happier Patients (...Focus on Post Acute Care: Lower Costs, Fewer Readmissions, Happier Patients (...
Focus on Post Acute Care: Lower Costs, Fewer Readmissions, Happier Patients (...U.S. News Healthcare of Tomorrow
 
freipresleyecqmsnurse16
freipresleyecqmsnurse16freipresleyecqmsnurse16
freipresleyecqmsnurse16Bill Presley
 
McGrath Health Data Analyst SXSW
McGrath Health Data Analyst SXSWMcGrath Health Data Analyst SXSW
McGrath Health Data Analyst SXSWRobert McGrath
 
Health language siemens presentation
Health language siemens presentationHealth language siemens presentation
Health language siemens presentationChris Cummins
 

Ähnlich wie Page rank for anomaly detection (20)

PP for HHA providers
PP for HHA providersPP for HHA providers
PP for HHA providers
 
Using the Right Metrics to Improve Physician Practice Management
Using the Right Metrics to Improve Physician Practice ManagementUsing the Right Metrics to Improve Physician Practice Management
Using the Right Metrics to Improve Physician Practice Management
 
Health IT Summit in Seattle 2014 – “Think Big, Act Small” with Deborah Dahl, ...
Health IT Summit in Seattle 2014 – “Think Big, Act Small” with Deborah Dahl, ...Health IT Summit in Seattle 2014 – “Think Big, Act Small” with Deborah Dahl, ...
Health IT Summit in Seattle 2014 – “Think Big, Act Small” with Deborah Dahl, ...
 
ICD-10 Testing
ICD-10 TestingICD-10 Testing
ICD-10 Testing
 
The Analytics Opportunity in Healthcare
The Analytics Opportunity in HealthcareThe Analytics Opportunity in Healthcare
The Analytics Opportunity in Healthcare
 
Tim Pletcher Presentation
Tim Pletcher PresentationTim Pletcher Presentation
Tim Pletcher Presentation
 
Tim Pletcher Presentation
Tim Pletcher PresentationTim Pletcher Presentation
Tim Pletcher Presentation
 
October 2014 ICD-10 Open Line Friday
October 2014 ICD-10 Open Line FridayOctober 2014 ICD-10 Open Line Friday
October 2014 ICD-10 Open Line Friday
 
From the Archives, 2008:Clinical and Economic Advantages Implantable Defibril...
From the Archives, 2008:Clinical and Economic Advantages Implantable Defibril...From the Archives, 2008:Clinical and Economic Advantages Implantable Defibril...
From the Archives, 2008:Clinical and Economic Advantages Implantable Defibril...
 
Post-Market-Surveillance_Lesotho-Case-Study_11Oct0217.pptx
Post-Market-Surveillance_Lesotho-Case-Study_11Oct0217.pptxPost-Market-Surveillance_Lesotho-Case-Study_11Oct0217.pptx
Post-Market-Surveillance_Lesotho-Case-Study_11Oct0217.pptx
 
Post-Market-Surveillance_Lesotho-Case-Study_11Oct0217.pptx
Post-Market-Surveillance_Lesotho-Case-Study_11Oct0217.pptxPost-Market-Surveillance_Lesotho-Case-Study_11Oct0217.pptx
Post-Market-Surveillance_Lesotho-Case-Study_11Oct0217.pptx
 
New Ways to Improve Hospital Flow with Predictive Analytics
New Ways to Improve Hospital Flow with Predictive AnalyticsNew Ways to Improve Hospital Flow with Predictive Analytics
New Ways to Improve Hospital Flow with Predictive Analytics
 
2009 Partners In Hope
2009 Partners In Hope2009 Partners In Hope
2009 Partners In Hope
 
2012 to 2013 Australian Hospital Digital Scanning Survey
2012 to 2013 Australian Hospital Digital Scanning Survey2012 to 2013 Australian Hospital Digital Scanning Survey
2012 to 2013 Australian Hospital Digital Scanning Survey
 
Financing Healthcare (Part 2) Lecture A
Financing Healthcare (Part 2) Lecture AFinancing Healthcare (Part 2) Lecture A
Financing Healthcare (Part 2) Lecture A
 
Focus on Post Acute Care: Lower Costs, Fewer Readmissions, Happier Patients (...
Focus on Post Acute Care: Lower Costs, Fewer Readmissions, Happier Patients (...Focus on Post Acute Care: Lower Costs, Fewer Readmissions, Happier Patients (...
Focus on Post Acute Care: Lower Costs, Fewer Readmissions, Happier Patients (...
 
freipresleyecqmsnurse16
freipresleyecqmsnurse16freipresleyecqmsnurse16
freipresleyecqmsnurse16
 
Medical Claims Compliance
Medical Claims ComplianceMedical Claims Compliance
Medical Claims Compliance
 
McGrath Health Data Analyst SXSW
McGrath Health Data Analyst SXSWMcGrath Health Data Analyst SXSW
McGrath Health Data Analyst SXSW
 
Health language siemens presentation
Health language siemens presentationHealth language siemens presentation
Health language siemens presentation
 

Kürzlich hochgeladen

定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 

Kürzlich hochgeladen (20)

定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 

Page rank for anomaly detection

  • 1. ©  Hortonworks  Inc.  2015   PageRank for Anomaly Detection Hadoop Summit – SF Data Mining Meetup San Jose, 2015 Ofer  Mendelevitch,  Hortonworks  
  • 2. ©  Hortonworks  Inc.  2015   Page 2 About Us Ofer  Mendelevitch   Director,  Data  Science  @  Hortonworks   Previously:  Nor1,  Yahoo!,  Risk  Insight,  Quiver   blog:  hHp://hortonworks.com/blog/author/ofermend/       Joint  work  with  Jiwon  Seo   Ph.D  Candidate  @  Stanford   SoNware  Engineer  @  Pinterest   Designed  SociaLite  (w/  professor  Monica  Lam)  
  • 3. ©  Hortonworks  Inc.  2015   Page 3 What is this talk about? • Why is fraud detection important in healthcare? • The Medicare-B dataset • Our approach: Similarity and PageRank • Implementation: Apache Pig and SociaLite • Some Results
  • 4. ©  Hortonworks  Inc.  2015   Page 4 Fraud prevention is important in healthcare Recovery rates are still low, e.g., 3-4% Source: https://fullfact.org/wp-content/uploads/2014/03/The-Financial-Cost-of-Healthcare-Fraud-Report-2014-11.3.14a.pdf $0   $500   $1,000   $1,500   $2,000   $2,500   US   EU   $2,270   $940   $171   $71   Healthcare  Expenditures  (Billions)   Fraud   Non  fraud  
  • 5. ©  Hortonworks  Inc.  2015   Page 5 Example fraud cases in healthcare… • A doctor billing too often for most expensive office visits http://www.dallasnews.com/investigations/20140515-medicare-data- reveals-unusual-billing-patterns-by-nearly-80-texas-doctors-medical- practitioners.ece • Medical supply stores paid off local doctors to prescribe motorized wheelchairs worth $7500 but instead provided scooters worth $1500 http://blog.operasolutions.com/bid/388511/Data-Science-As-the- Panacea-for-Healthcare-Fraud-Waste-and-Abuse
  • 6. ©  Hortonworks  Inc.  2015   Page 6 What are some fraud patterns? • Billing for services that were not actually performed • Performing unnecessary services • Using stolen patient IDs to submit claims • Unbundling: billing each stage of a procedure as if it is performed separately • Upcoding: billing for more expensive services than were actually performed • Billing cosmetic surgeries as necessary repairs • Etc…
  • 7. ©  Hortonworks  Inc.  2015   Page 7 Most healthcare providers have some type of system in place to identify such fraud • Rules based: – Business  rules  catch  known  fraud  paHerns   • Machine-learning based: – Automated  learning  catches  difficult  to  characterize   fraud  paHerns   • What are “good features” in the model that increase the accuracy? – Claim  features,  e.g.  total  amount   – Provider  features,  e.g.,  total  payment  last  year   – Pa`ent  features,  e.g.,  current  set  of  diagnoses  
  • 8. ©  Hortonworks  Inc.  2015   Page 8 Why PageRank for fraud detection? • Most approaches apply supervised learning – Graph  algorithms  not  as  widely-­‐used   • The main idea: – Produce  new  “features”  for  the  exis`ng  model   – Specifically,  a  score  per  provider  reflec`ng  its  degree  of   anomaly  rela`ve  to  a  medical  specialty  
  • 9. ©  Hortonworks  Inc.  2015   Page 9 Our Dataset • Medicare-B – real world public healthcare dataset – Released  by  CMS  (US  Centers  for  Medicare  and   Medicaid  Services)  in  2014   – Includes  provider  payment  informa`on  for  2012   – 9.5M  records;  880K+  providers;  5616  CPT  (procedure)   codes   • We will only use 4 fields: – NPI:  provider  ID   – Specialty:  e.g.  Internal  Medicine,  Den`st,  etc   – CPT  code:  medical  procedure  code   – Count:  #  of  procedures  performed  (normalized)  
  • 10. ©  Hortonworks  Inc.  2015   Page 10 Example rows from the dataset 1003000126 ENKESHAFI ARDALAN M.D. M I 900 SETON DR CUMBERLAND 215021854 MD US Internal Medicine Y F99222 Initial hospital care 115 112 115 135.25 0 199 0 108.11565217 0.9005883395 1003000126 ENKESHAFI ARDALAN M.D. M I 900 SETON DR CUMBERLAND 215021854 MD US Internal Medicine Y F99223 Initial hospital care 93 88 93 198.59 0 291 9.5916630466 158.87 0 1003000134 CIBULL THOMAS L M.D. M I 2650 RIDGE AVE EVANSTON HOSPITAL EVANSTON 602011718 IL US Pathology Y F88304 Tissue exam by pathologist 226 207 209 11.64 0 115 0 8.9804424779 1.7203407716 1003000134 CIBULL THOMAS L M.D. M I 2650 RIDGE AVE EVANSTON HOSPITAL EVANSTON 602011718 IL US Pathology Y F88305 Tissue exam by pathologist 6070 3624 4416 37.729960461 0.0012569747 170 0 28.984504119 5.6268316462 1003000134 CIBULL THOMAS L M.D. M I 2650 RIDGE AVE EVANSTON HOSPITAL EVANSTON 602011718 IL US Pathology Y F88311 Decalcify tissue 13 13 13 12.7 0 39 0 7.8153846154 4.2806624494 We use only 4 fields: NPI, specialty, CPT code and count: 1003000126, Internal Medicine, Initial hospital care (F99222), 115 1003000126, Internal Medicine, Initial hospital care (F99223), 88 1003000134, Pathology, Tissue exam by pathologist (F88304), 209 1003000134, Pathology, Tissue exam by pathologist (F88305), 4416 1003000134, Pathology, Decalcify tissue (F88311), 13
  • 11. ©  Hortonworks  Inc.  2015   Page 11 Our approach – the steps • Step 1: Data Preparation/cleansing • Step 2: Compute similarities, build graph • Step 3: Compute PageRank, identify anomalies  
  • 12. ©  Hortonworks  Inc.  2015   Page 12 Step 1: Data cleansing 1003000126            ENKESHAFI              ARDALAN                  M.D.        M              I              900  SETON  DR                        CUMBERLAND             215021854              MD            US            Internal  Medicine              Y              F99222        Ini`al  hospital  care      115          112           115          135.25    0              199          0              108.11565217        0.9005883395     1003000126            ENKESHAFI              ARDALAN                  M.D.        M              I              900  SETON  DR                        CUMBERLAND             215021854              MD            US            Internal  Medicine              Y              F99223        Ini`al  hospital  care      93            88             93            198.59    0              291          9.5916630466        158.87    0     1003000134            CIBULL    THOMAS    L              M.D.        M              I              2650  RIDGE  AVE    EVANSTON  HOSPITAL               EVANSTON                602011718              IL            US            Pathology              Y              F88304        Tissue  exam  by  pathologist             226          207          209          11.64      0              115          0              8.9804424779        1.7203407716   10030126  Internal  Medicine  Ini`al  care(F99222)  115     10030126  Internal  Medicine  Ini`al  care(F99223)  88     10030134  Pathology  Tissue  exam(F88304)  209  Filter columns, data cleansing • Extract needed data fields from dataset – NPI  (Na`onal  Provider  ID),  Specialty,  CPT  (procedure)   code,  count   – For  count,  we  chose:  “bene_day_srvc_cnt”  (number  of   dis`nct  Medicare  beneficiary  per  day  services)   • Re-compute “specialty” due to data quality issues
  • 13. ©  Hortonworks  Inc.  2015   Page 13 Specialty Lookup: NPI and NUCC datasets • Problem: – Some  “specialty”  values  are  inaccurate  or  not  specific   enough   • Solution: pre-processing step – NPI  data:  maps  NPI  to  specialty  code   – NUCC  data:  maps  specialty  code  to  taxonomy    
  • 14. ©  Hortonworks  Inc.  2015   Page 14 Step 2: build graph by similarities 10030126  Internal  Medicine  Ini`al  care(F99222)  115     10030126  Internal  Medicine  Ini`al  care(F99223)  88     10030134  Pathology  Tissue  exam(F88304)  209   • Two providers are “similar” if they have the same “procedure code patterns” • We use “Cosine Similarity” – Each  provider  represented  as  vector  of  5949  CPT  codes    
  • 15. ©  Hortonworks  Inc.  2015   Page 15 Example: similar providers • NPI1 • NPI2 CPT   93042   99283   99284   99285   99291   Count   280   29   265   410   28   CPT   99283   99284   99285   99291   Count   118   151   270   37   CPT   Descrip`on   93042   Rhythm  Ecg  report   99283   Emergency  dept  visit  (1)   99284   Emergency  dept  visit  (2)   99285   Emergency  dept  visit  (3)   99291   Cri`cal  care  first  hour  
  • 16. ©  Hortonworks  Inc.  2015   Page 16 Computing similarity at large scale… • Number of providers: ~880,000 • 880K * 880K = 77,440,000,000 similarity computations • Each one a “dot product” between vectors of length 5949 (but sparse)
  • 17. ©  Hortonworks  Inc.  2015   Page 17 How do we address scalability? • Our Implementation: – Heuris`cs:   – Only  compute  similarity  between  NPI1  and  NPI2  if  they   share  their  most  important  CPT  codes   – Filter  out  NPIs  with  less  than  3  CPT  codes   – Use  Apache  PIG  on  a  Hadoop  cluster  (with  UDFs)  to   compute  in  parallal   • Alternatives: – DIM-­‐SUM  (map-­‐reduce  or  Spark)   – Locality  Sensi`ve  Hashing  (DataFu)  
  • 18. ©  Hortonworks  Inc.  2015   Page 18 PIG code: compute similarity GRP = group DATA by npi parallel 10; PTS = foreach GRP generate group as npi, DATA.(cpt_inx, count) as cpt_vec; PTS_TOP = foreach PTS generate npi, cpt_vec, FLATTEN(udfs.top_cpt(cpt_vec)) as (cpt_inx: int, count: int); PTS_TOP_CPT = foreach PTS_TOP generate npi, cpt_vec, cpt_inx; CPT_CLUST = foreach (group PTS_TOP_CPT by cpt_inx parallel 10) generate PTS_TOP_CPT.(npi, cpt_vec) as clust_bag; RANKED = RANK CPT_CLUST; ID_WITH_CLUST = foreach RANKED generate $0 as clust_id, clust_bag; ID_WITH_SMALL_CLUST = foreach ID_WITH_CLUST generate clust_id, FLATTEN(udfs.breakLargeBag(clust_bag, 2000)) as clust_bag; ID_WITH_SMALL_CLUST_RAND = foreach ID_WITH_SMALL_CLUST generate clust_id, clust_bag, RANDOM() as r; ID_WITH_SMALL_CLUST_SHUF = foreach (GROUP ID_WITH_SMALL_CLUST_RAND by r parallel 240) generate FLATTEN($1) as (clust_id, clust_bag, r); NPI_AND_CLUST_ID = foreach ID_WITH_CLUST generate FLATTEN(clust_bag) as (npi: int, cpt_vec), clust_id; CLUST_JOINED = join ID_WITH_SMALL_CLUST_SHUF by clust_id, NPI_AND_CLUST_ID by clust_id using 'replicated'; PAIRS = foreach CLUST_JOINED generate npi as npi1, FLATTEN(udfs.similarNpi(npi, cpt_vec, clust_bag, 0.85)) as npi2; OUT = distinct PAIRS parallel 20; Things to highlight: • Using “replicated” joins (map-side joins) where possible • Handling Data Skew • Using Python UDFs to compute similarity, break large bags, etc
  • 19. ©  Hortonworks  Inc.  2015   Page 19 Step 3: Personalized PageRank Run Personalized PageRank with SociaLite • Compute specialty-centric “Personalized PageRank” for each node (provider) • Anomaly candidate: high score but wrong specialty 0.025 0.3 0.092 0.095 0.15 0.2 0.002 0.005 0.02 0.01 0.012 0.2
  • 20. ©  Hortonworks  Inc.  2015   Page 20 PageRank – a quick overview •  Random walk over the graph •  Start from any (randomly selected) node •  At each step, walker can: –  Move  to  an  adjacent  node   (probability  d  =  80%)   –  Randomly  jump  (or  “teleport”)  to  any   node  in  the  graph   (probability  1-­‐d  =  20%)   All doctor names are fictitious Dr.  Miller   9.6  %   Dr.  Jones   9.6  %   Dr.  Lam   9.6  %   Dr.  Ng   12.5  %   Dr.  Cheng   6.7  %   Dr.  Das   12.0  %   Dr.  Seo   9.2  %   Dr.  Page   6.6  %   Dr.  Ortega   12.0  %   Dr.  Padian   12.1  %  
  • 21. ©  Hortonworks  Inc.  2015   Page 21 Personalized PageRank Focused on a given specialty •  Random walk over the graph •  Start from any random node IN THE SPECIALTY GROUP •  At each step, walker can: –  Move  to  an  adjacent  node   (probability  d  =  80%)   –  Randomly  jump  (or  “teleport”)  to  any   node  OF  THE  GIVEN  SPECIALTY  GROUP   (probability  1-­‐d  =  20%)   All doctor names are fictitious Dr.  Miller   1.6  %   Dr.  Jones   1.6  %   Dr.  Lam   1.6  %   Dr.  Ng   3.3  %   Dr.  Cheng   4.6  %   Dr.  Das   20.7  %   Dr.  Seo   15.7  %   Dr.  Page   11.8  %   Dr.  Ortega   20.7  %   Dr.  Padian   18.2  %  
  • 22. ©  Hortonworks  Inc.  2015   Page 22 Personalized PageRank Focused on a given specialty •  Random walk over the graph •  Start from any random node IN THE SPECIALTY GROUP •  At each step, walker can: –  Move  to  an  adjacent  node   (probability  d  =  80%)   –  Randomly  jump  (or  “teleport”)  to  any   node  OF  THE  GIVEN  SPECIALTY  GROUP   (probability  1-­‐d  =  20%)   All doctor names are fictitious Dr.  Miller   20.1  %   Dr.  Jones   20.1    %   Dr.  Lam   20.1  %   Dr.  Ng   23.2    %   Dr.  Cheng   5.8  %   Dr.  Das   2.2  %   Dr.  Seo   1.7  %   Dr.  Page   0.9  %   Dr.  Ortega   2.2  %   Dr.  Padian   3.9  %  
  • 23. ©  Hortonworks  Inc.  2015   Page 23 Personalized PageRank with SociaLite `Rank(int npi:0..$MAX_NPI_ID, int i:iter, float rank).` `Rank(source_npi, 0, pr) :- Source(source_npi), pr=1.0f/$N.` for i in range(10): `Rank(node, $i+1, $sum(pr)) :- Source(node), pr = 0.2f*1.0f/$N ; :- Rank(src, $i, pr1), pr1>1e-8, EdgeCnt(src, cnt), pr = 0.8f*pr1/cnt, Graph(src, node).` Initialize PageRank value of source providers to be 1/N In each iteration: •  Teleport to source providers (w/ probability 0.2) ; •  Random walk to one of neighbors (w/ probability 0.8)
  • 24. ©  Hortonworks  Inc.  2015   Page 24 What’s so cool about SociaLite? • PageRank in 3 lines of code • Python integration • You don’t have to “think like a node”. Declarative language – “looks like” the formula
  • 25. ©  Hortonworks  Inc.  2015   Page 25 Using the PageRank scores? Rules Fraud Model Claim Generate Features PageRank Scores Decision Provider Patient Amount Date, time Etc… Pa`ent   informa`on   Provider   Informa`on   Etc…   Feature 1 Feature 2 … Feature N PR Feature 1 PR Feature 2 … PR feature M
  • 26. ©  Hortonworks  Inc.  2015   Page 26 Example result #1: Ophthalmology Found internist with high score, but these CPT codes: •  Internal eye photography •  Cmptr ophth img optic nerve •  Echo exam of eye thickness •  Cptr ophth dx img post segmt •  Revise eyelashes •  Ophthalmic biometry •  Eye exam new patient •  Eye exam established pat •  After cataract laser surgery •  Eye exam & treatment •  Eye exam with photos •  Cataract surg w/iol 1 stage •  Visual field examination(s)
  • 27. ©  Hortonworks  Inc.  2015   Page 27 Example result #2: Plastic Surgery Found Otolaryngologist with high score, but these CPT codes: •  Skin tissue rearrangement (multiple variants) •  Biopsy skin lesion
  • 28. ©  Hortonworks  Inc.  2015   Page 28 Thank  you!   Any  Ques`ons?   Ofer  Mendelevitch,  ofer@hortonworks.com,  @ofermend     Code  available  here:     hHps://github.com/ofermend/medicare-­‐demo/       Blog  post  series:   hHp://hortonworks.com/blog/using-­‐pagerank-­‐detect-­‐anomalies-­‐ fraud-­‐healthcare/