Boost Fertility New Invention Ups Success Rates.pdf
Â
Process Mining for ERP Systems
1. Erik Nooijen,
Boudewijn v. Dongen, Dirk Fahland
Process Mining for ERP Systems
2. Process Discovery
process
event process
discovery
log model
algorithm
c1: A B C D E assumptions
c2: A C B D E ⢠case = sequence of events of this case
c3: A F D E ⢠cases are isolated:
event A in c1 happens only in c1 (and not in c2)
âŚ
⢠cases of the same process
⢠one unique case id,
⢠each event associated to exactly one case id
PAGE 1
3. Typical Process in an ERP System
Manufacturer
Material A Material B
order
Material B Material B
product X order
Alice materials
ACME Inc.
Material B Material A
order
Material C Material C
product Y order
Bob
materials
Build to Order Mega Corp.
PAGE 2
4. n-to-m relations ď database
process
process
discovery
model
algorithm
id attributes time-stamp attributes ProductOrder Customer
poID cust. ⌠created processed built shipped cust. address âŚ
po1 Alice 30-08 9:22 30-08 13:12 01-09 15:12 03-09 10:15 Alice ⌠âŚ
po2 Bob 30-08 10:15 30-08 13:14 01-09 16:13 03-09 17:18 Bob ⌠âŚ
relations data attributes
OrderedMaterial id attributes MaterialOrder
poID moID type added moID suppl. ⌠completed sent received
po1 mo3 B 30-08 13:13 mo3 ACME 30-08 13:15 30-08 14:15 01-09 9:05
po1 mo4 A 30-08 13:14 mo4 MEGA 30-08 13:17 30-08 16:12 01-09 10:13
po2 mo3 B 30-08 13:15
po2 mo4 C 30-08 13:16 relations
PAGE 3
5. Process Discovery for ERP Systems
process
process
discovery
model
algorithm
0..*
Customer
reality: data in a relational DB
ProductOrder - cust
1
-⌠⢠events stored as time-stamped
- poID
- cust attributes in tables
- created OrderedMat.
MaterialOrder
- processed - poID
- built 1
- moID
- moID ⢠multiple primary keys
- shipped 1..* - supplier ď multiple notions of case
- type
1..* - completed
- added 1
- sent
- received ⢠tables are related
ď one event related to
multiple cases
PAGE 4
6. Process Discovery for ERP Systems
process
process
discovery
model
algorithm
0..*
Customer
reality: data in a relational DB
ProductOrder - cust
1
-⌠⢠events stored as time-stamped
- poID
- cust attributes in tables
- created OrderedMat.
MaterialOrder
- processed - poID
- built 1
- moID
- moID ⢠multiple primary keys
- shipped 1..* - supplier ď multiple notions of case
- type
1..* - completed
- added 1
- sent
- received ⢠tables are related
ď one event related to
multiple cases
PAGE 5
7. Outline
process
model
related by
primary foreign-key
relations
decompose by primary keys
model f.
log f. discovery PO
log f. model f.
MO
PO MO
discovery
PAGE 6
8. Find Artifact Schemas
process
model
related by
primary foreign-key
relations
decompose by primary keys
model f.
log f. discovery PO
log f. model f.
MO
PO MO
discovery
PAGE 7
9. Step 0: discover database schema
ď§ document schema vs. actual schema ď identify
⢠column types (esp. time-stamped columns)
⢠primary keys
⢠foreign keys
ď§ various (non-trivial) techniques available
ď§ key discovery is NP-complete in the size of the
table(s)
ď§ result:
PAGE 8
10. Step 1: decompose schema into processes
= schema summarization find:
1. sets of
corresponding
tables
2. links between
those
ProductOrder MaterialOrder
PAGE 9
11. Automatic Schema Summarization
= group similar tables
through clustering
ď§ define a distance between
any 2 tables
⢠by relations
⢠by information content
ď§ tables that are close to
each other
ď same cluster
ď§ # of clusters: user input
PAGE 10
12. Automatic Schema Summarization
1. structural distance A
between tables 1
2 fanout: 1 = (2+0)/2
fanout ~ avg. # of child fanout: 1
records related to the fanout: 2
same parent record
A B A B A B
1 X 1 X 1 X
2 Y 1 Y 1 Y
2 Z
2 U
PAGE 11
13. Automatic Schema Summarization
1. structural distance A
between tables 1
2 fanout: 1
fanout ~ avg. # of child fanout: 1 m.fr: 2 = 1/ (1/2)
records related to the m.fr: 1 fanout: 2
same parent record m.fr: 1
A B A B A B
matched fraction ~ 1 X 1 X 1 X
1 / (fraction of records in 2 Y 1 Y 1 Y
parent with matching child 2 Z
record) 2 U
PAGE 12
14. Grouping by Clustering
1. structural distance
2. information distance
importance of each table
= entropy (is maximal if all
records are different)
distance: 2 tables with high
entropies ď large distance
3. weighted distance by
structure + information
4. k-means clustering: most important table of cluster
k clusters based on = table with least distance to all
ď key attribute of the cluster
weighted distance
PAGE 13
15. Artifact Schema ď Artifact Log
process
model
related by
primary foreign-key
relations
decompose by primary keys
model f.
log f. discovery PO
log f. model f.
MO
PO MO
discovery
PAGE 14
16. Log Extraction
cluster = set of related tables
+ primary key of most important table
case id
poID cust. ⌠created processed built shipped
log f.
PO po1 Alice 30-08 9:22 30-08 13:12 01-09 15:12 03-09 10:15
po2 Bob 30-08 10:15 30-08 13:14 01-09 16:13 03-09 17:18
poID moID type added
po1 mo3 B 30-08 13:13
po1: po1 mo4 A 30-08 13:14
po2 mo3 B 30-08 13:15
po2: po2 mo4 C 30-08 13:16
PAGE 15
17. Log Extraction
cluster = set of related tables
+ primary key of most important table
case id
time-stamped attribute ď event
poID cust. ⌠created processed built shipped
log f.
PO po1 Alice 30-08 9:22 30-08 13:12 01-09 15:12 03-09 10:15
po2 Bob 30-08 10:15 30-08 13:14 01-09 16:13 03-09 17:18
poID moID type added
po1 mo3 B 30-08 13:13
po1: (created, poID=po1, time=30-08 9:22, âŚ) po1 mo4 A 30-08 13:14
po2 mo3 B 30-08 13:15
po2 mo4 C 30-08 13:16
PAGE 16
18. Log Extraction
cluster = set of related tables
+ primary key of most important table
case id
time-stamped attribute ď event
related attributes ď event attributes
poID cust. ⌠created processed built shipped
log f.
PO po1 Alice 30-08 9:22 30-08 13:12 01-09 15:12 03-09 10:15
po2 Bob 30-08 10:15 30-08 13:14 01-09 16:13 03-09 17:18
poID moID type added
po1 mo3 B 30-08 13:13
po1: (created, poID=po1, time=30-08 9:22, cust.=Alice, âŚ)po1 mo4 A 30-08 13:14
po2 mo3 B 30-08 13:15
po2 mo4 C 30-08 13:16
PAGE 17
19. Log Extraction
cluster = set of related tables
+ primary key of most important table
case id
time-stamped attribute ď event
related attributes ď event attributes
poID cust. ⌠created processed built shipped
log f.
PO po1 Alice 30-08 9:22 30-08 13:12 01-09 15:12 03-09 10:15
po2 Bob 30-08 10:15 30-08 13:14 01-09 16:13 03-09 17:18
poID moID type added
po1 mo3 B 30-08 13:13
po1: (created, poID=po1, time=30-08 9:22, cust.=Alice, âŚ)po1 mo4 A 30-08 13:14
(processed, poID=po1, time=30-08 13:12, âŚ) po2 mo3 B 30-08 13:15
po2 mo4 C 30-08 13:16
PAGE 18
20. Log Extraction
cluster = set of related tables
+ primary key of most important table
case id
time-stamped attribute ď event
related attributes ď event attributes
poID cust. ⌠created processed built shipped
log f.
PO po1 Alice 30-08 9:22 30-08 13:12 01-09 15:12 03-09 10:15
po2 Bob 30-08 10:15 30-08 13:14 01-09 16:13 03-09 17:18
poID moID type added
po1 mo3 B 30-08 13:13
po1: (created, poID=po1, time=30-08 9:22, cust.=Alice, âŚ)po1 mo4 A 30-08 13:14
(processed, poID=po1, time=30-08 13:12, âŚ) po2 mo3 B 30-08 13:15
(added, poID=po1, time=30-08 13:13, moID=mo3, âŚ)po2 mo4 C 30-08 13:16
refers to artifact âMaterialOrderâ
PAGE 19
21. Outline
process
model
compose by
primary foreign-key
relations
decompose by primary keys
model f.
log f. discovery order
log f. model f.
order
quote quote
discovery
PAGE 20
22. Resulting Model(s)
Product Order Material Order
1..*
added
create
completed
processed
added 1..* sent
built
received
shipped
(addded, poID=po1, âŚ, moID=mo3)
PAGE 21
26. Open issues
ď§ performance
⢠key discovery: NP-complete in R (# of records)
⢠foreign key discovery: NP-complete in R2
⢠problem is in the âhard partâ of NP
⢠ď sampling of data, domain knowledge, semi-automatic
ď§ requires good database structure
⢠proper relations, proper keys
⢠otherwise wrong clusters are formed
⢠events donât get right attributes
⢠ď semi-automatic approach
ď§ events shared by multiple cases⌠working on itâŚ
PAGE 25
27. Erik Nooijen,
Boudewijn v. Dongen, Dirk Fahland
Process Mining for ERP Systems