Business rules are widely used by enterprises in order to apply logic to their constantly growing data sets. There are many business rule management systems (BRMS) that facilitate this process, however they take a long time in order to process large scale datasets. Today, with information volumes measured in terabytes, standalone business rule engines are simply cannot keep up. With the advent of distributed computing technologies, such as Hadoop, performing jobs in parallel has become a much simpler and less stressful task. Many business rules are ?embarrassingly parallel?, which makes them perfect candidates for running in a parallel computing environment. This is due to the property of most rules to rely simply on a single record to execute and enrich that specific record. Even the business rules that do not have this property can be adapted to run in a parallel environment. In this presentation, I will use the Drools BRMS to show how to utilize Hadoop and the MapReduce paradigm in order to scale business rules to massive datasets.
2. Analyze call center data
Next call prevention
Call volume reduction
Big data platform (that’s me)
Applications on top of it
2
What we do
3. Decision logic
Automate processes
Enforce policies
Make decisions
ETL
Business Rule Management Systems
Ilog, Drools, etc.
Write, manage, deploy, execute, monitor
3
What are business rules
9. 9
How business rules work
Java beans to hold data
Create one object for every record
Rules to describe logic
Insert all beans into engine
Execute rules against objects
Modify objects in place
Return new objects
Write out to file/database/etc
10. 10
Why business rules
Non-programmers who want to analyze data
Don’t need to write code
Available GUIs for rule writing
One-time infrastructure set up
Rules are plug and play
11. Memory intensive
Hard calculations, medium data
Complicated decision logic
Pseudo-joins
Easy calculations, huge data
Row by row if-then
Aggregation
Scaling existing solutions
Very relevant at NICE
11
Why business rules on Hadoop
12. Calculations that require access to full data
set
Too much data for one key
Serialization of objects
Only if you have reducer
Custom objects only
12
Don’t get too excited
13. Agents make sales
Fake generated data
Different types of calculations
Compare performance between standalone
and clustered
13
Examples
14. 1. How much bonus should agent get based
on current sale?
Bonus = sale > 0 ? sale/100 : 0
All work in mapper, no reducer
2. How much did the agent sell total?
Total = sum of all sales by this agent
Pass-through mapper, work in reducer
14
Test Scenarios
15. public class AgentSalesBean implements
Writable{
private String name;
private String office;
private int salesTotal;
private double bonus;
}
//getters, setters, and serializer/deserializer
15
Details of Examples
16. Read in file with records
1 row = 1 AgentSalesBean
5M – 50M in increments of 5M for first test
1M – 5M in increments of 1M for second test
Extra runs on Hadoop with more data
1M records ≈ 23MB
16
Details of Examples
17. Standalone machine
3.3 GHz (1 CPU x 4 cores x 1 thread/core)
9 GB RAM allocated to JVM
Cluster
3 data nodes
2.4 GHz (2 CPUs x 4 cores x 2 threads/core)
2 GB/mapper (32 GB available/node)
17
Details of Examples
18. How much bonus should agent get based on
current sale?
Bonus = sale > 0 ? sale/100 : 0
All work in mapper, no reducer
18
First Scenario
19. 19
Results – First Scenario
y = 3.374x + 31.6
R² = 0.967
y = 6.517x + 7.466
R² = 0.993
0
50
100
150
200
250
300
350
400
0 10 20 30 40 50 60
Runtimeinseconds
Number of records (in millions)
Hadoop
Standalone
20. Neither implementation ran out of RAM
Have to clear out working memory
Both implementations grow linearly
Standalone grows twice as fast
20
Results – First Scenario
21. How much did the agent sell total?
Total = sum of all sales by this agent
Pass-through mapper, work in reducer
21
Second Scenario
22. 22
Results – Second Scenario
y = 6x + 32.8
R² = 0.992
y = 15.6x - 3
R² = 0.999
0
10
20
30
40
50
60
70
0 1 2 3 4 5 6
Runtime(inseconds)
Number of records (in millions)
Hadoop
Standalone
23. Standalone ran out of RAM (9 GB) at 5M
Hadoop never did; ran up to 50M
Hadoop ran at 24x600MB reducers
Getting the data there took a while though
MapReduce scaled very well
Due to how Hadoop MapReduce is implemented
Can be run on just one reducer
23
Results – Second Scenario
24. Lots of memory required
For complicated rules
Read up on the implementation of your engine
Hadoop only for huge datasets
JVM startup time
Cutoff size depends on rule complexity and
object sizes
Hadoop scales very well
Especially with subset calculations
24
Conclusions
25. You will have a custom solution
Need to know your data
Need to know what you want to find out
Different ways of writing rules for the same thing
Write your own MapReduce jobs
25
How do I actually do this?
26. Figure out execution of rules
What does the work (mapper vs reducer vs both)
Make sure your beans can be serialized
Recursive serialization
Your mileage will vary
Every organization has different needs and
capacities
26
How do I actually do this?