This is the video on YouTube for "Spark" in our Global Innovation Nights series.
In this workshop, Global Innovation Nights, our engineers will talk about our technology and knowledge.
The 1st topic for this workshop is "Spark". We will introduce how we use Spark in development of "AI WORKS". Using distributed computing, especially Spark, AI WORKS can process large-scale and complex payroll and accounting processes much faster than legacy ERP systems.
In addition, we will introduce the history of the research and development of distributed computing in Works Applications.
Please experience our technology and knowledge.
3. Agenda
- Profile
- About Worksap
- Our R&D history of distributed processing
- Case1: Accounting summary
- Case2: Salary calulation
4. Profile
Kawakami Tomoki, Engineer
Career
2012 Enter Works Applications
2014 Join new ERP project, AI WORKS
2015 Join distributed processing team
Work
- Platform development for distribute
processing
- Speed-up business processes
- Help communication between Japanese
and foreign engineers
# Not scientific major
# Hobby: Travel & Foreign language
9. A. I'm in the middle
of global development
19 English speakers 2 Japanese speakers
11. Develop and sell
ERP package software,
COMPANY & AI WORKS
Development
office in
Tokyo, Shanghai,
Singapore, India
Founded in 1996,
4,917 employees
(1st April, 2016)
13. High usability &
High speed
= 100ms
Using Cassandra,
Spark, Yarn, Kafka,
ElasticSearch etc.
Cloud native
application
A.I. built in
18. Our R&D history
of
distributed
processing
2005
Release multi-node parallel cobol process for
salary calculation batch
2010
Cobol to Java conversion
Multi-thread processing framework for batch
2012
Hadoop verification for salary calculation
(RDB)
2013
Start R&D for AI WORKS, choose Cassandra
2014
Hadoop verification for financial summary
2015
Choose Spark
Develop salary calculation batch and financial
summary batch with spark
Develop platform for distributed processing
2016
More batches developed with spark
19. Our R&D history
of
distributed
processing
2005
Release multi-node, parallel cobol process for
salary calculation batch
2010
Cobol to Java conversion
Multi-thread processing framework for batch
2012
Hadoop verification for salary calculation (RDB)
2013
Start R&D for AI WORKS, choose Cassandra
2014
Hadoop verification for financial summary
2015
Choose Spark
Develop salary calculation batch and financial
summary batch with spark
Develop platform for distributed processing
2016
More batches developed with spark
Request from
user to speed-up
the batch
Similar
structure as
current spark
batch
20. Our R&D history
of
distributed
processing
2005
Release multi-node parallel cobol process for
salary calculation batch
2010
Cobol to Java conversion
Multi-thread processing framework for batch
2012
Hadoop verification for salary calculation (RDB)
2013
Start R&D for AI WORKS, choose Cassandra
2014
Hadoop verification for financial summary
2015
Choose Spark
Develop salary calculation batch and financial
summary batch with spark
Develop platform for distributed processing
2016
More batches developed with spark
Num of cores
increased
More efficient
to process
in one machine
User cannot
manage cluster
21. Our R&D history
of
distributed
processing
2005
Release multi-node parallel cobol process for
salary calculation batch
2010
Cobol to Java conversion
Multi-thread processing framework for batch
2012
Hadoop verification for salary calculation (RDB)
2013
Start R&D for AI WORKS, choose Cassandra
2014
Hadoop verification for financial summary
2015
Choose Spark
Develop salary calculation batch and financial
summary batch with spark
Develop platform for distributed processing
2016
More batches developed with spark
Not so match
with RDB
22. Our R&D history
of
distributed
processing
2005
Release multi-node parallel cobol process for
salary calculation batch
2010
Cobol to Java conversion
Multi-thread processing framework for batch
2012
Hadoop verification for salary calculation (RDB)
2013
Start R&D for AI WORKS, choose Cassandra
2014
Hadoop verification for financial summary
2015
Choose Spark
Develop salary calculation batch and financial
summary batch with spark
Develop platform for distributed processing
2016
More batches developed with spark
On cloud, it's easy
to get resource
on demand
Performance
will be more stable
with multi-nodes
than one node
OSS is getting
popular
in enterprise
23. Our R&D history
of
distributed
processing
2005
Release multi-node parallel cobol process for
salary calculation batch
2010
Cobol to Java conversion
Multi-thread processing framework for batch
2012
Hadoop verification for salary calculation (RDB)
2013
Start R&D for AI WORKS, choose Cassandra
2014
Hadoop verification for financial summary
2015
Choose Spark
Develop salary calculation batch and financial
summary batch with spark
Develop platform for distributed processing
2016
More batches developed with spark
Easy to learn
interface
Better
peformance
than Hadoop
Trend
27. Behind
the summary
Summarize records to create financial reports
(B/S, P/L, Trial balance)
In SQL, “select sum(amount) from journals
group by item, section, product, xxx”
For each type, need to calculate all combinations
Example:
Num of Financial items = 1,000
Summary by section (7,000)
= 1,000 * 7,000 combinations
Summary by section & product (3,000)
= 1,000 * 7,000 * 3,000 combinations
30. Process image
RecordRdd
.flatMapToPair() ← Add key by item, type
.reduceByKey() ← Sum up
.mapToPair() ← Add key by type, term
.groupByKey() ← collect to same partition
.foreach() ← Update DB
31. Techniques
1. Split tasks into proper unit.
One record in one journal
< one journal
< all journals for one summary type
< all journals for all summary type
2. Use cassandra counter
Pros: Higher concurrency
Cons: Not idempotent
3. Combine with stream processing
Only spark batch makes UX worse.
34. Behind
the calculation
Requires many information
to calculate many types of salary
Calculate
- Fixed salary
- Overtime
- Health insurance
- Residencial tax
- Pension
...etc.
Information
- Class
- Evaluation
- Attendance
- Overtime
- Paid leaves
- Family
- Address
- Previous income
...etc.
38. Techniques
1. Use de-normalized table for Cassandra
Easier to get many data for a employee
2. Find best num of employee in a partition
To maximize the Cassandra throughput
By trial and error
About 2000 employees