2. Today’s talk
> What’s Embulk?
> Why our customers use Embulk?
> Embulk
> Data Connector
> Data Connector
> The architecture
> The use case
> with MapReduce Executor
> How we configure MapReduce Executor?
2
3. What’s Embulk?
> An open-source parallel bulk data loader
> loads records from “A” to “B”
> using plugins
> for various kinds of “A” and “B”
> to make data integration easy.
> which was very painful…
3
Storage, RDBMS,
NoSQL, Cloud Service,
etc.
broken records,
transactions (idempotency),
performance, …
5. Why our customers use Embulk?
> Upload various types of their data to TD with Embulk
> Various file formats
> CSV, TSV, JSON, XML,..
> Various data source
> Local disk, RDBMS, SFTP,..
> Various network environments
> embulk-output-td
> https://github.com/treasure-data/embulk-output-td
5
6. Out of scope for Embulk
> They develop scripts for
> generating Embulk configs
> changing schema on a regular basis
> logic to select some files but not others
> managing cron settings
> e.g. some users want to upload yesterday’s data
as daily batch
> Embulk is just “bulk loader”
6
7. Best practice to manage Embulk!!
7
http://www.slideshare.net/GONNakaTaka/embulk5
17. Timestamp parsing
> Implement strptime in Java
> Ported from CRuby implementation
> Can precompile the format
> Faster than JRuby’s strptime
> Has been maintained in Embulk repo obscurely..
> It will be merged into JRuby
17
18. How we use Data Connector at TD
> a. Monitoring our S3 buckets access
> e.g. “IAM users who accessed our S3 buckets?”
“Access frequency”
> {in: {type: s3}} and {parser: {type: csv}}
> b. Measuring KPIs for development process
> e.g. “phases that we took a long time on the process”
> {in: {type: jira}}
> c. Measuring Business & Support Performance
> {in: {type: Salesforce, Marketo, ZenDesk, …}}
18
19. Scaling Embulk
> Requests for massive data loading from users
> e.g. “Upload 150GB data by hourly batch”
“Start PoC and upload 500GB data today”
> Local Executor can not handle this scale
> MapReduce Executor enables us to scale
19
26. Grouping input files
- {in: {min_task_size}}
26
Map tasks Reduce tasksShuffle
Task
Task
Task
It also can reduce mapper’s launch cost.
27. One partition into multi-reducers
- {exec: {partitioning: {map_side_split}}}
27
Map tasks Reduce tasksShuffle
28. Conclusion
> What’s Embulk?
> Why we use Embulk?
> Embulk
> Data Connector
> Data Connector
> The architecture of Data Connector
> The use case
> with MapReduce Executor
28