SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Downloaden Sie, um offline zu lesen
Adding Forks to Your
Perly
Tableware
Handling fixed-width records in parallel
S Lteven embark
W Corkhorse omputing
lembark@wrkhors.com
You may hear things like:
Perl is too slow for processing fixed-width records so I
wrote the latest version in C++ since it can handle the
data in parallel.
There are a few misunderstandings here:
That Perl is slow.
That Perl does not handle parallel processing well.
That Perl does not handle fixed-width records.
It isn't, does, and can.
Comparing C++ and Perl:
Perl's I/O is a fairly thin layer over unistd.h library calls.
Perl and C++ block at the same rate.
Forked processes can easily share data stream with
separate file handles by letting the O/S buffer data.
unpack is reasonably efficient and more dynamic than
using struct's in C++.
Most of us use variable-width, delimited records.
These are the usual newline-terminated data we all know
and [lh][oa][vt]e.
Perl handles these via $, readline, and split or regexen.
Common examples: logs or FASTA and FASTQ.
Read using the buffered I/O with readline or read.
Used for large records, variable or self-described data.
Fixed-Width Records
Fixed number of bytes per record.
Small records with space- or zero-padding per field.
Common in financial data – derived from card images
used on mainframe systems.
Record sizes tend to be small.
Files with lots of rows leaving them “tall & narrow”.
Fixed-width reads
Perl can read them with readline, read, or sysread.
read() uses Perl's buffered I/O to read characters.
sysread() bypasses buffered I/O and reads bytes.
$/ does record-based character I/O with maximum
record size if the O/S supports it.
The thinnest layer over the O/S is sysread.
UTF8: When fixed-width isn't
Traditional fixed-width data has fixed bytes per record.
Records read into a fixed-width buffer N bytes at a time.
UTF8 has characters not bytes.
Result: Use read() with I/O disciplines to deal with data
that may contain UTF8-encoded strings.
Buffered I/O system deals with layered I/O and
disciplines.
Using sysread
Copy up to N bytes from file handle as-is into a buffer:
sysread FILEHANDLE,SCALAR,LENGTH,OFFSET
Bypasses process buffers and file handles.
On *NIX, Perl's sysread is a thin layer over read(2).
Copies data from the kernel buffer to process space.
Sharing a buffer
Common view: threads & shared memory “right” way.
Threads share the filehandle and in-process buffer.
Threads are a lot of work to program & test
Locking overhead may kill any time advantage.
Small records can share a kernel buffer.
Reads from the data stream pull in record-size chunks.
Traditional view of file handles
Handles connect to hardware.
Buffer the data in their own
space for efficiency.
readline and read use their
own buffers for this reason.
Kernel buffers make a big difference
Modern O/S use memory-
mapped I/O to a kernel buffer.
Data transfer via memcpy to
userspace.
Data faulted into the buffer.
File handles read from the
kernel buffer, not the device.
Example: Stock Trading Data
From the NYX daily files documentation.
Kernel buffers hold ~42 Quotes in an 4KB page.
Reads from multiple file handles will hit the buffer more
often than the disk.
Data
Format
Most of the
fields are
single-char
tags.
Record has
small
memory
footprint.
Field Offset Size Format
time 0 9 hhmmssXXX msec
exchange 9 1 char dictionary
symbol 10 16 char 6+10
bid price 26 11 float %.04f
bid size 37 7 int %7d
ask price 44 11 float %.04f
ask size 55 11 int %7d
quote condition 62 1 char dictionary
filler 63 4 text four blanks
bid exchange 67 1 char dictionary
ask exchange 68 1 char dictionary
sequence no 69 16 int %d
national bbo 85 1 digit dictionary
nasdaq bbo 86 1 digit dictionary
cancel/correct 87 1 char dictionary
source 88 1 char dictionary
retal int flag 89 1 char dictionary
short sale flag 90 1 char dictionary
CQS 91 1 char dictionary
UTP 92 1 char dictionary
finra adf 93 1 char dictionary
line 94 2 text cMcJ
Handling records
Unpack is the fastest way to split this up.
“A” is a space-padded ascii character sequence.
“x” skips a number of bytes.
No math on the values: don't convert to C int or float.
DBI stringifies all values anyway.
Also need to discard fixed with header record.
Parallel processing
Forks are actually simpler than they sound.
CPAN makes them even simpler:
I'll use Parallel::Queue for the examples here.
This takes a list of closures (or a job-creator object).
It forks them N-way parallel, aborting the queue if any
jobs exit non-zero.
Deals with exit (vs. return) to avoid phorktosis.
Simplest approach: Manager + Worker
A manager process forks off workers to read the data and
then cleans up after them.
Workers are given a filehandle.
Each worker reads a single record with sysread.
The natural order of reads will have most of the proc's
doing buffer copies most of the time.
Describing record
Template uses space-
padded values.
“A” and a width for
data loaded with DBI.
“x” ignores filler.
Read size == sum of A
& x fields.
my @fieldz =
(
[ qw( time A 9 ) ],
[ qw( exch A 1 ) ],
...
[ qw( fill x 4 ) ],
...
);
my $template
= join '',
map
{
join '', @{ $_ }[ 1, 2 ];
}
@fieldz;
my $size = sum map { $_->[2] } @fieldz;
#!/bin/env perl
use v5.22;
use autodie;
use Parallel::Queue;
my ( $path, $jobs ) = getopt ...;
my @fieldz = ... ;
my $template = ... ;
my $size = ... ;
my $buffer = '';
open my $fh, '<', $path;
sysread $fh, $buffer, 92 # fixed header
// die "Failed header: $!";
sub read_recs { ... }
my @queue = ( sub { read_recs ) x $jobs;
my @failed = runqueue $jobs, @queue
or die 'Failed:', @unfinished;
Closures dispatch reads.
Pass the queue to P::Q.
Failed jobs are returned
as a list of closures.
Production code needs
check on $. for restarts!
Dispatching reads.
Reading buffers
sub read_recs
{
my $dbh = DBI->connect( ... );
my $sth = $dbh->prepare( ... );
my $i = 0
for(;;)
{
$i = sysread $fh, $buffer, $size
// die;
$i or last;
$i == $size
or die "Runt: $i at $.n";
$sth->execute
(
unpack $template => $buffer
);
}
return
}
Not much code.
sysread pulls whole
records.
Sync via blocking.
Each process gets its
own $dbh, $sth.
What if the data is zipped?
Same basic process: share a filehandle.
open my $fh, '-|', “gzip -dc $path”;
After that fork and share a file handle.
Named pipes or filesystem sockets also be useful.
Network sockets have packet issues with record
boundaries – server needs to pre-chunk the data.
Improving DBI performance
Calling $sth->execute( @_ ) is expensive.
The faster approach is calling bind and using lexical
variables for the output...
But that doesn't fit onto a single slide.
Quick fix: bind array elements with “$row[$i]”
Saves managing a dozen+ lexical variables.
Half-billion records makes this worth benchmarking.
Multiple blocks help reduce overhead
Read N records < system page size.
Kernel call overhead more than larger memcopy.
Multiple records avoid starving jobs.
Check read with ! ( $i % $size ).
Apply template with substr or multiply template and
chunk array.
Using a job-object with P::Q
Generate the jobs as needed.
Blessed queue entry that can( 'next_job').
This can be handy for processing multiple files:
The original “queue” has an object for each file.
N jobs generated for each file by handler object.
Simple benchmark
open my $fh, '<', $path;
my $size = ... ;
my $buffer = ' ' $size;
my $i = 0;
my $j = 0;
my $wall = 0;
sysread $fh, $buffer, 92; # discard header
for(;;)
{
$i = sysread $fh, $buffer, $size
or last;
$i == $size
or die "Runt read: $i at $.";
@row = unpack $template => $buffer;
++$j % 65_536
and next;
$wall = tv_interval $t0, [ gettimeofday ];
say "$$ $j / $wall = " . int( $j / $wall );
}
$i // die "Failed read: $!";
say "Final: $$ $j / $wall = " . int( $j / $wall );
Read & split rows.
Ignore DBI overhead.
Vary buffer size to check
chunked read speed.
Loop with fork to check
number of jobs.
Running a million records through...
~ 4.4 sec wallclock
for 1Mrec.
About 3Ksec for the
daily file's full
550Mrec.
Un-zipping this
many records to
/dev/null takes about
1.6sec.
13455 4 65536 / 1.077347 = 60830
13452 1 65536 / 1.127445 = 58127
13453 2 65536 / 1.324299 = 49487
13454 3 65536 / 1.739874 = 37667
13455 4 131072 / 2.139257 = 61269
13452 1 131072 / 2.188887 = 59880
13453 2 131072 / 2.383899 = 54982
13454 3 131072 / 3.157574 = 41510
13455 4 196608 / 3.19973 = 61445
13452 1 196608 / 3.251169 = 60473
13453 2 196608 / 3.445435 = 57063
13454 3 196608 / 4.21482 = 46646
13455 4 262144 / 4.274705 = 61324
13452 1 262144 / 4.303842 = 60909
Final: 13455 4 268606 / 4.380693 = 61315
Final: 13454 3 207351 / 4.389554 = 47237
Final: 13452 1 267971 / 4.398593 = 60921
Final: 13453 2 256075 / 4.397525 = 58231
10Mrec looks about the same
Better per-process
balance with
longer running
tasks.
Still ~60KHz per
process.
13182 4 2424832 / 39.453094 = 61461
13181 3 2490368 / 39.666493 = 62782
13179 1 2490368 / 40.221195 = 61916
13180 2 2490368 / 40.26747 = 61845
Summary: it's easier than you think.
sysread makes reading fixed records easy.
Parallel::Queue makes it simple to fork.
Kernel buffering makes the streaming data efficient.
No need for shared memory or locking.
Net result: parallel processing of fixed-width data in Perl
is actually pretty easy.
References
As always, perldoc is your friend:
perlpacktut
perlperf
perlfork
Try “perldoc perl” just to see how much there is!

Weitere ähnliche Inhalte

Was ist angesagt?

Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open houseJulien Le Dem
 
20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍Tae Young Lee
 
Top 10 Hadoop Shell Commands
Top 10 Hadoop Shell Commands Top 10 Hadoop Shell Commands
Top 10 Hadoop Shell Commands SimoniShah6
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...CloudxLab
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceUday Vakalapudi
 
Cassandra by example - the path of read and write requests
Cassandra by example - the path of read and write requestsCassandra by example - the path of read and write requests
Cassandra by example - the path of read and write requestsgrro
 
What's new in Redis v3.2
What's new in Redis v3.2What's new in Redis v3.2
What's new in Redis v3.2Itamar Haber
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Hadoop 20111117
Hadoop 20111117Hadoop 20111117
Hadoop 20111117exsuns
 
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxData
 
Writing and using php streams and sockets
Writing and using php streams and socketsWriting and using php streams and sockets
Writing and using php streams and socketsElizabeth Smith
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelTakahiro Inoue
 
multi-line record grep
multi-line record grepmulti-line record grep
multi-line record grepRyoichi KATO
 
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemAccelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemShuai Yuan
 
Interactive big data analytics
Interactive big data analyticsInteractive big data analytics
Interactive big data analyticsViet-Trung TRAN
 
(Julien le dem) parquet
(Julien le dem)   parquet(Julien le dem)   parquet
(Julien le dem) parquetNAVER D2
 

Was ist angesagt? (20)

Anatomy of file write in hadoop
Anatomy of file write in hadoopAnatomy of file write in hadoop
Anatomy of file write in hadoop
 
Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open house
 
20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍
 
Top 10 Hadoop Shell Commands
Top 10 Hadoop Shell Commands Top 10 Hadoop Shell Commands
Top 10 Hadoop Shell Commands
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Cassandra by example - the path of read and write requests
Cassandra by example - the path of read and write requestsCassandra by example - the path of read and write requests
Cassandra by example - the path of read and write requests
 
What's new in Redis v3.2
What's new in Redis v3.2What's new in Redis v3.2
What's new in Redis v3.2
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Hadoop 20111117
Hadoop 20111117Hadoop 20111117
Hadoop 20111117
 
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
 
Writing and using php streams and sockets
Writing and using php streams and socketsWriting and using php streams and sockets
Writing and using php streams and sockets
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Anatomy of file read in hadoop
Anatomy of file read in hadoopAnatomy of file read in hadoop
Anatomy of file read in hadoop
 
multi-line record grep
multi-line record grepmulti-line record grep
multi-line record grep
 
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemAccelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
 
Interactive big data analytics
Interactive big data analyticsInteractive big data analytics
Interactive big data analytics
 
Gur1009
Gur1009Gur1009
Gur1009
 
(Julien le dem) parquet
(Julien le dem)   parquet(Julien le dem)   parquet
(Julien le dem) parquet
 

Andere mochten auch

Our Friends the Utils: A highway traveled by wheels we didn't re-invent.
Our Friends the Utils: A highway traveled by wheels we didn't re-invent. Our Friends the Utils: A highway traveled by wheels we didn't re-invent.
Our Friends the Utils: A highway traveled by wheels we didn't re-invent. Workhorse Computing
 
Investor Seminar in San Francisco March 7th, 2015
Investor Seminar in San Francisco March 7th, 2015Investor Seminar in San Francisco March 7th, 2015
Investor Seminar in San Francisco March 7th, 2015Joe Pryor
 
Low and No cost real estate marketing plan for Enid Oklahoma
Low and No cost real estate marketing plan for Enid OklahomaLow and No cost real estate marketing plan for Enid Oklahoma
Low and No cost real estate marketing plan for Enid OklahomaJoe Pryor
 

Andere mochten auch (7)

Our Friends the Utils: A highway traveled by wheels we didn't re-invent.
Our Friends the Utils: A highway traveled by wheels we didn't re-invent. Our Friends the Utils: A highway traveled by wheels we didn't re-invent.
Our Friends the Utils: A highway traveled by wheels we didn't re-invent.
 
Investor Seminar in San Francisco March 7th, 2015
Investor Seminar in San Francisco March 7th, 2015Investor Seminar in San Francisco March 7th, 2015
Investor Seminar in San Francisco March 7th, 2015
 
Object Exercise
Object ExerciseObject Exercise
Object Exercise
 
parallel processing
parallel processingparallel processing
parallel processing
 
A-Walk-on-the-W-Side
A-Walk-on-the-W-SideA-Walk-on-the-W-Side
A-Walk-on-the-W-Side
 
Clustering Genes: W-curve + TSP
Clustering Genes: W-curve + TSPClustering Genes: W-curve + TSP
Clustering Genes: W-curve + TSP
 
Low and No cost real estate marketing plan for Enid Oklahoma
Low and No cost real estate marketing plan for Enid OklahomaLow and No cost real estate marketing plan for Enid Oklahoma
Low and No cost real estate marketing plan for Enid Oklahoma
 

Ähnlich wie Perly Parallel Processing of Fixed Width Data Records

Perl at SkyCon'12
Perl at SkyCon'12Perl at SkyCon'12
Perl at SkyCon'12Tim Bunce
 
GC free coding in @Java presented @Geecon
GC free coding in @Java presented @GeeconGC free coding in @Java presented @Geecon
GC free coding in @Java presented @GeeconPeter Lawrey
 
Understanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLUnderstanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLHyderabad Scalability Meetup
 
Top 10 Perl Performance Tips
Top 10 Perl Performance TipsTop 10 Perl Performance Tips
Top 10 Perl Performance TipsPerrin Harkins
 
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Alex Levenson
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterDataWorks Summit
 
A Domain-Specific Embedded Language for Programming Parallel Architectures.
A Domain-Specific Embedded Language for Programming Parallel Architectures.A Domain-Specific Embedded Language for Programming Parallel Architectures.
A Domain-Specific Embedded Language for Programming Parallel Architectures.Jason Hearne-McGuiness
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory Optimizationguest3eed30
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory OptimizationWei Lin
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosEuangelos Linardos
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout source{d}
 
Learning Puppet basic thing
Learning Puppet basic thing Learning Puppet basic thing
Learning Puppet basic thing DaeHyung Lee
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchAndrew Lowe
 
Bottom to Top Stack Optimization with LAMP
Bottom to Top Stack Optimization with LAMPBottom to Top Stack Optimization with LAMP
Bottom to Top Stack Optimization with LAMPkatzgrau
 
Bottom to Top Stack Optimization - CICON2011
Bottom to Top Stack Optimization - CICON2011Bottom to Top Stack Optimization - CICON2011
Bottom to Top Stack Optimization - CICON2011CodeIgniter Conference
 
Managing Your Content with Elasticsearch
Managing Your Content with ElasticsearchManaging Your Content with Elasticsearch
Managing Your Content with ElasticsearchSamantha Quiñones
 
Tips
TipsTips
Tipsmclee
 

Ähnlich wie Perly Parallel Processing of Fixed Width Data Records (20)

Perl at SkyCon'12
Perl at SkyCon'12Perl at SkyCon'12
Perl at SkyCon'12
 
GC free coding in @Java presented @Geecon
GC free coding in @Java presented @GeeconGC free coding in @Java presented @Geecon
GC free coding in @Java presented @Geecon
 
Understanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLUnderstanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQL
 
Top 10 Perl Performance Tips
Top 10 Perl Performance TipsTop 10 Perl Performance Tips
Top 10 Perl Performance Tips
 
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
 
Vmfs
VmfsVmfs
Vmfs
 
Demo 0.9.4
Demo 0.9.4Demo 0.9.4
Demo 0.9.4
 
A Domain-Specific Embedded Language for Programming Parallel Architectures.
A Domain-Specific Embedded Language for Programming Parallel Architectures.A Domain-Specific Embedded Language for Programming Parallel Architectures.
A Domain-Specific Embedded Language for Programming Parallel Architectures.
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory Optimization
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory Optimization
 
pm1
pm1pm1
pm1
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout
 
Learning Puppet basic thing
Learning Puppet basic thing Learning Puppet basic thing
Learning Puppet basic thing
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible research
 
Bottom to Top Stack Optimization with LAMP
Bottom to Top Stack Optimization with LAMPBottom to Top Stack Optimization with LAMP
Bottom to Top Stack Optimization with LAMP
 
Bottom to Top Stack Optimization - CICON2011
Bottom to Top Stack Optimization - CICON2011Bottom to Top Stack Optimization - CICON2011
Bottom to Top Stack Optimization - CICON2011
 
Managing Your Content with Elasticsearch
Managing Your Content with ElasticsearchManaging Your Content with Elasticsearch
Managing Your Content with Elasticsearch
 
Tips
TipsTips
Tips
 

Mehr von Workhorse Computing

Wheels we didn't re-invent: Perl's Utility Modules
Wheels we didn't re-invent: Perl's Utility ModulesWheels we didn't re-invent: Perl's Utility Modules
Wheels we didn't re-invent: Perl's Utility ModulesWorkhorse Computing
 
Paranormal statistics: Counting What Doesn't Add Up
Paranormal statistics: Counting What Doesn't Add UpParanormal statistics: Counting What Doesn't Add Up
Paranormal statistics: Counting What Doesn't Add UpWorkhorse Computing
 
The $path to knowledge: What little it take to unit-test Perl.
The $path to knowledge: What little it take to unit-test Perl.The $path to knowledge: What little it take to unit-test Perl.
The $path to knowledge: What little it take to unit-test Perl.Workhorse Computing
 
Generating & Querying Calendar Tables in Posgresql
Generating & Querying Calendar Tables in PosgresqlGenerating & Querying Calendar Tables in Posgresql
Generating & Querying Calendar Tables in PosgresqlWorkhorse Computing
 
Hypers and Gathers and Takes! Oh my!
Hypers and Gathers and Takes! Oh my!Hypers and Gathers and Takes! Oh my!
Hypers and Gathers and Takes! Oh my!Workhorse Computing
 
BSDM with BASH: Command Interpolation
BSDM with BASH: Command InterpolationBSDM with BASH: Command Interpolation
BSDM with BASH: Command InterpolationWorkhorse Computing
 
BASH Variables Part 1: Basic Interpolation
BASH Variables Part 1: Basic InterpolationBASH Variables Part 1: Basic Interpolation
BASH Variables Part 1: Basic InterpolationWorkhorse Computing
 
The W-curve and its application.
The W-curve and its application.The W-curve and its application.
The W-curve and its application.Workhorse Computing
 
Keeping objects healthy with Object::Exercise.
Keeping objects healthy with Object::Exercise.Keeping objects healthy with Object::Exercise.
Keeping objects healthy with Object::Exercise.Workhorse Computing
 
Perl6 Regexen: Reduce the line noise in your code.
Perl6 Regexen: Reduce the line noise in your code.Perl6 Regexen: Reduce the line noise in your code.
Perl6 Regexen: Reduce the line noise in your code.Workhorse Computing
 
Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6
Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6
Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6Workhorse Computing
 

Mehr von Workhorse Computing (20)

Wheels we didn't re-invent: Perl's Utility Modules
Wheels we didn't re-invent: Perl's Utility ModulesWheels we didn't re-invent: Perl's Utility Modules
Wheels we didn't re-invent: Perl's Utility Modules
 
mro-every.pdf
mro-every.pdfmro-every.pdf
mro-every.pdf
 
Paranormal statistics: Counting What Doesn't Add Up
Paranormal statistics: Counting What Doesn't Add UpParanormal statistics: Counting What Doesn't Add Up
Paranormal statistics: Counting What Doesn't Add Up
 
The $path to knowledge: What little it take to unit-test Perl.
The $path to knowledge: What little it take to unit-test Perl.The $path to knowledge: What little it take to unit-test Perl.
The $path to knowledge: What little it take to unit-test Perl.
 
Unit Testing Lots of Perl
Unit Testing Lots of PerlUnit Testing Lots of Perl
Unit Testing Lots of Perl
 
Generating & Querying Calendar Tables in Posgresql
Generating & Querying Calendar Tables in PosgresqlGenerating & Querying Calendar Tables in Posgresql
Generating & Querying Calendar Tables in Posgresql
 
Hypers and Gathers and Takes! Oh my!
Hypers and Gathers and Takes! Oh my!Hypers and Gathers and Takes! Oh my!
Hypers and Gathers and Takes! Oh my!
 
BSDM with BASH: Command Interpolation
BSDM with BASH: Command InterpolationBSDM with BASH: Command Interpolation
BSDM with BASH: Command Interpolation
 
Findbin libs
Findbin libsFindbin libs
Findbin libs
 
Memory Manglement in Raku
Memory Manglement in RakuMemory Manglement in Raku
Memory Manglement in Raku
 
BASH Variables Part 1: Basic Interpolation
BASH Variables Part 1: Basic InterpolationBASH Variables Part 1: Basic Interpolation
BASH Variables Part 1: Basic Interpolation
 
Effective Benchmarks
Effective BenchmarksEffective Benchmarks
Effective Benchmarks
 
Metadata-driven Testing
Metadata-driven TestingMetadata-driven Testing
Metadata-driven Testing
 
The W-curve and its application.
The W-curve and its application.The W-curve and its application.
The W-curve and its application.
 
Keeping objects healthy with Object::Exercise.
Keeping objects healthy with Object::Exercise.Keeping objects healthy with Object::Exercise.
Keeping objects healthy with Object::Exercise.
 
Perl6 Regexen: Reduce the line noise in your code.
Perl6 Regexen: Reduce the line noise in your code.Perl6 Regexen: Reduce the line noise in your code.
Perl6 Regexen: Reduce the line noise in your code.
 
Smoking docker
Smoking dockerSmoking docker
Smoking docker
 
Getting Testy With Perl6
Getting Testy With Perl6Getting Testy With Perl6
Getting Testy With Perl6
 
Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6
Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6
Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6
 
Neatly folding-a-tree
Neatly folding-a-treeNeatly folding-a-tree
Neatly folding-a-tree
 

Kürzlich hochgeladen

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 

Kürzlich hochgeladen (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Perly Parallel Processing of Fixed Width Data Records

  • 1. Adding Forks to Your Perly Tableware Handling fixed-width records in parallel S Lteven embark W Corkhorse omputing lembark@wrkhors.com
  • 2. You may hear things like: Perl is too slow for processing fixed-width records so I wrote the latest version in C++ since it can handle the data in parallel. There are a few misunderstandings here: That Perl is slow. That Perl does not handle parallel processing well. That Perl does not handle fixed-width records. It isn't, does, and can.
  • 3. Comparing C++ and Perl: Perl's I/O is a fairly thin layer over unistd.h library calls. Perl and C++ block at the same rate. Forked processes can easily share data stream with separate file handles by letting the O/S buffer data. unpack is reasonably efficient and more dynamic than using struct's in C++.
  • 4. Most of us use variable-width, delimited records. These are the usual newline-terminated data we all know and [lh][oa][vt]e. Perl handles these via $, readline, and split or regexen. Common examples: logs or FASTA and FASTQ. Read using the buffered I/O with readline or read. Used for large records, variable or self-described data.
  • 5. Fixed-Width Records Fixed number of bytes per record. Small records with space- or zero-padding per field. Common in financial data – derived from card images used on mainframe systems. Record sizes tend to be small. Files with lots of rows leaving them “tall & narrow”.
  • 6. Fixed-width reads Perl can read them with readline, read, or sysread. read() uses Perl's buffered I/O to read characters. sysread() bypasses buffered I/O and reads bytes. $/ does record-based character I/O with maximum record size if the O/S supports it. The thinnest layer over the O/S is sysread.
  • 7. UTF8: When fixed-width isn't Traditional fixed-width data has fixed bytes per record. Records read into a fixed-width buffer N bytes at a time. UTF8 has characters not bytes. Result: Use read() with I/O disciplines to deal with data that may contain UTF8-encoded strings. Buffered I/O system deals with layered I/O and disciplines.
  • 8. Using sysread Copy up to N bytes from file handle as-is into a buffer: sysread FILEHANDLE,SCALAR,LENGTH,OFFSET Bypasses process buffers and file handles. On *NIX, Perl's sysread is a thin layer over read(2). Copies data from the kernel buffer to process space.
  • 9. Sharing a buffer Common view: threads & shared memory “right” way. Threads share the filehandle and in-process buffer. Threads are a lot of work to program & test Locking overhead may kill any time advantage. Small records can share a kernel buffer. Reads from the data stream pull in record-size chunks.
  • 10. Traditional view of file handles Handles connect to hardware. Buffer the data in their own space for efficiency. readline and read use their own buffers for this reason.
  • 11. Kernel buffers make a big difference Modern O/S use memory- mapped I/O to a kernel buffer. Data transfer via memcpy to userspace. Data faulted into the buffer. File handles read from the kernel buffer, not the device.
  • 12. Example: Stock Trading Data From the NYX daily files documentation. Kernel buffers hold ~42 Quotes in an 4KB page. Reads from multiple file handles will hit the buffer more often than the disk.
  • 13. Data Format Most of the fields are single-char tags. Record has small memory footprint. Field Offset Size Format time 0 9 hhmmssXXX msec exchange 9 1 char dictionary symbol 10 16 char 6+10 bid price 26 11 float %.04f bid size 37 7 int %7d ask price 44 11 float %.04f ask size 55 11 int %7d quote condition 62 1 char dictionary filler 63 4 text four blanks bid exchange 67 1 char dictionary ask exchange 68 1 char dictionary sequence no 69 16 int %d national bbo 85 1 digit dictionary nasdaq bbo 86 1 digit dictionary cancel/correct 87 1 char dictionary source 88 1 char dictionary retal int flag 89 1 char dictionary short sale flag 90 1 char dictionary CQS 91 1 char dictionary UTP 92 1 char dictionary finra adf 93 1 char dictionary line 94 2 text cMcJ
  • 14. Handling records Unpack is the fastest way to split this up. “A” is a space-padded ascii character sequence. “x” skips a number of bytes. No math on the values: don't convert to C int or float. DBI stringifies all values anyway. Also need to discard fixed with header record.
  • 15. Parallel processing Forks are actually simpler than they sound. CPAN makes them even simpler: I'll use Parallel::Queue for the examples here. This takes a list of closures (or a job-creator object). It forks them N-way parallel, aborting the queue if any jobs exit non-zero. Deals with exit (vs. return) to avoid phorktosis.
  • 16. Simplest approach: Manager + Worker A manager process forks off workers to read the data and then cleans up after them. Workers are given a filehandle. Each worker reads a single record with sysread. The natural order of reads will have most of the proc's doing buffer copies most of the time.
  • 17. Describing record Template uses space- padded values. “A” and a width for data loaded with DBI. “x” ignores filler. Read size == sum of A & x fields. my @fieldz = ( [ qw( time A 9 ) ], [ qw( exch A 1 ) ], ... [ qw( fill x 4 ) ], ... ); my $template = join '', map { join '', @{ $_ }[ 1, 2 ]; } @fieldz; my $size = sum map { $_->[2] } @fieldz;
  • 18. #!/bin/env perl use v5.22; use autodie; use Parallel::Queue; my ( $path, $jobs ) = getopt ...; my @fieldz = ... ; my $template = ... ; my $size = ... ; my $buffer = ''; open my $fh, '<', $path; sysread $fh, $buffer, 92 # fixed header // die "Failed header: $!"; sub read_recs { ... } my @queue = ( sub { read_recs ) x $jobs; my @failed = runqueue $jobs, @queue or die 'Failed:', @unfinished; Closures dispatch reads. Pass the queue to P::Q. Failed jobs are returned as a list of closures. Production code needs check on $. for restarts! Dispatching reads.
  • 19. Reading buffers sub read_recs { my $dbh = DBI->connect( ... ); my $sth = $dbh->prepare( ... ); my $i = 0 for(;;) { $i = sysread $fh, $buffer, $size // die; $i or last; $i == $size or die "Runt: $i at $.n"; $sth->execute ( unpack $template => $buffer ); } return } Not much code. sysread pulls whole records. Sync via blocking. Each process gets its own $dbh, $sth.
  • 20. What if the data is zipped? Same basic process: share a filehandle. open my $fh, '-|', “gzip -dc $path”; After that fork and share a file handle. Named pipes or filesystem sockets also be useful. Network sockets have packet issues with record boundaries – server needs to pre-chunk the data.
  • 21. Improving DBI performance Calling $sth->execute( @_ ) is expensive. The faster approach is calling bind and using lexical variables for the output... But that doesn't fit onto a single slide. Quick fix: bind array elements with “$row[$i]” Saves managing a dozen+ lexical variables. Half-billion records makes this worth benchmarking.
  • 22. Multiple blocks help reduce overhead Read N records < system page size. Kernel call overhead more than larger memcopy. Multiple records avoid starving jobs. Check read with ! ( $i % $size ). Apply template with substr or multiply template and chunk array.
  • 23. Using a job-object with P::Q Generate the jobs as needed. Blessed queue entry that can( 'next_job'). This can be handy for processing multiple files: The original “queue” has an object for each file. N jobs generated for each file by handler object.
  • 24. Simple benchmark open my $fh, '<', $path; my $size = ... ; my $buffer = ' ' $size; my $i = 0; my $j = 0; my $wall = 0; sysread $fh, $buffer, 92; # discard header for(;;) { $i = sysread $fh, $buffer, $size or last; $i == $size or die "Runt read: $i at $."; @row = unpack $template => $buffer; ++$j % 65_536 and next; $wall = tv_interval $t0, [ gettimeofday ]; say "$$ $j / $wall = " . int( $j / $wall ); } $i // die "Failed read: $!"; say "Final: $$ $j / $wall = " . int( $j / $wall ); Read & split rows. Ignore DBI overhead. Vary buffer size to check chunked read speed. Loop with fork to check number of jobs.
  • 25. Running a million records through... ~ 4.4 sec wallclock for 1Mrec. About 3Ksec for the daily file's full 550Mrec. Un-zipping this many records to /dev/null takes about 1.6sec. 13455 4 65536 / 1.077347 = 60830 13452 1 65536 / 1.127445 = 58127 13453 2 65536 / 1.324299 = 49487 13454 3 65536 / 1.739874 = 37667 13455 4 131072 / 2.139257 = 61269 13452 1 131072 / 2.188887 = 59880 13453 2 131072 / 2.383899 = 54982 13454 3 131072 / 3.157574 = 41510 13455 4 196608 / 3.19973 = 61445 13452 1 196608 / 3.251169 = 60473 13453 2 196608 / 3.445435 = 57063 13454 3 196608 / 4.21482 = 46646 13455 4 262144 / 4.274705 = 61324 13452 1 262144 / 4.303842 = 60909 Final: 13455 4 268606 / 4.380693 = 61315 Final: 13454 3 207351 / 4.389554 = 47237 Final: 13452 1 267971 / 4.398593 = 60921 Final: 13453 2 256075 / 4.397525 = 58231
  • 26. 10Mrec looks about the same Better per-process balance with longer running tasks. Still ~60KHz per process. 13182 4 2424832 / 39.453094 = 61461 13181 3 2490368 / 39.666493 = 62782 13179 1 2490368 / 40.221195 = 61916 13180 2 2490368 / 40.26747 = 61845
  • 27. Summary: it's easier than you think. sysread makes reading fixed records easy. Parallel::Queue makes it simple to fork. Kernel buffering makes the streaming data efficient. No need for shared memory or locking. Net result: parallel processing of fixed-width data in Perl is actually pretty easy.
  • 28. References As always, perldoc is your friend: perlpacktut perlperf perlfork Try “perldoc perl” just to see how much there is!