1. Memory Unmanglement With Perl
How to do what you do
without getting hit in the memory.
Steven Lembark
Workhorse Computing
2. In Our Last Episode...
● We saw our hero battling the forces of rambloat in
longrunning, heavilyforked, or largescale
processes.
● Learned the golden rule: Nothing Shrinks.
● Observed memory benchmarks using Devel::Peek,
Devel::Size, and perl -d.
●
peek() shows the structure & hash efficiency.
●
size() & total_size() show memory usage.
3. Time vs. Space
● The classic tradeoff is handled in favor of time in
the perl implementations.
● More efficient data structures can help both sides.
● Avoiding wasted space can help avoid thrashing, heap
management, and system call overhead.
● Faster access for arrays can make them more compact
and faster than hashes in some situations.
● Benchmarks are not only for time: include checks of
size(), total_size(), and peek() to see what is
really going on.
4. Nothing Ever Shrinks
● perl maintains strings and arrays as pointers to
memory allocations.
● Adjusting the size of a scalar with substr or a regex
changes it start and length.
● shift and pop adjust an array's initial offset and count.
● None of these will reduce the memory overhead of
the 'scaffolding' perl uses to manage the data.
5. Look Deep Into Your Memory
● Devel::Peek
● peek() at the structure
● Shows efficiency of hashing.
● Devel::Size
● size() shows memory usage of “scaffolding”.
● total_size() includes contents along with skeleton.
● size() can be useful in loops for managing size of re
cycled buffers.
6. Size & Structure
● Scalars
● Reference allocations for strings with offset & length.
● size() of the scalar is small, total_size() can be large.
● Arrays
● Allocated list of Scalars, also with offset & length.
● size() reports space for list, total_size() includes contents.
● Hashes
● Hash chains are an array of arrays with min. 8 chains.
● size() reports space for hash chains.
7. Taming the Beast
● There are tools for managing the memory, most of
which involve some sort of time/space tradeoff.
● undef can help – probably less than you think.
● You can manage the lifetime of variables with lexical or
local values.
● Recycling buffers localizes the bloat to one structure.
● Adapting your code to use more effective data structures
offers the best solution for large data.
● Here are some ideas.
8. undef() is somewhat helpful
● Marks the variable for reclamation.
● Space may not be immediately reclaimed – up to perl
whether to add heap or recycle the undefed variables.
● Structures are discarded, not reduced.
● This can have a significant performance overhead on
nested, reused data structures.
● Tradeoff: space for time for rebuilding the skeleton
of discarded structures.
● Most useful for recycling singlelevel structures.
9. undefing an Array Doesn't Zero It
● The contents are discarded & reallocated:
my @a = ();
$#a = 999_999;
print "Size @a:t", size( @a ), "n";
undef @a;
print "Size @a:t", size( @a ), "n";
Full @a:4000200
Post @a: 100
● For a large, nested structure this may not save the
amount of space you expect.
10. Recycling Buffers
● Use size() to discard and reallocate the buffer if it
grows too large.
● Preallocate to avoid marginoferror added by perl
when the initial allocation grows.
● Decent tradeoff between reallocating a buffer
frequently and having it grow without bounds.
● Avoids one record botching the entire processing
cycle.
11. Scalar Buffer
● Recycle buffer, clean it up, then copy by value.
● Easiest with scalars since they don't have any nested
structure.
while( $buffer = get_data )
{
$buffer =~ s/^s+//;
...
push @data, $buffer;
if( size( $buffer ) > $max_buff )
{
undef $buffer;
$buffer = ' ' x $max_buff;
}
}
12. Array Buffer
● This works well for single level buffers multilevel
buffers often require too much work to rebuild.
my @buff = ();
$#buff = $buff_count;
while( @buff = get_data )
{
... # clean up buffer
$data{ $key } = [ @buff ]; # store values
if( size( @a ) > $buff_max )
{
undef @buff;
$#buff = $max_buff;
}
}
13. Assign Arrays SinglePass
● Say you have to store a large number of items:
my @a = @b = ();
push @a, “” for ( 1 .. 1_000_000 );
@b = map { “” } ( 1 .. 1_000_000 );
print 'Size of @a: ', size( @a ), "n";
print 'Size of @b: ', size( @b ), "n";
● Push ends up with a larger structure:
Size of @a: 4194388
Size of @b: 4000100
14. Hashes are Huge
● Incremental assignment doesn't make hashes larger:
they are 8x larger than arrays in both cases.
my %a =();
my %b = ();
$a{ $_ } = “” for ( 1 .. 1_000_000 );
%b = map { $_ => “” } ( 1 .. 1_000_000 );
print 'Size of %a: ', size( %a ), "n";
print 'Size of %b: ', size( %b ), "n";
Size of %a: 32083244 # vs. 4000100
Size of %b: 32083244 # in an array!
15. Two Ways of Storing Nothing
● There are two common ways of storing nothing in
the values of a hash:
● Assign an empty list: $hash{ $key } = ();
● Assign an empty string: $hash{ $key } = “”;
● Question:
Which would take less space: empty list or empty string?
16. TMTOWTDN
●
size() gives the same result for both values. Why?
my %a =();
my %b = ();
$a{ $_ } = () for( 'aaa' .. 'zzz' );
$b{ $_ } = '' for( 'aaa' .. 'zzz' );
print "Size of %a:t", size( %a ), "n";
print "Size of %b:t", size( %b ), "n";
Size of %a: 570516 # same size for “” & ()?
Size of %b: 570516
17. TMTOWTDN
●
total_size() benchmarks the values:
my %a =();
my %b = ();
$a{ $_ } = () for( 'aaa' .. 'zzz' );
$b{ $_ } = '' for( 'aaa' .. 'zzz' );
print "Size of %a:t", size( %a ), "n";
print "Size of %b:t", size( %b ), "n";
print "Total in %a:t", total_size( %a ), "n";
print "Total in %b:t", total_size( %b ), "n";
Size of %a: 570516 # size() doesn't always
Size of %b: 570516 # matter!
Total in %a: 851732
Total in %b: 1203252
18. Replace Hashes With Arrays
● The smartmatch operator (“~~”) is fast.
● Pushing onto an array:
$a ~~ @uniq or push @uniq, $a
uses about 1/8 the space of assigning hash keys:
$uniq{ $a } = ();
...
keys %uniq
● The extra space used by array growth in push is
dwarfed by the savings of an array over a hash.
●
sort @uniq is much faster than sort keys %uniq.
19. Example: Taxonomy Trees
● The NCBI Taxonomy is delivered with each entry
having a full tree.
● These must be reduced to a single tree for data entry
and validation.
● There are several ways to do this...
20. Worst Solution: Parent tree.
● Since the tree is often used from the bottom up,
some people store it as a child:parent relationship:
$parentz{ $child_id } = $parent_id;
● Unfortunately, this allocates a full hash table for
each 1:1 relationship between a child and parent.
21. Another Bad Solution: Child Tree
● Another alternative is storing the children in a hash
for each parent:
$childz{ $parent_id }{ $child_id } = ();
$childz{ '' } = [ $root_id ];
● This works via depthfirst search to generate the
trees and has space to store the treedepth.
● Hashes are bulky and slow for storing a singlelevel
structure like this.
22. Another Solution: SingleLevel Hash
● One oftforgotten bit of Perly lore in the age of
references: multipart hash keys.
$childz{ $parent_id, $child_id } = $depth;
$childz{ “” } = [ $root_id ];
● Trades wasted space in thousands of anon hashes for
split /$;/o, $key and grep's.
● Usable for moderate trees.
● Obviously painful for really large trees.
23. Q: Why Nest Hashes?
● Hashes are nice for the toplevel lookup, but why
nest them?
my $c = $childz{ $parent_id } ||=[];
$new_id ~~ $c
or push @{ $c{ $parent_id } }, $new_id;
● Arrays save about 85% of the overhead below the
top level.
● Any wasted space from the arrays growing via push
is more than saved by avoiding hashes.
● The arrays only need to be sorted once if the tree is
used multiple times.
24. Nested Lists
● List::Util has first() which saves greping entire lists.
● A key and payload on an array can be handled
quickly.
first { $_->[0] eq $key } @data;
● For shorter lists this saves space and can be faster
than a hash.
● This is best for numerics, which don't have to be
converted to text in order to be hashed: $_->[0] ==
$value is the least amount of work to compare
integers.
25. Manage Lifespans
● Lexical variables are an obvious place.
● Local values are another.
● Saves reallocating a set of values within tight loops in
the called code.
● Local hash keys are a good way to manage storage in re
used hashes handled with recursion.
● Use delete to remove hash keys in multilevel
structures instead of assigning an empty list or “”.
● This preserves the skeleton for recycling.
● Saves storing the keys.
26. Use Simpler Objects
● If you're using insideout objects, why bless a hash?
● Users aren't supposed to diddle around inside your
objects anyway.
● The only thing you care about is the address.
● Bless something smaller:
my $obj = bless (my $a), $package;
27. Use Linked Lists for Queues
● Automatically frees discarded nodes without having
to modify the entire list.
● Based on an array they don't use much extra data:
$node = [ $ref_to_next, @node_data ];
● Walking the list is simple enough:
( $node, my @data ) = @$node;
● So is removing a node:
$node->[0] = $node->[0][0];
● These are quite convenient for threading.
28. Use Hashes for Sparse Arrays
● OK, Time to stop beating up on hashes.
● They beat out arrays for sparse lists.
● Even list of integers.
● Say a collection of DNA runs from 15 to 10_000
bases, filling about 10% of the actual values.
● You could store it as:
$dnaz[ $length ] = [ qw( dna dna dna ) ];
● But this is probably better stored in a hash:
$dnaz{ $length } = [ qw( dna dna dna ) ];
29. Accessing Hash Keys: Integer Slices
● Numeric sequences work fine as hash keys.
● Say you want to find all of the sequences within
+/‑10% of the current length:
my $min = 0.9 * $length;
my $max = 1.1 * $length;
my @found = grep{ $_ } @dnaz{ ( $min .. $max ) };
● For nontrivial, sparse lists this saves scaffolding by
only storing the structure necessary.
● This doesn't change the data storage, just the
overhead for accessing it by length.
30. Store Uppertriangular Comparisons
● Saves more than half the space.
● Accessor can look for $i > $j ? [$i][$j] : [$j][$i] and
get the same results.
● Requires designing symmetric comparison
algorithms (values can be returned asis or just
negated).
● Also saves about half the processing time to only
generate a single comparison for each pair.
● Requires access to the algorithm.
31. Example: DNA Analysis
● Our Wcurve analysis is used to compare large
groups of DNA to one another.
● The original algorithm compared the curves until the
first one was exhausted.
● Changing that to use the longer sequence in all cases
saved us over half the comparison time.
32. Summary
● Devel::Size can be useful in your code.
● Managing the lifespan of values helps.
● Using efficient structures helps even more.
● Use arrays instead of hash structures where they make
sense.
● Bless smaller structures: scalars, regexen, globs make
perfectly good objects and take less space than hashes.
● Use XS or Inline where necessary.
● And, yes, size() still matters.