Memory unmanglement

Memory Unmanglement With Perl

How to do what you do
without getting hit in the memory.

Steven Lembark
Workhorse Computing

In Our Last Episode...
● We saw our hero battling the forces of rambloat in
longrunning, heavilyforked, or largescale
processes.
● Learned the golden rule: Nothing Shrinks.
● Observed memory benchmarks using Devel::Peek,
Devel::Size, and perl -d.
●
peek() shows the structure & hash efficiency.
●
size() & total_size() show memory usage.

Time vs. Space
● The classic tradeoff is handled in favor of time in
the perl implementations.
● More efficient data structures can help both sides.
● Avoiding wasted space can help avoid thrashing, heap
management, and system call overhead.
● Faster access for arrays can make them more compact
and faster than hashes in some situations.
● Benchmarks are not only for time: include checks of
size(), total_size(), and peek() to see what is
really going on.

Nothing Ever Shrinks
● perl maintains strings and arrays as pointers to
memory allocations.
● Adjusting the size of a scalar with substr or a regex
changes it start and length.
● shift and pop adjust an array's initial offset and count.
● None of these will reduce the memory overhead of
the 'scaffolding' perl uses to manage the data.

Look Deep Into Your Memory
● Devel::Peek
● peek() at the structure
● Shows efficiency of hashing.
● Devel::Size
● size() shows memory usage of “scaffolding”.
● total_size() includes contents along with skeleton.
● size() can be useful in loops for managing size of re
cycled buffers.

Size & Structure
● Scalars
● Reference allocations for strings with offset & length.
● size() of the scalar is small, total_size() can be large.
● Arrays
● Allocated list of Scalars, also with offset & length.
● size() reports space for list, total_size() includes contents.
● Hashes
● Hash chains are an array of arrays with min. 8 chains.
● size() reports space for hash chains.

Taming the Beast
● There are tools for managing the memory, most of
which involve some sort of time/space tradeoff.
● undef can help – probably less than you think.
● You can manage the lifetime of variables with lexical or
local values.
● Recycling buffers localizes the bloat to one structure.
● Adapting your code to use more effective data structures
offers the best solution for large data.
● Here are some ideas.

undef() is somewhat helpful
● Marks the variable for reclamation.
● Space may not be immediately reclaimed – up to perl
whether to add heap or recycle the undefed variables.
● Structures are discarded, not reduced.
● This can have a significant performance overhead on
nested, reused data structures.
● Tradeoff: space for time for rebuilding the skeleton
of discarded structures.
● Most useful for recycling singlelevel structures.

undefing an Array Doesn't Zero It
● The contents are discarded & reallocated:

my @a = ();
$#a = 999_999;
print "Size @a:t", size( @a ), "n";

undef @a;
print "Size @a:t", size( @a ), "n";

Full @a:4000200
Post @a: 100

● For a large, nested structure this may not save the
amount of space you expect.

Recycling Buffers
● Use size() to discard and reallocate the buffer if it
grows too large.
● Preallocate to avoid marginoferror added by perl
when the initial allocation grows.
● Decent tradeoff between reallocating a buffer
frequently and having it grow without bounds.
● Avoids one record botching the entire processing
cycle.

Scalar Buffer
● Recycle buffer, clean it up, then copy by value.
● Easiest with scalars since they don't have any nested
structure.
while( $buffer = get_data )
{
$buffer =~ s/^s+//;
...
push @data, $buffer;

if( size( $buffer ) > $max_buff )
{
undef $buffer;
$buffer = ' ' x $max_buff;
}
}

Array Buffer
● This works well for single level buffers multilevel
buffers often require too much work to rebuild.
my @buff = ();
$#buff = $buff_count;

while( @buff = get_data )
{
... # clean up buffer
$data{ $key } = [ @buff ]; # store values

if( size( @a ) > $buff_max )
{
undef @buff;
$#buff = $max_buff;
}
}

Assign Arrays SinglePass
● Say you have to store a large number of items:

my @a = @b = ();

push @a, “” for ( 1 .. 1_000_000 );
@b = map { “” } ( 1 .. 1_000_000 );

print 'Size of @a: ', size( @a ), "n";
print 'Size of @b: ', size( @b ), "n";

● Push ends up with a larger structure:
Size of @a: 4194388
Size of @b: 4000100

Hashes are Huge
● Incremental assignment doesn't make hashes larger:
they are 8x larger than arrays in both cases.
my %a =();
my %b = ();

$a{ $_ } = “” for ( 1 .. 1_000_000 );
%b = map { $_ => “” } ( 1 .. 1_000_000 );

print 'Size of %a: ', size( %a ), "n";
print 'Size of %b: ', size( %b ), "n";

Size of %a: 32083244 # vs. 4000100
Size of %b: 32083244 # in an array!

Two Ways of Storing Nothing
● There are two common ways of storing nothing in
the values of a hash:
● Assign an empty list: $hash{ $key } = ();
● Assign an empty string: $hash{ $key } = “”;
● Question:
Which would take less space: empty list or empty string?

TMTOWTDN
●
size() gives the same result for both values. Why?
my %a =();
my %b = ();

$a{ $_ } = () for( 'aaa' .. 'zzz' );
$b{ $_ } = '' for( 'aaa' .. 'zzz' );

print "Size of %a:t", size( %a ), "n";
print "Size of %b:t", size( %b ), "n";

Size of %a: 570516 # same size for “” & ()?
Size of %b: 570516

TMTOWTDN
●
total_size() benchmarks the values:
my %a =();
my %b = ();

$a{ $_ } = () for( 'aaa' .. 'zzz' );
$b{ $_ } = '' for( 'aaa' .. 'zzz' );

print "Size of %a:t", size( %a ), "n";
print "Size of %b:t", size( %b ), "n";

print "Total in %a:t", total_size( %a ), "n";
print "Total in %b:t", total_size( %b ), "n";

Size of %a: 570516 # size() doesn't always
Size of %b: 570516 # matter!

Total in %a: 851732
Total in %b: 1203252

Replace Hashes With Arrays
● The smartmatch operator (“~~”) is fast.
● Pushing onto an array:
$a ~~ @uniq or push @uniq, $a

uses about 1/8 the space of assigning hash keys:
$uniq{ $a } = ();
...
keys %uniq
● The extra space used by array growth in push is
dwarfed by the savings of an array over a hash.
●
sort @uniq is much faster than sort keys %uniq.

Example: Taxonomy Trees
● The NCBI Taxonomy is delivered with each entry
having a full tree.
● These must be reduced to a single tree for data entry
and validation.
● There are several ways to do this...

Worst Solution: Parent tree.
● Since the tree is often used from the bottom up,
some people store it as a child:parent relationship:
$parentz{ $child_id } = $parent_id;
● Unfortunately, this allocates a full hash table for
each 1:1 relationship between a child and parent.

Another Bad Solution: Child Tree
● Another alternative is storing the children in a hash
for each parent:
$childz{ $parent_id }{ $child_id } = ();
$childz{ '' } = [ $root_id ];
● This works via depthfirst search to generate the
trees and has space to store the treedepth.
● Hashes are bulky and slow for storing a singlelevel
structure like this.

Another Solution: SingleLevel Hash
● One oftforgotten bit of Perly lore in the age of
references: multipart hash keys.
$childz{ $parent_id, $child_id } = $depth;
$childz{ “” } = [ $root_id ];
● Trades wasted space in thousands of anon hashes for
split /$;/o, $key and grep's.
● Usable for moderate trees.
● Obviously painful for really large trees.

Q: Why Nest Hashes?
● Hashes are nice for the toplevel lookup, but why
nest them?
my $c = $childz{ $parent_id } ||=[];

$new_id ~~ $c
or push @{ $c{ $parent_id } }, $new_id;

● Arrays save about 85% of the overhead below the
top level.
● Any wasted space from the arrays growing via push
is more than saved by avoiding hashes.
● The arrays only need to be sorted once if the tree is
used multiple times.

Nested Lists
● List::Util has first() which saves greping entire lists.
● A key and payload on an array can be handled
quickly.
first { $_->[0] eq $key } @data;
● For shorter lists this saves space and can be faster
than a hash.
● This is best for numerics, which don't have to be
converted to text in order to be hashed: $_->[0] ==
$value is the least amount of work to compare
integers.

Manage Lifespans
● Lexical variables are an obvious place.
● Local values are another.
● Saves reallocating a set of values within tight loops in
the called code.
● Local hash keys are a good way to manage storage in re
used hashes handled with recursion.
● Use delete to remove hash keys in multilevel
structures instead of assigning an empty list or “”.
● This preserves the skeleton for recycling.
● Saves storing the keys.

Use Simpler Objects
● If you're using insideout objects, why bless a hash?
● Users aren't supposed to diddle around inside your
objects anyway.
● The only thing you care about is the address.
● Bless something smaller:
my $obj = bless (my $a), $package;

Use Linked Lists for Queues
● Automatically frees discarded nodes without having
to modify the entire list.
● Based on an array they don't use much extra data:
$node = [ $ref_to_next, @node_data ];
● Walking the list is simple enough:
( $node, my @data ) = @$node;
● So is removing a node:
$node->[0] = $node->[0][0];
● These are quite convenient for threading.

Use Hashes for Sparse Arrays
● OK, Time to stop beating up on hashes.
● They beat out arrays for sparse lists.
● Even list of integers.
● Say a collection of DNA runs from 15 to 10_000
bases, filling about 10% of the actual values.
● You could store it as:
$dnaz[ $length ] = [ qw( dna dna dna ) ];
● But this is probably better stored in a hash:
$dnaz{ $length } = [ qw( dna dna dna ) ];

Accessing Hash Keys: Integer Slices
● Numeric sequences work fine as hash keys.
● Say you want to find all of the sequences within
+/‑10% of the current length:
my $min = 0.9 * $length;
my $max = 1.1 * $length;
my @found = grep{ $_ } @dnaz{ ( $min .. $max ) };

● For nontrivial, sparse lists this saves scaffolding by
only storing the structure necessary.
● This doesn't change the data storage, just the
overhead for accessing it by length.

Store Uppertriangular Comparisons
● Saves more than half the space.
● Accessor can look for $i > $j ? [$i][$j] : [$j][$i] and
get the same results.
● Requires designing symmetric comparison
algorithms (values can be returned asis or just
negated).
● Also saves about half the processing time to only
generate a single comparison for each pair.
● Requires access to the algorithm.

Example: DNA Analysis
● Our Wcurve analysis is used to compare large
groups of DNA to one another.
● The original algorithm compared the curves until the
first one was exhausted.
● Changing that to use the longer sequence in all cases
saved us over half the comparison time.

Summary
● Devel::Size can be useful in your code.
● Managing the lifespan of values helps.
● Using efficient structures helps even more.
● Use arrays instead of hash structures where they make
sense.
● Bless smaller structures: scalars, regexen, globs make
perfectly good objects and take less space than hashes.
● Use XS or Inline where necessary.
● And, yes, size() still matters.

Memory unmanglement

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (10)

Andere mochten auch

Andere mochten auch (7)

Ähnlich wie Memory unmanglement

Ähnlich wie Memory unmanglement (20)

Mehr von Workhorse Computing

Mehr von Workhorse Computing (20)

Memory unmanglement