SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Downloaden Sie, um offline zu lesen
Memory Un­manglement With Perl



           How to do what you do
      without getting hit in the memory.

             Steven Lembark
           Workhorse Computing
In Our Last Episode...
●   We saw our hero battling the forces of ram­bloat in 
    long­running, heavily­forked, or large­scale 
    processes.
●   Learned the golden rule: Nothing Shrinks.
●   Observed memory benchmarks using Devel::Peek, 
     Devel::Size, and perl -d.
    ●
        peek() shows the structure & hash efficiency.
    ●
        size() & total_size() show memory usage.
Time vs. Space
●   The classic trade­off is handled in favor of time in 
    the perl implementations.
●   More efficient data structures can help both sides.
    ●   Avoiding wasted space can help avoid thrashing, heap 
        management, and system call overhead.
    ●   Faster access for arrays can make them more compact 
        and faster than hashes in some situations.
●   Benchmarks are not only for time: include checks of 
    size(), total_size(), and peek() to see what is 
    really going on.
Nothing Ever Shrinks
●   perl maintains strings and arrays as pointers to 
    memory allocations.
    ●   Adjusting the size of a scalar with substr or a regex 
        changes it start and length.
    ●   shift and pop adjust an array's initial offset and count.
●   None of these will reduce the memory overhead of 
    the 'scaffolding' perl uses to manage the data.
Look Deep Into Your Memory
●   Devel::Peek
    ●   peek() at the structure
    ●   Shows efficiency of hashing.
●   Devel::Size
    ●   size() shows memory usage of “scaffolding”.
    ●   total_size() includes contents along with skeleton.
●   size() can be useful in loops for managing size of re­
    cycled buffers.
Size & Structure
●   Scalars
    ●   Reference allocations for strings with offset & length.
    ●   size() of the scalar is small, total_size() can be large.
●   Arrays
    ●   Allocated list of Scalars, also with offset & length.
    ●   size() reports space for list, total_size() includes contents.
●   Hashes
    ●   Hash chains are an array of arrays with min. 8 chains.
    ●   size() reports space for hash chains.
Taming the Beast
●   There are tools for managing the memory, most of 
    which involve some sort of time/space tradeoff.
    ●   undef can help – probably less than you think.
    ●   You can manage the lifetime of variables with lexical or 
        local values.
    ●   Re­cycling buffers localizes the bloat to one structure.
    ●   Adapting your code to use more effective data structures 
        offers the best solution for large data.
●   Here are some ideas.
undef() is somewhat helpful
●   Marks the variable for reclamation.
    ●   Space may not be immediately reclaimed – up to perl 
        whether to add heap or recycle the undef­ed variables.
●   Structures are discarded, not reduced.
    ●   This can have a significant performance overhead on 
        nested, re­used data structures.
●   Trade­off: space for time for re­building the skeleton 
    of discarded structures.
●   Most useful for recycling single­level structures.
undef­ing an Array Doesn't Zero It
●   The contents are discarded & re­allocated:

     my @a   = ();
     $#a     = 999_999;
     print "Size @a:t", size( @a ), "n";

     undef @a;
     print "Size @a:t", size( @a ), "n";

     Full @a:4000200
     Post @a:    100


●   For a large, nested structure this may not save the 
    amount of space you expect.
Recycling Buffers
●   Use size() to discard and re­allocate the buffer if it 
    grows too large.
●   Pre­allocate to avoid margin­of­error added by perl 
    when the initial allocation grows.
●   Decent trade­off between re­allocating a buffer 
    frequently and having it grow without bounds.
●   Avoids one record botching the entire processing 
    cycle.
Scalar Buffer
●   Recycle buffer, clean it up, then copy by value.
●   Easiest with scalars since they don't have any nested 
    structure.
    while( $buffer = get_data )
    {
        $buffer =~ s/^s+//;
        ...
        push @data, $buffer;

        if( size( $buffer ) > $max_buff )
        {
            undef $buffer;
            $buffer = ' ' x $max_buff;
        }
    }
Array Buffer
●   This works well for single level buffers ­­ multi­level 
    buffers often require too much work to rebuild.
    my @buff = ();
    $#buff   = $buff_count;

    while( @buff = get_data )
    {
        ...                           # clean up buffer
        $data{ $key } = [ @buff ];    # store values

        if( size( @a ) > $buff_max )
        {
            undef @buff;
            $#buff = $max_buff;
        }
    }
Assign Arrays Single­Pass
●   Say you have to store a large number of items:

     my @a   = @b = ();

     push @a, “” for ( 1 .. 1_000_000 );
     @b = map { “” } ( 1 .. 1_000_000 );

     print 'Size of @a: ', size( @a ), "n";
     print 'Size of @b: ', size( @b ), "n";


●   Push ends up with a larger structure:
    Size of @a: 4194388
    Size of @b: 4000100
Hashes are Huge
●   Incremental assignment doesn't make hashes larger: 
    they are 8x larger than arrays in both cases.
    my %a =();
    my %b = ();

    $a{ $_ } = “” for ( 1 .. 1_000_000 );
    %b = map { $_ => “” } ( 1 .. 1_000_000 );

    print 'Size of %a: ', size( %a ), "n";
    print 'Size of %b: ', size( %b ), "n";


     Size of %a: 32083244    # vs. 4000100
     Size of %b: 32083244    # in an array!
Two Ways of Storing Nothing
●   There are two common ways of storing nothing in 
    the values of  a hash:
    ●   Assign an empty list:     $hash{ $key } = ();
    ●   Assign an empty string: $hash{ $key } = “”;
●   Question:
        Which would take less space: empty list or empty string?
TMTOWTDN
●
    size() gives the same result for both values. Why?
my %a =();
my %b = ();

$a{ $_ } =    () for( 'aaa' .. 'zzz' );
$b{ $_ } =    '' for( 'aaa' .. 'zzz' );

print "Size of %a:t",      size( %a ), "n";
print "Size of %b:t",      size( %b ), "n";




Size of %a:        570516     # same size for “” & ()?
Size of %b:        570516
TMTOWTDN
●
    total_size() benchmarks the values:
my %a =();
my %b = ();

$a{ $_ } =    () for( 'aaa' .. 'zzz' );
$b{ $_ } =    '' for( 'aaa' .. 'zzz' );

print "Size of %a:t",      size( %a ), "n";
print "Size of %b:t",      size( %b ), "n";

print "Total in %a:t", total_size( %a ), "n";
print "Total in %b:t", total_size( %b ), "n";

Size of %a:        570516     # size() doesn't always
Size of %b:        570516     # matter!

Total in %a:       851732
Total in %b:      1203252
Replace Hashes With Arrays
●   The smart­match operator (“~~”) is fast.
●   Pushing onto an array:
      $a ~~ @uniq or push @uniq, $a

    uses about 1/8 the space of assigning hash keys:
      $uniq{ $a } = ();
      ...
      keys %uniq
●   The extra space used by array growth in push is 
    dwarfed by the savings of an array over a hash.
●
    sort @uniq is much faster than sort keys %uniq.
Example: Taxonomy Trees
●   The NCBI Taxonomy is delivered with each entry 
    having a full tree.
●   These must be reduced to a single tree for data entry 
    and validation.
●   There are several ways to do this...
Worst Solution: Parent tree.
●   Since the tree is often used from the bottom up, 
    some people store it as a child:parent relationship:
      $parentz{ $child_id } = $parent_id;
●   Unfortunately, this allocates a full hash table for 
    each 1:1 relationship between a child and parent.
Another Bad Solution: Child Tree
●   Another alternative is storing the children in a hash 
    for each parent:
      $childz{ $parent_id }{ $child_id } = ();
      $childz{ '' }       = [ $root_id ];
●   This works via depth­first search to generate the 
    trees and has space to store the tree­depth.
●   Hashes are bulky and slow for storing a single­level 
    structure like this.
Another Solution: Single­Level Hash
●   One oft­forgotten bit of Perly lore in the age of 
    references: multi­part hash keys.
      $childz{ $parent_id, $child_id } = $depth;
      $childz{ “” } = [ $root_id ];
●   Trades wasted space in thousands of anon hashes for 
    split /$;/o, $key and grep's.
●   Usable for moderate trees.
●   Obviously painful for really large trees.
Q: Why Nest Hashes?
●   Hashes are nice for the top­level lookup, but why 
    nest them?
    my $c = $childz{ $parent_id } ||=[];

    $new_id ~~ $c
    or push @{ $c{ $parent_id } }, $new_id;

●   Arrays save about 85% of the overhead below the 
    top level.
●   Any wasted space from the arrays growing via push 
    is more than saved by avoiding hashes.
●   The arrays only need to be sorted once if the tree is 
    used multiple times.
Nested Lists
●   List::Util has first() which saves grep­ing entire lists.
●   A key and payload on an array can be handled 
    quickly.
      first { $_->[0] eq $key } @data;
●   For shorter lists this saves space and can be faster 
    than a hash.
●   This is best for numerics, which don't have to be 
    converted to text in order to be hashed: $_->[0] ==
    $value is the least amount of work to compare 
    integers.
Manage Lifespans
●   Lexical variables are an obvious place.
●   Local values are another.
    ●   Saves re­allocating a set of values within tight loops in 
        the called code.
    ●   Local hash keys are a good way to manage storage in re­
        used hashes handled with recursion.
●   Use delete to remove hash keys in multi­level 
    structures instead of assigning an empty list or “”.
    ●   This preserves the skeleton for re­cycling.
    ●   Saves storing the keys.
Use Simpler Objects
●   If you're using inside­out objects, why bless a hash?
    ●   Users aren't supposed to diddle around inside your 
        objects anyway.
●   The only thing you care about is the address.
●   Bless something smaller:
        my $obj = bless (my $a), $package;
Use Linked Lists for Queues
●   Automatically frees discarded nodes without having 
    to modify the entire list. 
●   Based on an array they don't use much extra data:
      $node = [ $ref_to_next, @node_data ];
●   Walking the list is simple enough:
      ( $node, my @data ) = @$node;
●   So is removing a node:
      $node->[0] = $node->[0][0];
●   These are quite convenient for threading.
Use Hashes for Sparse Arrays
●   OK, Time to stop beating up on hashes.
    ●   They beat out arrays for sparse lists.
    ●   Even list of integers.
●   Say a collection of DNA runs from 15 to 10_000 
    bases, filling about 10% of the actual values.
●   You could store it as:
        $dnaz[ $length ] = [ qw( dna dna dna ) ];
●   But this is probably better stored in a hash:
        $dnaz{ $length } = [ qw( dna dna dna ) ];
Accessing Hash Keys: Integer Slices
●   Numeric sequences work fine as hash keys.
●   Say you want to find all of the sequences within 
    +/‑10% of the current length:
    my $min = 0.9 * $length;
    my $max = 1.1 * $length;
    my @found = grep{ $_ } @dnaz{ ( $min .. $max ) };

●   For non­trivial, sparse lists this saves scaffolding by 
    only storing the structure necessary.
●   This doesn't change the data storage, just the 
    overhead for accessing it by length.
Store Upper­triangular Comparisons
●   Saves more than half the space.
●   Accessor can look for $i > $j ? [$i][$j] : [$j][$i] and 
    get the same results.
●   Requires designing symmetric comparison 
    algorithms (values can be returned as­is or just 
    negated).
●   Also saves about half the processing time to only 
    generate a single comparison for each pair.
●   Requires access to the algorithm.
Example: DNA Analysis
●   Our W­curve analysis is used to compare large 
    groups of DNA to one another.
●   The original algorithm compared the curves until the 
    first one was exhausted.
●   Changing that to use the longer sequence in all cases 
    saved us over half the comparison time.
Summary
●   Devel::Size can be useful in your code.
●   Managing the lifespan of values helps.
●   Using efficient structures helps even more.
    ●   Use arrays instead of hash structures where they make 
        sense.
    ●   Bless smaller structures: scalars, regexen, globs make 
        perfectly good objects and take less space than hashes.
●   Use XS or Inline where necessary.
●   And, yes, size() still matters.

Weitere ähnliche Inhalte

Was ist angesagt?

No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionChetan Khatri
 
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And ProfitJDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And ProfitPROIDEA
 
Data Visualization using base graphics
Data Visualization using base graphicsData Visualization using base graphics
Data Visualization using base graphicsRupak Roy
 
SAS and R Code for Basic Statistics
SAS and R Code for Basic StatisticsSAS and R Code for Basic Statistics
SAS and R Code for Basic StatisticsAvjinder (Avi) Kaler
 
R Workshop for Beginners
R Workshop for BeginnersR Workshop for Beginners
R Workshop for BeginnersMetamarkets
 
Centralising Authorisation in PostgreSQL
Centralising Authorisation in PostgreSQLCentralising Authorisation in PostgreSQL
Centralising Authorisation in PostgreSQLGary Evans
 
Manipulating data with dates
Manipulating data with datesManipulating data with dates
Manipulating data with datesRupak Roy
 

Was ist angesagt? (10)

Apache pig
Apache pigApache pig
Apache pig
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And ProfitJDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
 
Data Visualization using base graphics
Data Visualization using base graphicsData Visualization using base graphics
Data Visualization using base graphics
 
SAS and R Code for Basic Statistics
SAS and R Code for Basic StatisticsSAS and R Code for Basic Statistics
SAS and R Code for Basic Statistics
 
R Workshop for Beginners
R Workshop for BeginnersR Workshop for Beginners
R Workshop for Beginners
 
Dplyr and Plyr
Dplyr and PlyrDplyr and Plyr
Dplyr and Plyr
 
Centralising Authorisation in PostgreSQL
Centralising Authorisation in PostgreSQLCentralising Authorisation in PostgreSQL
Centralising Authorisation in PostgreSQL
 
Manipulating data with dates
Manipulating data with datesManipulating data with dates
Manipulating data with dates
 
R language introduction
R language introductionR language introduction
R language introduction
 

Andere mochten auch

Selenium sandwich-3: Being where you aren't.
Selenium sandwich-3: Being where you aren't.Selenium sandwich-3: Being where you aren't.
Selenium sandwich-3: Being where you aren't.Workhorse Computing
 
Object Trampoline: Why having not the object you want is what you need.
Object Trampoline: Why having not the object you want is what you need.Object Trampoline: Why having not the object you want is what you need.
Object Trampoline: Why having not the object you want is what you need.Workhorse Computing
 

Andere mochten auch (7)

Lies, Damn Lies, and Benchmarks
Lies, Damn Lies, and BenchmarksLies, Damn Lies, and Benchmarks
Lies, Damn Lies, and Benchmarks
 
Signal Stacktrace
Signal StacktraceSignal Stacktrace
Signal Stacktrace
 
Getting testy with Perl
Getting testy with PerlGetting testy with Perl
Getting testy with Perl
 
Selenium sandwich-3: Being where you aren't.
Selenium sandwich-3: Being where you aren't.Selenium sandwich-3: Being where you aren't.
Selenium sandwich-3: Being where you aren't.
 
Get your teeth into Plack
Get your teeth into PlackGet your teeth into Plack
Get your teeth into Plack
 
Digital Age 2.0 - Andrea Harrison
Digital Age 2.0 - Andrea HarrisonDigital Age 2.0 - Andrea Harrison
Digital Age 2.0 - Andrea Harrison
 
Object Trampoline: Why having not the object you want is what you need.
Object Trampoline: Why having not the object you want is what you need.Object Trampoline: Why having not the object you want is what you need.
Object Trampoline: Why having not the object you want is what you need.
 

Ähnlich wie Memory unmanglement

Our Friends the Utils: A highway traveled by wheels we didn't re-invent.
Our Friends the Utils: A highway traveled by wheels we didn't re-invent. Our Friends the Utils: A highway traveled by wheels we didn't re-invent.
Our Friends the Utils: A highway traveled by wheels we didn't re-invent. Workhorse Computing
 
Perly Parallel Processing of Fixed Width Data Records
Perly Parallel Processing of Fixed Width Data RecordsPerly Parallel Processing of Fixed Width Data Records
Perly Parallel Processing of Fixed Width Data RecordsWorkhorse Computing
 
Bioinformatica: Esercizi su Perl, espressioni regolari e altre amenità (BMR G...
Bioinformatica: Esercizi su Perl, espressioni regolari e altre amenità (BMR G...Bioinformatica: Esercizi su Perl, espressioni regolari e altre amenità (BMR G...
Bioinformatica: Esercizi su Perl, espressioni regolari e altre amenità (BMR G...Andrea Telatin
 
Rob Sullivan at Heroku's Waza 2013: Your Database -- A Story of Indifference
Rob Sullivan at Heroku's Waza 2013: Your Database -- A Story of IndifferenceRob Sullivan at Heroku's Waza 2013: Your Database -- A Story of Indifference
Rob Sullivan at Heroku's Waza 2013: Your Database -- A Story of IndifferenceHeroku
 
PHP data structures (and the impact of php 7 on them), phpDay Verona 2015, Italy
PHP data structures (and the impact of php 7 on them), phpDay Verona 2015, ItalyPHP data structures (and the impact of php 7 on them), phpDay Verona 2015, Italy
PHP data structures (and the impact of php 7 on them), phpDay Verona 2015, ItalyPatrick Allaert
 
Perl at SkyCon'12
Perl at SkyCon'12Perl at SkyCon'12
Perl at SkyCon'12Tim Bunce
 
5 Coding Hacks to Reduce GC Overhead
5 Coding Hacks to Reduce GC Overhead5 Coding Hacks to Reduce GC Overhead
5 Coding Hacks to Reduce GC OverheadTakipi
 
Regular expressions, Session and Cookies by Dr.C.R.Dhivyaa Kongu Engineering ...
Regular expressions, Session and Cookies by Dr.C.R.Dhivyaa Kongu Engineering ...Regular expressions, Session and Cookies by Dr.C.R.Dhivyaa Kongu Engineering ...
Regular expressions, Session and Cookies by Dr.C.R.Dhivyaa Kongu Engineering ...Dhivyaa C.R
 
R programming slides
R  programming slidesR  programming slides
R programming slidesPankaj Saini
 
Dynomite at Erlang Factory
Dynomite at Erlang FactoryDynomite at Erlang Factory
Dynomite at Erlang Factorymoonpolysoft
 
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekingeBioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekingeProf. Wim Van Criekinge
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 

Ähnlich wie Memory unmanglement (20)

Our Friends the Utils: A highway traveled by wheels we didn't re-invent.
Our Friends the Utils: A highway traveled by wheels we didn't re-invent. Our Friends the Utils: A highway traveled by wheels we didn't re-invent.
Our Friends the Utils: A highway traveled by wheels we didn't re-invent.
 
Perly Parallel Processing of Fixed Width Data Records
Perly Parallel Processing of Fixed Width Data RecordsPerly Parallel Processing of Fixed Width Data Records
Perly Parallel Processing of Fixed Width Data Records
 
Bioinformatica: Esercizi su Perl, espressioni regolari e altre amenità (BMR G...
Bioinformatica: Esercizi su Perl, espressioni regolari e altre amenità (BMR G...Bioinformatica: Esercizi su Perl, espressioni regolari e altre amenità (BMR G...
Bioinformatica: Esercizi su Perl, espressioni regolari e altre amenità (BMR G...
 
Bioinformatica p2-p3-introduction
Bioinformatica p2-p3-introductionBioinformatica p2-p3-introduction
Bioinformatica p2-p3-introduction
 
Perl5 Memory Manglement
Perl5 Memory ManglementPerl5 Memory Manglement
Perl5 Memory Manglement
 
Rob Sullivan at Heroku's Waza 2013: Your Database -- A Story of Indifference
Rob Sullivan at Heroku's Waza 2013: Your Database -- A Story of IndifferenceRob Sullivan at Heroku's Waza 2013: Your Database -- A Story of Indifference
Rob Sullivan at Heroku's Waza 2013: Your Database -- A Story of Indifference
 
PHP data structures (and the impact of php 7 on them), phpDay Verona 2015, Italy
PHP data structures (and the impact of php 7 on them), phpDay Verona 2015, ItalyPHP data structures (and the impact of php 7 on them), phpDay Verona 2015, Italy
PHP data structures (and the impact of php 7 on them), phpDay Verona 2015, Italy
 
Perl at SkyCon'12
Perl at SkyCon'12Perl at SkyCon'12
Perl at SkyCon'12
 
5 Coding Hacks to Reduce GC Overhead
5 Coding Hacks to Reduce GC Overhead5 Coding Hacks to Reduce GC Overhead
5 Coding Hacks to Reduce GC Overhead
 
Regular expressions, Session and Cookies by Dr.C.R.Dhivyaa Kongu Engineering ...
Regular expressions, Session and Cookies by Dr.C.R.Dhivyaa Kongu Engineering ...Regular expressions, Session and Cookies by Dr.C.R.Dhivyaa Kongu Engineering ...
Regular expressions, Session and Cookies by Dr.C.R.Dhivyaa Kongu Engineering ...
 
UNIT IV (4).pptx
UNIT IV (4).pptxUNIT IV (4).pptx
UNIT IV (4).pptx
 
Bioinformatica p4-io
Bioinformatica p4-ioBioinformatica p4-io
Bioinformatica p4-io
 
R programming slides
R  programming slidesR  programming slides
R programming slides
 
Bioinformatics p2-p3-perl-regexes v2014
Bioinformatics p2-p3-perl-regexes v2014Bioinformatics p2-p3-perl-regexes v2014
Bioinformatics p2-p3-perl-regexes v2014
 
Deep Dive: Amazon DynamoDB
Deep Dive: Amazon DynamoDBDeep Dive: Amazon DynamoDB
Deep Dive: Amazon DynamoDB
 
Dynomite at Erlang Factory
Dynomite at Erlang FactoryDynomite at Erlang Factory
Dynomite at Erlang Factory
 
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekingeBioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
lab4_php
lab4_phplab4_php
lab4_php
 
lab4_php
lab4_phplab4_php
lab4_php
 

Mehr von Workhorse Computing

Wheels we didn't re-invent: Perl's Utility Modules
Wheels we didn't re-invent: Perl's Utility ModulesWheels we didn't re-invent: Perl's Utility Modules
Wheels we didn't re-invent: Perl's Utility ModulesWorkhorse Computing
 
Paranormal statistics: Counting What Doesn't Add Up
Paranormal statistics: Counting What Doesn't Add UpParanormal statistics: Counting What Doesn't Add Up
Paranormal statistics: Counting What Doesn't Add UpWorkhorse Computing
 
The $path to knowledge: What little it take to unit-test Perl.
The $path to knowledge: What little it take to unit-test Perl.The $path to knowledge: What little it take to unit-test Perl.
The $path to knowledge: What little it take to unit-test Perl.Workhorse Computing
 
Generating & Querying Calendar Tables in Posgresql
Generating & Querying Calendar Tables in PosgresqlGenerating & Querying Calendar Tables in Posgresql
Generating & Querying Calendar Tables in PosgresqlWorkhorse Computing
 
Hypers and Gathers and Takes! Oh my!
Hypers and Gathers and Takes! Oh my!Hypers and Gathers and Takes! Oh my!
Hypers and Gathers and Takes! Oh my!Workhorse Computing
 
BSDM with BASH: Command Interpolation
BSDM with BASH: Command InterpolationBSDM with BASH: Command Interpolation
BSDM with BASH: Command InterpolationWorkhorse Computing
 
BASH Variables Part 1: Basic Interpolation
BASH Variables Part 1: Basic InterpolationBASH Variables Part 1: Basic Interpolation
BASH Variables Part 1: Basic InterpolationWorkhorse Computing
 
The W-curve and its application.
The W-curve and its application.The W-curve and its application.
The W-curve and its application.Workhorse Computing
 
Keeping objects healthy with Object::Exercise.
Keeping objects healthy with Object::Exercise.Keeping objects healthy with Object::Exercise.
Keeping objects healthy with Object::Exercise.Workhorse Computing
 
Perl6 Regexen: Reduce the line noise in your code.
Perl6 Regexen: Reduce the line noise in your code.Perl6 Regexen: Reduce the line noise in your code.
Perl6 Regexen: Reduce the line noise in your code.Workhorse Computing
 
Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6
Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6
Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6Workhorse Computing
 

Mehr von Workhorse Computing (20)

Wheels we didn't re-invent: Perl's Utility Modules
Wheels we didn't re-invent: Perl's Utility ModulesWheels we didn't re-invent: Perl's Utility Modules
Wheels we didn't re-invent: Perl's Utility Modules
 
mro-every.pdf
mro-every.pdfmro-every.pdf
mro-every.pdf
 
Paranormal statistics: Counting What Doesn't Add Up
Paranormal statistics: Counting What Doesn't Add UpParanormal statistics: Counting What Doesn't Add Up
Paranormal statistics: Counting What Doesn't Add Up
 
The $path to knowledge: What little it take to unit-test Perl.
The $path to knowledge: What little it take to unit-test Perl.The $path to knowledge: What little it take to unit-test Perl.
The $path to knowledge: What little it take to unit-test Perl.
 
Unit Testing Lots of Perl
Unit Testing Lots of PerlUnit Testing Lots of Perl
Unit Testing Lots of Perl
 
Generating & Querying Calendar Tables in Posgresql
Generating & Querying Calendar Tables in PosgresqlGenerating & Querying Calendar Tables in Posgresql
Generating & Querying Calendar Tables in Posgresql
 
Hypers and Gathers and Takes! Oh my!
Hypers and Gathers and Takes! Oh my!Hypers and Gathers and Takes! Oh my!
Hypers and Gathers and Takes! Oh my!
 
BSDM with BASH: Command Interpolation
BSDM with BASH: Command InterpolationBSDM with BASH: Command Interpolation
BSDM with BASH: Command Interpolation
 
Findbin libs
Findbin libsFindbin libs
Findbin libs
 
Memory Manglement in Raku
Memory Manglement in RakuMemory Manglement in Raku
Memory Manglement in Raku
 
BASH Variables Part 1: Basic Interpolation
BASH Variables Part 1: Basic InterpolationBASH Variables Part 1: Basic Interpolation
BASH Variables Part 1: Basic Interpolation
 
Effective Benchmarks
Effective BenchmarksEffective Benchmarks
Effective Benchmarks
 
Metadata-driven Testing
Metadata-driven TestingMetadata-driven Testing
Metadata-driven Testing
 
The W-curve and its application.
The W-curve and its application.The W-curve and its application.
The W-curve and its application.
 
Keeping objects healthy with Object::Exercise.
Keeping objects healthy with Object::Exercise.Keeping objects healthy with Object::Exercise.
Keeping objects healthy with Object::Exercise.
 
Perl6 Regexen: Reduce the line noise in your code.
Perl6 Regexen: Reduce the line noise in your code.Perl6 Regexen: Reduce the line noise in your code.
Perl6 Regexen: Reduce the line noise in your code.
 
Smoking docker
Smoking dockerSmoking docker
Smoking docker
 
Getting Testy With Perl6
Getting Testy With Perl6Getting Testy With Perl6
Getting Testy With Perl6
 
Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6
Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6
Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6
 
Neatly folding-a-tree
Neatly folding-a-treeNeatly folding-a-tree
Neatly folding-a-tree
 

Memory unmanglement

  • 1. Memory Un­manglement With Perl How to do what you do without getting hit in the memory. Steven Lembark Workhorse Computing
  • 2. In Our Last Episode... ● We saw our hero battling the forces of ram­bloat in  long­running, heavily­forked, or large­scale  processes. ● Learned the golden rule: Nothing Shrinks. ● Observed memory benchmarks using Devel::Peek,   Devel::Size, and perl -d. ● peek() shows the structure & hash efficiency. ● size() & total_size() show memory usage.
  • 3. Time vs. Space ● The classic trade­off is handled in favor of time in  the perl implementations. ● More efficient data structures can help both sides. ● Avoiding wasted space can help avoid thrashing, heap  management, and system call overhead. ● Faster access for arrays can make them more compact  and faster than hashes in some situations. ● Benchmarks are not only for time: include checks of  size(), total_size(), and peek() to see what is  really going on.
  • 4. Nothing Ever Shrinks ● perl maintains strings and arrays as pointers to  memory allocations. ● Adjusting the size of a scalar with substr or a regex  changes it start and length. ● shift and pop adjust an array's initial offset and count. ● None of these will reduce the memory overhead of  the 'scaffolding' perl uses to manage the data.
  • 5. Look Deep Into Your Memory ● Devel::Peek ● peek() at the structure ● Shows efficiency of hashing. ● Devel::Size ● size() shows memory usage of “scaffolding”. ● total_size() includes contents along with skeleton. ● size() can be useful in loops for managing size of re­ cycled buffers.
  • 6. Size & Structure ● Scalars ● Reference allocations for strings with offset & length. ● size() of the scalar is small, total_size() can be large. ● Arrays ● Allocated list of Scalars, also with offset & length. ● size() reports space for list, total_size() includes contents. ● Hashes ● Hash chains are an array of arrays with min. 8 chains. ● size() reports space for hash chains.
  • 7. Taming the Beast ● There are tools for managing the memory, most of  which involve some sort of time/space tradeoff. ● undef can help – probably less than you think. ● You can manage the lifetime of variables with lexical or  local values. ● Re­cycling buffers localizes the bloat to one structure. ● Adapting your code to use more effective data structures  offers the best solution for large data. ● Here are some ideas.
  • 8. undef() is somewhat helpful ● Marks the variable for reclamation. ● Space may not be immediately reclaimed – up to perl  whether to add heap or recycle the undef­ed variables. ● Structures are discarded, not reduced. ● This can have a significant performance overhead on  nested, re­used data structures. ● Trade­off: space for time for re­building the skeleton  of discarded structures. ● Most useful for recycling single­level structures.
  • 9. undef­ing an Array Doesn't Zero It ● The contents are discarded & re­allocated: my @a = (); $#a = 999_999; print "Size @a:t", size( @a ), "n"; undef @a; print "Size @a:t", size( @a ), "n"; Full @a:4000200 Post @a: 100 ● For a large, nested structure this may not save the  amount of space you expect.
  • 10. Recycling Buffers ● Use size() to discard and re­allocate the buffer if it  grows too large. ● Pre­allocate to avoid margin­of­error added by perl  when the initial allocation grows. ● Decent trade­off between re­allocating a buffer  frequently and having it grow without bounds. ● Avoids one record botching the entire processing  cycle.
  • 11. Scalar Buffer ● Recycle buffer, clean it up, then copy by value. ● Easiest with scalars since they don't have any nested  structure. while( $buffer = get_data ) { $buffer =~ s/^s+//; ... push @data, $buffer; if( size( $buffer ) > $max_buff ) { undef $buffer; $buffer = ' ' x $max_buff; } }
  • 12. Array Buffer ● This works well for single level buffers ­­ multi­level  buffers often require too much work to rebuild. my @buff = (); $#buff = $buff_count; while( @buff = get_data ) { ... # clean up buffer $data{ $key } = [ @buff ]; # store values if( size( @a ) > $buff_max ) { undef @buff; $#buff = $max_buff; } }
  • 13. Assign Arrays Single­Pass ● Say you have to store a large number of items: my @a = @b = (); push @a, “” for ( 1 .. 1_000_000 ); @b = map { “” } ( 1 .. 1_000_000 ); print 'Size of @a: ', size( @a ), "n"; print 'Size of @b: ', size( @b ), "n"; ● Push ends up with a larger structure: Size of @a: 4194388 Size of @b: 4000100
  • 14. Hashes are Huge ● Incremental assignment doesn't make hashes larger:  they are 8x larger than arrays in both cases. my %a =(); my %b = (); $a{ $_ } = “” for ( 1 .. 1_000_000 ); %b = map { $_ => “” } ( 1 .. 1_000_000 ); print 'Size of %a: ', size( %a ), "n"; print 'Size of %b: ', size( %b ), "n"; Size of %a: 32083244 # vs. 4000100 Size of %b: 32083244 # in an array!
  • 15. Two Ways of Storing Nothing ● There are two common ways of storing nothing in  the values of  a hash: ● Assign an empty list:   $hash{ $key } = (); ● Assign an empty string: $hash{ $key } = “”; ● Question: Which would take less space: empty list or empty string?
  • 16. TMTOWTDN ● size() gives the same result for both values. Why? my %a =(); my %b = (); $a{ $_ } = () for( 'aaa' .. 'zzz' ); $b{ $_ } = '' for( 'aaa' .. 'zzz' ); print "Size of %a:t", size( %a ), "n"; print "Size of %b:t", size( %b ), "n"; Size of %a: 570516 # same size for “” & ()? Size of %b: 570516
  • 17. TMTOWTDN ● total_size() benchmarks the values: my %a =(); my %b = (); $a{ $_ } = () for( 'aaa' .. 'zzz' ); $b{ $_ } = '' for( 'aaa' .. 'zzz' ); print "Size of %a:t", size( %a ), "n"; print "Size of %b:t", size( %b ), "n"; print "Total in %a:t", total_size( %a ), "n"; print "Total in %b:t", total_size( %b ), "n"; Size of %a: 570516 # size() doesn't always Size of %b: 570516 # matter! Total in %a: 851732 Total in %b: 1203252
  • 18. Replace Hashes With Arrays ● The smart­match operator (“~~”) is fast. ● Pushing onto an array: $a ~~ @uniq or push @uniq, $a uses about 1/8 the space of assigning hash keys: $uniq{ $a } = (); ... keys %uniq ● The extra space used by array growth in push is  dwarfed by the savings of an array over a hash. ● sort @uniq is much faster than sort keys %uniq.
  • 19. Example: Taxonomy Trees ● The NCBI Taxonomy is delivered with each entry  having a full tree. ● These must be reduced to a single tree for data entry  and validation. ● There are several ways to do this...
  • 20. Worst Solution: Parent tree. ● Since the tree is often used from the bottom up,  some people store it as a child:parent relationship: $parentz{ $child_id } = $parent_id; ● Unfortunately, this allocates a full hash table for  each 1:1 relationship between a child and parent.
  • 21. Another Bad Solution: Child Tree ● Another alternative is storing the children in a hash  for each parent: $childz{ $parent_id }{ $child_id } = (); $childz{ '' } = [ $root_id ]; ● This works via depth­first search to generate the  trees and has space to store the tree­depth. ● Hashes are bulky and slow for storing a single­level  structure like this.
  • 22. Another Solution: Single­Level Hash ● One oft­forgotten bit of Perly lore in the age of  references: multi­part hash keys. $childz{ $parent_id, $child_id } = $depth; $childz{ “” } = [ $root_id ]; ● Trades wasted space in thousands of anon hashes for  split /$;/o, $key and grep's. ● Usable for moderate trees. ● Obviously painful for really large trees.
  • 23. Q: Why Nest Hashes? ● Hashes are nice for the top­level lookup, but why  nest them? my $c = $childz{ $parent_id } ||=[]; $new_id ~~ $c or push @{ $c{ $parent_id } }, $new_id; ● Arrays save about 85% of the overhead below the  top level. ● Any wasted space from the arrays growing via push  is more than saved by avoiding hashes. ● The arrays only need to be sorted once if the tree is  used multiple times.
  • 24. Nested Lists ● List::Util has first() which saves grep­ing entire lists. ● A key and payload on an array can be handled  quickly. first { $_->[0] eq $key } @data; ● For shorter lists this saves space and can be faster  than a hash. ● This is best for numerics, which don't have to be  converted to text in order to be hashed: $_->[0] == $value is the least amount of work to compare  integers.
  • 25. Manage Lifespans ● Lexical variables are an obvious place. ● Local values are another. ● Saves re­allocating a set of values within tight loops in  the called code. ● Local hash keys are a good way to manage storage in re­ used hashes handled with recursion. ● Use delete to remove hash keys in multi­level  structures instead of assigning an empty list or “”. ● This preserves the skeleton for re­cycling. ● Saves storing the keys.
  • 26. Use Simpler Objects ● If you're using inside­out objects, why bless a hash? ● Users aren't supposed to diddle around inside your  objects anyway. ● The only thing you care about is the address. ● Bless something smaller: my $obj = bless (my $a), $package;
  • 27. Use Linked Lists for Queues ● Automatically frees discarded nodes without having  to modify the entire list.  ● Based on an array they don't use much extra data: $node = [ $ref_to_next, @node_data ]; ● Walking the list is simple enough: ( $node, my @data ) = @$node; ● So is removing a node: $node->[0] = $node->[0][0]; ● These are quite convenient for threading.
  • 28. Use Hashes for Sparse Arrays ● OK, Time to stop beating up on hashes. ● They beat out arrays for sparse lists. ● Even list of integers. ● Say a collection of DNA runs from 15 to 10_000  bases, filling about 10% of the actual values. ● You could store it as: $dnaz[ $length ] = [ qw( dna dna dna ) ]; ● But this is probably better stored in a hash: $dnaz{ $length } = [ qw( dna dna dna ) ];
  • 29. Accessing Hash Keys: Integer Slices ● Numeric sequences work fine as hash keys. ● Say you want to find all of the sequences within  +/‑10% of the current length: my $min = 0.9 * $length; my $max = 1.1 * $length; my @found = grep{ $_ } @dnaz{ ( $min .. $max ) }; ● For non­trivial, sparse lists this saves scaffolding by  only storing the structure necessary. ● This doesn't change the data storage, just the  overhead for accessing it by length.
  • 30. Store Upper­triangular Comparisons ● Saves more than half the space. ● Accessor can look for $i > $j ? [$i][$j] : [$j][$i] and  get the same results. ● Requires designing symmetric comparison  algorithms (values can be returned as­is or just  negated). ● Also saves about half the processing time to only  generate a single comparison for each pair. ● Requires access to the algorithm.
  • 31. Example: DNA Analysis ● Our W­curve analysis is used to compare large  groups of DNA to one another. ● The original algorithm compared the curves until the  first one was exhausted. ● Changing that to use the longer sequence in all cases  saved us over half the comparison time.
  • 32. Summary ● Devel::Size can be useful in your code. ● Managing the lifespan of values helps. ● Using efficient structures helps even more. ● Use arrays instead of hash structures where they make  sense. ● Bless smaller structures: scalars, regexen, globs make  perfectly good objects and take less space than hashes. ● Use XS or Inline where necessary. ● And, yes, size() still matters.