Perl: When is unneeded memory of a scalar freed without going out of scope?

Question

I have an app which reads a giant chunk of textual data into a scalar, sometimes even GBs in size. I use substr on that scalar to read most of the data into another scalar and replace the extracted data with an empty string, because it is not needed in the first scalar anymore. What I've found recently was that Perl is not freeing the memory of the first scalar, while it is recognizing that its logical length has changed. So what I need to do is extract the data from the first scalar into a third again, undef the first scalar und put the extracted data back in place. Only this way the memory occupied by the first scalar is really freed up. Assigning undef to that scalar or some other value less than the allocated block of memory doesn't change anything about the allocated memory.

The following is what I do now:

     $$extFileBufferRef = substr($$contentRef, $offset, $length, '');
     $length            = length($$contentRef);
  my $content           = substr($$contentRef, 0, $length);
     $$contentRef       = undef( $$contentRef) || $content;

$$contentRef might be e.g. 5 GBs in size in the first line, I extract 4,9 GB of data and replace the extracted data. The second line would now report e.g. 100 MBs of data as the length of the string, but e.g. Devel::Size::total_size would still output that 5 GB of data are allocated for that scalar. And assigning undef or such to $$contentRef doesn't seem to change a thing about that, I need to call undef as a function on that scalar.

I would have expected that the memory behind $$contentRef is already at least partially freed after substr was applied. Doesn't seem to be the case...

So, is memory only freed if variables go out of scope? And if so, why is assigning undef different to calling undef as a function on the same scalar?

Yes, I have multiple copies of the data around for different reasons and additionally the whole process might be executed in parallel. So some wasted GBs of memory during the whole process is something I need to care about. And yes, might be bad design and all, but that's the way it is currently... — Thorsten Schöning, Sep 15 '16 at 16:07

ikegami · Accepted Answer · 2016-09-16T05:36:21.490

Your analysis is correct.

$ perl -MDevel::Peek -e'
   my $x; $x .= "x" for 1..100;
   Dump($x);
   substr($x, 50, length($x), "");
   Dump($x);
'
SV = PV(0x24208e0) at 0x243d550
  ...
  CUR = 100       # length($x) == 100
  LEN = 120       # 120 bytes are allocated for the string buffer.

SV = PV(0x24208e0) at 0x243d550
  ...
  CUR = 50        # length($x) == 50
  LEN = 120       # 120 bytes are allocated for the string buffer.

Not only does Perl overallocate strings, it doesn't even free variables that go out of scope, instead reusing them the next time the scope is entered.

$ perl -MDevel::Peek -e'
   sub f {
      my ($set) = @_;
      my $x;
      if ($set) { $x = "abc"; $x .= "def"; }
      Dump($x);
   }

   f(1);
   f(0);
'
SV = PV(0x3be74b0) at 0x3c04228   # PV: Scalar may contain a string
  REFCNT = 1
  FLAGS = (POK,pPOK)              # POK: Scalar contains a string
  PV = 0x3c0c6a0 "abcdef"\0       # The string buffer
  CUR = 6
  LEN = 10                        # Allocated size of the string buffer

SV = PV(0x3be74b0) at 0x3c04228   # Could be a different scalar at the same address,
  REFCNT = 1                      #   but it's truly the same scalar
  FLAGS = ()                      # No "OK" flags: undef
  PV = 0x3c0c6a0 "abcdef"\0       # The same string buffer
  CUR = 6
  LEN = 10                        # Allocated size of the string buffer

The logic is that if you needed the memory once, there's a strong chance you'll need it again.

For the same reason, assigning undef to a scalar doesn't free its string buffer. But Perl gives you a chance to free the buffers if you want, so passing a scalar to undef does force the freeing of the scalar's internal buffers.

$ perl -MDevel::Peek -e'
   my $x = "abc"; $x .= "def";  Dump($x);
   $x = undef;                  Dump($x);
   undef $x;                    Dump($x);
'
SV = PV(0x37d1fb0) at 0x37eec98   # PV: Scalar may contain a string
  REFCNT = 1
  FLAGS = (POK,pPOK)              # POK: Scalar contains a string
  PV = 0x37e8290 "abcdef"\0       # The string buffer
  CUR = 6
  LEN = 10                        # Allocated size of the string buffer

SV = PV(0x37d1fb0) at 0x37eec98   # PV: Scalar may contain a string
  REFCNT = 1
  FLAGS = ()                      # No "OK" flags: undef
  PV = 0x37e8290 "abcdef"\0       # The string buffer is still allcoated
  CUR = 6
  LEN = 10                        # Allocated size of the string buffer

SV = PV(0x37d1fb0) at 0x37eec98   # PV: Scalar may contain a string
  REFCNT = 1
  FLAGS = ()                      # No "OK" flags: undef
  PV = 0                          # The string buffer has been freed.

Thanks, wasn't aware of that out of scope-behaviour. Is that caching approach per OS-process or Perl interpreter? Because I'm using mod_perl and recognized that large amounts of memory are kept in the process after a request. Thought of a memory leak somewhere, but might be that "clever" caching. A process can have many interpreters in memory and if all of those cache some GBs of data I have a problem. — Thorsten Schöning, Sep 15 '16 at 16:28
Sorry, don't think I understand your answer: Many threads per process choose arbitrary interpreters available in memory in a pool in mod_perl if I'm correct. So the cached/allocated memory needs to be assigned to the interpreters and is used only if some thread executes some interpreter with such cached/allocated memory? Other interpreters don't benefit, but allocate themselfs instead. A thread leaving a interpreter doesn't frees the memory. 10 interpreters, 10 * 2 GB of data allocated for example. Only cleared when interpreters are deleted from process. Right? — Thorsten Schöning, Sep 15 '16 at 17:01
mod_perl-interpreters are not freed after usage, they stay in memory and are by default used many times by different threads. Else mod_perl itself wouldn't make sense because you don't gain any performance if all interpreters would be freed directly after requests. The code would need to be compiled over and over again... Even the docs say otherwise: https://perl.apache.org/docs/2.0/user/config/config.html#Threads_Mode_Specific_Directives — Thorsten Schöning, Sep 16 '16 at 06:23
Is that clearly stated in the docs somewhere? Doesn't make sense to me and mod_perl src has functions like `modperl_interp_[un]select` which don't seem to take any thread into account, only requests, connections or the server itself. Putting interpreters back into a pool wouldn't be needed at all if only ever used with one thread. — Thorsten Schöning, Sep 16 '16 at 07:02
Additionally, Apache has spare threads and such itself and can't know if a request needs mod_perl or not when decision is made to process it by one arbitrary thread in the pool of free threads. On fixed 1:1 relation over thread lifetime for thread and mod_perl interpreter, unnecessary interpreters would need to be cloned only because a thread is chosen with no interpreter assigned in the past, while interpreters might be freely available in the pool which were used in the past already, but not currently. — Thorsten Schöning, Sep 16 '16 at 07:09
Interpreters can be shared amongst threads, PERL_SET_CONTEXT is the keyword as described in the following source. Those calls can be find in the mod_perl src as well. http://perldoc.perl.org/perlguts.html#Should-I-do-anything-special-if-I-call-perl-from-multiple-threads%3f — Thorsten Schöning, Sep 16 '16 at 08:20

Perl: When is unneeded memory of a scalar freed without going out of scope?

1 Answers1

Linked