[Bioperl-l] Question regarding while loops for reading files

Thu Feb 14 20:04:44 UTC 2013

I think that it's important to get feedback on code that one has
written and to try to understand how/what/why someone else has done in
their code.  To that end....

Since Tiago's using this to learn the language better I can't resist
some comments beyond resetting the file handle.

For grins I rewrote it using Text::CSV_XS and Statistics::Basic and to
take a single pass through the data file using a multilevel data
structure.

I resisted the urge to rewrite it in Moose.  Didn't even have an urge
to rewrite it in R.  Funny, that....

The script is here

  Tiago.pl
    https://gist.github.com/hartzell/4955401

With something like what I think the data looks like here:

    https://gist.github.com/hartzell/4955570

Even without that big of a rewrite, I had a bunch of local comments
which are inline below.

Daisie Huang writes:
 > [...]
 > On Wednesday, February 13, 2013 6:46:41 PM UTC-8, Tiago Hori wrote:
 > >
 > > Hey Guys,
 > >
 > > I am still at the same place. I am writing these little pieces of code to 
 > > try to learn the language better, so any advice would be useful.
 > > [...]
 > > here is the code:
 > >
 > > #! /usr/bin/perl
 > > use strict;
 > > use warnings;
 > >
 > > my $file = $ARGV[0];

Slightly better would be $filename, so that when you step up to
Path::Class you can differentiate a file object from a file name
string.

 > > my @family = ('AS5','AS9');

Better would be @families, plural.  See the use of $family below.

 > > my $i;
 > > my $ii;

As far as I can tell, these are just counting the number of things
that you push onto the various arrays.  You don't need them, referring
to the list in scalar context will give you its size.

 > > my $test;

You use this to hold the name of the family, so it's not particularly
evocative.  You should also restrict it's scope to within the loop.
See the comment for the foreach loop.

 > > open (my $fh, "<", $file) or die ("Can't open $file: $!");

You made my day, three arg. open *and* you checked for errors.  Nice!

 > > foreach (@family){

Better would be

  for my $family (@families) {

which is evocative and restricts the scope of $family to the for loop
(and for is 4 characters shorter than foreach...).

 > >     $test = $_;

No longer need this, using $family declared in the for loop with the
proper scoping.

 > >     my @data_weight_2N = ();
 > >     my @data_weight_3N = ();
 > >     while (<$fh>){
 > >         chomp;  
 > >         my $line = $_;
 > >         my @data  = split ("\t", $line);

Don't parse CSV (TSV) files yourself.  Get in the habit of using
Text::CSV_XS.

 > >         if ($data[0] !~ /[0-9]*/){
 > >         next;}
 > >         elsif ($data[1] eq "ABF09-$test"){
 > >             $i += 1; 

You don't need the counter.

 > >             push (@data_weight_2N,  $data[6]);
 > >         }elsif ($data[1] eq "ABF09-".$test."PS"){
 > >         $ii += 1;

You don't need the counter.

 > >             push (@data_weight_3N,$data[6]);
 > >     }
 > > }
 > >     my $mean_2N = &average (\@data_weight_2N);
 > >     my $stdev_2N = &stdev (\@data_weight_2N);

You don't need the ampersands on the subroutine calls.  They're old
school <joke> and just encourage people to make fun of our language for its
use of all those funny punctuation marks </joke>.

 > >     my $stderr_2N = ($stdev_2N/sqrt($i));

Unless I'm mistaken, this is equivalent

    my $stderr_2N = ($stdev_2N/sqrt(scalar @data_weight_2N));

and you don't need the counter, the explicit use of scalar there might
even be redundant (I'm a coward).  You use the same trick in your
subroutine defn's below.

 > >
 > >     print "These are the the avearge weight, stdev and stderr for $test 
 > > 2N:\t", $mean_2N,"\t",$stdev_2N,"\t",$stderr_2N, "\n";
 > >
 > >     my $mean_3N = &average (\@data_weight_3N);
 > >     my $stdev_3N = &stdev (\@data_weight_3N);
 > >     my $stderr_3N = ($stdev_3N/sqrt($i));
 > >
 > >     print "These are the the avearge weight, stdev and stderr for $test 
 > > 3N:\t", $mean_3N,"\t",$stdev_3N,"\t",$stderr_3N, "\n";
 > > }
 > >
 > > close ($fh);

Ah, rats.  You checked whether open worked, you need to do the same
thing on close too!

  close ($fh) or die !$;

Or you could just

  use autodie qw(open close);

and then they'll die appropriately when they have to and you don't
have to bother with the checking.

 > > sub average{
 > >         my($data) = @_;
 > >         if (not @$data) {
 > >                 print ("Empty array\n");
 > >                 return 0;
 > >         }
 > >         my $total = 0;
 > >         foreach (@$data) {
 > >                 $total += $_;
 > >         }

  use List::AllUtils qw(sum); # somewhere up at the top of the script...

  my $total = sum(@$data);
  if (not defined $total) {
     print "Empty array\n";
     return;
  }

List::AllUtils is your friend.  Learn to use it.

Your returning 0 for an empty list is probably the wrong thing, isn't
it possible to the total to actually be 0?  Just return instead.
Don't return undef, just return (and let perl take context into
account for you).

You probably don't actually want to spew "Empty array" out into your
output stream, imagine writing a script that postprocesses your output
and having to deal with it.  If you really need to say it, send it to
standard error with

  print STDERR "Empty array\n";

 > >         my $average = $total / @$data;
 > >         return $average;

If you don't really need the error message, then you can get to

  my $total = sum(@$data);
  return unless $total;
  return $total / @$data;

And if an empty data array is *truly* unexpected, maybe you should
just die/carp.

 > > }
 > >
 > > sub stdev{
 > >         my($data) = @_;
 > >         if(@$data == 1){
 > >                 return 0;
 > >         }
 > >         my $average = &average($data);
 > >         my $sqtotal = 0;
 > >         foreach(@$data) {
 > >                 $sqtotal += ($average-$_) ** 2;
 > >         }
 > >         my $std = ($sqtotal / (@$data-1)) ** 0.5;
 > >         return $std;
 > > }

Ditto on the use of List::AllUtils, etc...

Phew.

The only other thing I'd like to see would be an arrangement that
let's you write simple tests.  A simple sol'n would be to package the
entire main part of the code up into e.g. a subroutine that returns a
hashref keyed by family, containing a hashref keyed by 2N/3N/... and
then you could just:

  use Test::More;

  use Tiago qw(summarize);

  my $output = summarize("test_data.tsv");

  is($output->{AS5}->{'2N}, "42", "Got the magic number")

  # etc...

  done_testing;

Thanks for sharing your code.  Keep practicing!

g.