[Bioperl-l] SeqIO::embl/SeqIO::genbank/SeqIO::swiss

James Smith js5@sanger.ac.uk
Mon, 17 Dec 2001 14:55:49 +0000 (GMT)


While developing an EMBL/genbank exporter for EnsEMBL, we
noticed that these modules can get rather slow if we use
a "_post_sort()" function to sort the features we are
dumping [ if there are a large number of features ].

To speed this up we have modified our local copy to also
have an _index_function() which generates a indexed version
of the features array for use in a "Schwarzian transform"
to sort the features.

In tests on a 1MB VC, we have observed speed ups of the
script from 200 seconds to around about 30 seconds (or even
faster) for these exports. 

If no-one objects I will add these into the 0.7 and live
branches of EnsEMBL by the end of the week. The extra code
is written in such away that it won't break any use of the
_post_sort functionallity.

James Smith

(EnsEMBL web developer)

Code sample:

This sorts features in base pair order { in a given order
e.g. similarity features first, then repeats, then gene features, ... }

    $self->_print_GenBank_FTHelper($_->[1]) foreach (
        sort { $a->[0] <=> $b->[0] }
          map { [ &$index_func($_), $_ ] }
            map { Bio::SeqIO::FTHelper::from_SeqFeature($_,$seq) }
              $seq->top_SeqFeatures
    );

    where &$index_func is:

    sub sort_Indexer_function {
        my $a = shift;
        my ($a_5prime) = $a->loc =~ /(\d+)/;
        return ($sort_order{$a->key} || $last)*1e9 + $a_5prime;
    }