[Biojava-l] Different implementation of Sequence?

Wed Jun 4 19:38:57 EDT 2003

Once upon a time, Y D Sun wrote:
> Hi,
> 
> It seems that the implmentation for a sequence that is read from a plain
> text file (e.g. Embl file) or from a BioSQL database is different.
> 
> I apply a feature filter to a sequence seq like:
> 
>             //make a Filter for "CDS" types
>             FeatureFilter ff = new FeatureFilter.ByType("CDS");
> 
>             //get the filtered Features
>             FeatureHolder fh = seq.filter(ff);
> 
> The feature filtering takes longer time for a sequence from database. In
> my experiment, for example,
> 
> If seq is read from an Embl file, the time cost of seq.filter(ff) is 54
> ms;
> If the same seq is read from a BioSQL database, the time is 51518 ms (as
> high as 1000 times).
> 
> The latter also requires more memory space in execution.
> 
> Could anybody give some justification for this phenomenon?

If you load a sequence from a file, it's all loaded into memory.
The filtering process is a simple in-memory operation.  When
a sequence is fetched from BioSQL, it's just a lazy reference
to the database.  The features are only being fetched when you
perform the filter operation.  This will be slower.  I'm
surprised it uses more memory, though -- certainly when you're
working with large numbers of sequences, BioSQL should be more
efficient.

That said, the time you quote is very, very, slow.  Where
did you get the BioSQL schema from?  Some versions are circulating
which seem to be missing some critical "CREATE INDEX" statements,
which makes feature-filtering substantially slower than it should
be...

   Thomas.