[Biojava-l] Different implementation of Sequence?

Thu Jun 5 08:58:58 EDT 2003

Just to add my 2 cents worth.

I'm using the latest version of the BioSQL schema within MySQL and the 
filters are quite fast.  On a database containing 18 complete bacterial 
genomes, fetching a given gene by name which uses a combination of 5 
filters in my case, takes approx. 1-2 seconds.

Alas, the current version of biojava doesn't support the latest schema, 
but I have modified all of the BioSQL classes to handle most of the new 
schema and it does add, remove and filter sequences correctly according 
to all my tests so far.  And now that I have cvs access (Thanks Thomas, 
it works), I will be checking in these updates hopefully within the next 
day or 2.

If you want them sooner, I can email them to you directly.  Let me know.

Cheers,
Simon Foote

-- 
Bioinformatics Specialist
Institute for Biological Sciences
National Research Council of Canada
[T] 613-990-0561  [F] 613-952-9092
simon.foote at nrc-cnrc.gc.ca

Thomas Down wrote:

>Once upon a time, Y D Sun wrote:
>  
>
>>Hi,
>>
>>It seems that the implmentation for a sequence that is read from a plain
>>text file (e.g. Embl file) or from a BioSQL database is different.
>>
>>I apply a feature filter to a sequence seq like:
>>
>>            //make a Filter for "CDS" types
>>            FeatureFilter ff = new FeatureFilter.ByType("CDS");
>>
>>            //get the filtered Features
>>            FeatureHolder fh = seq.filter(ff);
>>
>>The feature filtering takes longer time for a sequence from database. In
>>my experiment, for example,
>>
>>If seq is read from an Embl file, the time cost of seq.filter(ff) is 54
>>ms;
>>If the same seq is read from a BioSQL database, the time is 51518 ms (as
>>high as 1000 times).
>>
>>The latter also requires more memory space in execution.
>>
>>Could anybody give some justification for this phenomenon?
>>    
>>
>
>If you load a sequence from a file, it's all loaded into memory.
>The filtering process is a simple in-memory operation.  When
>a sequence is fetched from BioSQL, it's just a lazy reference
>to the database.  The features are only being fetched when you
>perform the filter operation.  This will be slower.  I'm
>surprised it uses more memory, though -- certainly when you're
>working with large numbers of sequences, BioSQL should be more
>efficient.
>
>That said, the time you quote is very, very, slow.  Where
>did you get the BioSQL schema from?  Some versions are circulating
>which seem to be missing some critical "CREATE INDEX" statements,
>which makes feature-filtering substantially slower than it should
>be...
>
>   Thomas.
>
>
>_______________________________________________
>Biojava-l mailing list  -  Biojava-l at biojava.org
>http://biojava.org/mailman/listinfo/biojava-l
>  
>