[Biojava-l] Getting a part of a sequence

Richard Holland holland at eaglegenomics.com
Tue Oct 7 23:05:54 UTC 2008


Your code is pretty good already - but you're right, it will load the
whole chromosome into memory before you can chop out the interesting
bit you actually need.

As you observed, by using ThinRichSequence in your query it will load
only the initial shell of a sequence object to start with, but the
moment you try and sub-sequence it, it will immediately load the whole
sequence data into memory in order to perform the operation.

If you only want the sequence data, as a string, you can do this by
specifying the sequence attribute in the query and bypassing the
sequence object entirely:

 select rs.stringSequence from Sequence as rs where rs.description
like '%hromosome :num%

This will return a String instead of a RichSequence object. You can
use HQL operators to perform substrings etc. on the string inside the
query itself - see
, particularly section 14.9.

If you only want the features, you can do this by using the
BioSQLFeatureFilter technique. In particular you will want the
BySequenceName filter, the And filter, and the OverlapsRichLocation
filter. You construct a filter then pass it to the filter() method in
BioSQLRichSequenceDB. The database will return to you all the
RichFeature objects that match your criteria. Note that it searches
the whole database so you really must use a BySequenceName filter at
the very least in order to make the results useful!

However, you can't use HQL to construct a complete slice of a sequence
directly in the database before returning it to the program for use as
a ready-made RichSequence object. This would require Hibernate to know
what a BioJava sub-sequence object is and how it behaves in relation
to an 'unsliced' one, which is beyond the scope of it's job as a
persistence framework.


2008/10/7 Gabrielle Doan <gabrielle_doan at gmx.net>:
> Hi all,
> I have a BioSQL database which contains all human chromosomes. My intention
> is to get the information about a particular gene. How can I get a part of a
> particular chromosome with all associated features? At the moment I use
> following code to create my new sequence:
> <code>
> RichSequence subSeq = RichSequence.Tools.subSequence(parent,
>        position[0], position[1], ns, geneName, parent.getAccession(),
>        parent.getIdentifier(), parent.getVersion() + 1,
>        (Double) (parent.getVersion() + 1.0));
> <\code>
> Here is the part how I get the parent sequence:
> <code>
>        public static RichSequence getChromosome(String chrNo) {
>                Transaction tx = session.beginTransaction();
>                RichSequence ret = null;
>                String query;
>                try {
>                        if (chrNo.equals("MT")) {
>                                query = "from BioEntry as be where
> be.description like '%:num%'";
>                                query = query.replaceAll(":num",
> "mitochondrion");
>                        } else {
>                                query = "from BioEntry as be where
> be.description like '%hromosome :num%'";
>                                query = query.replaceAll(":num", chrNo);
>                        }
>                        Query q = session.createQuery(query);
>                        ret = (RichSequence) q.list().get(0);
>                        tx.commit();
>                } catch (Exception e) {
>                        tx.rollback();
>                        e.printStackTrace();
>                }
>                return ret;
>        }
> <\code>
> I always have to load the whole chromsome to get a part of it, so it takes
> very long time and I get a lot of unused information (waste of memory). I
> also tried to use <code>ThinRichSequence<\code> instead of
> <code>RichSequence<\code>, but thereby I didn't notice any difference.
> Can you give me a hint how to accelerate the code?
> I am grateful for any hits.
> cheers,
> Gabrielle
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

Richard Holland, BSc MBCS
Finance Director, Eagle Genomics Ltd
M: +44 7500 438846 | E: holland at eaglegenomics.com

More information about the Biojava-l mailing list