[BioSQL-l] Re: SeqWithQuality and biosql
Hilmar Lapp
hlapp at gnf.org
Tue Jul 5 14:55:10 EDT 2005
(I don't think posting to bioperl was a mistake, so I'm including it
here again)
I think I like Mark's proposal best, i.e., the fundamental model of at
most one sequence for each bioentry (e.g., Bio::SeqI object) is left
intact, and the problem is reformulated as how to encode/decode
sequences from alphabet cross-products as strings.
Encoding/decoding wouldn't be difficult to implement, even such that
the encoded string is still humanly readable. Biojava has a natural
provision for doing this (SymbolTokenizer?), but Bioperl does not,
i.e., in Bioperl the object model assumes that the sequence is a flat
string, and the alphabet is also a flat string; there is no object you
could ask to provide you with an encoder/decoder appropriate for either
the alphabet or the type of sequence object.
I'd like to hear some feedback from the Bioperl folks as to whether
you'd consider this capability a generally useful addition to Bioperl.
(It could be designed in a number of ways ranging from more intrusive
to completely neutral - e.g., adding this as a method to SeqI [like
$seq->seq_encoder()], or making $seq->alphabet() return an object with
this and other capabilities, or creating a separate factory class that
would return the appropriate encoder known to [or registered with] it
based on a given alphabet and type of sequence object.)
As for Bio::Seq::MetaI, this could certainly be the interface for
SeqWithQuality, but wouldn't solve the de/serialization problem. Also,
at least conceptually MetaI-derived classes could represent
multi-dimensional meta-information, right? That is, the problem of how
to encode/decode the meta-information isn't trivial or restricted to
two dimensions here either.
As for creating a specialized adaptor in Bioperl-db, that would
certainly work too and would most likely be the fastest way to get
something that works. However, long-term it would solve the problem
only for SeqWithQuality and not for the more general problem of how to
store sequences that are based on cross-product alphabets. BTW if you
do implement a specialized adaptor, then instead of storing two
bioentries and connecting them you might as well implement the sequence
encoding/decoding for this particular object in the adaptor - you'd
gain speed because instead of increasing the number of database
operations you'd spend a couple more CPU cycles in Perl code, and you
wouldn't be burdened with two bioentries that aren't coupled by foreign
key constraint.
As for consensus for how to encode sequence with quality values, I'd
include a delimiter between the alphabet operands in the cross-product.
I.e., using e.g. slash as the delimiter: 'A/22 T/30 A/32 G/35 C/35'.
This can be easily extended to multi-dimensional cross-products so long
as the delimiter between them isn't a symbol in any of the alphabets.
-hilmar
On Jul 5, 2005, at 12:39 AM, Marc Logghe wrote:
> Thanks for the feedback.
> Good to know I am not alone in this ;-)
> I totally agree with Mark that there should be a kind of consensus on
> how to store this in Bio*.
> Yesterday I mistakenly posted my original mail to the bioperl list.
> Heikki responded to that; it might be a good starting point but I am
> not
> familiar with it:
> http://portal.open-bio.org/pipermail/bioperl-l/2005-July/019271.html
> So far the long term solustion.
> In short term, to have at least something that works, I'll experiment a
> little with storing separate objects. I remember one of the
> presentations of Hilmar, where he gave the example of making an adaptor
> and storing 2 sequence objects that interacted with each other as a
> result of a Two Hybrid experiment in yeast.
> Cheers,
> Marc
>
>
>>
>> I'd think storing it in BioSQL as 2-byte pairs would be good.
>> First byte is the base (an ASCII character), second byte is
>> the quality (an 8-bit integer). Sure it wastes a few bits but
>> so does normal DNA...
>>
>>
>> Richard Holland
>> Bioinformatics Specialist
>> GIS extension 8199
>> ---------------------------------------------
>> This email is confidential and may be privileged. If you are
>> not the intended recipient, please delete it and notify us
>> immediately. Please do not copy or use it for any purpose, or
>> disclose its content to any other person. Thank you.
>> ---------------------------------------------
>>
>>
>>> -----Original Message-----
>>> From: biosql-l-bounces at portal.open-bio.org
>>> [mailto:biosql-l-bounces at portal.open-bio.org] On Behalf Of
>>> mark.schreiber at novartis.com
>>> Sent: Tuesday, July 05, 2005 1:44 PM
>>> To: Marc Logghe
>>> Cc: biosql-l-bounces at portal.open-bio.org; biosql-l at open-bio.org
>>> Subject: Re: [BioSQL-l] FW: SeqWithQuality and biosql
>>>
>>>
>>> Hello -
>>>
>>> I was wondering about similar issues with biojava. As you
>> may (or may
>>> not) know biojava can make sequences from symbols in any
>> alphabet, two
>>> examples are DNA and the integer alphabet (a collection of Symbols
>>> that are integers). Biojava can also make compound
>> alphabets, one such
>>> example is the Phred alphabet which is the multiplication of DNA x
>>> Integer (technically a subset of Integer from 0 to 99).
>>>
>>> Because sequence in BioSQL is stored in a CLOB if you can
>> encode your
>>> SeqWithQuality as a String of characters you can store it.
>>> With the case
>>> above (which is probably similar to yours) you would need 400
>>> characters to store it which is too large for ASCI but
>> could be done
>>> in Unicode. The downside is your persitance layer needs to
>> know how to
>>> encode and decode your SeqWithQuality. I'm not familiar how BioPerl
>>> would do this. BioJava would need to Implement a
>> SymbolTokenizer for
>>> the alphabet and then persistance would happen
>> automatically (assuming
>>> your DB is OK with Unicode). An alternative would be to make a
>>> tokenizer that uses more than single character tokens for
>> encoding (eg
>>> A23 G40 T34 C22 etc).
>>>
>>> The alternative you suggest of storing two sequences with a
>>> relationship is also nice (because you can retreive each part
>>> seperately) but also requires your persitance layer to know
>> about it.
>>> However, it has big disadvantages because they are not
>> strongly tied
>>> to each other. If you manipulate one you might invalidate
>> the other.
>>> Also if you delete one the other will probably not be deleted in a
>>> cascade.
>>>
>>> Not sure if any of this helps but a consensus on how to store this
>>> kind of information would be good so the bio* projects do
>> it the same
>>> way.
>>> Consensus in this case will probably mean whatever the first
>>> implementation is.
>>>
>>> - Mark
>>>
>>>
>>>
>>>
>>>
>>> "Marc Logghe" <Marc.Logghe at devgen.com> Sent by:
>>> biosql-l-bounces at portal.open-bio.org
>>> 07/04/2005 05:56 PM
>>>
>>>
>>> To: <biosql-l at open-bio.org>
>>> cc: (bcc: Mark Schreiber/GP/Novartis)
>>> Subject: [BioSQL-l] FW: SeqWithQuality and biosql
>>>
>>>
>>> Apologies for cross posting, I had picked the wrong mail adress :-(
>>>
>>> -----Original Message-----
>>> From: Marc Logghe
>>> Sent: Monday, July 04, 2005 11:43 AM
>>> To: bioperl-l at portal.open-bio.org
>>> Subject: SeqWithQuality and biosql
>>>
>>> Hi all,
>>> I am currently exploring the possibility to store a
>>> Bio::Seq::SeqWithQuality object in biosql.
>>> Has anyone ever tried this ?
>>> One possibility would be to
>>> 1) split up the Bio::Seq::SeqWithQuality object into a plain
>>> Bio::Seq::RichSeq and a Bio::Seq::PrimaryQual
>>> 2) store them separately in biosql; different namespaces
>>> 3) link them with a relation term.
>>> 4) make a custom adaptor to fetch the persistent objects
>> from biosql
>>> and reconstruct the Bio::Seq::SeqWithQuality
>>>
>>> Does that make sense ? Any other suggestions/possibilities ?
>>> As a test I tried to load a Bio::Seq::PrimaryQual in biosql
>> using the
>>> load_seqdatabase.pl but it fails because Bio::Seq::PrimaryQual does
>>> not have a namespace method.
>>> I hope I'm wrong but I have the impression there is a long
>> way to go
>>> ;-)
>>>
>>> Marc
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> BioSQL-l mailing list
>>> BioSQL-l at open-bio.org
>>> http://open-bio.org/mailman/listinfo/biosql-l
>>>
>>>
>>>
>>> _______________________________________________
>>> BioSQL-l mailing list
>>> BioSQL-l at open-bio.org
>>> http://open-bio.org/mailman/listinfo/biosql-l
>>>
>>
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at open-bio.org
> http://open-bio.org/mailman/listinfo/biosql-l
>
--
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------
More information about the BioSQL-l
mailing list