[BioRuby] [Wg-phyloinformatics] GSOC: phyloXML for BioRuby: Mapping sequence

Tue Jun 9 19:18:20 UTC 2009

Hi:

Thank you for the detailed comments.

I think this is a very crucial point, since sequence and taxonomy are 
the two most important elements.

At this point, I would recommend to create a special class for 
phyloxml-sequence, and add methods/constructors to it which make 
transferring to and from Bio::Sequence easy.
But I can definitely see the advantages of directly using Bio::Sequence, 
too.

Also, please don't forget that, should a consensus/strong opinion 
emerge, we could also add features to the phyloxml-sequence definition 
to make it match BioRuby and BioPython sequence better.

Christian

Naohisa GOTO wrote:
> Hi,
>
> sorry for delay.
>
> On Sat, 30 May 2009 17:27:52 -0400
> Diana Jaunzeikare <rozziite at gmail.com> wrote:
>
>   
>> Hi all,
>>
>> So I looked more carefully at the sequence element of phyloXML and it
>> consists of information which cannot be mapped to Bio::Sequence object. I
>> suggest to have a sequence class which closely resembles phyloXML structure
>> and then have a method to extract relevant elements return Bio::Sequence
>> object.  What do you think?
>>     
>
> In this case, the method to convert from Bio::Sequence to the
> phyloXML sequence class is also needed.
>
> If some of the attributes are really essential and not specific
> to phyloXML but will be needed from other data types, it is
> also possible to add new attributes to Bio::Sequence.
>
>   
>> Here on the left i listed phyloXML sequence tag elements and after the arrow
>> -> the possible corresponding attribute of Bio::Sequence
>> * type
>> ** rna, dna  -> Bio::Sequence::NA -> molecule type
>> ** aa -> Bio::Sequence::AA
>> * id_source (string ?) -> id_namespace
>> * id_ref (string ) -> entry_id
>>     
id_source and id_ref are actually used to describe relations between 
sequences, for example to describe orthology-relationships.

>> * symbol (string ?)
>> * accession
>> ** source (example: "UniProtKB") ->
>> ** id (example: "P17304") ->  primary_accession
>>     
** source -> id_namespace
** id -> primary_accession (or entry_id)

>> * name (string )
>> * location (string ? )
>> * mol_seq (string) -> seq / Bio::Sequence::NA/AA
>> * uri
>> ** desc (string)
>> ** type (string )
>> ** uri
>>
>> * annotation []
>> ** ref
>> ** source
>> ** evidence
>> ** type
>> ** desc
>> ** confidence
>> ** property []
>> ** uri
>>
>> * domain_architecture
>> ** length
>> ** domain []
>> *** from
>> *** to
>> *** confidence
>> *** id
>>     
>
> The annotations and domain architecture could be mapped to the features
> in Bio::Sequence.  But, in some cases, it is difficult to be mapped,
> depending on the vocabulary used in the annotations/domain_architecture.
>
>