[Bioperl-l] Entrez Gene ASN

Stefan Kirov skirov at utk.edu
Fri Mar 11 14:02:16 EST 2005



Hilmar Lapp wrote:

> Gene shouldn't be fundamentally different from LocusLink, and 
> LocusLink was represented as an annotated SeqI within bioperl.

It is not, you are right.

>
> If at all possible I'd still like it to remain that way for Gene in 
> order to allow for a smooth transition from LL to Gene for code that's 
> been using the former.
>
hmmmm, back compatibility is good thing, but sometimes it may be hard to 
achieve.

> If you want to emphasize the fact that it's a container for sequences, 
> then that sounds like a ClusterI to me, which can be richly annotated 
> too.

Let me disagree here. Cluster is designed for independent sequences, 
where Gene should deal with sequences, that have hierarchical 
relationship among themselves. This is one of the issues I think  Seq 
object is not designed to deal with.  What we need is:
genome--(Bio::Seq)-
                   |--transcript(Bio::Seq)
                                          |--protein(Bio::Seq)
                     |--transcript(Bio::Seq)
                                          |--protein(Bio::Seq)
etc. As an alternative one can store in a separate ontology object the 
relationships, but I don't think this is really effective.
As many genome and transcript entries exist, it will be easy to loose 
the relations.

Another significant concern I have is that if we store everything as 
SeqFeature or the overhead may become huge (some records have hundreds 
of different features) and any user of the parser will have to do quite 
of a data mining to find the relevant feature. One approach would be to 
add more Bio::Annotation:: objects (for example Bio::Annotation::STS, 
Bio::Annotation::GRIF, etc). And one last thing: orthology (which agan 
could be based on ontology) and synteny are things that should be in the 
Gene (or loculink) object.
We may decide to create a simplified (Bio::Seq, no relationships) or 
more complex object (Gene), based on the user request.
I hope this does not sound too counfusing as I am burried in the Gene 
ASN structure and I am quickly approaching quiet madness.

>
> Note also that NCBI is working on an ASN.1->XML converter. Personally, 
> I'm inclined to wait for that converter to appear, but other 
> priorities may prevail.
>
I have waited for a while. If they cannot parse their own data...? 
Anyway, some issues will still be there even if we have the XML.
Stefan

> Let me know what you think.
>
>     -hilmar
>
> On Thursday, March 10, 2005, at 06:14  AM, Stefan Kirov wrote:
>
>> Hi guys!
>> I have done some (mostly) serious thinking about ASN Entrez Gene 
>> parsing and I propose we do my favorite thing- postpone everything we 
>> cannot deal with right now. If you want it to sound better: take a 
>> gradual approach where we store the data we can deal with in the 
>> existing Bioperl objects and skipping the rest for now.
>> In details:
>> ASN gene record can be correctly represented as a tree. I have 
>> written a simple parser for my own purposes which is storing the 
>> following:
>> node_id---|
>>                  --parent
>>                  --level
>>                  --tag
>>                  --values
>> What I do then is get specific levels and tags and build different 
>> objects. So level 2 with parent EntrezGene (which is the root level 
>> and has no information) is gene description and has tags such as 
>> gene, name, etc; at level 3, 5 and 6 you can get the complete specie 
>> definition by looking for orgname and org as tags and records with 
>> parent mod (which is a value for orgname, descend down the branch).
>> I am using this approach to store most of the data in a relational 
>> database without going through Bioperl. What I ultimately want to do 
>> is use standard Bioperl modules. However, I don't think we have an 
>> object that can efficiently represent the structure (correct me if I 
>> am wrong). I think it may be a good idea to have a container object, 
>> possibly Bio::Gene that may contain multiple Bio::Seq objects (with 
>> or without real sequence). I believe we can borrow some structure and 
>> code from EnsEMBL gene representation (way to contain multiple 
>> transcripts, etc., not the database interactions certainly).
>> Please let me know what you think.
>> Stefan
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at portal.open-bio.org
>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>>
>>

-- 
Stefan Kirov, Ph.D.
University of Tennessee/Oak Ridge National Laboratory
5700 bldg, PO BOX 2008 MS6164
Oak Ridge TN 37831-6164
USA
tel +865 576 5120
fax +865-576-5332
e-mail: skirov at utk.edu
sao at ornl.gov

"And the wars go on with brainwashed pride
For the love of God and our human rights
And all these things are swept aside"



More information about the Bioperl-l mailing list