[Bioperl-l] EntrezGene ASN parser

Stefan Kirov skirov at utk.edu
Fri Apr 1 08:20:57 EST 2005



Hilmar Lapp wrote:

>
> On Wednesday, March 30, 2005, at 02:31  PM, Stefan Kirov wrote:
>
>> I just finished a Bioperl EntrezGene Parser based on Mingyi Liu's ASN 
>> Gene parser. It creates two main objects: a Bio::Seq object which 
>> contains most of the data such as references, description, map 
>> location, etc; and a Bio::Cluster::SequenceFamily object, which 
>> contains the refseqs and the gene structure (through NT/NC 
>> annotation, represented as Bio::SeqFeature::Gene objects).
>
>
> You added Bio::SeqFeature::Gene objects to a 
> Bio::Cluster::SequenceFamily instance?

>
> Bio::Cluster::SequenceFamily as a Bio::ClusterI should accept only 
> Bio::PrimarySeqI as members ... I.e., originally these clusters were 
> meant to hold sequences.
>
> I'm not sure it's a good idea to mix bags of sequences with bags of 
> features.
>
> Or I misunderstood and you meant something else?

Nope. Bio::SeqFeature::Gene to Bio::Seq which then goes to 
Bio::Cluster::SequenceFamily. Sorry my description may have been 
misleadling.

>
>> Another data I make available is the uncaptured data. So each time a 
>> some data is transfered from the hash which represents the parsed 
>> data, I am deleting the respective  key. Everything else is 
>> concidered uncaptured. I am doing this since some records could be 
>> non-compliant or simply there may be new data supplied by NCBI. There 
>> will be naturally some data, which is not interesting, and therefore 
>> is not captured (a lot of redundant data in the EntrezGene). So the 
>> parser would act like that:
>> my ($egene,$assoc_seq,$uncaptured)=$egparser->next_seq;
>
>
> Be careful here, this is non-compliant with Bio::SeqIO which mandates 
> that next_seq() return a sequence object.
>
> You could use wantarray to determine whether to return a single object 
> (supposedly $egene?) or three elements, but if someone does
>
>     my $seq = $egparser->next_seq();
>
> the result should not be the scalar 3 (i.e., number of elements).

Hmm I see... So unless you want all data as an array (the 3 objects) you 
will get only the Bio::Seq object with the immediate entrezgene data (no 
genomic cocrdinates, etc...). OK, I will change that.

>
>> There are few things I need to add (Markers and GO are not yet in 
>> these objects), but most of work is done. Unless somebody objects, I 
>> will commit the code (Bio::SeqIO::entrezgene?) when I write the 
>> documentation to match the standard.
>
>
> Sounds like a good name. I suggest you commit so that interested 
> others (i.e., me :) can have a look.
>
> Also, if you have certain use cases driving your work that expect 
> certain things in certain places, it'd be good if you start writing 
> test cases that check for those things. I certainly have such a use 
> case as I probably indicated earlier; so if I need things in different 
> places than you put them it'd be good to see where changes can be made 
> easily and where not. I depend(ed) a lot on the LocusLink annotation 
> and that will be no different for its successor.
>
OK... I will take a look again at locuslink and try to adjust as much as 
possible. Once I commit the code you can tell me if there is a critical 
part that needs additional work or changes.

>> Few notes:
>> 1. It would be nice if there is Bio::Annotation::DBLink::url method. 
>> It makes sense (I think) since most DB links would refer also to a 
>> webpage.
>
>
> Feel free to add, but don't expect e.g. bioperl-db to (de)serialize this.
>
>> 2. It takes now 45 minutes to parse the whole human ASN file, which 
>> is 4 times slower. Keeping uncaptured data slows things down a bit, 
>> so I will introduce -debug option. Anyway I think the speed is not 
>> going to be an issue.
>
>
> What would -debug do?
>
> I think there should be an option to disable the keeping of what you 
> call uncaptured data. Also, as I said before the standard way of 
> calling is to ask for a sequence object, so if I know in advance that 
> that's all I'm ever going to do I should have the option to disable 
> construction of those other 2 objects you propose to return from 
> next_seq.

exactly what -debug would do (-debug=>'off' as default)

>
> Sounds like the Entrez Gene parser is coming along without me having 
> to write it. I'm thrilled Stefan!!

Thanks...

>
>     -hilmar
>
>
>> 3. Due to the cyclic reference in the GeneStructure object I am 
>> removing the Transcript->{parent} in the parser. This code should be 
>> deleted once the Transcript object is fixed.
>> There are also some other minor issues, but I think I will be able to 
>> fix them by the end of the week.
>> Please let me know what you think.
>> Stefan
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at portal.open-bio.org
>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>>
>>

-- 
Stefan Kirov, Ph.D.
University of Tennessee/Oak Ridge National Laboratory
5700 bldg, PO BOX 2008 MS6164
Oak Ridge TN 37831-6164
USA
tel +865 576 5120
fax +865-576-5332
e-mail: skirov at utk.edu
sao at ornl.gov

"And the wars go on with brainwashed pride
For the love of God and our human rights
And all these things are swept aside"



More information about the Bioperl-l mailing list