[Bioperl-l] EntrezGene ASN parser
Hilmar Lapp
hlapp at gmx.net
Fri Apr 1 03:52:10 EST 2005
On Wednesday, March 30, 2005, at 02:31 PM, Stefan Kirov wrote:
> I just finished a Bioperl EntrezGene Parser based on Mingyi Liu's ASN
> Gene parser. It creates two main objects: a Bio::Seq object which
> contains most of the data such as references, description, map
> location, etc; and a Bio::Cluster::SequenceFamily object, which
> contains the refseqs and the gene structure (through NT/NC annotation,
> represented as Bio::SeqFeature::Gene objects).
You added Bio::SeqFeature::Gene objects to a
Bio::Cluster::SequenceFamily instance?
Bio::Cluster::SequenceFamily as a Bio::ClusterI should accept only
Bio::PrimarySeqI as members ... I.e., originally these clusters were
meant to hold sequences.
I'm not sure it's a good idea to mix bags of sequences with bags of
features.
Or I misunderstood and you meant something else?
> Another data I make available is the uncaptured data. So each time a
> some data is transfered from the hash which represents the parsed
> data, I am deleting the respective key. Everything else is concidered
> uncaptured. I am doing this since some records could be non-compliant
> or simply there may be new data supplied by NCBI. There will be
> naturally some data, which is not interesting, and therefore is not
> captured (a lot of redundant data in the EntrezGene). So the parser
> would act like that:
> my ($egene,$assoc_seq,$uncaptured)=$egparser->next_seq;
Be careful here, this is non-compliant with Bio::SeqIO which mandates
that next_seq() return a sequence object.
You could use wantarray to determine whether to return a single object
(supposedly $egene?) or three elements, but if someone does
my $seq = $egparser->next_seq();
the result should not be the scalar 3 (i.e., number of elements).
> There are few things I need to add (Markers and GO are not yet in
> these objects), but most of work is done. Unless somebody objects, I
> will commit the code (Bio::SeqIO::entrezgene?) when I write the
> documentation to match the standard.
Sounds like a good name. I suggest you commit so that interested others
(i.e., me :) can have a look.
Also, if you have certain use cases driving your work that expect
certain things in certain places, it'd be good if you start writing
test cases that check for those things. I certainly have such a use
case as I probably indicated earlier; so if I need things in different
places than you put them it'd be good to see where changes can be made
easily and where not. I depend(ed) a lot on the LocusLink annotation
and that will be no different for its successor.
> Few notes:
> 1. It would be nice if there is Bio::Annotation::DBLink::url method.
> It makes sense (I think) since most DB links would refer also to a
> webpage.
Feel free to add, but don't expect e.g. bioperl-db to (de)serialize
this.
> 2. It takes now 45 minutes to parse the whole human ASN file, which is
> 4 times slower. Keeping uncaptured data slows things down a bit, so I
> will introduce -debug option. Anyway I think the speed is not going to
> be an issue.
What would -debug do?
I think there should be an option to disable the keeping of what you
call uncaptured data. Also, as I said before the standard way of
calling is to ask for a sequence object, so if I know in advance that
that's all I'm ever going to do I should have the option to disable
construction of those other 2 objects you propose to return from
next_seq.
Sounds like the Entrez Gene parser is coming along without me having to
write it. I'm thrilled Stefan!!
-hilmar
> 3. Due to the cyclic reference in the GeneStructure object I am
> removing the Transcript->{parent} in the parser. This code should be
> deleted once the Transcript object is fixed.
> There are also some other minor issues, but I think I will be able to
> fix them by the end of the week.
> Please let me know what you think.
> Stefan
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
>
--
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------
More information about the Bioperl-l
mailing list