From darin.london at duke.edu Mon Jul 3 08:41:33 2006 From: darin.london at duke.edu (Darin London) Date: Mon, 03 Jul 2006 08:41:33 -0400 Subject: [Bioperl-l] Call For Birds of a Feather Suggestions Message-ID: <44A9107D.2050304@duke.edu> The BOSC organizing comittee is currently seeking suggestions for Birds of a Feather meeting ideas. Birds of a Feather meetings are one of the more popular activities at BOSC, occurring at the end of each days session. These are free-form meetings organized by the attendees themselves to discuss one or a few topics of interest in greater detail. BOF?s have been formed to allow developers and users of individual OBF software to meet each other face-to-face to discuss the project, or to discuss completely new ideas, and even start new software development projects. These meetings offer a unique opportunity for individuals to explore more about the activities of the various Open Source Projects, and, in some cases, even take an active role influencing the future of Open Source Software development. If you would like to create a BOF, just sign up for a wiki account, login, and edit the BOSC 2006 Birds of a Feather page. From bix at sendu.me.uk Wed Jul 5 08:37:34 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Wed, 05 Jul 2006 13:37:34 +0100 Subject: [Bioperl-l] checkout_all fails on biodata Message-ID: <44ABB28E.2000203@sendu.me.uk> I'm doing: cvs -d:ext:sendu at dev.open-bio.org:/home/repository/bioperl co bioperl_all to check out all the bioperl packages at once. However it only checks out core, db, pedigree, pipeline and run before failing on biodata: cvs checkout: Updating biodata cvs checkout: failed to create lock directory for `/home/repository/bioperl/biodata' (/home/repository/bioperl/biodata/#cvs.lock): Permission denied cvs checkout: failed to obtain dir lock in repository `/home/repository/bioperl/biodata' cvs [checkout aborted]: read lock failed - giving up This failure is consistent for me (had it multiple times, different days, never worked). Biodata isn't even mentioned as a possible package at http://bioperl.org/wiki/Using_CVS. What is it? Could it be moved to the end of the alias list so it is checked out last, letting all the other packages be checked out before failure? PS. neither biodata nor pipeline are mentioned as a package on that wiki page or at http://bioperl.org/wiki/Category:BioPerl_Packages. Are there yet more packages? Cheers, Sendu. From hlapp at gmx.net Wed Jul 5 08:55:42 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 5 Jul 2006 08:55:42 -0400 Subject: [Bioperl-l] checkout_all fails on biodata In-Reply-To: <44ABB28E.2000203@sendu.me.uk> References: <44ABB28E.2000203@sendu.me.uk> Message-ID: <9C3FE120-6B3D-4C81-99E1-921E81926CCA@gmx.net> Should have been fixed - I can cvs update. did you try again? On Jul 5, 2006, at 8:37 AM, Sendu Bala wrote: > I'm doing: > > cvs -d:ext:sendu at dev.open-bio.org:/home/repository/bioperl co > bioperl_all > > to check out all the bioperl packages at once. However it only checks > out core, db, pedigree, pipeline and run before failing on biodata: > > cvs checkout: Updating biodata > cvs checkout: failed to create lock directory for > `/home/repository/bioperl/biodata' > (/home/repository/bioperl/biodata/#cvs.lock): Permission denied > cvs checkout: failed to obtain dir lock in repository > `/home/repository/bioperl/biodata' > cvs [checkout aborted]: read lock failed - giving up > > This failure is consistent for me (had it multiple times, different > days, never worked). > > Biodata isn't even mentioned as a possible package at > http://bioperl.org/wiki/Using_CVS. What is it? Could it be moved to > the > end of the alias list so it is checked out last, letting all the other > packages be checked out before failure? > > PS. neither biodata nor pipeline are mentioned as a package on that > wiki > page or at http://bioperl.org/wiki/Category:BioPerl_Packages. Are > there > yet more packages? > > Cheers, > Sendu. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From bix at sendu.me.uk Wed Jul 5 09:03:50 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Wed, 05 Jul 2006 14:03:50 +0100 Subject: [Bioperl-l] checkout_all fails on biodata In-Reply-To: <9C3FE120-6B3D-4C81-99E1-921E81926CCA@gmx.net> References: <44ABB28E.2000203@sendu.me.uk> <9C3FE120-6B3D-4C81-99E1-921E81926CCA@gmx.net> Message-ID: <44ABB8B6.5040707@sendu.me.uk> Hilmar Lapp wrote: > Should have been fixed - I can cvs update. did you try again? Still doesn't work, no change. I can manually check out the other packages, I just can't do it with bioperl_all alias. co bioperl-biodata fails because: cvs server: cannot find module `bioperl-biodata' - ignored cvs [checkout aborted]: cannot expand modules (not that I want it - if its no longer a bioperl package can it be removed from the alias?) From hlapp at gmx.net Wed Jul 5 09:41:27 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 5 Jul 2006 09:41:27 -0400 Subject: [Bioperl-l] checkout_all fails on biodata In-Reply-To: <44ABB8B6.5040707@sendu.me.uk> References: <44ABB28E.2000203@sendu.me.uk> <9C3FE120-6B3D-4C81-99E1-921E81926CCA@gmx.net> <44ABB8B6.5040707@sendu.me.uk> Message-ID: The idea was once that Bioperl, Biojava, etc had all those unit tests that use specific sample data which take up quite a bit of space. Unifying the largely redundant test data into a single shared repository would save quite a bit of space and therefore download/ update time for people who work on/use more than one Bio* project. However, this was never fully implemented AFAIK. I.e., you don't need biodata. I guess it could be removed from the alias since it's not integrated anyway. Any other opinions? I also forwarded your report to root-l as I couldn't find the offending (stale) lock file. -hilmar On Jul 5, 2006, at 9:03 AM, Sendu Bala wrote: > Hilmar Lapp wrote: >> Should have been fixed - I can cvs update. did you try again? > > Still doesn't work, no change. I can manually check out the other > packages, I just can't do it with bioperl_all alias. > > co bioperl-biodata fails because: > cvs server: cannot find module `bioperl-biodata' - ignored > cvs [checkout aborted]: cannot expand modules > > (not that I want it - if its no longer a bioperl package can it be > removed from the alias?) > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Wed Jul 5 09:48:03 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 5 Jul 2006 08:48:03 -0500 Subject: [Bioperl-l] checkout_all fails on biodata In-Reply-To: <44ABB8B6.5040707@sendu.me.uk> Message-ID: <000f01c6a039$a7a24f10$15327e82@pyrimidine> Bioperl-data was a directory started up a few years ago to hold various data files for testing and as examples (BLAST file examples, GenBank files, etc), somewhat like the t/data directory but cleaned up a bit more. It hasn't been updated in a while. Regardless, you should be able to check it out. As for the problem, looks like Hilmar's checking up on a possible lock file issue. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Sendu Bala > Sent: Wednesday, July 05, 2006 8:04 AM > To: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] checkout_all fails on biodata > > Hilmar Lapp wrote: > > Should have been fixed - I can cvs update. did you try again? > > Still doesn't work, no change. I can manually check out the other > packages, I just can't do it with bioperl_all alias. > > co bioperl-biodata fails because: > cvs server: cannot find module `bioperl-biodata' - ignored > cvs [checkout aborted]: cannot expand modules > > (not that I want it - if its no longer a bioperl package can it be > removed from the alias?) > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Wed Jul 5 11:06:30 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 5 Jul 2006 10:06:30 -0500 Subject: [Bioperl-l] checkout_all fails on biodata In-Reply-To: Message-ID: <001901c6a044$999a14b0$15327e82@pyrimidine> I use TortoiseCVS via WinXP and I'm getting the same issue that Sendu has: --------------------------- In C:\Perl\src: "C:\Program Files\TortoiseCVS\cvs.exe" "-q" "--lf" "checkout" "-P" "bioperl_all" CVSROOT=:ext:cjfields at dev.open-bio.org:/home/repository/bioperl ... cvs checkout: failed to create lock directory for `/home/repository/bioperl/biodata' (/home/repository/bioperl/biodata/#cvs.lock): Permission denied cvs checkout: failed to obtain dir lock in repository `/home/repository/bioperl/biodata' cvs [checkout aborted]: read lock failed - giving up cvs.exe checkout: in directory bioperl: cvs.exe checkout: cannot open CVS/Entries for reading: No such file or directory --------------------------- I had the same problem with schema (BioSQL) a while back. I tried again, and... --------------------------- cvs checkout: failed to create lock directory for `/home/repository/bioperl/biosql-schema' (/home/repository/bioperl/biosql-schema/#cvs.lock): Permission denied cvs checkout: failed to obtain dir lock in repository `/home/repository/bioperl/biosql-schema' cvs [checkout aborted]: read lock failed - giving up cvs.exe checkout: in directory .: cvs.exe checkout: cannot open CVS/Entries for reading: No such file or directory --------------------------- I believe it had something to do with CVS commit privileges (i.e. I had none for schema, which was fine). So maybe this is a permissions issue via the lock file? Looking at the alias: bioperl_all -d bioperl &core &db &run &pipeline &pedigree &biodata &schema &network µarray This may mean if anyone w/o commit privs for any of the above (specifically schema and biodata) tries checkout/update using bioperl-all, they may run into this problem. Since it's not integrated I don't see the problem with removing it from the alias, but if we follow the same line of logic (and privileges are the issue) then schema must be removed as well. To me it doesn't make much sense to not include schema though since we can checkout/update bioperl-db. BTW, I like the idea of biodata as you've outlined it. Would be nice to gear the test suite towards a more general set of data for all the Bio* projects versus having each one come with their own, and the data could be updated a bit more frequently that t/data is. Seems like it would definitely save a large chunk of real estate for the distributions. If one wanted to run the full test suite then they would have to download biodata separately, though, but not a bad compromise. Though, if this is/was its intent, why would it need a lock file? Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Hilmar Lapp > Sent: Wednesday, July 05, 2006 8:41 AM > To: Sendu Bala > Cc: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] checkout_all fails on biodata > > The idea was once that Bioperl, Biojava, etc had all those unit tests > that use specific sample data which take up quite a bit of space. > Unifying the largely redundant test data into a single shared > repository would save quite a bit of space and therefore download/ > update time for people who work on/use more than one Bio* project. > > However, this was never fully implemented AFAIK. I.e., you don't need > biodata. I guess it could be removed from the alias since it's not > integrated anyway. > > Any other opinions? > > I also forwarded your report to root-l as I couldn't find the > offending (stale) lock file. > > -hilmar > > On Jul 5, 2006, at 9:03 AM, Sendu Bala wrote: > > > Hilmar Lapp wrote: > >> Should have been fixed - I can cvs update. did you try again? > > > > Still doesn't work, no change. I can manually check out the other > > packages, I just can't do it with bioperl_all alias. > > > > co bioperl-biodata fails because: > > cvs server: cannot find module `bioperl-biodata' - ignored > > cvs [checkout aborted]: cannot expand modules > > > > (not that I want it - if its no longer a bioperl package can it be > > removed from the alias?) > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Wed Jul 5 11:36:33 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 5 Jul 2006 10:36:33 -0500 Subject: [Bioperl-l] Bio::SearchIO::hmmer hsp behaviour In-Reply-To: Message-ID: <001a01c6a048$cb802420$15327e82@pyrimidine> Okay, I managed to figure out what the problem was. I committed a fix in CVS for the initial bug (Selvi's missing hits). Still has one HSP per hit for now; I think it will take a bit more effort to get a BLAST-like multi HSP/hit up and running. Selvi, update from CVS to see if that works. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Chris Fields > Sent: Friday, June 30, 2006 12:44 PM > To: Sendu Bala; Jason Stajich > Cc: bioperl-l at lists.open-bio.org list > Subject: Re: [Bioperl-l] Bio::SearchIO::hmmer hsp behaviour > > I'll try looking at it this weekend. A suggested workaround is to > either try setting -A for no alignments or setting it to a high > number to retrieve all of them. It's pretty serious as the error > silently dumps those domains, so for those using automated annotation > pipelines would miss it unless they are also checking the raw output. > > You could design a SearchIO::hmmpfam parser then expand it to take in > hmmsearch output at a later point, or keep them separate. I like the > idea of having modules that are more specific about what they parse; > seems at some point you reach serious code bloat and maintenance > becomes an issue. Look at SearchIO::blast; it parses various text > BLAST output very well but with some serious obfuscation. Just don't > know how productive it would be to separate out the PSI-BLAST and > bl2seq stuff since they are pretty close to a standard BLAST > report... oh well. > > To Jason : good luck on your move. Drop us a line here to let us > know everything went well. > > Chris > > On Jun 30, 2006, at 11:14 AM, Sendu Bala wrote: > > > Chris Fields wrote: > >> It may have been just simpler to have it be one HSP (domain) per Hit > >> (model) as that's how the reports are generated. My reasoning was > >> that > >> using the one domain per model made sense based on what you are > >> actually > >> trying to do, which is annotate the sequence based on the order the > >> domain appears. Most others may not view it that way, which is fine. > >> One can always gather the relevant HSP's, convert to seqfeatures, > >> then > >> sort them if order is important, I suppose. > >> > >> I would say, if the overall consensus is to modify it to have > >> multiple > >> domain hits per model (similar to BLAST) then Sendu should go > >> ahead and > >> make those changes then announce it on the list so no one can gripe > >> about it later. My main concern was not changing things so > >> dramatically > >> that it'll break for someone > > > > Going on your earlier suggestion, I was thinking about making > > SearchIO::hmmpfam instead, which would get used if you set the > > format to > > 'hmmpfam' instead of the generic 'hmmer' when making a SearchIO. I > > suppose I would make a SearchIO::hmmsearch as well, if necessary. > > > > > > [...] > >> that the reported bug about missing hits (Bug 2036) is fixed as well. > > > > However, having never made a SearchIO plugin before, it will be some > > time before I get my head around it. I'll want to make one the current > > HOWTO:SearchIO way before I can think about doing it a better way > > (hashes) as well. So I can say I'll make a move on this at some > > point in > > the future, but if someone wants to fix Bug 2036 in the mean time, > > they > > are welcome to. Again as suggested, my priority is Bio::Map right now. > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From arareko at campus.iztacala.unam.mx Wed Jul 5 11:38:14 2006 From: arareko at campus.iztacala.unam.mx (Mauricio Herrera Cuadra) Date: Wed, 05 Jul 2006 10:38:14 -0500 Subject: [Bioperl-l] checkout_all fails on biodata In-Reply-To: <001901c6a044$999a14b0$15327e82@pyrimidine> References: <001901c6a044$999a14b0$15327e82@pyrimidine> Message-ID: <44ABDCE6.7090906@campus.iztacala.unam.mx> Same problem here. I've never used the bioperl_all alias before (I always check-out dirs individually), but to me it seems like a privileges issue as Chris suggests. Also browsed through all the repository in dev.open-bio.org and didn't found such lock file. I guess Chris D. or Jason will know better what's happening here. Mauricio. Chris Fields wrote: > I use TortoiseCVS via WinXP and I'm getting the same issue that Sendu has: > --------------------------- > In C:\Perl\src: "C:\Program Files\TortoiseCVS\cvs.exe" "-q" "--lf" > "checkout" "-P" "bioperl_all" > CVSROOT=:ext:cjfields at dev.open-bio.org:/home/repository/bioperl > > ... > > cvs checkout: failed to create lock directory for > `/home/repository/bioperl/biodata' > (/home/repository/bioperl/biodata/#cvs.lock): Permission denied > cvs checkout: failed to obtain dir lock in repository > `/home/repository/bioperl/biodata' > cvs [checkout aborted]: read lock failed - giving up > cvs.exe checkout: in directory bioperl: > cvs.exe checkout: cannot open CVS/Entries for reading: No such file or > directory > --------------------------- > > I had the same problem with schema (BioSQL) a while back. I tried again, > and... > > --------------------------- > cvs checkout: failed to create lock directory for > `/home/repository/bioperl/biosql-schema' > (/home/repository/bioperl/biosql-schema/#cvs.lock): Permission denied > cvs checkout: failed to obtain dir lock in repository > `/home/repository/bioperl/biosql-schema' > cvs [checkout aborted]: read lock failed - giving up > cvs.exe checkout: in directory .: > cvs.exe checkout: cannot open CVS/Entries for reading: No such file or > directory > --------------------------- > > I believe it had something to do with CVS commit privileges (i.e. I had none > for schema, which was fine). So maybe this is a permissions issue via the > lock file? Looking at the alias: > > bioperl_all -d bioperl &core &db &run &pipeline &pedigree &biodata &schema > &network µarray > > This may mean if anyone w/o commit privs for any of the above (specifically > schema and biodata) tries checkout/update using bioperl-all, they may run > into this problem. > > Since it's not integrated I don't see the problem with removing it from the > alias, but if we follow the same line of logic (and privileges are the > issue) then schema must be removed as well. To me it doesn't make much > sense to not include schema though since we can checkout/update bioperl-db. > > > BTW, I like the idea of biodata as you've outlined it. Would be nice to > gear the test suite towards a more general set of data for all the Bio* > projects versus having each one come with their own, and the data could be > updated a bit more frequently that t/data is. Seems like it would > definitely save a large chunk of real estate for the distributions. If one > wanted to run the full test suite then they would have to download biodata > separately, though, but not a bad compromise. Though, if this is/was its > intent, why would it need a lock file? > > Chris > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> bounces at lists.open-bio.org] On Behalf Of Hilmar Lapp >> Sent: Wednesday, July 05, 2006 8:41 AM >> To: Sendu Bala >> Cc: bioperl-l at lists.open-bio.org >> Subject: Re: [Bioperl-l] checkout_all fails on biodata >> >> The idea was once that Bioperl, Biojava, etc had all those unit tests >> that use specific sample data which take up quite a bit of space. >> Unifying the largely redundant test data into a single shared >> repository would save quite a bit of space and therefore download/ >> update time for people who work on/use more than one Bio* project. >> >> However, this was never fully implemented AFAIK. I.e., you don't need >> biodata. I guess it could be removed from the alias since it's not >> integrated anyway. >> >> Any other opinions? >> >> I also forwarded your report to root-l as I couldn't find the >> offending (stale) lock file. >> >> -hilmar >> >> On Jul 5, 2006, at 9:03 AM, Sendu Bala wrote: >> >>> Hilmar Lapp wrote: >>>> Should have been fixed - I can cvs update. did you try again? >>> Still doesn't work, no change. I can manually check out the other >>> packages, I just can't do it with bioperl_all alias. >>> >>> co bioperl-biodata fails because: >>> cvs server: cannot find module `bioperl-biodata' - ignored >>> cvs [checkout aborted]: cannot expand modules >>> >>> (not that I want it - if its no longer a bioperl package can it be >>> removed from the alias?) >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >> -- >> =========================================================== >> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : >> =========================================================== >> >> >> >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- MAURICIO HERRERA CUADRA arareko at campus.iztacala.unam.mx Laboratorio de Gen?tica Unidad de Morfofisiolog?a y Funci?n Facultad de Estudios Superiores Iztacala, UNAM From bix at sendu.me.uk Thu Jul 6 04:41:57 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 06 Jul 2006 09:41:57 +0100 Subject: [Bioperl-l] Bio::Map changes In-Reply-To: <449A9AF9.2000305@sendu.me.uk> References: <44985915.8010607@sendu.me.uk> <449A9AF9.2000305@sendu.me.uk> Message-ID: <44ACCCD5.3030309@sendu.me.uk> Sendu Bala wrote: > The next step is to tidy up all of Bio::Map*, which involves a major > reimplementation of the whole system [...] > The reimplementation will make Position central to the model, allowing > for lots of other things to work properly without anything becoming > inconsistent (as is currently the case). This is now done. It uses a new PositionHandler class behind the scenes. The next step is to introduce relative positioning across the board, possibly in a way that makes OrderedPosition redundant or an implementer of the system. Has anyone here ever used Bio::Map* modules for anything? I would appreciate you sending me your code, especially if you've used MapIO, Physical (encompassing Clone, Contig, FPCMarker, OrderedPositionWithDistance) or LinkageMap (encompassing LinkagePosition, OrderedPosition, Microsatellite) since these have insufficient tests at the moment. From nidage at yahoo.com Thu Jul 6 14:13:12 2006 From: nidage at yahoo.com (sss lll) Date: Thu, 6 Jul 2006 11:13:12 -0700 (PDT) Subject: [Bioperl-l] PrimarySeqI object Exception Message-ID: <20060706181312.63581.qmail@web35607.mail.mud.yahoo.com> Hi there, I encountered a problem while calling module PrimarySeqI, with the following code: my $db=Bio::DB::Fasta->new($fafile); my $obj=$db->get_Seq_by_id($array_gene_name[$p]); $seqio->write_seq($obj); The error message was: MSG: Did not provide a valid Bio::PrimarySeqI object STACK Bio::SeqIO::fasta::write_seq /usr/lib/perl5/site_perl/5.8.0/Bio/SeqIO/fasta.pm:178 We think it had to do with the lengh of the gene name. For example the following gene name was a problem: gi|59711891|ref|YP_204667.1| NAD-specific glutamate dehydrogenase [Vibrio fischeri ES114]*VIB*COG2902*E Any ideas on what happened? Thanks __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From rmb32 at cornell.edu Thu Jul 6 19:11:00 2006 From: rmb32 at cornell.edu (Robert Buels) Date: Thu, 06 Jul 2006 16:11:00 -0700 Subject: [Bioperl-l] parser for GeneSeqer In-Reply-To: <2FB066C7-12E6-46D8-8F4A-BD096BE2A0CA@uiuc.edu> References: <44A558F2.2050304@cornell.edu> <2FB066C7-12E6-46D8-8F4A-BD096BE2A0CA@uiuc.edu> Message-ID: <44AD9884.6040507@cornell.edu> The Annotation/Annotatable stuff was going to be talked about at the GMOD meeting that just happened, wasn't it? What's the scoop on that? Rob Chris Fields wrote: > If you plan on generating seqfeatures from this output you could check > out the Bio::Tools core modules for examples. There are a few there > that take program output and convert them to Bio::SeqFeature::Generic > objects, including Bio::Tools:RNAMotif and Bio::Tools::tRNAscanSE. If > alignments are involved you might want something like > Bio::SeqFeature::FeaturePair. Not sure about using the > SeqFeature::Annotation or others; I thought that the some of the > Annotation/Annotatable stuff might be changing soon but I may be wrong. > > Chris > > On Jun 30, 2006, at 12:01 PM, Robert Buels wrote: > >> Hi all, >> >> I find myself needing a parser for GeneSeqer output, so I'm writing one >> (which I will submit for your consideration when it's working). In a >> nutshell, GeneSeqer is a (kind of old) program for aligning a bunch of >> ESTs to genomic sequence, then using those alignments to predict where >> in the genomic sequence the genes are. So really what you get from this >> is a bunch of hierarchical features. >> >> I don't really know where I should put it in the bioperl hierarchy >> though. Probably FeatureIO? >> >> And what's the current fashion for objects it should emit? >> Bio::SeqFeature::Generic? Bio::SeqFeature::Annotated? >> >> Rob >> >> --Robert Buels >> SGN Bioinformatics Analyst >> 252A Emerson Hall, Cornell University >> Ithaca, NY 14853 >> Tel: 503-889-8539 >> rmb32 at cornell.edu >> http://www.sgn.cornell.edu >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign > > > -- Robert Buels SGN Bioinformatics Analyst 252A Emerson Hall, Cornell University Ithaca, NY 14853 Tel: 503-889-8539 rmb32 at cornell.edu http://www.sgn.cornell.edu From hlapp at gmx.net Thu Jul 6 19:27:31 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 6 Jul 2006 19:27:31 -0400 Subject: [Bioperl-l] parser for GeneSeqer In-Reply-To: <44AD9884.6040507@cornell.edu> References: <44A558F2.2050304@cornell.edu> <2FB066C7-12E6-46D8-8F4A-BD096BE2A0CA@uiuc.edu> <44AD9884.6040507@cornell.edu> Message-ID: <6B530ED6-5825-47C4-A677-2C75E0F97E26@gmx.net> No scoop b/c no time. I am busy w/ a grant and Lincoln had to leave early as well on Friday. Sorry. On Jul 6, 2006, at 7:11 PM, Robert Buels wrote: > The Annotation/Annotatable stuff was going to be talked about at the > GMOD meeting that just happened, wasn't it? What's the scoop on that? > > Rob > > > Chris Fields wrote: >> If you plan on generating seqfeatures from this output you could >> check >> out the Bio::Tools core modules for examples. There are a few there >> that take program output and convert them to Bio::SeqFeature::Generic >> objects, including Bio::Tools:RNAMotif and >> Bio::Tools::tRNAscanSE. If >> alignments are involved you might want something like >> Bio::SeqFeature::FeaturePair. Not sure about using the >> SeqFeature::Annotation or others; I thought that the some of the >> Annotation/Annotatable stuff might be changing soon but I may be >> wrong. >> >> Chris >> >> On Jun 30, 2006, at 12:01 PM, Robert Buels wrote: >> >>> Hi all, >>> >>> I find myself needing a parser for GeneSeqer output, so I'm >>> writing one >>> (which I will submit for your consideration when it's working). >>> In a >>> nutshell, GeneSeqer is a (kind of old) program for aligning a >>> bunch of >>> ESTs to genomic sequence, then using those alignments to predict >>> where >>> in the genomic sequence the genes are. So really what you get >>> from this >>> is a bunch of hierarchical features. >>> >>> I don't really know where I should put it in the bioperl hierarchy >>> though. Probably FeatureIO? >>> >>> And what's the current fashion for objects it should emit? >>> Bio::SeqFeature::Generic? Bio::SeqFeature::Annotated? >>> >>> Rob >>> >>> --Robert Buels >>> SGN Bioinformatics Analyst >>> 252A Emerson Hall, Cornell University >>> Ithaca, NY 14853 >>> Tel: 503-889-8539 >>> rmb32 at cornell.edu >>> http://www.sgn.cornell.edu >>> >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> Christopher Fields >> Postdoctoral Researcher >> Lab of Dr. Robert Switzer >> Dept of Biochemistry >> University of Illinois Urbana-Champaign >> >> >> > > -- > Robert Buels > SGN Bioinformatics Analyst > 252A Emerson Hall, Cornell University > Ithaca, NY 14853 > Tel: 503-889-8539 > rmb32 at cornell.edu > http://www.sgn.cornell.edu > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Thu Jul 6 19:28:09 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 6 Jul 2006 18:28:09 -0500 Subject: [Bioperl-l] parser for GeneSeqer In-Reply-To: <44AD9884.6040507@cornell.edu> Message-ID: <000001c6a153$d78b83c0$15327e82@pyrimidine> Not any word yet. Been pretty quiet, likely b/c everybody was there planning a roadmap for v1.6. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Robert Buels > Sent: Thursday, July 06, 2006 6:11 PM > To: bioperl-l at bioperl.org > Subject: Re: [Bioperl-l] parser for GeneSeqer > > The Annotation/Annotatable stuff was going to be talked about at the > GMOD meeting that just happened, wasn't it? What's the scoop on that? > > Rob > > > Chris Fields wrote: > > If you plan on generating seqfeatures from this output you could check > > out the Bio::Tools core modules for examples. There are a few there > > that take program output and convert them to Bio::SeqFeature::Generic > > objects, including Bio::Tools:RNAMotif and Bio::Tools::tRNAscanSE. If > > alignments are involved you might want something like > > Bio::SeqFeature::FeaturePair. Not sure about using the > > SeqFeature::Annotation or others; I thought that the some of the > > Annotation/Annotatable stuff might be changing soon but I may be wrong. > > > > Chris > > > > On Jun 30, 2006, at 12:01 PM, Robert Buels wrote: > > > >> Hi all, > >> > >> I find myself needing a parser for GeneSeqer output, so I'm writing one > >> (which I will submit for your consideration when it's working). In a > >> nutshell, GeneSeqer is a (kind of old) program for aligning a bunch of > >> ESTs to genomic sequence, then using those alignments to predict where > >> in the genomic sequence the genes are. So really what you get from > this > >> is a bunch of hierarchical features. > >> > >> I don't really know where I should put it in the bioperl hierarchy > >> though. Probably FeatureIO? > >> > >> And what's the current fashion for objects it should emit? > >> Bio::SeqFeature::Generic? Bio::SeqFeature::Annotated? > >> > >> Rob > >> > >> --Robert Buels > >> SGN Bioinformatics Analyst > >> 252A Emerson Hall, Cornell University > >> Ithaca, NY 14853 > >> Tel: 503-889-8539 > >> rmb32 at cornell.edu > >> http://www.sgn.cornell.edu > >> > >> > >> _______________________________________________ > >> Bioperl-l mailing list > >> Bioperl-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > Christopher Fields > > Postdoctoral Researcher > > Lab of Dr. Robert Switzer > > Dept of Biochemistry > > University of Illinois Urbana-Champaign > > > > > > > > -- > Robert Buels > SGN Bioinformatics Analyst > 252A Emerson Hall, Cornell University > Ithaca, NY 14853 > Tel: 503-889-8539 > rmb32 at cornell.edu > http://www.sgn.cornell.edu > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From hlapp at gmx.net Thu Jul 6 19:41:44 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 6 Jul 2006 19:41:44 -0400 Subject: [Bioperl-l] parser for GeneSeqer In-Reply-To: <000001c6a153$d78b83c0$15327e82@pyrimidine> References: <000001c6a153$d78b83c0$15327e82@pyrimidine> Message-ID: Uhm - roadmap - I guess yes, but more that of the Golden State, or other states on the way, for Jason. On Jul 6, 2006, at 7:28 PM, Chris Fields wrote: > Not any word yet. Been pretty quiet, likely b/c everybody was there > planning a roadmap for v1.6. > > Chris > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> bounces at lists.open-bio.org] On Behalf Of Robert Buels >> Sent: Thursday, July 06, 2006 6:11 PM >> To: bioperl-l at bioperl.org >> Subject: Re: [Bioperl-l] parser for GeneSeqer >> >> The Annotation/Annotatable stuff was going to be talked about at the >> GMOD meeting that just happened, wasn't it? What's the scoop on >> that? >> >> Rob >> >> >> Chris Fields wrote: >>> If you plan on generating seqfeatures from this output you could >>> check >>> out the Bio::Tools core modules for examples. There are a few there >>> that take program output and convert them to >>> Bio::SeqFeature::Generic >>> objects, including Bio::Tools:RNAMotif and >>> Bio::Tools::tRNAscanSE. If >>> alignments are involved you might want something like >>> Bio::SeqFeature::FeaturePair. Not sure about using the >>> SeqFeature::Annotation or others; I thought that the some of the >>> Annotation/Annotatable stuff might be changing soon but I may be >>> wrong. >>> >>> Chris >>> >>> On Jun 30, 2006, at 12:01 PM, Robert Buels wrote: >>> >>>> Hi all, >>>> >>>> I find myself needing a parser for GeneSeqer output, so I'm >>>> writing one >>>> (which I will submit for your consideration when it's working). >>>> In a >>>> nutshell, GeneSeqer is a (kind of old) program for aligning a >>>> bunch of >>>> ESTs to genomic sequence, then using those alignments to predict >>>> where >>>> in the genomic sequence the genes are. So really what you get from >> this >>>> is a bunch of hierarchical features. >>>> >>>> I don't really know where I should put it in the bioperl hierarchy >>>> though. Probably FeatureIO? >>>> >>>> And what's the current fashion for objects it should emit? >>>> Bio::SeqFeature::Generic? Bio::SeqFeature::Annotated? >>>> >>>> Rob >>>> >>>> --Robert Buels >>>> SGN Bioinformatics Analyst >>>> 252A Emerson Hall, Cornell University >>>> Ithaca, NY 14853 >>>> Tel: 503-889-8539 >>>> rmb32 at cornell.edu >>>> http://www.sgn.cornell.edu >>>> >>>> >>>> _______________________________________________ >>>> Bioperl-l mailing list >>>> Bioperl-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >>> Christopher Fields >>> Postdoctoral Researcher >>> Lab of Dr. Robert Switzer >>> Dept of Biochemistry >>> University of Illinois Urbana-Champaign >>> >>> >>> >> >> -- >> Robert Buels >> SGN Bioinformatics Analyst >> 252A Emerson Hall, Cornell University >> Ithaca, NY 14853 >> Tel: 503-889-8539 >> rmb32 at cornell.edu >> http://www.sgn.cornell.edu >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Thu Jul 6 19:49:23 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 6 Jul 2006 18:49:23 -0500 Subject: [Bioperl-l] parser for GeneSeqer In-Reply-To: Message-ID: <000101c6a156$cee60bc0$15327e82@pyrimidine> Oh well. There's always BOSC... Chris > -----Original Message----- > From: Hilmar Lapp [mailto:hlapp at gmx.net] > Sent: Thursday, July 06, 2006 6:42 PM > To: Chris Fields > Cc: 'Robert Buels'; bioperl-l at bioperl.org > Subject: Re: [Bioperl-l] parser for GeneSeqer > > Uhm - roadmap - I guess yes, but more that of the Golden State, or > other states on the way, for Jason. > > On Jul 6, 2006, at 7:28 PM, Chris Fields wrote: > > > Not any word yet. Been pretty quiet, likely b/c everybody was there > > planning a roadmap for v1.6. > > > > Chris > > > >> -----Original Message----- > >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > >> bounces at lists.open-bio.org] On Behalf Of Robert Buels > >> Sent: Thursday, July 06, 2006 6:11 PM > >> To: bioperl-l at bioperl.org > >> Subject: Re: [Bioperl-l] parser for GeneSeqer > >> > >> The Annotation/Annotatable stuff was going to be talked about at the > >> GMOD meeting that just happened, wasn't it? What's the scoop on > >> that? > >> > >> Rob > >> > >> > >> Chris Fields wrote: > >>> If you plan on generating seqfeatures from this output you could > >>> check > >>> out the Bio::Tools core modules for examples. There are a few there > >>> that take program output and convert them to > >>> Bio::SeqFeature::Generic > >>> objects, including Bio::Tools:RNAMotif and > >>> Bio::Tools::tRNAscanSE. If > >>> alignments are involved you might want something like > >>> Bio::SeqFeature::FeaturePair. Not sure about using the > >>> SeqFeature::Annotation or others; I thought that the some of the > >>> Annotation/Annotatable stuff might be changing soon but I may be > >>> wrong. > >>> > >>> Chris > >>> > >>> On Jun 30, 2006, at 12:01 PM, Robert Buels wrote: > >>> > >>>> Hi all, > >>>> > >>>> I find myself needing a parser for GeneSeqer output, so I'm > >>>> writing one > >>>> (which I will submit for your consideration when it's working). > >>>> In a > >>>> nutshell, GeneSeqer is a (kind of old) program for aligning a > >>>> bunch of > >>>> ESTs to genomic sequence, then using those alignments to predict > >>>> where > >>>> in the genomic sequence the genes are. So really what you get from > >> this > >>>> is a bunch of hierarchical features. > >>>> > >>>> I don't really know where I should put it in the bioperl hierarchy > >>>> though. Probably FeatureIO? > >>>> > >>>> And what's the current fashion for objects it should emit? > >>>> Bio::SeqFeature::Generic? Bio::SeqFeature::Annotated? > >>>> > >>>> Rob > >>>> > >>>> --Robert Buels > >>>> SGN Bioinformatics Analyst > >>>> 252A Emerson Hall, Cornell University > >>>> Ithaca, NY 14853 > >>>> Tel: 503-889-8539 > >>>> rmb32 at cornell.edu > >>>> http://www.sgn.cornell.edu > >>>> > >>>> > >>>> _______________________________________________ > >>>> Bioperl-l mailing list > >>>> Bioperl-l at lists.open-bio.org > >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l > >>> > >>> Christopher Fields > >>> Postdoctoral Researcher > >>> Lab of Dr. Robert Switzer > >>> Dept of Biochemistry > >>> University of Illinois Urbana-Champaign > >>> > >>> > >>> > >> > >> -- > >> Robert Buels > >> SGN Bioinformatics Analyst > >> 252A Emerson Hall, Cornell University > >> Ithaca, NY 14853 > >> Tel: 503-889-8539 > >> rmb32 at cornell.edu > >> http://www.sgn.cornell.edu > >> > >> > >> _______________________________________________ > >> Bioperl-l mailing list > >> Bioperl-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > From osborne1 at optonline.net Thu Jul 6 21:06:32 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Thu, 06 Jul 2006 21:06:32 -0400 Subject: [Bioperl-l] PrimarySeqI object Exception In-Reply-To: <20060706181312.63581.qmail@web35607.mail.mud.yahoo.com> Message-ID: sss lll, What this error means is that $obj is not a valid Sequence object, this is what's passed to the write_seq method. What identifier is $array_gene_name[$p]? Brian O. On 7/6/06 2:13 PM, "sss lll" wrote: > Hi there, > > I encountered a problem while calling module > PrimarySeqI, with the following code: > > my $db=Bio::DB::Fasta->new($fafile); > my $obj=$db->get_Seq_by_id($array_gene_name[$p]); > $seqio->write_seq($obj); > > The error message was: > MSG: Did not provide a valid Bio::PrimarySeqI object > STACK Bio::SeqIO::fasta::write_seq > /usr/lib/perl5/site_perl/5.8.0/Bio/SeqIO/fasta.pm:178 > > We think it had to do with the lengh of the gene name. > For example the following gene name was a problem: > > gi|59711891|ref|YP_204667.1| NAD-specific glutamate > dehydrogenase [Vibrio fischeri ES114]*VIB*COG2902*E > > Any ideas on what happened? > > Thanks > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam protection around > http://mail.yahoo.com > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From rmb32 at cornell.edu Thu Jul 6 21:24:40 2006 From: rmb32 at cornell.edu (Robert Buels) Date: Thu, 06 Jul 2006 18:24:40 -0700 Subject: [Bioperl-l] FeatureIO tests failing on linux debian sarge Message-ID: <44ADB7D8.7080102@cornell.edu> I am stumped. On a fresh checkout from cvs (as of like 10 seconds ago): rob at rubisco:/usr/local/lib/site_perl/bioperl-live$ perl -v This is perl, v5.8.4 built for i386-linux-thread-multi Copyright 1987-2004, Larry Wall Perl may be copied only under the terms of either the Artistic License or the GNU General Public License, which may be found in the Perl 5 source kit. Complete documentation for Perl, including FAQ lists, should be found on this system using `man perl' or `perldoc perl'. If you have access to the Internet, point your browser at http://www.perl.com/, the Perl Home Page. rob at rubisco:/usr/local/lib/site_perl/Bio$ perl t/FeatureIO.t 1..22 ok 1 ok 2 ok 3 ok 4 ok 5 ok 6 Can't locate object method "get_Annotations" via package "Bio::SeqFeature::Annotated" at /usr/local/lib/site_perl/Bio/SeqFeature/Annotated.pm line 292, line 2. ok 7 # Cannot complete FeatureIO tests ok 8 # Cannot complete FeatureIO tests ok 9 # Cannot complete FeatureIO tests ok 10 # Cannot complete FeatureIO tests ok 11 # Cannot complete FeatureIO tests ok 12 # Cannot complete FeatureIO tests ok 13 # Cannot complete FeatureIO tests ok 14 # Cannot complete FeatureIO tests ok 15 # Cannot complete FeatureIO tests ok 16 # Cannot complete FeatureIO tests ok 17 # Cannot complete FeatureIO tests ok 18 # Cannot complete FeatureIO tests ok 19 # Cannot complete FeatureIO tests ok 20 # Cannot complete FeatureIO tests ok 21 # Cannot complete FeatureIO tests ok 22 # Cannot complete FeatureIO tests However, same code runs fine on my debian unstable machine (perl 5.8.8). Perhaps this is a bug in debian stable's perl? I did some poking around through the code, changing @ISA = qw/.../ to use base, switching the order of inclusion in the ISA at the top of Bio::SeqFeature::Annotated, no dice. Anybody able to reproduce this? Anyone have any ideas? Rob -- Robert Buels SGN Bioinformatics Analyst 252A Emerson Hall, Cornell University Ithaca, NY 14853 Tel: 503-889-8539 rmb32 at cornell.edu http://www.sgn.cornell.edu From cjfields at uiuc.edu Thu Jul 6 22:30:25 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 6 Jul 2006 21:30:25 -0500 Subject: [Bioperl-l] FeatureIO tests failing on linux debian sarge In-Reply-To: <44ADB7D8.7080102@cornell.edu> Message-ID: <000001c6a16d$4dd7e6e0$15327e82@pyrimidine> I don't get any issues (all tests pass), except a few warning messages which is normal; some ontology handlind not implemented. Usually when running tests I use 'perl -I. t/test.t' to force it to use the core directory first. You might try that to see if it 'fixes' the problem. If it does, there may be another bioperl installation in @INC being used instead of your current directory. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Robert Buels > Sent: Thursday, July 06, 2006 8:25 PM > To: bioperl-l at bioperl.org > Subject: [Bioperl-l] FeatureIO tests failing on linux debian sarge > > I am stumped. On a fresh checkout from cvs (as of like 10 seconds ago): > > > rob at rubisco:/usr/local/lib/site_perl/bioperl-live$ perl -v > > This is perl, v5.8.4 built for i386-linux-thread-multi > > Copyright 1987-2004, Larry Wall > > Perl may be copied only under the terms of either the Artistic License > or the > GNU General Public License, which may be found in the Perl 5 source kit. > > Complete documentation for Perl, including FAQ lists, should be found on > this system using `man perl' or `perldoc perl'. If you have access to the > Internet, point your browser at http://www.perl.com/, the Perl Home Page. > > rob at rubisco:/usr/local/lib/site_perl/Bio$ perl t/FeatureIO.t > 1..22 > ok 1 > ok 2 > ok 3 > ok 4 > ok 5 > ok 6 > Can't locate object method "get_Annotations" via package > "Bio::SeqFeature::Annotated" at > /usr/local/lib/site_perl/Bio/SeqFeature/Annotated.pm line 292, > line 2. > ok 7 # Cannot complete FeatureIO tests > ok 8 # Cannot complete FeatureIO tests > ok 9 # Cannot complete FeatureIO tests > ok 10 # Cannot complete FeatureIO tests > ok 11 # Cannot complete FeatureIO tests > ok 12 # Cannot complete FeatureIO tests > ok 13 # Cannot complete FeatureIO tests > ok 14 # Cannot complete FeatureIO tests > ok 15 # Cannot complete FeatureIO tests > ok 16 # Cannot complete FeatureIO tests > ok 17 # Cannot complete FeatureIO tests > ok 18 # Cannot complete FeatureIO tests > ok 19 # Cannot complete FeatureIO tests > ok 20 # Cannot complete FeatureIO tests > ok 21 # Cannot complete FeatureIO tests > ok 22 # Cannot complete FeatureIO tests > > However, same code runs fine on my debian unstable machine (perl > 5.8.8). Perhaps this is a bug in debian stable's perl? > > I did some poking around through the code, changing @ISA = qw/.../ to > use base, switching the order of inclusion in the ISA at the top of > Bio::SeqFeature::Annotated, no dice. > > Anybody able to reproduce this? Anyone have any ideas? > > Rob > > -- > Robert Buels > SGN Bioinformatics Analyst > 252A Emerson Hall, Cornell University > Ithaca, NY 14853 > Tel: 503-889-8539 > rmb32 at cornell.edu > http://www.sgn.cornell.edu > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From chandan.kr.singh at gmail.com Fri Jul 7 01:23:40 2006 From: chandan.kr.singh at gmail.com (CHANDAN SINGH) Date: Fri, 7 Jul 2006 10:53:40 +0530 Subject: [Bioperl-l] PrimarySeqI object Exception In-Reply-To: References: <20060706181312.63581.qmail@web35607.mail.mud.yahoo.com> Message-ID: <2d4f320607062223y520a1375lb30cf40c1c883702@mail.gmail.com> Hi By default , id is the first word encountered i.e, the first string after ">" seperated from the rest by a space. The sample id u mentioned in ur first mail contains spaces and as i mentioned in my previous mail, i am sure the ids made by indexing and the ones u r using in the array are different. U can see the ids used in indexing by using @ids = $db->ids() ; print join("\n", at ids) ; Cheers Chandan On 7/7/06, Brian Osborne wrote: > > sss lll, > > What this error means is that $obj is not a valid Sequence object, this is > what's passed to the write_seq method. What identifier is > $array_gene_name[$p]? > > Brian O. > > > On 7/6/06 2:13 PM, "sss lll" wrote: > > > Hi there, > > > > I encountered a problem while calling module > > PrimarySeqI, with the following code: > > > > my $db=Bio::DB::Fasta->new($fafile); > > my $obj=$db->get_Seq_by_id($array_gene_name[$p]); > > $seqio->write_seq($obj); > > > > The error message was: > > MSG: Did not provide a valid Bio::PrimarySeqI object > > STACK Bio::SeqIO::fasta::write_seq > > /usr/lib/perl5/site_perl/5.8.0/Bio/SeqIO/fasta.pm:178 > > > > We think it had to do with the lengh of the gene name. > > For example the following gene name was a problem: > > > > gi|59711891|ref|YP_204667.1| NAD-specific glutamate > > dehydrogenase [Vibrio fischeri ES114]*VIB*COG2902*E > > > > Any ideas on what happened? > > > > Thanks > > > > __________________________________________________ > > Do You Yahoo!? > > Tired of spam? Yahoo! Mail has the best spam protection around > > http://mail.yahoo.com > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From selvik at ufl.edu Fri Jul 7 12:07:03 2006 From: selvik at ufl.edu (Selvi Kadirvel) Date: Fri, 7 Jul 2006 12:07:03 -0400 Subject: [Bioperl-l] Bio::SearchIO::hmmer hsp behaviour In-Reply-To: <001a01c6a048$cb802420$15327e82@pyrimidine> References: <001a01c6a048$cb802420$15327e82@pyrimidine> Message-ID: <1A5235F4-87E6-42D7-9796-7FEB8F7C04E5@ufl.edu> Chris: I just tried it out, and it looks like this solution works fine for me. Thank you for the fix! -Selvi On Jul 5, 2006, at 11:36 AM, Chris Fields wrote: > Okay, I managed to figure out what the problem was. I committed a > fix in > CVS for the initial bug (Selvi's missing hits). Still has one HSP > per hit > for now; I think it will take a bit more effort to get a BLAST-like > multi > HSP/hit up and running. > > Selvi, update from CVS to see if that works. > > Chris > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> bounces at lists.open-bio.org] On Behalf Of Chris Fields >> Sent: Friday, June 30, 2006 12:44 PM >> To: Sendu Bala; Jason Stajich >> Cc: bioperl-l at lists.open-bio.org list >> Subject: Re: [Bioperl-l] Bio::SearchIO::hmmer hsp behaviour >> >> I'll try looking at it this weekend. A suggested workaround is to >> either try setting -A for no alignments or setting it to a high >> number to retrieve all of them. It's pretty serious as the error >> silently dumps those domains, so for those using automated annotation >> pipelines would miss it unless they are also checking the raw output. >> >> You could design a SearchIO::hmmpfam parser then expand it to take in >> hmmsearch output at a later point, or keep them separate. I like the >> idea of having modules that are more specific about what they parse; >> seems at some point you reach serious code bloat and maintenance >> becomes an issue. Look at SearchIO::blast; it parses various text >> BLAST output very well but with some serious obfuscation. Just don't >> know how productive it would be to separate out the PSI-BLAST and >> bl2seq stuff since they are pretty close to a standard BLAST >> report... oh well. >> >> To Jason : good luck on your move. Drop us a line here to let us >> know everything went well. >> >> Chris >> >> On Jun 30, 2006, at 11:14 AM, Sendu Bala wrote: >> >>> Chris Fields wrote: >>>> It may have been just simpler to have it be one HSP (domain) per >>>> Hit >>>> (model) as that's how the reports are generated. My reasoning was >>>> that >>>> using the one domain per model made sense based on what you are >>>> actually >>>> trying to do, which is annotate the sequence based on the order the >>>> domain appears. Most others may not view it that way, which is >>>> fine. >>>> One can always gather the relevant HSP's, convert to seqfeatures, >>>> then >>>> sort them if order is important, I suppose. >>>> >>>> I would say, if the overall consensus is to modify it to have >>>> multiple >>>> domain hits per model (similar to BLAST) then Sendu should go >>>> ahead and >>>> make those changes then announce it on the list so no one can gripe >>>> about it later. My main concern was not changing things so >>>> dramatically >>>> that it'll break for someone >>> >>> Going on your earlier suggestion, I was thinking about making >>> SearchIO::hmmpfam instead, which would get used if you set the >>> format to >>> 'hmmpfam' instead of the generic 'hmmer' when making a SearchIO. I >>> suppose I would make a SearchIO::hmmsearch as well, if necessary. >>> >>> >>> [...] >>>> that the reported bug about missing hits (Bug 2036) is fixed as >>>> well. >>> >>> However, having never made a SearchIO plugin before, it will be some >>> time before I get my head around it. I'll want to make one the >>> current >>> HOWTO:SearchIO way before I can think about doing it a better way >>> (hashes) as well. So I can say I'll make a move on this at some >>> point in >>> the future, but if someone wants to fix Bug 2036 in the mean time, >>> they >>> are welcome to. Again as suggested, my priority is Bio::Map right >>> now. >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> Christopher Fields >> Postdoctoral Researcher >> Lab of Dr. Robert Switzer >> Dept of Biochemistry >> University of Illinois Urbana-Champaign >> >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > From cjfields at uiuc.edu Fri Jul 7 12:16:30 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 7 Jul 2006 11:16:30 -0500 Subject: [Bioperl-l] Bio::SeqFeatureI spliced_seq Message-ID: <002a01c6a1e0$b4e2b360$15327e82@pyrimidine> There is a reported bug (Bug 2039) which I found an easy fix for; the issue is that spliced_seq, as currently implemented, has two optional arguments: my ($self, $db, $nosort) = @_; $db is-a Bio::DB::RandomAccessI; $nosort is a flag so that locations aren't sorted before splicing, which is crux of the bug. So, to set $nosort you must also set $db to either undef or a Bio::DB::RandomAccessI (a point not made in the docs and not immediately clear to the user). Would it make more sense to have something like this (using $self->_rearrange to get the options)? my $seq = $sf->spliced_seq(-nosort => 1); my $seq = $sf->spliced_seq(-db => $db); my $seq = $sf->spliced_seq(-nosort => 1 -db => $db); Christopher Fields Postdoctoral Researcher - Switzer Lab Dept. of Biochemistry University of Illinois Urbana-Champaign From vebaev at gmail.com Sat Jul 8 16:59:40 2006 From: vebaev at gmail.com (Vesselin Baev) Date: Sat, 08 Jul 2006 23:59:40 +0300 Subject: [Bioperl-l] BLAST running options Message-ID: <44B01CBC.9070404@gmail.com> Hi, I'm parsing Blast results, but I need an Blast option to limit limit and decrease the Blast number of results. I'm blasting an oligo about 40nt and I need only results which are with mismatches (not more than 10) or exactly matching but in the length as the query - 40. I do not want all the big amount of results that blast gave me about shorter matching. Do anyone knows what king of BLAST option to use? Thanks -- ------------------------------------------------ University of Plovdiv Faculty of Biology Dept. Molecular Biology and Plant Physiology Tzar Asen 24 Plovdiv 4000, BULGARIA vebaev at gmail.com tel.00359889034044 From cjfields at uiuc.edu Sat Jul 8 19:15:29 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sat, 8 Jul 2006 18:15:29 -0500 Subject: [Bioperl-l] BLAST running options In-Reply-To: <44B01CBC.9070404@gmail.com> References: <44B01CBC.9070404@gmail.com> Message-ID: <95D47990-9B63-444D-B386-04219D21DC39@uiuc.edu> There were some posts about this a few months back. http://bioperl.org/pipermail/bioperl-l/2006-April/021341.html Essentially, most responders suggested not using BLAST, but I believe there were a few who gave pointers. Chris On Jul 8, 2006, at 3:59 PM, Vesselin Baev wrote: > Hi, > I'm parsing Blast results, but I need an Blast option to limit > limit and > decrease the Blast number of results. > I'm blasting an oligo about 40nt and I need only results which are > with > mismatches (not more than 10) or exactly matching but in the length as > the query - 40. > I do not want all the big amount of results that blast gave me about > shorter matching. > > Do anyone knows what king of BLAST option to use? > Thanks > > -- > ------------------------------------------------ > > University of Plovdiv > Faculty of Biology > Dept. Molecular Biology and Plant Physiology > Tzar Asen 24 > Plovdiv 4000, BULGARIA > vebaev at gmail.com > tel.00359889034044 > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Mon Jul 10 17:09:12 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 10 Jul 2006 16:09:12 -0500 Subject: [Bioperl-l] How to use gi2taxonid Message-ID: <000301c6a465$182025d0$15327e82@pyrimidine> Hubert, In case you didn't get this going, there may be another option now. I have started work on a new set of modules called Bio::DB::EUtilities in bioperl-live, intended as a back-end for NCBI database searches. It can be used directly if needed though. You can use EPost/Elink to directly retrieve the taxonIDs using the following script (pass a file containing the protein/nucleotide primary ID on command line). The below retrieves taxonid's using protein GI's: use Bio::DB::EUtilities; my @ids; while (my $id = <>) { chomp $id; push @ids, $id; } my $epost = Bio::DB::EUtilities->new( -eutil => 'epost', -db => 'protein', -id => \@ids, ); $epost->get_response; my $elink = Bio::DB::EUtilities->new( -eutil => 'elink', -cookie => $epost->next_cookie, -db => 'taxonomy', ); $elink->get_response; my @tax_ids = $elink->get_db_ids; Chris > hi, > I have downloaded the gi2taxonid file to get the taxonid for a GI > number > taken from a report as recommended here, but I don't know how to > use the > gi2taxonid file. > Jason wrote in a previous post that you have to make a DB_File out of > it, but I don't know how....and finally tie it to a hash.... > Can anybody give me a hint how to use it..... my final goal is to get > the taxonomy. > > thanks > Hubert Christopher Fields Postdoctoral Researcher - Switzer Lab Dept. of Biochemistry University of Illinois Urbana-Champaign From hubert.prielinger at gmx.at Mon Jul 10 19:53:26 2006 From: hubert.prielinger at gmx.at (Hubert Prielinger) Date: Mon, 10 Jul 2006 17:53:26 -0600 Subject: [Bioperl-l] How to use gi2taxonid In-Reply-To: <000301c6a465$182025d0$15327e82@pyrimidine> References: <000301c6a465$182025d0$15327e82@pyrimidine> Message-ID: <44B2E876.2020200@gmx.at> Hi Chris, thanks for your response, actually I have done it with the EUtils, because I have only accession ids and there is no possibility to retrieve the taxonomy directly for an accession id. Because the xml files you retrieve are very small, I first assign accession id to esearch, parse the Uid from the xml file, assign Uid to esummary, parse tax id from xml and finally, assign tax id to esummary again and retrieve taxonomy and parse it..... I know a little bit intricatley, but it works fine.....thanks regards Hubert Chris Fields wrote: > Hubert, > > In case you didn't get this going, there may be another option now. I have > started work on a new set of modules called Bio::DB::EUtilities in > bioperl-live, intended as a back-end for NCBI database searches. It can be > used directly if needed though. You can use EPost/Elink to directly > retrieve the taxonIDs using the following script (pass a file containing the > protein/nucleotide primary ID on command line). The below retrieves > taxonid's using protein GI's: > > > use Bio::DB::EUtilities; > my @ids; > > while (my $id = <>) { > chomp $id; > push @ids, $id; > } > > my $epost = Bio::DB::EUtilities->new( > -eutil => 'epost', > -db => 'protein', > -id => \@ids, > ); > > $epost->get_response; > > my $elink = Bio::DB::EUtilities->new( > -eutil => 'elink', > -cookie => $epost->next_cookie, > -db => 'taxonomy', > ); > > $elink->get_response; > > my @tax_ids = $elink->get_db_ids; > > > > Chris > > >> hi, >> I have downloaded the gi2taxonid file to get the taxonid for a GI >> number >> taken from a report as recommended here, but I don't know how to >> use the >> gi2taxonid file. >> Jason wrote in a previous post that you have to make a DB_File out of >> it, but I don't know how....and finally tie it to a hash.... >> Can anybody give me a hint how to use it..... my final goal is to get >> the taxonomy. >> >> thanks >> Hubert >> > > Christopher Fields > Postdoctoral Researcher - Switzer Lab > Dept. of Biochemistry > University of Illinois Urbana-Champaign > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > From MEC at stowers-institute.org Mon Jul 10 20:25:11 2006 From: MEC at stowers-institute.org (Cook, Malcolm) Date: Mon, 10 Jul 2006 19:25:11 -0500 Subject: [Bioperl-l] Bio::Tools::Fgenesh bug? and fix? Message-ID: I am finding the Bio::Tools::Fgenesh parser to incorrectly handle the feature coordinates on - strand predictions. In particular, start & end are deliberately reversed if the strand is '-'. I guess this was a holdover from Genscan.pm and wasn't really tested !?!?! Or, perhaps fgenesh v 2.4 which I am running has different output in this respect compared to the version 2.0, against which this module was written. Or, perhaps my understanding is blotto (known to happen). Does anyone know for sure? If I comment out selected lines... # if($predobj->strand() == 1) { $predobj->start($start); $predobj->end($end); # } else { # $predobj->end($start); # $predobj->start($end); # } ... then GFF produced by my naive fgenesh2gff script below is correct (at least w.r.t. strand and coordinates - GFF compatibility purists might wince). Should I commit this change to head? Malcolm Cook Database Applications Manager, Bioinformatics Stowers Institute for Medical Research #!/usr/bin/env perl # fgenesh2gff # PURPOSE: parse fgenesh output into gff # USAGE: fgenesh fish somefish.dna | fgenesh2gff > somefish.dna.fgenesh.gff use strict; use warnings; use Bio::Tools::Fgenesh; use Bio::FeatureIO; # Remaining options should name files to process, but if none, process # standard input: @ARGV = ('-') unless @ARGV; my $fgenesh = Bio::Tools::Fgenesh->new(-fh => \*ARGV); my $featureout = new Bio::Tools::GFF( -gff_version => 2, #whatever ;) ); my $IDNUM = 0; while (my $gene = $fgenesh->next_prediction()) { my $ID = "fgenesh" . ++ $IDNUM; $gene->add_tag_value('ID', $ID); $featureout->write_feature($gene); foreach ($gene->exons()) { $_->add_tag_value('Parent', $ID); $_->seq_id($gene->seq_id); $featureout->write_feature($_); } } $fgenesh->close(); exit 0; From chris at dwan.org Mon Jul 10 22:06:41 2006 From: chris at dwan.org (Christopher Dwan) Date: Mon, 10 Jul 2006 22:06:41 -0400 Subject: [Bioperl-l] Bio::Tools::Fgenesh bug? and fix? In-Reply-To: References: Message-ID: I'm not surprised that there are parts that don't work right, I coped genscan.pm and made the absolute minimal changes required to get what I needed working. Haven't touched it since. Please feel free to do what needs to be done, and sorry about the mess. -Chris Dwan On Jul 10, 2006, at 8:25 PM, Cook, Malcolm wrote: > I am finding the Bio::Tools::Fgenesh parser to incorrectly handle the > feature coordinates on - strand predictions. > > In particular, start & end are deliberately reversed if the strand is > '-'. > > I guess this was a holdover from Genscan.pm and wasn't really tested > !?!?! > > Or, perhaps fgenesh v 2.4 which I am running has different output in > this respect compared to the version 2.0, against which this module > was > written. > > Or, perhaps my understanding is blotto (known to happen). > > Does anyone know for sure? > > If I comment out selected lines... > > # if($predobj->strand() == 1) { > $predobj->start($start); > $predobj->end($end); > # } else { > # $predobj->end($start); > # $predobj->start($end); > # } > > ... then GFF produced by my naive fgenesh2gff script below is correct > (at least w.r.t. strand and coordinates - GFF compatibility purists > might wince). > > Should I commit this change to head? > > > Malcolm Cook > Database Applications Manager, Bioinformatics > Stowers Institute for Medical Research > > > > #!/usr/bin/env perl > > # fgenesh2gff > # PURPOSE: parse fgenesh output into gff > # USAGE: fgenesh fish somefish.dna | fgenesh2gff > > somefish.dna.fgenesh.gff > > use strict; > use warnings; > use Bio::Tools::Fgenesh; > use Bio::FeatureIO; > > # Remaining options should name files to process, but if none, process > # standard input: > @ARGV = ('-') unless @ARGV; > my $fgenesh = Bio::Tools::Fgenesh->new(-fh => \*ARGV); > > my $featureout = new Bio::Tools::GFF( > -gff_version => 2, #whatever ;) > ); > my $IDNUM = 0; > while (my $gene = $fgenesh->next_prediction()) { > my $ID = "fgenesh" . ++ $IDNUM; > $gene->add_tag_value('ID', $ID); > $featureout->write_feature($gene); > foreach ($gene->exons()) { > $_->add_tag_value('Parent', $ID); > $_->seq_id($gene->seq_id); > $featureout->write_feature($_); > } > } > $fgenesh->close(); > > exit 0; > From rvosa at sfu.ca Tue Jul 11 04:58:46 2006 From: rvosa at sfu.ca (Rutger Vos) Date: Tue, 11 Jul 2006 01:58:46 -0700 Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? Message-ID: <44B36846.8070103@sfu.ca> Dear all, would it be possible to overload Bio::Root::RootI's 'throw' method to accept an additional, optional (positional) argument to define the exception class, e.g. using Exception::Class: # ...somewhere ... sub makefh { my ( $self, $filename ) = @_; open my $fh, '<' $filename or $self->throw("Can't open file: $!", 'Bio::Exceptions::FileIO'); # NOTE second argument return $fh; } #.... somewhere else my $fh; eval { $fh = $obj->makefh( 'data.txt'); } if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) { # something's wrong with the file? } -- ++++++++++++++++++++++++++++++++++++++++++++++++++++ Rutger Vos, PhD. candidate Department of Biological Sciences Simon Fraser University 8888 University Drive Burnaby, BC, V5A1S6 Phone: 604-291-5625 Fax: 604-291-3496 Personal site: http://www.sfu.ca/~rvosa FAB* lab: http://www.sfu.ca/~fabstar Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ ++++++++++++++++++++++++++++++++++++++++++++++++++++ From khoiwal_tara at yahoo.co.in Tue Jul 11 08:19:17 2006 From: khoiwal_tara at yahoo.co.in (Khoiwal Tara) Date: Tue, 11 Jul 2006 05:19:17 -0700 (PDT) Subject: [Bioperl-l] Need help in needle parser Message-ID: <20060711121918.64429.qmail@web8513.mail.in.yahoo.com> Hi, I want to parse the output of needle.I tried but didn't able to get expected output. My code is as follows: #!/usr/local/bin/perl use strict; use warnings; use Bio::AlignIO; my $needleReport = $ARGV[0]; my $in = new Bio::AlignIO(-format => 'emboss',-file =>$needleReport); while(my $align = $in->next_aln()){ print "Alignment Length:".$align->length()."\n"; print "Percentage Identity:".$align->percentage_identity()."\n"; print "Consensus string:".$align->consensus_string()."\n"; print "Number of sequences:".$align->no_sequence()."\n"; print "Number of residues:".$align->no_residues()."\n"; } But it doesn't go inside the while loop. Pls help me. How to find the alignment position for the query sequence on the target sequence from the needle output? Where can i find the good documentation on needle parser and its usage? Good document on bioperl for beginners. Regards, Tara Khoiwal. --------------------------------- Sneak preview the all-new Yahoo.com. It's not radically different. Just radically better. From cjfields at uiuc.edu Tue Jul 11 08:59:07 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 11 Jul 2006 07:59:07 -0500 Subject: [Bioperl-l] Need help in needle parser In-Reply-To: <20060711121918.64429.qmail@web8513.mail.in.yahoo.com> References: <20060711121918.64429.qmail@web8513.mail.in.yahoo.com> Message-ID: <250EEE60-48D0-4844-B0C0-13E17E60965C@uiuc.edu> perldoc Bio::AlignIO perldoc Bio::AlignIO::needle http://www.bioperl.org/wiki/FAQ http://www.bioperl.org/wiki/HOWTO:Beginners http://www.bioperl.org/wiki/Bptutorial.pl http://www.catb.org/~esr/faqs/smart-questions.html Google is your friend! If it isn't entering the while loop, there are two possibilities: 1) Something is wrong with the file 2) The parser isn't reading the file correctly In order to know which, we will need to see the alignment itself. Chris On Jul 11, 2006, at 7:19 AM, Khoiwal Tara wrote: > Hi, > I want to parse the output of needle.I tried but didn't able to > get expected output. > > My code is as follows: > > #!/usr/local/bin/perl > > use strict; > use warnings; > use Bio::AlignIO; > my $needleReport = $ARGV[0]; > > my $in = new Bio::AlignIO(-format => 'emboss',-file =>$needleReport); > > while(my $align = $in->next_aln()){ > print "Alignment Length:".$align->length()."\n"; > print "Percentage Identity:".$align->percentage_identity()."\n"; > print "Consensus string:".$align->consensus_string()."\n"; > print "Number of sequences:".$align->no_sequence()."\n"; > print "Number of residues:".$align->no_residues()."\n"; > } > > But it doesn't go inside the while loop. > Pls help me. > How to find the alignment position for the query sequence on the > target sequence from the needle output? > Where can i find the good documentation on needle parser and its > usage? > Good document on bioperl for beginners. > > Regards, > Tara Khoiwal. > > > --------------------------------- > Sneak preview the all-new Yahoo.com. It's not radically different. > Just radically better. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Tue Jul 11 09:13:23 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 11 Jul 2006 08:13:23 -0500 Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? In-Reply-To: <44B36846.8070103@sfu.ca> References: <44B36846.8070103@sfu.ca> Message-ID: <954D76EB-F8D3-4652-957B-6ED4BBAEE209@uiuc.edu> I suppose you could; Bio::Root::Root does that using Error.pm (if it is installed). It almost sounds like what Bio::Root::Root does is what you want, but you want a little more information when exceptions are thrown maybe? from perldoc Bio::Root::Root: ... # Alternatively, using the new typed exception syntax in the throw() call: $obj->throw( -class => 'Bio::Root::BadParameter', -text => "Can not open file $file", -value => $file); ... Typed Exception Syntax The typed exception syntax of throw() has the advantage of plainly indicating the nature of the trouble, since the name of the class is included in the title of the exception output. To take advantage of this capability, you must specify arguments as named parameters in the throw() call. Here are the parameters: -class name of the class of the exception. This should be one of the classes defined in Bio::Root::Exception, or a custom error of yours that extends one of the exceptions defined in Bio::Root::Exception. -text a sensible message for the exception -value the value causing the exception or $!, if appropriate. Note that Bio::Root::Exception does not need to be imported into your module (or script) namespace in order to throw exceptions via Bio::Root::Root::throw(), since Bio::Root::Root imports it. Chris On Jul 11, 2006, at 3:58 AM, Rutger Vos wrote: > Dear all, > > would it be possible to overload Bio::Root::RootI's 'throw' method to > accept an additional, optional (positional) argument to define the > exception class, e.g. using Exception::Class: > > # ...somewhere ... > > sub makefh { > my ( $self, $filename ) = @_; > open my $fh, '<' $filename or $self->throw("Can't open file: $!", > 'Bio::Exceptions::FileIO'); # NOTE second argument > return $fh; > } > > #.... somewhere else > my $fh; > eval { > $fh = $obj->makefh( 'data.txt'); > } > if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) { > # something's wrong with the file? > } > > -- > ++++++++++++++++++++++++++++++++++++++++++++++++++++ > Rutger Vos, PhD. candidate > Department of Biological Sciences > Simon Fraser University > 8888 University Drive > Burnaby, BC, V5A1S6 > Phone: 604-291-5625 > Fax: 604-291-3496 > Personal site: http://www.sfu.ca/~rvosa > FAB* lab: http://www.sfu.ca/~fabstar > Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Tue Jul 11 11:25:32 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 11 Jul 2006 10:25:32 -0500 Subject: [Bioperl-l] Need help in needle parser In-Reply-To: <20060711132601.46368.qmail@web8510.mail.in.yahoo.com> Message-ID: <001601c6a4fe$3ff7ca10$15327e82@pyrimidine> There are a few odd things about the data you sent; the FASTA files aren't FASTA format (they are raw) and the needle output doesn't have sequence names. You could try running these through needle with descriptors to see if that helps, but. it is very likely my option #2 (i.e. the parser doesn't recognize the format). There is a thread on the mail list about this issue: http://thread.gmane.org/gmane.comp.lang.perl.bio.general/8926/focus=8935 Basically, it looks like the needle output has changed dramatically in EMBOSS v3. Jason's suggested options from the above thread (as well as mine): . I think the "emboss" format changed in 3.0.0 solutions: a) fix the AlignIO::emboss parser to handle both flavors (old and new) b) have it output MSF format and use AlignIO::msf. . So, as a workaround, use MSF output. I won't have time to look at this anytime soon as I'm busy at $job and getting ready for a summer institute; I'll submit this as a bug to see if someone else can tackle it before I get back in early August. Chris _____ From: Khoiwal Tara [mailto:khoiwal_tara at yahoo.co.in] Sent: Tuesday, July 11, 2006 8:26 AM To: Chris Fields Subject: Re: [Bioperl-l] Need help in needle parser I am sending my testing data to you. I have two fasta files "GenomicSeq.fasta" and "TranscriptSeq.fasta". I ran needle on these files as follows: $ needle GenomicSeq.fasta TranscriptSeq.fasta outfile.needle So the out put of the needle will get stored in outfile.needle. I am attaching the output file also. Please check it and tell me if it has any problem. Is my output file is correct? Thanks and Regards, Tara. Chris Fields wrote: perldoc Bio::AlignIO perldoc Bio::AlignIO::needle http://www.bioperl.org/wiki/FAQ http://www.bioperl.org/wiki/HOWTO:Beginners http://www.bioperl.org/wiki/Bptutorial.pl http://www.catb.org/~esr/faqs/smart-questions.html Google is your friend! If it isn't entering the while loop, there are two possibilities: 1) Something is wrong with the file 2) The parser isn't reading the file correctly In order to know which, we will need to see the alignment itself. Chris On Jul 11, 2006, at 7:19 AM, Khoiwal Tara wrote: > Hi, > I want to parse the output of needle.I tried but didn't able to > get expected output. > > My code is as follows: > > #!/usr/local/bin/perl > > use strict; > use warnings; > use Bio::AlignIO; > my $needleReport = $ARGV[0]; > > my $in = new Bio::AlignIO(-format => 'emboss',-file =>$needleReport); > > while(my $align = $in->next_aln()){ > print "Alignment Length:".$align->length()."\n"; > print "Percentage Identity:".$align->percentage_identity()."\n"; > print "Consensus string:".$align->consensus_string()."\n"; > print "Number of sequences:".$align->no_sequence()."\n"; > print "Number of residues:".$align->no_residues()."\n"; > } > > But it doesn't go inside the while loop. > Pls help me. > How to find the alignment position for the query sequence on the > target sequence from the needle output? > Where can i find the good documentation on needle parser and its > usage? > Good document on bioperl for beginners. > > Regards, > Tara Khoiwal. > > > --------------------------------- > Sneak preview the all-new Yahoo.com. It's not radically different. > Just radically better. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign _____ Do you Yahoo!? Next-gen email? Have it all with the all-new Yahoo! Mail Beta. From MEC at stowers-institute.org Tue Jul 11 11:56:40 2006 From: MEC at stowers-institute.org (Cook, Malcolm) Date: Tue, 11 Jul 2006 10:56:40 -0500 Subject: [Bioperl-l] Bio::Tools::Fgenesh bug? and fix? Message-ID: Got it. Commits made. Thanks for the history lesson. Cheers, Malcolm Cook >-----Original Message----- >From: Christopher Dwan [mailto:chris at dwan.org] >Sent: Monday, July 10, 2006 9:07 PM >To: Cook, Malcolm >Cc: bioperl-l >Subject: Re: Bio::Tools::Fgenesh bug? and fix? > > >I'm not surprised that there are parts that don't work right, I coped >genscan.pm and made the absolute minimal changes required to get what >I needed working. Haven't touched it since. > >Please feel free to do what needs to be done, and sorry about the mess. > >-Chris Dwan > >On Jul 10, 2006, at 8:25 PM, Cook, Malcolm wrote: > >> I am finding the Bio::Tools::Fgenesh parser to incorrectly handle the >> feature coordinates on - strand predictions. >> >> In particular, start & end are deliberately reversed if the strand is >> '-'. >> >> I guess this was a holdover from Genscan.pm and wasn't really tested >> !?!?! >> >> Or, perhaps fgenesh v 2.4 which I am running has different output in >> this respect compared to the version 2.0, against which this module >> was >> written. >> >> Or, perhaps my understanding is blotto (known to happen). >> >> Does anyone know for sure? >> >> If I comment out selected lines... >> >> # if($predobj->strand() == 1) { >> $predobj->start($start); >> $predobj->end($end); >> # } else { >> # $predobj->end($start); >> # $predobj->start($end); >> # } >> >> ... then GFF produced by my naive fgenesh2gff script below is correct >> (at least w.r.t. strand and coordinates - GFF compatibility purists >> might wince). >> >> Should I commit this change to head? >> >> >> Malcolm Cook >> Database Applications Manager, Bioinformatics >> Stowers Institute for Medical Research >> >> >> >> #!/usr/bin/env perl >> >> # fgenesh2gff >> # PURPOSE: parse fgenesh output into gff >> # USAGE: fgenesh fish somefish.dna | fgenesh2gff > >> somefish.dna.fgenesh.gff >> >> use strict; >> use warnings; >> use Bio::Tools::Fgenesh; >> use Bio::FeatureIO; >> >> # Remaining options should name files to process, but if >none, process >> # standard input: >> @ARGV = ('-') unless @ARGV; >> my $fgenesh = Bio::Tools::Fgenesh->new(-fh => \*ARGV); >> >> my $featureout = new Bio::Tools::GFF( >> -gff_version => 2, #whatever ;) >> ); >> my $IDNUM = 0; >> while (my $gene = $fgenesh->next_prediction()) { >> my $ID = "fgenesh" . ++ $IDNUM; >> $gene->add_tag_value('ID', $ID); >> $featureout->write_feature($gene); >> foreach ($gene->exons()) { >> $_->add_tag_value('Parent', $ID); >> $_->seq_id($gene->seq_id); >> $featureout->write_feature($_); >> } >> } >> $fgenesh->close(); >> >> exit 0; >> > > From cjfields at uiuc.edu Tue Jul 11 12:04:49 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 11 Jul 2006 11:04:49 -0500 Subject: [Bioperl-l] Need help in needle parser In-Reply-To: <20060711132601.46368.qmail@web8510.mail.in.yahoo.com> Message-ID: <000101c6a503$bd982eb0$15327e82@pyrimidine> Okay, I take that back. Bio::AlignIO::emboss does parse EMBOSS v3 needle output! The fact that it doesn't parse your alignment is b/c there are no sequence descriptors in the file for the sequences (your FASTA files didn't have them either). Modifying the file to contain descriptions for both the alignment and the 'Aligned_sequences:' section gets your test alignment to work. I consider this a feature and not a bug; how would others be able to distinguish between numerous sequences in an alignment w/o identifiers of some sort? It shouldn't just toss this out without a warning however; I'll try to add a little exception handling. BTW, one line is incorrect in your script; it should be print "Number of sequences:".$align->no_sequences()."\n"; you have print "Number of sequences:".$align->no_sequence()."\n"; Chris _____ From: Khoiwal Tara [mailto:khoiwal_tara at yahoo.co.in] Sent: Tuesday, July 11, 2006 8:26 AM To: Chris Fields Subject: Re: [Bioperl-l] Need help in needle parser I am sending my testing data to you. I have two fasta files "GenomicSeq.fasta" and "TranscriptSeq.fasta". I ran needle on these files as follows: $ needle GenomicSeq.fasta TranscriptSeq.fasta outfile.needle So the out put of the needle will get stored in outfile.needle. I am attaching the output file also. Please check it and tell me if it has any problem. Is my output file is correct? Thanks and Regards, Tara. Chris Fields wrote: perldoc Bio::AlignIO perldoc Bio::AlignIO::needle http://www.bioperl.org/wiki/FAQ http://www.bioperl.org/wiki/HOWTO:Beginners http://www.bioperl.org/wiki/Bptutorial.pl http://www.catb.org/~esr/faqs/smart-questions.html Google is your friend! If it isn't entering the while loop, there are two possibilities: 1) Something is wrong with the file 2) The parser isn't reading the file correctly In order to know which, we will need to see the alignment itself. Chris On Jul 11, 2006, at 7:19 AM, Khoiwal Tara wrote: > Hi, > I want to parse the output of needle.I tried but didn't able to > get expected output. > > My code is as follows: > > #!/usr/local/bin/perl > > use strict; > use warnings; > use Bio::AlignIO; > my $needleReport = $ARGV[0]; > > my $in = new Bio::AlignIO(-format => 'emboss',-file =>$needleReport); > > while(my $align = $in->next_aln()){ > print "Alignment Length:".$align->length()."\n"; > print "Percentage Identity:".$align->percentage_identity()."\n"; > print "Consensus string:".$align->consensus_string()."\n"; > print "Number of sequences:".$align->no_sequence()."\n"; > print "Number of residues:".$align->no_residues()."\n"; > } > > But it doesn't go inside the while loop. > Pls help me. > How to find the alignment position for the query sequence on the > target sequence from the needle output? > Where can i find the good documentation on needle parser and its > usage? > Good document on bioperl for beginners. > > Regards, > Tara Khoiwal. > > > --------------------------------- > Sneak preview the all-new Yahoo.com. It's not radically different. > Just radically better. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign _____ Do you Yahoo!? Next-gen email? Have it all with the all-new Yahoo! Mail Beta. From wrp at virginia.edu Tue Jul 11 14:05:29 2006 From: wrp at virginia.edu (William R. Pearson) Date: Tue, 11 Jul 2006 14:05:29 -0400 Subject: [Bioperl-l] Course announcement: CSHL Computational Genomics Course In-Reply-To: References: Message-ID: <45D80228-35DE-44B0-9E11-48EC76CE0DE7@virginia.edu> Course announcement - Application deadline, July 15, 2006 ================================================================ Cold Spring Harbor COMPUTATIONAL & COMPARATIVE GENOMICS November 8 - 14, 2006 Application Deadline: July 15, 2006 INSTRUCTORS: Pearson, William, Ph.D., University of Virginia, Charlottesville, VA Smith, Randall, Ph.D., SmithKline Beecham Pharmaceuticals, King of Prussia, PA Beyond BLAST and FASTA - Alignment: from proteins to genomes - This course presents a comprehensive overview of the theory and practice of computational methods for extracting the maximum amount of information from protein and DNA sequence similarity through sequence database searches, statistical analysis, and multiple sequence alignment, and genome scale alignment. Additional topics include gene finding, dentifying signals in unaligned sequences, integration of genetic and sequence information in biological databases. The course combines lectures with hands-on exercises; students are encouraged to pose challenging sequence analysis problems using their own data. The course makes extensive use of local WWW pages to present problem sets and the computing tools to solve them. Students use Windows and Mac workstations attached to a UNIX server; participants should be comfortable using the Unix operating system and a Unix text editor. The course is designed for biologists seeking advanced training in biological sequence analysis, computational biology core resource directors and staff, and for scientists in other disciplines, such as computer science, who wish to survey current research problems in biological sequence analysis and comparative genomics. The primary focus of the Computational and Comparative Genomics Course is the theory and practice of algorithms used in computational biology, with the goal of using current methods more effectively and developing new algorithms. Cold Spring Harbor also offers a "Programming for Biology" course, which focuses more on software development. Over the past few years, the course has been expanded to cover more algorithms and exercises on comparative genomics and genome databases. For additional information and the lecture schedule and problem sets for the 2005 course, see: http://fasta.bioch.virginia.edu/cshl05 ================================================================ To apply to the course, fill out the form at: http://meetings.cshl.edu/courses/courseapplication.asp ================================================================ From rvosa at sfu.ca Tue Jul 11 14:58:25 2006 From: rvosa at sfu.ca (Rutger Vos) Date: Tue, 11 Jul 2006 11:58:25 -0700 Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? In-Reply-To: <954D76EB-F8D3-4652-957B-6ED4BBAEE209@uiuc.edu> References: <44B36846.8070103@sfu.ca> <954D76EB-F8D3-4652-957B-6ED4BBAEE209@uiuc.edu> Message-ID: <44B3F4D1.7090804@sfu.ca> I must have overlooked this. I think it does what I want. So could I do something like: $obj->thow_not_implemented( -class => 'Bio::Root::NotImplemented' ); ...in interfaces? Chris Fields wrote: > I suppose you could; Bio::Root::Root does that using Error.pm (if it > is installed). It almost sounds like what Bio::Root::Root does is > what you want, but you want a little more information when exceptions > are thrown maybe? > > from perldoc Bio::Root::Root: > > ... > # Alternatively, using the new typed exception syntax in > the throw() call: > > $obj->throw( -class => 'Bio::Root::BadParameter', > -text => "Can not open file $file", > -value => $file); > ... > > Typed Exception Syntax > > The typed exception syntax of throw() has the advantage of > plainly > indicating the nature of the trouble, since the name of the > class is > included in the title of the exception output. > > To take advantage of this capability, you must specify > arguments as > named parameters in the throw() call. Here are the parameters: > > -class > name of the class of the exception. This should be one > of the > classes defined in Bio::Root::Exception, or a custom > error of yours > that extends one of the exceptions defined in > Bio::Root::Exception. > > -text > a sensible message for the exception > > -value > the value causing the exception or $!, if appropriate. > > Note that Bio::Root::Exception does not need to be imported > into your > module (or script) namespace in order to throw exceptions via > Bio::Root::Root::throw(), since Bio::Root::Root imports it. > > > Chris > > On Jul 11, 2006, at 3:58 AM, Rutger Vos wrote: > > >> Dear all, >> >> would it be possible to overload Bio::Root::RootI's 'throw' method to >> accept an additional, optional (positional) argument to define the >> exception class, e.g. using Exception::Class: >> >> # ...somewhere ... >> >> sub makefh { >> my ( $self, $filename ) = @_; >> open my $fh, '<' $filename or $self->throw("Can't open file: $!", >> 'Bio::Exceptions::FileIO'); # NOTE second argument >> return $fh; >> } >> >> #.... somewhere else >> my $fh; >> eval { >> $fh = $obj->makefh( 'data.txt'); >> } >> if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) { >> # something's wrong with the file? >> } >> >> -- >> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Rutger Vos, PhD. candidate >> Department of Biological Sciences >> Simon Fraser University >> 8888 University Drive >> Burnaby, BC, V5A1S6 >> Phone: 604-291-5625 >> Fax: 604-291-3496 >> Personal site: http://www.sfu.ca/~rvosa >> FAB* lab: http://www.sfu.ca/~fabstar >> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- ++++++++++++++++++++++++++++++++++++++++++++++++++++ Rutger Vos, PhD. candidate Department of Biological Sciences Simon Fraser University 8888 University Drive Burnaby, BC, V5A1S6 Phone: 604-291-5625 Fax: 604-291-3496 Personal site: http://www.sfu.ca/~rvosa FAB* lab: http://www.sfu.ca/~fabstar Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ ++++++++++++++++++++++++++++++++++++++++++++++++++++ From hlapp at gmx.net Tue Jul 11 15:05:03 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 11 Jul 2006 15:05:03 -0400 Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? In-Reply-To: <44B36846.8070103@sfu.ca> References: <44B36846.8070103@sfu.ca> Message-ID: <18C839F9-B099-4A4A-9957-2BF4EB7CFB85@gmx.net> I think it does this already, except that I believe you need to create the exception object and initialize with the message upfront. Steve, can you comment? Is this at least somewhat right? -hilmar On Jul 11, 2006, at 4:58 AM, Rutger Vos wrote: > Dear all, > > would it be possible to overload Bio::Root::RootI's 'throw' method to > accept an additional, optional (positional) argument to define the > exception class, e.g. using Exception::Class: > > # ...somewhere ... > > sub makefh { > my ( $self, $filename ) = @_; > open my $fh, '<' $filename or $self->throw("Can't open file: $!", > 'Bio::Exceptions::FileIO'); # NOTE second argument > return $fh; > } > > #.... somewhere else > my $fh; > eval { > $fh = $obj->makefh( 'data.txt'); > } > if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) { > # something's wrong with the file? > } > > -- > ++++++++++++++++++++++++++++++++++++++++++++++++++++ > Rutger Vos, PhD. candidate > Department of Biological Sciences > Simon Fraser University > 8888 University Drive > Burnaby, BC, V5A1S6 > Phone: 604-291-5625 > Fax: 604-291-3496 > Personal site: http://www.sfu.ca/~rvosa > FAB* lab: http://www.sfu.ca/~fabstar > Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Tue Jul 11 15:05:54 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 11 Jul 2006 15:05:54 -0400 Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? In-Reply-To: <954D76EB-F8D3-4652-957B-6ED4BBAEE209@uiuc.edu> References: <44B36846.8070103@sfu.ca> <954D76EB-F8D3-4652-957B-6ED4BBAEE209@uiuc.edu> Message-ID: <297D4770-A963-4039-8D90-987CC570BA94@gmx.net> Alright - well spotted Chris. This is what I was looking for. On Jul 11, 2006, at 9:13 AM, Chris Fields wrote: > I suppose you could; Bio::Root::Root does that using Error.pm (if it > is installed). It almost sounds like what Bio::Root::Root does is > what you want, but you want a little more information when exceptions > are thrown maybe? > > from perldoc Bio::Root::Root: > > ... > # Alternatively, using the new typed exception syntax in > the throw() call: > > $obj->throw( -class => 'Bio::Root::BadParameter', > -text => "Can not open file $file", > -value => $file); > ... > > Typed Exception Syntax > > The typed exception syntax of throw() has the advantage of > plainly > indicating the nature of the trouble, since the name of the > class is > included in the title of the exception output. > > To take advantage of this capability, you must specify > arguments as > named parameters in the throw() call. Here are the parameters: > > -class > name of the class of the exception. This should be one > of the > classes defined in Bio::Root::Exception, or a custom > error of yours > that extends one of the exceptions defined in > Bio::Root::Exception. > > -text > a sensible message for the exception > > -value > the value causing the exception or $!, if appropriate. > > Note that Bio::Root::Exception does not need to be imported > into your > module (or script) namespace in order to throw exceptions via > Bio::Root::Root::throw(), since Bio::Root::Root imports it. > > > Chris > > On Jul 11, 2006, at 3:58 AM, Rutger Vos wrote: > >> Dear all, >> >> would it be possible to overload Bio::Root::RootI's 'throw' method to >> accept an additional, optional (positional) argument to define the >> exception class, e.g. using Exception::Class: >> >> # ...somewhere ... >> >> sub makefh { >> my ( $self, $filename ) = @_; >> open my $fh, '<' $filename or $self->throw("Can't open file: $!", >> 'Bio::Exceptions::FileIO'); # NOTE second argument >> return $fh; >> } >> >> #.... somewhere else >> my $fh; >> eval { >> $fh = $obj->makefh( 'data.txt'); >> } >> if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) { >> # something's wrong with the file? >> } >> >> -- >> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Rutger Vos, PhD. candidate >> Department of Biological Sciences >> Simon Fraser University >> 8888 University Drive >> Burnaby, BC, V5A1S6 >> Phone: 604-291-5625 >> Fax: 604-291-3496 >> Personal site: http://www.sfu.ca/~rvosa >> FAB* lab: http://www.sfu.ca/~fabstar >> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Tue Jul 11 16:42:35 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 11 Jul 2006 15:42:35 -0500 Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? In-Reply-To: <44B3F4D1.7090804@sfu.ca> Message-ID: <000b01c6a52a$8c77f8c0$15327e82@pyrimidine> Bio::Root::Root doesn't overload throw_not_implemented from Bio::Root::RootI; from the comments looks like Steve C and Ewan B couldn't work out some of the Error.pm issues. Judging by the POD for Bio::Root::RootI, throw_not_implemented doesn't accept arguments; it throws a Bio::Root::NotImplemented exception automatically. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Rutger Vos > Sent: Tuesday, July 11, 2006 1:58 PM > To: Chris Fields > Cc: 'Bioperl List' > Subject: Re: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? > > I must have overlooked this. I think it does what I want. So could I do > something like: > > $obj->thow_not_implemented( -class => 'Bio::Root::NotImplemented' ); > > ...in interfaces? > > Chris Fields wrote: > > I suppose you could; Bio::Root::Root does that using Error.pm (if it > > is installed). It almost sounds like what Bio::Root::Root does is > > what you want, but you want a little more information when exceptions > > are thrown maybe? > > > > from perldoc Bio::Root::Root: > > > > ... > > # Alternatively, using the new typed exception syntax in > > the throw() call: > > > > $obj->throw( -class => 'Bio::Root::BadParameter', > > -text => "Can not open file $file", > > -value => $file); > > ... > > > > Typed Exception Syntax > > > > The typed exception syntax of throw() has the advantage of > > plainly > > indicating the nature of the trouble, since the name of the > > class is > > included in the title of the exception output. > > > > To take advantage of this capability, you must specify > > arguments as > > named parameters in the throw() call. Here are the parameters: > > > > -class > > name of the class of the exception. This should be one > > of the > > classes defined in Bio::Root::Exception, or a custom > > error of yours > > that extends one of the exceptions defined in > > Bio::Root::Exception. > > > > -text > > a sensible message for the exception > > > > -value > > the value causing the exception or $!, if appropriate. > > > > Note that Bio::Root::Exception does not need to be imported > > into your > > module (or script) namespace in order to throw exceptions via > > Bio::Root::Root::throw(), since Bio::Root::Root imports it. > > > > > > Chris > > > > On Jul 11, 2006, at 3:58 AM, Rutger Vos wrote: > > > > > >> Dear all, > >> > >> would it be possible to overload Bio::Root::RootI's 'throw' method to > >> accept an additional, optional (positional) argument to define the > >> exception class, e.g. using Exception::Class: > >> > >> # ...somewhere ... > >> > >> sub makefh { > >> my ( $self, $filename ) = @_; > >> open my $fh, '<' $filename or $self->throw("Can't open file: $!", > >> 'Bio::Exceptions::FileIO'); # NOTE second argument > >> return $fh; > >> } > >> > >> #.... somewhere else > >> my $fh; > >> eval { > >> $fh = $obj->makefh( 'data.txt'); > >> } > >> if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) { > >> # something's wrong with the file? > >> } > >> > >> -- > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> Rutger Vos, PhD. candidate > >> Department of Biological Sciences > >> Simon Fraser University > >> 8888 University Drive > >> Burnaby, BC, V5A1S6 > >> Phone: 604-291-5625 > >> Fax: 604-291-3496 > >> Personal site: http://www.sfu.ca/~rvosa > >> FAB* lab: http://www.sfu.ca/~fabstar > >> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> > >> > >> _______________________________________________ > >> Bioperl-l mailing list > >> Bioperl-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > >> > > > > Christopher Fields > > Postdoctoral Researcher > > Lab of Dr. Robert Switzer > > Dept of Biochemistry > > University of Illinois Urbana-Champaign > > > > > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > > > > > > > -- > ++++++++++++++++++++++++++++++++++++++++++++++++++++ > Rutger Vos, PhD. candidate > Department of Biological Sciences > Simon Fraser University > 8888 University Drive > Burnaby, BC, V5A1S6 > Phone: 604-291-5625 > Fax: 604-291-3496 > Personal site: http://www.sfu.ca/~rvosa > FAB* lab: http://www.sfu.ca/~fabstar > Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From frederick.partridge at st-johns.oxford.ac.uk Tue Jul 11 17:23:28 2006 From: frederick.partridge at st-johns.oxford.ac.uk (Frederick Partridge) Date: Tue, 11 Jul 2006 22:23:28 +0100 (BST) Subject: [Bioperl-l] Get nucleotide sequence when expecting protein from genpept Message-ID: I am trying to retrieve various protein sequences from genpept using get_Seq_by_acc. All of them work ok, except one T16005: If I try and retrieve it with a reduced program: #!usr/bin/perl -w use strict; use Bio::Perl; use Bio::SeqIO; my $genpept = new Bio::DB::GenPept; my $seq = $genpept->get_Seq_by_acc('T16005'); print ($seq->seq(),'\n'); I get back a nucleotide sequence, which is another sequence at NCBI with the same accession number. (I thought these were meant to be unique? but evidently not.) I am using bioperl 1.5.1, perl 5.8.1, Mac OS 10.3 Could anyone help me to get this protein sequence with my program? Many thanks, Freddie Partridge University of Oxford From qfdong at iastate.edu Tue Jul 11 17:32:56 2006 From: qfdong at iastate.edu (Qunfeng) Date: Tue, 11 Jul 2006 16:32:56 -0500 Subject: [Bioperl-l] Get nucleotide sequence when expecting protein from genpept In-Reply-To: References: Message-ID: <6.1.2.0.2.20060711163128.08086570@qfdong.mail.iastate.edu> This particular protein record (acc#T16005) was imported from PIR. In other words, this is not an original GenBank protein record. When GenBank imports protein records from other DB, it keeps their original acc#. However, gi# should be unique. Q At 04:23 PM 7/11/2006, Frederick Partridge wrote: >I am trying to retrieve various protein sequences from genpept using >get_Seq_by_acc. All of them work ok, except one T16005: > > >If I try and retrieve it with a reduced program: > > >#!usr/bin/perl -w > >use strict; > >use Bio::Perl; >use Bio::SeqIO; > >my $genpept = new Bio::DB::GenPept; > >my $seq = $genpept->get_Seq_by_acc('T16005'); > >print ($seq->seq(),'\n'); > > > >I get back a nucleotide sequence, which is another sequence at NCBI with >the same accession number. (I thought these were meant to be unique? but >evidently not.) > > >I am using bioperl 1.5.1, perl 5.8.1, Mac OS 10.3 > > >Could anyone help me to get this protein sequence with my program? > > >Many thanks, > > > >Freddie Partridge > >University of Oxford > > >_______________________________________________ >Bioperl-l mailing list >Bioperl-l at lists.open-bio.org >http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Tue Jul 11 18:05:09 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 11 Jul 2006 17:05:09 -0500 Subject: [Bioperl-l] Get nucleotide sequence when expecting protein fromgenpept In-Reply-To: Message-ID: <000001c6a536$141befb0$15327e82@pyrimidine> It's an imprted PIR record, so there probably is no accession recorded in the database. I think NCBI uses a fallback to nucleotide if it can't find a particular accession via protein. Using the primary ID (the GI#, 7498730) works. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Frederick Partridge > Sent: Tuesday, July 11, 2006 4:23 PM > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] Get nucleotide sequence when expecting protein > fromgenpept > > > > I am trying to retrieve various protein sequences from genpept using > get_Seq_by_acc. All of them work ok, except one T16005: > > > If I try and retrieve it with a reduced program: > > > #!usr/bin/perl -w > > use strict; > > use Bio::Perl; > use Bio::SeqIO; > > my $genpept = new Bio::DB::GenPept; > > my $seq = $genpept->get_Seq_by_acc('T16005'); > > print ($seq->seq(),'\n'); > > > > I get back a nucleotide sequence, which is another sequence at NCBI with > the same accession number. (I thought these were meant to be unique? but > evidently not.) > > > I am using bioperl 1.5.1, perl 5.8.1, Mac OS 10.3 > > > Could anyone help me to get this protein sequence with my program? > > > Many thanks, > > > > Freddie Partridge > > University of Oxford > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Tue Jul 11 18:47:38 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 11 Jul 2006 17:47:38 -0500 Subject: [Bioperl-l] Get nucleotide sequence when expecting proteinfromgenpept In-Reply-To: <000001c6a536$141befb0$15327e82@pyrimidine> Message-ID: <000201c6a53c$03970ed0$15327e82@pyrimidine> Okay, now try this: use Bio::DB::GenPept; use Bio::SeqIO; my $factory = Bio::DB::GenPept->new(-format => 'fasta'); my $seqin = $factory->get_Stream_by_acc('T16005'); my $seqout = Bio::SeqIO->new(-fh => \*STDOUT, -format => 'fasta'); while (my $seq = $seqin->next_seq) { $seqout->write_seq($seq); } This returns both the nucleotide sequence and the correct protein sequence; the protein was returned second for some reason, so get_Seq_by_acc misses it while get_Stream_by_acc doesn't. I have notified NCBI about this issue, but they will likely just tell me to use the GI number for searches as they are unique. Probably a good warning for anyone using accessions for all their work (I use the GI myself). Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Chris Fields > Sent: Tuesday, July 11, 2006 5:05 PM > To: 'Frederick Partridge'; bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Get nucleotide sequence when expecting > proteinfromgenpept > > It's an imprted PIR record, so there probably is no accession recorded in > the database. I think NCBI uses a fallback to nucleotide if it can't find > a > particular accession via protein. Using the primary ID (the GI#, 7498730) > works. > > Chris > > > -----Original Message----- > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > > bounces at lists.open-bio.org] On Behalf Of Frederick Partridge > > Sent: Tuesday, July 11, 2006 4:23 PM > > To: bioperl-l at lists.open-bio.org > > Subject: [Bioperl-l] Get nucleotide sequence when expecting protein > > fromgenpept > > > > > > > > I am trying to retrieve various protein sequences from genpept using > > get_Seq_by_acc. All of them work ok, except one T16005: > > > > > > If I try and retrieve it with a reduced program: > > > > > > #!usr/bin/perl -w > > > > use strict; > > > > use Bio::Perl; > > use Bio::SeqIO; > > > > my $genpept = new Bio::DB::GenPept; > > > > my $seq = $genpept->get_Seq_by_acc('T16005'); > > > > print ($seq->seq(),'\n'); > > > > > > > > I get back a nucleotide sequence, which is another sequence at NCBI with > > the same accession number. (I thought these were meant to be unique? but > > evidently not.) > > > > > > I am using bioperl 1.5.1, perl 5.8.1, Mac OS 10.3 > > > > > > Could anyone help me to get this protein sequence with my program? > > > > > > Many thanks, > > > > > > > > Freddie Partridge > > > > University of Oxford > > > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From Steve_Chervitz at affymetrix.com Tue Jul 11 20:21:16 2006 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Tue, 11 Jul 2006 17:21:16 -0700 Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? In-Reply-To: <18C839F9-B099-4A4A-9957-2BF4EB7CFB85@gmx.net> Message-ID: The Bio::Root::Root object is rigged to use the Error.pm module if available, so you can throw and catch of exception objects derived from Error. The motivation here was to provide a recommended path for folks that want to use more structured exception handling logic in their bioperl code. There are a number of pre-defined subclasses of exceptions that cover common problems (such as FileOpenException), but you can also define your own. See a list of the predfined exceptions as well as some how to docs in the POD for Bio::Root::Exception: http://search.cpan.org/~birney/bioperl-1.4/Bio/Root/Exception.pm There's a bunch more info about Bioperl exception fun available from the bioperl distribution under the examples/root directory. See the README in that directory to get oriented. There are a number of demo scripts there, too. Bio::Root::Root doesn't know anything about Exception::Class, but I see you can use it with Error.pm as described here: http://search.cpan.org/~drolsky/Exception-Class-1.23/lib/Exception/Class.pm# OTHER_EXCEPTION_MODULES_(try%2Fcatch_syntax) Cheers, Steve > From: Hilmar Lapp > Date: Tue, 11 Jul 2006 15:05:03 -0400 > To: Rutger Vos > Cc: Bioperl , Steve Chervitz > > Subject: Re: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? > > I think it does this already, except that I believe you need to > create the exception object and initialize with the message upfront. > > Steve, can you comment? Is this at least somewhat right? > > -hilmar > > On Jul 11, 2006, at 4:58 AM, Rutger Vos wrote: > >> Dear all, >> >> would it be possible to overload Bio::Root::RootI's 'throw' method to >> accept an additional, optional (positional) argument to define the >> exception class, e.g. using Exception::Class: >> >> # ...somewhere ... >> >> sub makefh { >> my ( $self, $filename ) = @_; >> open my $fh, '<' $filename or $self->throw("Can't open file: $!", >> 'Bio::Exceptions::FileIO'); # NOTE second argument >> return $fh; >> } >> >> #.... somewhere else >> my $fh; >> eval { >> $fh = $obj->makefh( 'data.txt'); >> } >> if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) { >> # something's wrong with the file? >> } >> >> -- >> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Rutger Vos, PhD. candidate >> Department of Biological Sciences >> Simon Fraser University >> 8888 University Drive >> Burnaby, BC, V5A1S6 >> Phone: 604-291-5625 >> Fax: 604-291-3496 >> Personal site: http://www.sfu.ca/~rvosa >> FAB* lab: http://www.sfu.ca/~fabstar >> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > From Steve_Chervitz at affymetrix.com Tue Jul 11 21:07:06 2006 From: Steve_Chervitz at affymetrix.com (Steve_Chervitz) Date: Tue, 11 Jul 2006 18:07:06 -0700 Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? In-Reply-To: <000b01c6a52a$8c77f8c0$15327e82@pyrimidine> References: <000b01c6a52a$8c77f8c0$15327e82@pyrimidine> Message-ID: <337D69E3-4075-426B-A53E-31D37ED6CA2E@affymetrix.com> On Jul 11, 2006, at 1:42 PM, Chris Fields wrote: > Bio::Root::Root doesn't overload throw_not_implemented from > Bio::Root::RootI; from the comments looks like Steve C and Ewan B > couldn't > work out some of the Error.pm issues. The issue (I believe) was that Bio::Root::RootI::throw_not_implemented was doing some checking for the presence of Error.pm and calling Error::throw. I changed it so that this fanciness only happens in Root.pm. > Judging by the POD for Bio::Root::RootI, throw_not_implemented doesn't > accept arguments; it throws a Bio::Root::NotImplemented exception > automatically. Looking at the code now, throw_not_implemented() does not throw a Bio::Root::NotImplemented exception. It just throws a simple, unclassed message. We could allow it to throw an exception of class Bio::Root:NotImplemented by changing this code: if( $self->can('throw') ) { $self->throw($message); }... to this if( $self->can('throw') ) { $self->throw(-text=>$message, -class=>'Bio::Root::NotImplemented'); }... This does not create any dependency on Error.pm, but permits it to be used if available. If Error.pm is not loaded, the only change is that the class string is included in the error message, which is kind of handy. Trouble would occur if the implementing class: * does not derive from Bio::Root::Root, * does not import Bio::Root::Exception, * fails to implement a method which gets called, and * Error.pm is available. I don't know if such implementations exist in bioperl now, but I suspect they would be rare (and discouraged). Steve > Chris > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> bounces at lists.open-bio.org] On Behalf Of Rutger Vos >> Sent: Tuesday, July 11, 2006 1:58 PM >> To: Chris Fields >> Cc: 'Bioperl List' >> Subject: Re: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) >> overloading? >> >> I must have overlooked this. I think it does what I want. So could >> I do >> something like: >> >> $obj->thow_not_implemented( -class => 'Bio::Root::NotImplemented' ); >> >> ...in interfaces? >> >> Chris Fields wrote: >>> I suppose you could; Bio::Root::Root does that using Error.pm (if it >>> is installed). It almost sounds like what Bio::Root::Root does is >>> what you want, but you want a little more information when >>> exceptions >>> are thrown maybe? >>> >>> from perldoc Bio::Root::Root: >>> >>> ... >>> # Alternatively, using the new typed exception syntax in >>> the throw() call: >>> >>> $obj->throw( -class => 'Bio::Root::BadParameter', >>> -text => "Can not open file $file", >>> -value => $file); >>> ... >>> >>> Typed Exception Syntax >>> >>> The typed exception syntax of throw() has the advantage of >>> plainly >>> indicating the nature of the trouble, since the name of the >>> class is >>> included in the title of the exception output. >>> >>> To take advantage of this capability, you must specify >>> arguments as >>> named parameters in the throw() call. Here are the >>> parameters: >>> >>> -class >>> name of the class of the exception. This should be one >>> of the >>> classes defined in Bio::Root::Exception, or a custom >>> error of yours >>> that extends one of the exceptions defined in >>> Bio::Root::Exception. >>> >>> -text >>> a sensible message for the exception >>> >>> -value >>> the value causing the exception or $!, if appropriate. >>> >>> Note that Bio::Root::Exception does not need to be imported >>> into your >>> module (or script) namespace in order to throw exceptions >>> via >>> Bio::Root::Root::throw(), since Bio::Root::Root imports it. >>> >>> >>> Chris >>> >>> On Jul 11, 2006, at 3:58 AM, Rutger Vos wrote: >>> >>> >>>> Dear all, >>>> >>>> would it be possible to overload Bio::Root::RootI's 'throw' >>>> method to >>>> accept an additional, optional (positional) argument to define the >>>> exception class, e.g. using Exception::Class: >>>> >>>> # ...somewhere ... >>>> >>>> sub makefh { >>>> my ( $self, $filename ) = @_; >>>> open my $fh, '<' $filename or $self->throw("Can't open file: >>>> $!", >>>> 'Bio::Exceptions::FileIO'); # NOTE second argument >>>> return $fh; >>>> } >>>> >>>> #.... somewhere else >>>> my $fh; >>>> eval { >>>> $fh = $obj->makefh( 'data.txt'); >>>> } >>>> if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) { >>>> # something's wrong with the file? >>>> } >>>> >>>> -- >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Rutger Vos, PhD. candidate >>>> Department of Biological Sciences >>>> Simon Fraser University >>>> 8888 University Drive >>>> Burnaby, BC, V5A1S6 >>>> Phone: 604-291-5625 >>>> Fax: 604-291-3496 >>>> Personal site: http://www.sfu.ca/~rvosa >>>> FAB* lab: http://www.sfu.ca/~fabstar >>>> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >>>> >>>> _______________________________________________ >>>> Bioperl-l mailing list >>>> Bioperl-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>> >>> >>> Christopher Fields >>> Postdoctoral Researcher >>> Lab of Dr. Robert Switzer >>> Dept of Biochemistry >>> University of Illinois Urbana-Champaign >>> >>> >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >>> >>> >>> >> >> -- >> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Rutger Vos, PhD. candidate >> Department of Biological Sciences >> Simon Fraser University >> 8888 University Drive >> Burnaby, BC, V5A1S6 >> Phone: 604-291-5625 >> Fax: 604-291-3496 >> Personal site: http://www.sfu.ca/~rvosa >> FAB* lab: http://www.sfu.ca/~fabstar >> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Tue Jul 11 23:27:37 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 11 Jul 2006 22:27:37 -0500 Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? In-Reply-To: <337D69E3-4075-426B-A53E-31D37ED6CA2E@affymetrix.com> References: <000b01c6a52a$8c77f8c0$15327e82@pyrimidine> <337D69E3-4075-426B-A53E-31D37ED6CA2E@affymetrix.com> Message-ID: <18B66484-FE46-4F77-BDD1-B97085D04C95@uiuc.edu> Makes sense to keep most of the magic in Root instead of RootI.pm. The POD for RootI does state that the class exception thrown is Bio::Root::NotImplemented, so we should probably either change the POD to reflect what really happens or change throw_not_implemented like you suggest (my vote is the latter). I don't think many (if any) implementing classes fall into your 'trouble' category, though I can't be sure how many actually import Bio::Root::Exception. Chris On Jul 11, 2006, at 8:07 PM, Steve_Chervitz wrote: > On Jul 11, 2006, at 1:42 PM, Chris Fields wrote: > >> Bio::Root::Root doesn't overload throw_not_implemented from >> Bio::Root::RootI; from the comments looks like Steve C and Ewan B >> couldn't >> work out some of the Error.pm issues. > > The issue (I believe) was that > Bio::Root::RootI::throw_not_implemented was doing some checking for > the presence of Error.pm and calling Error::throw. I changed it so > that this fanciness only happens in Root.pm. > >> Judging by the POD for Bio::Root::RootI, throw_not_implemented >> doesn't >> accept arguments; it throws a Bio::Root::NotImplemented exception >> automatically. > > Looking at the code now, throw_not_implemented() does not throw a > Bio::Root::NotImplemented exception. It just throws a simple, > unclassed message. We could allow it to throw an exception of class > Bio::Root:NotImplemented by changing this code: > > if( $self->can('throw') ) { > $self->throw($message); > }... > > to this > > if( $self->can('throw') ) { > $self->throw(-text=>$message, - > class=>'Bio::Root::NotImplemented'); > }... > > This does not create any dependency on Error.pm, but permits it to > be used if available. If Error.pm is not loaded, the only change is > that the class string is included in the error message, which is > kind of handy. > > Trouble would occur if the implementing class: > > * does not derive from Bio::Root::Root, > * does not import Bio::Root::Exception, > * fails to implement a method which gets called, and > * Error.pm is available. > > I don't know if such implementations exist in bioperl now, but I > suspect they would be rare (and discouraged). > > Steve > > >> Chris >> >>> -----Original Message----- >>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >>> bounces at lists.open-bio.org] On Behalf Of Rutger Vos >>> Sent: Tuesday, July 11, 2006 1:58 PM >>> To: Chris Fields >>> Cc: 'Bioperl List' >>> Subject: Re: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) >>> overloading? >>> >>> I must have overlooked this. I think it does what I want. So >>> could I do >>> something like: >>> >>> $obj->thow_not_implemented( -class => 'Bio::Root::NotImplemented' ); >>> >>> ...in interfaces? >>> >>> Chris Fields wrote: >>>> I suppose you could; Bio::Root::Root does that using Error.pm >>>> (if it >>>> is installed). It almost sounds like what Bio::Root::Root does is >>>> what you want, but you want a little more information when >>>> exceptions >>>> are thrown maybe? >>>> >>>> from perldoc Bio::Root::Root: >>>> >>>> ... >>>> # Alternatively, using the new typed exception syntax in >>>> the throw() call: >>>> >>>> $obj->throw( -class => 'Bio::Root::BadParameter', >>>> -text => "Can not open file $file", >>>> -value => $file); >>>> ... >>>> >>>> Typed Exception Syntax >>>> >>>> The typed exception syntax of throw() has the advantage of >>>> plainly >>>> indicating the nature of the trouble, since the name of the >>>> class is >>>> included in the title of the exception output. >>>> >>>> To take advantage of this capability, you must specify >>>> arguments as >>>> named parameters in the throw() call. Here are the >>>> parameters: >>>> >>>> -class >>>> name of the class of the exception. This should be one >>>> of the >>>> classes defined in Bio::Root::Exception, or a custom >>>> error of yours >>>> that extends one of the exceptions defined in >>>> Bio::Root::Exception. >>>> >>>> -text >>>> a sensible message for the exception >>>> >>>> -value >>>> the value causing the exception or $!, if appropriate. >>>> >>>> Note that Bio::Root::Exception does not need to be imported >>>> into your >>>> module (or script) namespace in order to throw >>>> exceptions via >>>> Bio::Root::Root::throw(), since Bio::Root::Root imports it. >>>> >>>> >>>> Chris >>>> >>>> On Jul 11, 2006, at 3:58 AM, Rutger Vos wrote: >>>> >>>> >>>>> Dear all, >>>>> >>>>> would it be possible to overload Bio::Root::RootI's 'throw' >>>>> method to >>>>> accept an additional, optional (positional) argument to define the >>>>> exception class, e.g. using Exception::Class: >>>>> >>>>> # ...somewhere ... >>>>> >>>>> sub makefh { >>>>> my ( $self, $filename ) = @_; >>>>> open my $fh, '<' $filename or $self->throw("Can't open >>>>> file: $!", >>>>> 'Bio::Exceptions::FileIO'); # NOTE second argument >>>>> return $fh; >>>>> } >>>>> >>>>> #.... somewhere else >>>>> my $fh; >>>>> eval { >>>>> $fh = $obj->makefh( 'data.txt'); >>>>> } >>>>> if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) { >>>>> # something's wrong with the file? >>>>> } >>>>> >>>>> -- >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> Rutger Vos, PhD. candidate >>>>> Department of Biological Sciences >>>>> Simon Fraser University >>>>> 8888 University Drive >>>>> Burnaby, BC, V5A1S6 >>>>> Phone: 604-291-5625 >>>>> Fax: 604-291-3496 >>>>> Personal site: http://www.sfu.ca/~rvosa >>>>> FAB* lab: http://www.sfu.ca/~fabstar >>>>> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> >>>>> >>>>> _______________________________________________ >>>>> Bioperl-l mailing list >>>>> Bioperl-l at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>>> >>>> >>>> Christopher Fields >>>> Postdoctoral Researcher >>>> Lab of Dr. Robert Switzer >>>> Dept of Biochemistry >>>> University of Illinois Urbana-Champaign >>>> >>>> >>>> >>>> _______________________________________________ >>>> Bioperl-l mailing list >>>> Bioperl-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>> >>>> >>>> >>>> >>> >>> -- >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Rutger Vos, PhD. candidate >>> Department of Biological Sciences >>> Simon Fraser University >>> 8888 University Drive >>> Burnaby, BC, V5A1S6 >>> Phone: 604-291-5625 >>> Fax: 604-291-3496 >>> Personal site: http://www.sfu.ca/~rvosa >>> FAB* lab: http://www.sfu.ca/~fabstar >>> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From frederick.partridge at st-johns.oxford.ac.uk Wed Jul 12 11:16:33 2006 From: frederick.partridge at st-johns.oxford.ac.uk (Frederick Partridge) Date: Wed, 12 Jul 2006 16:16:33 +0100 (BST) Subject: [Bioperl-l] Get nucleotide sequence when expecting proteinfromgenpept In-Reply-To: <000201c6a53c$03970ed0$15327e82@pyrimidine> References: <000201c6a53c$03970ed0$15327e82@pyrimidine> Message-ID: On Tue, 11 Jul 2006, Chris Fields wrote: > This returns both the nucleotide sequence and the correct protein sequence; > the protein was returned second for some reason, so get_Seq_by_acc misses it > while get_Stream_by_acc doesn't. I have notified NCBI about this issue, but > they will likely just tell me to use the GI number for searches as they are > unique. Probably a good warning for anyone using accessions for all their > work (I use the GI myself). Thank you both for your help, I have converted to GIs and it works much better. As an aside, it might be nice to have a $hit->gi method as well as $hit->accession for parsing blast reports. (I now realise that you can derive the gi from $hit->name, but this might have encouraged me to start off using gi instead of accession numbers). Freddie Partridge University of Oxford From cjfields at uiuc.edu Wed Jul 12 11:39:39 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 12 Jul 2006 10:39:39 -0500 Subject: [Bioperl-l] Get nucleotide sequence when expecting proteinfromgenpept In-Reply-To: Message-ID: <000b01c6a5c9$635a7540$15327e82@pyrimidine> Problem is, you may or may not have GIs for a BLAST hit depending on how you retrieve the BLAST report, what interface you use, etc. NCBI is pretty ambiguous when it comes to GI vs. accession; the sequence database guys want you to use the GI for searches (since that's the unique ID for NCBI's databases) and don't promise getting the correct sequence using the accession. However, the BLAST interface guys have set up the BLAST CGI server to not return the GI by default(accessible through Bio::Tools::Run::RemoteBlast). Even more confusing, if you use the NCBI BLAST web interface, this option is turned on by default. Don't know what blastcl3 or blastall does, haven't checked in a while. Anyway, this could be why there is no $hit->gi method for GenericHit/BlastHit. It could be added; I will need to look at SearchIO::blast/blastxml/blasttable to see how this is parsed out. BTW, what I do as a work-around, when using RemoteBlast, is below (you could use the while loop to grab the GIs using SearchIO::blast if they are present in the BLAST report). This grabs all the GI's from the description line (not just the best hit). # sets retrieval header to include the GI always $Bio::Tools::Run::RemoteBlast::RETRIEVALHEADER{'NCBI_GI'} = 'yes'; ... while ( my $hit = $result->next_hit) { my $description = $hit->description; while ($description =~ /gi\|(.*?)\|/g) { my $gi = $1; push @gis, $gi; } } Chris > -----Original Message----- > From: Frederick Partridge [mailto:frederick.partridge at st- > johns.oxford.ac.uk] > Sent: Wednesday, July 12, 2006 10:17 AM > To: Chris Fields > Cc: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Get nucleotide sequence when expecting > proteinfromgenpept > > > > On Tue, 11 Jul 2006, Chris Fields wrote: > > This returns both the nucleotide sequence and the correct protein > sequence; > > the protein was returned second for some reason, so get_Seq_by_acc > misses it > > while get_Stream_by_acc doesn't. I have notified NCBI about this issue, > but > > they will likely just tell me to use the GI number for searches as they > are > > unique. Probably a good warning for anyone using accessions for all > their > > work (I use the GI myself). > > > Thank you both for your help, I have converted to GIs and it works much > better. > > As an aside, it might be nice to have a $hit->gi method as well as > $hit->accession for parsing blast reports. (I now realise that you can > derive the gi from $hit->name, but this might have encouraged me to start > off using gi instead of accession numbers). > > > Freddie Partridge > > University of Oxford > From Steve_Chervitz at affymetrix.com Wed Jul 12 14:53:22 2006 From: Steve_Chervitz at affymetrix.com (Steve_Chervitz) Date: Wed, 12 Jul 2006 11:53:22 -0700 Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? In-Reply-To: <18B66484-FE46-4F77-BDD1-B97085D04C95@uiuc.edu> References: <000b01c6a52a$8c77f8c0$15327e82@pyrimidine> <337D69E3-4075-426B-A53E-31D37ED6CA2E@affymetrix.com> <18B66484-FE46-4F77-BDD1-B97085D04C95@uiuc.edu> Message-ID: <3E119694-68C5-47A6-971B-8E035CBB6429@affymetrix.com> For modules that derive from Bio::Root::Root, there's no need to import Bio::Root::Exception since the Root object does it. I also favor adding the -class parameter to throw_not_implemented in RootI. I just committed this change in in bioperl-live. I also added a test for it in t/RootI.t I haven't run the complete suite of tests after making this change, but I don't suspect there'll be any trouble (famous last words). Really, if any test leads to the calling of throw_not_implemented (besides the test I just added), that in itself is trouble. Steve On Jul 11, 2006, at 8:27 PM, Chris Fields wrote: > Makes sense to keep most of the magic in Root instead of RootI.pm. > The POD for RootI does state that the class exception thrown is > Bio::Root::NotImplemented, so we should probably either change the > POD to reflect what really happens or change throw_not_implemented > like you suggest (my vote is the latter). I don't think many (if > any) implementing classes fall into your 'trouble' category, though I > can't be sure how many actually import Bio::Root::Exception. > > Chris > > On Jul 11, 2006, at 8:07 PM, Steve_Chervitz wrote: > >> On Jul 11, 2006, at 1:42 PM, Chris Fields wrote: >> >>> Bio::Root::Root doesn't overload throw_not_implemented from >>> Bio::Root::RootI; from the comments looks like Steve C and Ewan B >>> couldn't >>> work out some of the Error.pm issues. >> >> The issue (I believe) was that >> Bio::Root::RootI::throw_not_implemented was doing some checking for >> the presence of Error.pm and calling Error::throw. I changed it so >> that this fanciness only happens in Root.pm. >> >>> Judging by the POD for Bio::Root::RootI, throw_not_implemented >>> doesn't >>> accept arguments; it throws a Bio::Root::NotImplemented exception >>> automatically. >> >> Looking at the code now, throw_not_implemented() does not throw a >> Bio::Root::NotImplemented exception. It just throws a simple, >> unclassed message. We could allow it to throw an exception of class >> Bio::Root:NotImplemented by changing this code: >> >> if( $self->can('throw') ) { >> $self->throw($message); >> }... >> >> to this >> >> if( $self->can('throw') ) { >> $self->throw(-text=>$message, - >> class=>'Bio::Root::NotImplemented'); >> }... >> >> This does not create any dependency on Error.pm, but permits it to >> be used if available. If Error.pm is not loaded, the only change is >> that the class string is included in the error message, which is >> kind of handy. >> >> Trouble would occur if the implementing class: >> >> * does not derive from Bio::Root::Root, >> * does not import Bio::Root::Exception, >> * fails to implement a method which gets called, and >> * Error.pm is available. >> >> I don't know if such implementations exist in bioperl now, but I >> suspect they would be rare (and discouraged). >> >> Steve >> >> >>> Chris >>> >>>> -----Original Message----- >>>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >>>> bounces at lists.open-bio.org] On Behalf Of Rutger Vos >>>> Sent: Tuesday, July 11, 2006 1:58 PM >>>> To: Chris Fields >>>> Cc: 'Bioperl List' >>>> Subject: Re: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) >>>> overloading? >>>> >>>> I must have overlooked this. I think it does what I want. So >>>> could I do >>>> something like: >>>> >>>> $obj->thow_not_implemented( -class => >>>> 'Bio::Root::NotImplemented' ); >>>> >>>> ...in interfaces? >>>> >>>> Chris Fields wrote: >>>>> I suppose you could; Bio::Root::Root does that using Error.pm >>>>> (if it >>>>> is installed). It almost sounds like what Bio::Root::Root does is >>>>> what you want, but you want a little more information when >>>>> exceptions >>>>> are thrown maybe? >>>>> >>>>> from perldoc Bio::Root::Root: >>>>> >>>>> ... >>>>> # Alternatively, using the new typed exception syntax in >>>>> the throw() call: >>>>> >>>>> $obj->throw( -class => 'Bio::Root::BadParameter', >>>>> -text => "Can not open file $file", >>>>> -value => $file); >>>>> ... >>>>> >>>>> Typed Exception Syntax >>>>> >>>>> The typed exception syntax of throw() has the advantage of >>>>> plainly >>>>> indicating the nature of the trouble, since the name of >>>>> the >>>>> class is >>>>> included in the title of the exception output. >>>>> >>>>> To take advantage of this capability, you must specify >>>>> arguments as >>>>> named parameters in the throw() call. Here are the >>>>> parameters: >>>>> >>>>> -class >>>>> name of the class of the exception. This should be >>>>> one >>>>> of the >>>>> classes defined in Bio::Root::Exception, or a custom >>>>> error of yours >>>>> that extends one of the exceptions defined in >>>>> Bio::Root::Exception. >>>>> >>>>> -text >>>>> a sensible message for the exception >>>>> >>>>> -value >>>>> the value causing the exception or $!, if appropriate. >>>>> >>>>> Note that Bio::Root::Exception does not need to be >>>>> imported >>>>> into your >>>>> module (or script) namespace in order to throw >>>>> exceptions via >>>>> Bio::Root::Root::throw(), since Bio::Root::Root imports >>>>> it. >>>>> >>>>> >>>>> Chris >>>>> >>>>> On Jul 11, 2006, at 3:58 AM, Rutger Vos wrote: >>>>> >>>>> >>>>>> Dear all, >>>>>> >>>>>> would it be possible to overload Bio::Root::RootI's 'throw' >>>>>> method to >>>>>> accept an additional, optional (positional) argument to define >>>>>> the >>>>>> exception class, e.g. using Exception::Class: >>>>>> >>>>>> # ...somewhere ... >>>>>> >>>>>> sub makefh { >>>>>> my ( $self, $filename ) = @_; >>>>>> open my $fh, '<' $filename or $self->throw("Can't open >>>>>> file: $!", >>>>>> 'Bio::Exceptions::FileIO'); # NOTE second argument >>>>>> return $fh; >>>>>> } >>>>>> >>>>>> #.... somewhere else >>>>>> my $fh; >>>>>> eval { >>>>>> $fh = $obj->makefh( 'data.txt'); >>>>>> } >>>>>> if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) { >>>>>> # something's wrong with the file? >>>>>> } >>>>>> >>>>>> -- >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> Rutger Vos, PhD. candidate >>>>>> Department of Biological Sciences >>>>>> Simon Fraser University >>>>>> 8888 University Drive >>>>>> Burnaby, BC, V5A1S6 >>>>>> Phone: 604-291-5625 >>>>>> Fax: 604-291-3496 >>>>>> Personal site: http://www.sfu.ca/~rvosa >>>>>> FAB* lab: http://www.sfu.ca/~fabstar >>>>>> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Bioperl-l mailing list >>>>>> Bioperl-l at lists.open-bio.org >>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>>>> >>>>> >>>>> Christopher Fields >>>>> Postdoctoral Researcher >>>>> Lab of Dr. Robert Switzer >>>>> Dept of Biochemistry >>>>> University of Illinois Urbana-Champaign >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Bioperl-l mailing list >>>>> Bioperl-l at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>>> >>>>> >>>>> >>>>> >>>> >>>> -- >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Rutger Vos, PhD. candidate >>>> Department of Biological Sciences >>>> Simon Fraser University >>>> 8888 University Drive >>>> Burnaby, BC, V5A1S6 >>>> Phone: 604-291-5625 >>>> Fax: 604-291-3496 >>>> Personal site: http://www.sfu.ca/~rvosa >>>> FAB* lab: http://www.sfu.ca/~fabstar >>>> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >>>> >>>> _______________________________________________ >>>> Bioperl-l mailing list >>>> Bioperl-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Wed Jul 12 15:23:33 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 12 Jul 2006 14:23:33 -0500 Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? In-Reply-To: <3E119694-68C5-47A6-971B-8E035CBB6429@affymetrix.com> Message-ID: <000901c6a5e8$aaca53e0$15327e82@pyrimidine> Thanks Steve! Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Steve_Chervitz > Sent: Wednesday, July 12, 2006 1:53 PM > To: Chris Fields > Cc: Rutger Vos; Bioperl List > Subject: Re: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? > > For modules that derive from Bio::Root::Root, there's no need to > import Bio::Root::Exception since the Root object does it. > > I also favor adding the -class parameter to throw_not_implemented in > RootI. I just committed this change in in bioperl-live. I also added > a test for it in t/RootI.t > > I haven't run the complete suite of tests after making this change, > but I don't suspect there'll be any trouble (famous last words). > Really, if any test leads to the calling of throw_not_implemented > (besides the test I just added), that in itself is trouble. > > Steve > > On Jul 11, 2006, at 8:27 PM, Chris Fields wrote: > > > Makes sense to keep most of the magic in Root instead of RootI.pm. > > The POD for RootI does state that the class exception thrown is > > Bio::Root::NotImplemented, so we should probably either change the > > POD to reflect what really happens or change throw_not_implemented > > like you suggest (my vote is the latter). I don't think many (if > > any) implementing classes fall into your 'trouble' category, though I > > can't be sure how many actually import Bio::Root::Exception. > > > > Chris > > > > On Jul 11, 2006, at 8:07 PM, Steve_Chervitz wrote: > > > >> On Jul 11, 2006, at 1:42 PM, Chris Fields wrote: > >> > >>> Bio::Root::Root doesn't overload throw_not_implemented from > >>> Bio::Root::RootI; from the comments looks like Steve C and Ewan B > >>> couldn't > >>> work out some of the Error.pm issues. > >> > >> The issue (I believe) was that > >> Bio::Root::RootI::throw_not_implemented was doing some checking for > >> the presence of Error.pm and calling Error::throw. I changed it so > >> that this fanciness only happens in Root.pm. > >> > >>> Judging by the POD for Bio::Root::RootI, throw_not_implemented > >>> doesn't > >>> accept arguments; it throws a Bio::Root::NotImplemented exception > >>> automatically. > >> > >> Looking at the code now, throw_not_implemented() does not throw a > >> Bio::Root::NotImplemented exception. It just throws a simple, > >> unclassed message. We could allow it to throw an exception of class > >> Bio::Root:NotImplemented by changing this code: > >> > >> if( $self->can('throw') ) { > >> $self->throw($message); > >> }... > >> > >> to this > >> > >> if( $self->can('throw') ) { > >> $self->throw(-text=>$message, - > >> class=>'Bio::Root::NotImplemented'); > >> }... > >> > >> This does not create any dependency on Error.pm, but permits it to > >> be used if available. If Error.pm is not loaded, the only change is > >> that the class string is included in the error message, which is > >> kind of handy. > >> > >> Trouble would occur if the implementing class: > >> > >> * does not derive from Bio::Root::Root, > >> * does not import Bio::Root::Exception, > >> * fails to implement a method which gets called, and > >> * Error.pm is available. > >> > >> I don't know if such implementations exist in bioperl now, but I > >> suspect they would be rare (and discouraged). > >> > >> Steve > >> > >> > >>> Chris > >>> > >>>> -----Original Message----- > >>>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > >>>> bounces at lists.open-bio.org] On Behalf Of Rutger Vos > >>>> Sent: Tuesday, July 11, 2006 1:58 PM > >>>> To: Chris Fields > >>>> Cc: 'Bioperl List' > >>>> Subject: Re: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) > >>>> overloading? > >>>> > >>>> I must have overlooked this. I think it does what I want. So > >>>> could I do > >>>> something like: > >>>> > >>>> $obj->thow_not_implemented( -class => > >>>> 'Bio::Root::NotImplemented' ); > >>>> > >>>> ...in interfaces? > >>>> > >>>> Chris Fields wrote: > >>>>> I suppose you could; Bio::Root::Root does that using Error.pm > >>>>> (if it > >>>>> is installed). It almost sounds like what Bio::Root::Root does is > >>>>> what you want, but you want a little more information when > >>>>> exceptions > >>>>> are thrown maybe? > >>>>> > >>>>> from perldoc Bio::Root::Root: > >>>>> > >>>>> ... > >>>>> # Alternatively, using the new typed exception syntax in > >>>>> the throw() call: > >>>>> > >>>>> $obj->throw( -class => 'Bio::Root::BadParameter', > >>>>> -text => "Can not open file $file", > >>>>> -value => $file); > >>>>> ... > >>>>> > >>>>> Typed Exception Syntax > >>>>> > >>>>> The typed exception syntax of throw() has the advantage of > >>>>> plainly > >>>>> indicating the nature of the trouble, since the name of > >>>>> the > >>>>> class is > >>>>> included in the title of the exception output. > >>>>> > >>>>> To take advantage of this capability, you must specify > >>>>> arguments as > >>>>> named parameters in the throw() call. Here are the > >>>>> parameters: > >>>>> > >>>>> -class > >>>>> name of the class of the exception. This should be > >>>>> one > >>>>> of the > >>>>> classes defined in Bio::Root::Exception, or a custom > >>>>> error of yours > >>>>> that extends one of the exceptions defined in > >>>>> Bio::Root::Exception. > >>>>> > >>>>> -text > >>>>> a sensible message for the exception > >>>>> > >>>>> -value > >>>>> the value causing the exception or $!, if appropriate. > >>>>> > >>>>> Note that Bio::Root::Exception does not need to be > >>>>> imported > >>>>> into your > >>>>> module (or script) namespace in order to throw > >>>>> exceptions via > >>>>> Bio::Root::Root::throw(), since Bio::Root::Root imports > >>>>> it. > >>>>> > >>>>> > >>>>> Chris > >>>>> > >>>>> On Jul 11, 2006, at 3:58 AM, Rutger Vos wrote: > >>>>> > >>>>> > >>>>>> Dear all, > >>>>>> > >>>>>> would it be possible to overload Bio::Root::RootI's 'throw' > >>>>>> method to > >>>>>> accept an additional, optional (positional) argument to define > >>>>>> the > >>>>>> exception class, e.g. using Exception::Class: > >>>>>> > >>>>>> # ...somewhere ... > >>>>>> > >>>>>> sub makefh { > >>>>>> my ( $self, $filename ) = @_; > >>>>>> open my $fh, '<' $filename or $self->throw("Can't open > >>>>>> file: $!", > >>>>>> 'Bio::Exceptions::FileIO'); # NOTE second argument > >>>>>> return $fh; > >>>>>> } > >>>>>> > >>>>>> #.... somewhere else > >>>>>> my $fh; > >>>>>> eval { > >>>>>> $fh = $obj->makefh( 'data.txt'); > >>>>>> } > >>>>>> if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) { > >>>>>> # something's wrong with the file? > >>>>>> } > >>>>>> > >>>>>> -- > >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>>>> Rutger Vos, PhD. candidate > >>>>>> Department of Biological Sciences > >>>>>> Simon Fraser University > >>>>>> 8888 University Drive > >>>>>> Burnaby, BC, V5A1S6 > >>>>>> Phone: 604-291-5625 > >>>>>> Fax: 604-291-3496 > >>>>>> Personal site: http://www.sfu.ca/~rvosa > >>>>>> FAB* lab: http://www.sfu.ca/~fabstar > >>>>>> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ > >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>>>> > >>>>>> > >>>>>> _______________________________________________ > >>>>>> Bioperl-l mailing list > >>>>>> Bioperl-l at lists.open-bio.org > >>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l > >>>>>> > >>>>> > >>>>> Christopher Fields > >>>>> Postdoctoral Researcher > >>>>> Lab of Dr. Robert Switzer > >>>>> Dept of Biochemistry > >>>>> University of Illinois Urbana-Champaign > >>>>> > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> Bioperl-l mailing list > >>>>> Bioperl-l at lists.open-bio.org > >>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l > >>>>> > >>>>> > >>>>> > >>>>> > >>>> > >>>> -- > >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>> Rutger Vos, PhD. candidate > >>>> Department of Biological Sciences > >>>> Simon Fraser University > >>>> 8888 University Drive > >>>> Burnaby, BC, V5A1S6 > >>>> Phone: 604-291-5625 > >>>> Fax: 604-291-3496 > >>>> Personal site: http://www.sfu.ca/~rvosa > >>>> FAB* lab: http://www.sfu.ca/~fabstar > >>>> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ > >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>> > >>>> > >>>> _______________________________________________ > >>>> Bioperl-l mailing list > >>>> Bioperl-l at lists.open-bio.org > >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l > >>> > >>> _______________________________________________ > >>> Bioperl-l mailing list > >>> Bioperl-l at lists.open-bio.org > >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l > >> > > > > Christopher Fields > > Postdoctoral Researcher > > Lab of Dr. Robert Switzer > > Dept of Biochemistry > > University of Illinois Urbana-Champaign > > > > > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From dsche at uga.edu Thu Jul 13 14:55:03 2006 From: dsche at uga.edu (Dongsheng Che) Date: Thu, 13 Jul 2006 14:55:03 -0400 (EDT) Subject: [Bioperl-l] remoteBlast problem Message-ID: <20060713145503.CIV61560@punts2.cc.uga.edu> To whom it may concern: I'm trying to do blast search remotely, so I downloaded bioperl-1.5, and followed the installation procedure, ie, perl Makefile.PL, make, make test. make install. I know there are some installation failure during the installation. Since my main purpose is to get remoteBlast worked, I don't want bother to figure out all failures. but I run remote Blast, it gave me some erorrs from examples (bptutorial). ------------------------------------------------------------- Beginning run_remoteblast example... Use of uninitialized value in numeric lt (<) at bptutorial.pl line 3303. **Warning**: Couldn't connect to NCBI with Bio::Tools::Run::StandAloneBlast.pm! Probably no network access. Skipping Test ---------------------------------------------------------------- I wondering what cause the problem. Thanks in advance! Dongsheng From vrramnar at student.cs.uwaterloo.ca Thu Jul 13 18:39:19 2006 From: vrramnar at student.cs.uwaterloo.ca (vrramnar at student.cs.uwaterloo.ca) Date: Thu, 13 Jul 2006 18:39:19 -0400 Subject: [Bioperl-l] Remote Blast - Blast Human Genome Message-ID: <1152830359.44b6cb97ef16c@www.nexusmail.uwaterloo.ca> Hello Again, I have another question regarding Remote blast but this time using Genome Blast. Here is the link: http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=9606 which again uses the main Blast web site: http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi Again I am not sure what to add or what HEADER information to change within my script. Here is my program, which was the same as the last email: #!/usr/bin/perl -w use Bio::Perl; use Bio::Tools::Run::RemoteBlast; my $prog = "blastn"; my $db = "refseq_genomic"; my $e_val = 0.01; my @params = ( '-prog' => $prog, '-data' => $db, '-expect' => $e_val); my $factory = new Bio::Tools::Run::RemoteBlast->new(@params); $Bio::Tools::Run::RemoteBlast::HEADER{'WWW_BLAST_TYPE'} = '????'; <----- what do I put here #$Bio::Tools::Run::RemoteBlast::HEADER{'?????'} = '????'; <--- Do I need to add any other values to the form inputs $factory->submit_blast("blast.in"); $v = 1; while (my @rids = $factory->each_rid) { foreach my $rid ( @rids ) { my $rc = $factory->retrieve_blast($rid); if( !ref($rc) ) { if( $rc < 0 ) { $factory->remove_rid($rid); } print STDERR "." if ( $v > 0 ); sleep 5; } else { my $result = $rc->next_result(); my $filename = $result->query_name()."\.out"; $factory->save_output($filename); $factory->remove_rid($rid); print "\nQuery Name: ", $result->query_name(), "\n"; } } } Both of my questions are very similiar as in I know how to use remote blast but not sure what to change to access the specific blast I want. Again, any help would be very appreciated!! Rohan ---------------------------------------- This mail sent through www.mywaterloo.ca From vrramnar at student.cs.uwaterloo.ca Thu Jul 13 18:31:38 2006 From: vrramnar at student.cs.uwaterloo.ca (vrramnar at student.cs.uwaterloo.ca) Date: Thu, 13 Jul 2006 18:31:38 -0400 Subject: [Bioperl-l] Remote Blast - SNP data base Message-ID: <1152829898.44b6c9cab7a3a@www.nexusmail.uwaterloo.ca> Hello, 1. I was wondering if anyone knew how to use SNP Blast via the Remote Blast module?? Basically I want to blast my sequence against the dbSNP database and you can normally do this through NCBI's website: http://www.ncbi.nlm.nih.gov/SNP/snp_blastByOrg.cgi The site basically takes your info and submits it to the main blast site: http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi I am just not sure what settings to change within my script. I have something like this: #!/usr/bin/perl -w use Bio::Perl; use Bio::Tools::Run::RemoteBlast; my $prog = "blastn"; my $db = "refseq_genomic"; <--- What db should I use?? my $e_val = 0.01; my @params = ( '-prog' => $prog, '-data' => $db, '-expect' => $e_val); my $factory = new Bio::Tools::Run::RemoteBlast->new(@params); $factory->submit_blast("blast.in"); <--- Name of my file in fasta format $v = 1; while (my @rids = $factory->each_rid) { foreach my $rid ( @rids ) { my $rc = $qu->retrieve_blast($rid); if( !ref($rc) ) { if( $rc < 0 ) { $factory->remove_rid($rid); } print STDERR "." if ( $v > 0 ); sleep 5; } else { my $result = $rc->next_result(); my $filename = $result->query_name()."\.out"; $factory->save_output($filename); $factory->remove_rid($rid); print "\nQuery Name: ", $result->query_name(), "\n"; } } } I think something like this should be added to have the correct form inputs but I am unsure: $Bio::Tools::Run::RemoteBlast::HEADER{'???'} = '????'; Any help on this topic would greatly be appreciated!! Rohan ---------------------------------------- This mail sent through www.mywaterloo.ca From cjfields at uiuc.edu Thu Jul 13 20:42:57 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 13 Jul 2006 19:42:57 -0500 Subject: [Bioperl-l] remoteBlast problem In-Reply-To: <20060713145503.CIV61560@punts2.cc.uga.edu> Message-ID: <000401c6a6de$737fe570$15327e82@pyrimidine> 1) Before I get wound up in the obvious here, you need to upgrade to CVS; RemoteBlast and SearchIO::blast were fixed post v.-1.5.1 (i.e. in CVS) to account for changes in BLAST output at the NCBI 2) The Bio::Tools::Run::StandAloneBlast.pm bit worried me a little, so I did a little digging; that's a typo. Now corrected in CVS, along with some BPLite cruft left over. 3) Speaking bluntly? Come on. The error is stated as plainly as possible. No? How about this (note the arrows): -----------> **Warning**: Couldn't connect to NCBI with -----------> Bio::Tools::Run::StandAloneBlast.pm! -----------> Probably no network access. Skipping Test Check your network connections, preferably AFTER you update to CVS. It's possible that it's a proxy issue, but that should also be fixed in CVS. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Dongsheng Che > Sent: Thursday, July 13, 2006 1:55 PM > To: bioperl-l at bioperl.org > Subject: [Bioperl-l] remoteBlast problem > > To whom it may concern: > > I'm trying to do blast search remotely, so I downloaded bioperl-1.5, and > followed the installation procedure, ie, perl Makefile.PL, make, make > test. make install. I know there are some installation failure during the > installation. > > Since my main purpose is to get remoteBlast worked, I don't want bother to > figure out all failures. but I run remote Blast, it gave me some erorrs > from examples (bptutorial). > ------------------------------------------------------------- > Beginning run_remoteblast example... > Use of uninitialized value in numeric lt (<) at bptutorial.pl line 3303. > > > **Warning**: Couldn't connect to NCBI with > Bio::Tools::Run::StandAloneBlast.pm! > Probably no network access. > Skipping Test > ---------------------------------------------------------------- > > I wondering what cause the problem. > > Thanks in advance! > > Dongsheng > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Thu Jul 13 21:56:30 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 13 Jul 2006 20:56:30 -0500 Subject: [Bioperl-l] Remote Blast - Blast Human Genome In-Reply-To: <1152830359.44b6cb97ef16c@www.nexusmail.uwaterloo.ca> Message-ID: <000501c6a6e8$b9c24d20$15327e82@pyrimidine> I added a method to RemoteBlast in bioperl-live (CVS) if you want to play with changing the URL. I have been thinking about doing this for a bit now but I already see problems. Here's the issue: the BLAST page you see is NOT the NCBI BLAST page (note the differences in the URL) but a user-friendly request page, generated on the fly by Genome, to submit BLAST requests for the relevant database. So changing the URL will not work (even by adding extra parameters); you only get the original HTML web page. You could try changing the database or limiting the search using an Entrez term (which you should be able to include in the request, probably by adding it to the HEADER). Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of vrramnar at student.cs.uwaterloo.ca > Sent: Thursday, July 13, 2006 5:39 PM > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] Remote Blast - Blast Human Genome > > > Hello Again, > > I have another question regarding Remote blast but this time using Genome > Blast. > > Here is the link: > > http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=9606 > > which again uses the main Blast web site: > > http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi > > Again I am not sure what to add or what HEADER information to change > within my > script. > > Here is my program, which was the same as the last email: > > #!/usr/bin/perl -w > > use Bio::Perl; > use Bio::Tools::Run::RemoteBlast; > > my $prog = "blastn"; > my $db = "refseq_genomic"; > my $e_val = 0.01; > > my @params = ( '-prog' => $prog, > '-data' => $db, > '-expect' => $e_val); > > my $factory = new Bio::Tools::Run::RemoteBlast->new(@params); > $Bio::Tools::Run::RemoteBlast::HEADER{'WWW_BLAST_TYPE'} = '????'; <----- > what > do I put here > #$Bio::Tools::Run::RemoteBlast::HEADER{'?????'} = '????'; <--- Do I need > to add > any other values to the form inputs > > $factory->submit_blast("blast.in"); > $v = 1; > > while (my @rids = $factory->each_rid) > { foreach my $rid ( @rids ) > { my $rc = $factory->retrieve_blast($rid); > if( !ref($rc) ) > { if( $rc < 0 ) > { $factory->remove_rid($rid); > } > print STDERR "." if ( $v > 0 ); > sleep 5; > } > else > { my $result = $rc->next_result(); > my $filename = $result->query_name()."\.out"; > $factory->save_output($filename); > $factory->remove_rid($rid); > print "\nQuery Name: ", $result->query_name(), "\n"; > } > } > } > > > Both of my questions are very similiar as in I know how to use remote > blast but > not sure what to change to access the specific blast I want. > > Again, any help would be very appreciated!! > > Rohan > > > > ---------------------------------------- > This mail sent through www.mywaterloo.ca > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From smart_bioit at yahoo.com Fri Jul 14 13:25:51 2006 From: smart_bioit at yahoo.com (raj sharma) Date: Fri, 14 Jul 2006 10:25:51 -0700 (PDT) Subject: [Bioperl-l] advice Message-ID: <20060714172551.60978.qmail@web37309.mail.mud.yahoo.com> i have one problem in perl i want to make one program which whn run online can download required data from data bank to local server frm where i shld start --------------------------------- Yahoo! Music Unlimited - Access over 1 million songs.Try it free. From charlesh at stedwards.edu Sat Jul 15 15:29:46 2006 From: charlesh at stedwards.edu (Charles Hauser) Date: Sat, 15 Jul 2006 14:29:46 -0500 Subject: [Bioperl-l] Finding locations of a string within a fasta file Message-ID: <040D0FF5-0BD1-49DA-B66C-32E359F254BA@stedwards.edu> All, I'm trying to determine where (the start .. end positions) within a genomic scaffold sequence gaps occur. The gaps are denoted as runs of N's. Suggestions on how to easily retrieve this would be appreciated. ch From cjfields at uiuc.edu Sat Jul 15 17:22:15 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sat, 15 Jul 2006 16:22:15 -0500 Subject: [Bioperl-l] Finding locations of a string within a fasta file In-Reply-To: <040D0FF5-0BD1-49DA-B66C-32E359F254BA@stedwards.edu> Message-ID: <000001c6a854$bee47400$15327e82@pyrimidine> You can retrieve the original GenBank CONTIG file using Bio::DB::GenBank if the format is set to 'gb' (it is now set to 'gbwithparts' by default. The CONTIG lines are currently stored in a series of Bio::Annotation::SimpleValue objects; get the accessions using the following script. use strict; use warnings; use Bio::DB::GenBank; my $factory = Bio::DB::GenBank->new(-format => 'gb'); my $seq = $factory->get_Seq_by_id(shift); my $seqout = Bio::SeqIO->new(-fh => \*STDOUT, -format => 'genbank'); # greps only annotations with CONTIG tagname, joins all together my $contig = join '', grep {$_->tagname eq 'CONTIG'} $seq->get_Annotations(); # split each region, getting rid of gaps and join(), then split into acc/span for (grep {$_ !~ m{gap|join}} split ',', $contig) { my ($acc, $span) = split ':', $_; $span =~ s{\)}{}g; # spurious ')' print "ACC: $acc\n\tSpan:$span\n"; } > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Charles Hauser > Sent: Saturday, July 15, 2006 2:30 PM > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] Finding locations of a string within a fasta file > > All, > > I'm trying to determine where (the start .. end positions) within a > genomic scaffold sequence gaps occur. > The gaps are denoted as runs of N's. > > Suggestions on how to easily retrieve this would be appreciated. > > ch > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From sudhaneti at yahoo.com Sat Jul 15 15:26:01 2006 From: sudhaneti at yahoo.com (Sudha Gunturu) Date: Sat, 15 Jul 2006 12:26:01 -0700 (PDT) Subject: [Bioperl-l] BLOSUM matrix Message-ID: <20060715192601.36517.qmail@web53315.mail.yahoo.com> Being a beginner perl programming, was wondering if anyone can help me with implementation of BLOSUM 65 matrix for the following alignments or in general. Any inputs, websites to help with this are appreciated. AILCAA ALLLAA ILIICL Thanks Sudha --------------------------------- Do you Yahoo!? Next-gen email? Have it all with the all-new Yahoo! Mail Beta. From charlesh at stedwards.edu Sun Jul 16 19:32:38 2006 From: charlesh at stedwards.edu (Charles Hauser) Date: Sun, 16 Jul 2006 18:32:38 -0500 Subject: [Bioperl-l] Finding locations of a string within a fasta file In-Reply-To: <000001c6a854$bee47400$15327e82@pyrimidine> References: <000001c6a854$bee47400$15327e82@pyrimidine> Message-ID: Hi Chris, Thanks for the info. Unfortunately, I was not clear that the sequence is unannotated, i.e. there is no GenBank record. I need to extract the locations of the gaps from a raw fasta file. ch On Jul 15, 2006, at 4:22 PM, Chris Fields wrote: > You can retrieve the original GenBank CONTIG file using > Bio::DB::GenBank if > the format is set to 'gb' (it is now set to 'gbwithparts' by > default. The > CONTIG lines are currently stored in a series of > Bio::Annotation::SimpleValue objects; get the accessions using the > following > script. > > use strict; > use warnings; > > use Bio::DB::GenBank; > > my $factory = Bio::DB::GenBank->new(-format => 'gb'); > > my $seq = $factory->get_Seq_by_id(shift); > > my $seqout = Bio::SeqIO->new(-fh => \*STDOUT, > -format => 'genbank'); > > # greps only annotations with CONTIG tagname, joins all together > my $contig = join '', grep {$_->tagname eq 'CONTIG'} > $seq->get_Annotations(); > > # split each region, getting rid of gaps and join(), then split into > acc/span > for (grep {$_ !~ m{gap|join}} > split ',', $contig) { > my ($acc, $span) = split ':', $_; > $span =~ s{\)}{}g; # spurious ')' > print "ACC: $acc\n\tSpan:$span\n"; > } > > > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> bounces at lists.open-bio.org] On Behalf Of Charles Hauser >> Sent: Saturday, July 15, 2006 2:30 PM >> To: bioperl-l at lists.open-bio.org >> Subject: [Bioperl-l] Finding locations of a string within a fasta >> file >> >> All, >> >> I'm trying to determine where (the start .. end positions) within a >> genomic scaffold sequence gaps occur. >> The gaps are denoted as runs of N's. >> >> Suggestions on how to easily retrieve this would be appreciated. >> >> ch >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > From torsten.seemann at infotech.monash.edu.au Sun Jul 16 22:23:51 2006 From: torsten.seemann at infotech.monash.edu.au (Torsten Seemann) Date: Mon, 17 Jul 2006 12:23:51 +1000 Subject: [Bioperl-l] advice In-Reply-To: <20060714172551.60978.qmail@web37309.mail.mud.yahoo.com> References: <20060714172551.60978.qmail@web37309.mail.mud.yahoo.com> Message-ID: <44BAF4B7.8090508@infotech.monash.edu.au> raj sharma wrote: > i have one problem in perl is this Bio::Perl related? > i want to make one program which whn run online do you mean runs on a web server as a CGI script, or access on-line data? > can download required data from data bank to local server which databank - genbank or ... ? > frm where i shld start http://www.oreilly.com/catalog/lperl3/ -- Dr Torsten Seemann http://www.vicbioinformatics.com Victorian Bioinformatics Consortium, Monash University, Australia From torsten.seemann at infotech.monash.edu.au Sun Jul 16 22:21:31 2006 From: torsten.seemann at infotech.monash.edu.au (Torsten Seemann) Date: Mon, 17 Jul 2006 12:21:31 +1000 Subject: [Bioperl-l] Finding locations of a string within a fasta file In-Reply-To: <040D0FF5-0BD1-49DA-B66C-32E359F254BA@stedwards.edu> References: <040D0FF5-0BD1-49DA-B66C-32E359F254BA@stedwards.edu> Message-ID: <44BAF42B.8080102@infotech.monash.edu.au> > I'm trying to determine where (the start .. end positions) within a > genomic scaffold sequence gaps occur. > The gaps are denoted as runs of N's. > Suggestions on how to easily retrieve this would be appreciated. First you need to get the sequence into a string within Perl. As your email Subject: says it is in the Fasta file, you need to 1. open the fasta file - see Bio::SeqIO 2. read first sequence (as an object) - see next_seq() 3. get the string of the sequence in the object - see seq() Then you could just use the inbuilt Perl function index() to loop through all the occurences of 'N' - type 'perldoc -f index' for help. Alternatively use regexp matching eg, m/(N+)/g and the pos() function. -- Dr Torsten Seemann http://www.vicbioinformatics.com Victorian Bioinformatics Consortium, Monash University, Australia From sudhaneti at yahoo.com Sun Jul 16 22:33:20 2006 From: sudhaneti at yahoo.com (Sudha Gunturu) Date: Sun, 16 Jul 2006 19:33:20 -0700 (PDT) Subject: [Bioperl-l] BLOSUM matrix In-Reply-To: <44BAF316.9020301@infotech.monash.edu.au> Message-ID: <20060717023320.6402.qmail@web53313.mail.yahoo.com> Sorry for not being clear with my question. Let me try to explain. I want to Implement dynamic programing using Blosum as scoring matrix. 1. I want to know how to define the values of Blosum in an array. 2. What functions are suitable for global alignment of two sequences. Etc., Being a beginer programer want some direction, books, and good websites which can help me in achieving the implementation. It would be great if someone can walk me through this. Thanks Sudha Torsten Seemann wrote: Sudha, > Being a beginner perl programming, was wondering if anyone can help me with implementation of BLOSUM 65 matrix for the following alignments or in general. Any inputs, websites to help with this are appreciated. > AILCAA > ALLLAA > ILIICL The BLOSUM65 matrix does not define a method for alignment, it just provides some parameters. Perhaps you should read this first: http://en.wikipedia.org/wiki/Sequence_alignment -- Dr Torsten Seemann http://www.vicbioinformatics.com Victorian Bioinformatics Consortium, Monash University, Australia --------------------------------- Do you Yahoo!? Everyone is raving about the all-new Yahoo! Mail Beta. From torsten.seemann at infotech.monash.edu.au Sun Jul 16 22:16:54 2006 From: torsten.seemann at infotech.monash.edu.au (Torsten Seemann) Date: Mon, 17 Jul 2006 12:16:54 +1000 Subject: [Bioperl-l] BLOSUM matrix In-Reply-To: <20060715192601.36517.qmail@web53315.mail.yahoo.com> References: <20060715192601.36517.qmail@web53315.mail.yahoo.com> Message-ID: <44BAF316.9020301@infotech.monash.edu.au> Sudha, > Being a beginner perl programming, was wondering if anyone can help me with implementation of BLOSUM 65 matrix for the following alignments or in general. Any inputs, websites to help with this are appreciated. > AILCAA > ALLLAA > ILIICL The BLOSUM65 matrix does not define a method for alignment, it just provides some parameters. Perhaps you should read this first: http://en.wikipedia.org/wiki/Sequence_alignment -- Dr Torsten Seemann http://www.vicbioinformatics.com Victorian Bioinformatics Consortium, Monash University, Australia From smart_bioit at yahoo.com Mon Jul 17 00:21:41 2006 From: smart_bioit at yahoo.com (raj sharma) Date: Sun, 16 Jul 2006 21:21:41 -0700 (PDT) Subject: [Bioperl-l] advice In-Reply-To: <44BAF4B7.8090508@infotech.monash.edu.au> Message-ID: <20060717042141.9069.qmail@web37307.mail.mud.yahoo.com> hi trston well here i will make u clear my problem i want to make one data base of marine species u can say this as mirror of data so at present whn i click there on line data base of ncbi gets open so i want to dowload data of marine species (ny one) nd whn ever i click on tht link local data which i have downloaded shld open nd data shld also b updated online after some time waiting for ur reply --------------------------------- Do you Yahoo!? Next-gen email? Have it all with the all-new Yahoo! Mail Beta. From cjfields at uiuc.edu Mon Jul 17 00:51:20 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sun, 16 Jul 2006 23:51:20 -0500 Subject: [Bioperl-l] BLOSUM matrix In-Reply-To: <20060717023320.6402.qmail@web53313.mail.yahoo.com> References: <20060717023320.6402.qmail@web53313.mail.yahoo.com> Message-ID: Hmm, beginner programmer, wants to learn perl? Here are some directions: http://learn.perl.org/ Start with Schwartz's latest incarnation of Learning Perl, then work your way up to Intermediate Perl (I think Mastering Perl is on the horizon...) For some pointers using Perl and bioinformatics, pick up Tisdall's books Beginning/Mastering Perl for Bioinformatics. This is really a list for bioperl, not perl and bioinformatics (thought the two cross here all the time!). We normally don't mind answering questions but we typically don't do people's homework unless we're unusually bored. And we can be excessively cranky when someone repeatedly posts requests for something that shouldn't take much reading and Googling to find out. Again, we're not into that homework gig, i.e. 'walking you through it' is tantamount to 'doing it for you.' 1) Arrays and how to use them are in Learning Perl; there are probably better ways to do this than an array, though... 2) Use Torsten's link to get you started. Chris On Jul 16, 2006, at 9:33 PM, Sudha Gunturu wrote: > Sorry for not being clear with my question. Let me try to > explain. I want to Implement dynamic programing using Blosum as > scoring matrix. > > 1. I want to know how to define the values of Blosum in an array. > 2. What functions are suitable for global alignment of two > sequences. Etc., > > Being a beginer programer want some direction, books, and good > websites which can help me in achieving the implementation. It > would be great if someone can walk me through this. > > Thanks > Sudha > > Torsten Seemann wrote: > Sudha, > >> Being a beginner perl programming, was wondering if anyone can >> help me with implementation of BLOSUM 65 matrix for the following >> alignments or in > general. Any inputs, websites to help with this are appreciated. >> AILCAA >> ALLLAA >> ILIICL > > The BLOSUM65 matrix does not define a method for alignment, it just > provides some parameters. Perhaps you should read this first: > > http://en.wikipedia.org/wiki/Sequence_alignment > > -- > Dr Torsten Seemann http://www.vicbioinformatics.com > Victorian Bioinformatics Consortium, Monash University, Australia > > > > > --------------------------------- > Do you Yahoo!? > Everyone is raving about the all-new Yahoo! Mail Beta. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Mon Jul 17 01:01:53 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 17 Jul 2006 00:01:53 -0500 Subject: [Bioperl-l] advice In-Reply-To: <20060717042141.9069.qmail@web37307.mail.mud.yahoo.com> References: <20060717042141.9069.qmail@web37307.mail.mud.yahoo.com> Message-ID: <82C51420-A18B-4DEA-A519-CE1D7B9C7B10@uiuc.edu> This is a Bioperl list. If you don't have a Bioperl-related question, you will very likely get testy replies. I don't believe that you quite understand Torsten's response, so I'll just copy-and-paste from a reply I just gave a second ago to save myself the typing: Hmm, beginner programmer, wants to learn perl? Here are some directions: http://learn.perl.org/ Start with Schwartz's latest incarnation of Learning Perl, then work your way up to Intermediate Perl (I think Mastering Perl is on the horizon...) For some pointers using Perl and bioinformatics, pick up Tisdall's books Beginning/Mastering Perl for Bioinformatics. This is really a list for bioperl, not perl and bioinformatics (thought the two cross here all the time!). We normally don't mind answering questions but we typically don't do people's homework unless we're unusually bored. And we can be excessively cranky when someone repeatedly posts requests for something that shouldn't take much reading and Googling to find out. Again, we're not into that homework gig, i.e. 'walking you through it' is tantamount to 'doing it for you.' For your particular instance, you might want to brush up on web services, CGI, and a little web etiquette. http://catb.org/esr/faqs/smart-questions.html I think you may be waiting for a long time for a reply! Chris On Jul 16, 2006, at 11:21 PM, raj sharma wrote: > > hi trston > well here i will make u clear my problem > > i want to make one data base of marine species u can say this as > mirror of data > > > so at present whn i click there on line data base of ncbi gets open > > so i want to dowload data of marine species (ny one) > nd whn ever i click on tht link local data which i have > downloaded shld open > nd data shld also b updated online after some time > > waiting for ur reply > > > > > > > --------------------------------- > Do you Yahoo!? > Next-gen email? Have it all with the all-new Yahoo! Mail Beta. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bmoore at genetics.utah.edu Mon Jul 17 01:25:32 2006 From: bmoore at genetics.utah.edu (Barry Moore) Date: Sun, 16 Jul 2006 23:25:32 -0600 Subject: [Bioperl-l] advice In-Reply-To: <20060714172551.60978.qmail@web37309.mail.mud.yahoo.com> Message-ID: By reading this: http://catb.org/esr/faqs/smart-questions.html -----Original Message----- From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of raj sharma Sent: Friday, July 14, 2006 11:26 AM To: bioperl-l at lists.open-bio.org Subject: [Bioperl-l] advice i have one problem in perl i want to make one program which whn run online can download required data from data bank to local server frm where i shld start --------------------------------- Yahoo! Music Unlimited - Access over 1 million songs.Try it free. _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l From bmoore at genetics.utah.edu Mon Jul 17 01:34:58 2006 From: bmoore at genetics.utah.edu (Barry Moore) Date: Sun, 16 Jul 2006 23:34:58 -0600 Subject: [Bioperl-l] advice In-Reply-To: <20060717042141.9069.qmail@web37307.mail.mud.yahoo.com> Message-ID: If you're on a unix type system look at wget -mirror and it's variations. B -----Original Message----- From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of raj sharma Sent: Sunday, July 16, 2006 10:22 PM To: Torsten Seemann Subject: Re: [Bioperl-l] advice hi trston well here i will make u clear my problem i want to make one data base of marine species u can say this as mirror of data so at present whn i click there on line data base of ncbi gets open so i want to dowload data of marine species (ny one) nd whn ever i click on tht link local data which i have downloaded shld open nd data shld also b updated online after some time waiting for ur reply --------------------------------- Do you Yahoo!? Next-gen email? Have it all with the all-new Yahoo! Mail Beta. _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l From bix at sendu.me.uk Mon Jul 17 10:32:13 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 17 Jul 2006 15:32:13 +0100 Subject: [Bioperl-l] Bio::Map changes In-Reply-To: <44ACCCD5.3030309@sendu.me.uk> References: <44985915.8010607@sendu.me.uk> <449A9AF9.2000305@sendu.me.uk> <44ACCCD5.3030309@sendu.me.uk> Message-ID: <44BB9F6D.10005@sendu.me.uk> Sendu Bala wrote: > Sendu Bala wrote: >> The reimplementation will make Position central to the model, allowing >> for lots of other things to work properly without anything becoming >> inconsistent (as is currently the case). > > This is now done. It uses a new PositionHandler class behind the scenes. > > The next step is to introduce relative positioning across the board This is now done. It uses a new Relative class to describe what a given position is relative to. I also made Bio::Map:MapI an AnnotableI and SimpleMap an implementor. I think this pretty much brings an end to my changes to Bio::Map. Unless anyone thinks the changes lack sanity, I think the API of the new things should be somewhat stable. > possibly in a way that makes OrderedPosition redundant or an implementer > of the system. I haven't yet touched the other kinds of Positions to update/remove them. Docs in general could probably do with an update/ improvement. I plan to do this 'soon'. From golharam at umdnj.edu Mon Jul 17 10:13:20 2006 From: golharam at umdnj.edu (Ryan Golhar) Date: Mon, 17 Jul 2006 10:13:20 -0400 Subject: [Bioperl-l] advice In-Reply-To: Message-ID: <004301c6a9ab$2896c8b0$2f01a8c0@GOLHARMOBILE1> I apologize that this is off-topic, but it is an interesting email. Notice the lack of vowels (whn, ny, nd, shld, b) however in other words, the vowels are clearly included. Am I getting old or is "internet spelling" starting to differ from "english spelling"? Or is it that the younger generation (not that I'm old...a mere 32 is not old), using shorthand for frequently used words? -----Original Message----- From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Barry Moore Sent: Monday, July 17, 2006 1:35 AM To: raj sharma Cc: bioperl-l Subject: Re: [Bioperl-l] advice If you're on a unix type system look at wget -mirror and it's variations. B -----Original Message----- From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of raj sharma Sent: Sunday, July 16, 2006 10:22 PM To: Torsten Seemann Subject: Re: [Bioperl-l] advice hi trston well here i will make u clear my problem i want to make one data base of marine species u can say this as mirror of data so at present whn i click there on line data base of ncbi gets open so i want to dowload data of marine species (ny one) nd whn ever i click on tht link local data which i have downloaded shld open nd data shld also b updated online after some time waiting for ur reply --------------------------------- Do you Yahoo!? Next-gen email? Have it all with the all-new Yahoo! Mail Beta. _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l From arareko at campus.iztacala.unam.mx Mon Jul 17 11:31:09 2006 From: arareko at campus.iztacala.unam.mx (Mauricio Herrera Cuadra) Date: Mon, 17 Jul 2006 10:31:09 -0500 Subject: [Bioperl-l] advice In-Reply-To: <004301c6a9ab$2896c8b0$2f01a8c0@GOLHARMOBILE1> References: <004301c6a9ab$2896c8b0$2f01a8c0@GOLHARMOBILE1> Message-ID: <44BBAD3D.2040203@campus.iztacala.unam.mx> Maybe it's a new "obscure" perl6 syntax :) Ryan Golhar wrote: > I apologize that this is off-topic, but it is an interesting email. > Notice the lack of vowels (whn, ny, nd, shld, b) however in other > words, the vowels are clearly included. > > Am I getting old or is "internet spelling" starting to differ from > "english spelling"? Or is it that the younger generation (not that I'm > old...a mere 32 is not old), using shorthand for frequently used words? > > > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org > [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Barry Moore > Sent: Monday, July 17, 2006 1:35 AM > To: raj sharma > Cc: bioperl-l > Subject: Re: [Bioperl-l] advice > > > If you're on a unix type system look at wget -mirror and it's > variations. > > B > > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org > [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of raj sharma > Sent: Sunday, July 16, 2006 10:22 PM > To: Torsten Seemann > Subject: Re: [Bioperl-l] advice > > > hi trston > well here i will make u clear my problem > > i want to make one data base of marine species u can say this as > mirror of data > > > so at present whn i click there on line data base of ncbi gets open > > so i want to dowload data of marine species (ny one) > nd whn ever i click on tht link local data which i have downloaded > shld open > nd data shld also b updated online after some time > > waiting for ur reply > > > > > > > --------------------------------- > Do you Yahoo!? > Next-gen email? Have it all with the all-new Yahoo! Mail Beta. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- MAURICIO HERRERA CUADRA arareko at campus.iztacala.unam.mx Laboratorio de Gen?tica Unidad de Morfofisiolog?a y Funci?n Facultad de Estudios Superiores Iztacala, UNAM From cjfields at uiuc.edu Mon Jul 17 12:09:27 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 17 Jul 2006 11:09:27 -0500 Subject: [Bioperl-l] advice In-Reply-To: <004301c6a9ab$2896c8b0$2f01a8c0@GOLHARMOBILE1> Message-ID: <000a01c6a9bb$6478ba90$15327e82@pyrimidine> Ha ! I *almost* added something about that. I thought his vowel keys were broken for a bit, maybe from pounding the keyboard with extreme frustration! As an aside, doesn't Damian Conway say something about the non-use of vowels in 'Perl Best Practices?' I think it was in relation to variables, though... Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Ryan Golhar > Sent: Monday, July 17, 2006 9:13 AM > To: 'bioperl-l' > Subject: Re: [Bioperl-l] advice > > I apologize that this is off-topic, but it is an interesting email. > Notice the lack of vowels (whn, ny, nd, shld, b) however in other > words, the vowels are clearly included. > > Am I getting old or is "internet spelling" starting to differ from > "english spelling"? Or is it that the younger generation (not that I'm > old...a mere 32 is not old), using shorthand for frequently used words? > > > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org > [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Barry Moore > Sent: Monday, July 17, 2006 1:35 AM > To: raj sharma > Cc: bioperl-l > Subject: Re: [Bioperl-l] advice > > > If you're on a unix type system look at wget -mirror and it's > variations. > > B > > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org > [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of raj sharma > Sent: Sunday, July 16, 2006 10:22 PM > To: Torsten Seemann > Subject: Re: [Bioperl-l] advice > > > hi trston > well here i will make u clear my problem > > i want to make one data base of marine species u can say this as > mirror of data > > > so at present whn i click there on line data base of ncbi gets open > > so i want to dowload data of marine species (ny one) > nd whn ever i click on tht link local data which i have downloaded > shld open > nd data shld also b updated online after some time > > waiting for ur reply > > > > > > > --------------------------------- > Do you Yahoo!? > Next-gen email? Have it all with the all-new Yahoo! Mail Beta. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From bix at sendu.me.uk Mon Jul 17 12:31:37 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 17 Jul 2006 17:31:37 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes Message-ID: <44BBBB69.6000906@sendu.me.uk> I see strange node names via Bio::DB::Taxonomy::flatfile: use Bio::DB::Taxonomy; my $db = new Bio::DB::Taxonomy(-source => 'flatfile', -directory => $taxonomy_dir, -nodesfile => $taxonomy_dir.'nodes.dmp', -namesfile => $taxonomy_dir.'names.dmp'); my $tax_id = 89593; my $node = $db->get_Taxonomy_Node($tax_id); print "node $tax_id has name '", @{$node->name('common')}, "' and rank '", $node->rank, "'\n"; Results in: node 89593 has name 'Craniata ' and rank 'subphylum' Other examples: node 2 has name 'Bacteria ' and rank 'superkingdom' node 1386 has name 'Bacillus ' and rank 'genus' node 7776 has name 'Gnathostomata ' and rank 'superclass' etc. For me the bits in <> are inappropriate and shouldn't be there. The NCBI website agrees, and you won't see these things if you use -source => 'entrez'. Should they be removed by the flatfile parser as a matter of course, with no warnings or option? Or do people want them? Typically they are just the name of the parent node, so I don't see why anyone would /need/ them, and I argue it's invalid for parent node information to be duplicated here. If there are no objections I'll strip the <> bits. I also plan to make $node->name('scientific', 'sapiens'); set and get the node name, and have flatfile and entrez store all common names with $obj->name('common', 'human', 'man');. As these changes will make the implementation match the docs I don't see any problems, except that flatfile users will now find the node name in a different place (@{$node->name('scientific')} instead of @{$node->name('common')}). I'll also fix the problem with node names for ranks species and lower, as discussed in thread 'Bio::DB::Taxonomy:: mishandles species, subspecies/variant names', in the way I suggested there. If anyone can see a problem with any of these changes, let me know asap. From hlapp at gmx.net Mon Jul 17 13:53:17 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 17 Jul 2006 13:53:17 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BBBB69.6000906@sendu.me.uk> References: <44BBBB69.6000906@sendu.me.uk> Message-ID: <8C935CDE-BEFF-48BC-8CC8-0C3FFEF2EF17@gmx.net> Sound good to me. BTW NCBI guarantees (well, promises) that there will only be one node name of class 'scientific'. -hilmar On Jul 17, 2006, at 12:31 PM, Sendu Bala wrote: > I see strange node names via Bio::DB::Taxonomy::flatfile: > > use Bio::DB::Taxonomy; > > my $db = new Bio::DB::Taxonomy(-source => 'flatfile', -directory => > $taxonomy_dir, -nodesfile => $taxonomy_dir.'nodes.dmp', -namesfile => > $taxonomy_dir.'names.dmp'); > > my $tax_id = 89593; > my $node = $db->get_Taxonomy_Node($tax_id); > > print "node $tax_id has name '", @{$node->name('common')}, "' and rank > '", $node->rank, "'\n"; > > Results in: > node 89593 has name 'Craniata ' and rank 'subphylum' > > Other examples: > node 2 has name 'Bacteria ' and rank 'superkingdom' > node 1386 has name 'Bacillus ' and rank 'genus' > node 7776 has name 'Gnathostomata ' and rank 'superclass' > etc. > > For me the bits in <> are inappropriate and shouldn't be there. The > NCBI > website agrees, and you won't see these things if you use -source => > 'entrez'. Should they be removed by the flatfile parser as a matter of > course, with no warnings or option? Or do people want them? Typically > they are just the name of the parent node, so I don't see why anyone > would /need/ them, and I argue it's invalid for parent node > information > to be duplicated here. > > If there are no objections I'll strip the <> bits. I also plan to make > $node->name('scientific', 'sapiens'); set and get the node name, and > have flatfile and entrez store all common names with > $obj->name('common', 'human', 'man');. As these changes will make the > implementation match the docs I don't see any problems, except that > flatfile users will now find the node name in a different place > (@{$node->name('scientific')} instead of @{$node->name('common')}). > > I'll also fix the problem with node names for ranks species and lower, > as discussed in thread 'Bio::DB::Taxonomy:: mishandles species, > subspecies/variant names', in the way I suggested there. > > If anyone can see a problem with any of these changes, let me know > asap. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Mon Jul 17 14:31:08 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 17 Jul 2006 13:31:08 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <8C935CDE-BEFF-48BC-8CC8-0C3FFEF2EF17@gmx.net> Message-ID: <001d01c6a9cf$2cf50f60$15327e82@pyrimidine> I agree. Would be nice to get this to play well with weird bacterial names! I plan on doing some behind-the-scenes work on Bio::DB::Taxonomy::entrez at some point soon to test out Bio::DB::EUtilities as the user agent; it currently uses Bio::Root::HTTPget, I think. Reason I'm doing this is to quickly get tax info based on any primary ID, primarily for grabbing related Tax information from the sequence GI w/o parsing the sequence for the TaxID; this uses NCBI's ELink which I've now implemented. I'll make sure everything passes tests before I commit. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Hilmar Lapp > Sent: Monday, July 17, 2006 12:53 PM > To: Sendu Bala > Cc: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > Sound good to me. > > BTW NCBI guarantees (well, promises) that there will only be one node > name of class 'scientific'. > > -hilmar > > On Jul 17, 2006, at 12:31 PM, Sendu Bala wrote: > > > I see strange node names via Bio::DB::Taxonomy::flatfile: > > > > use Bio::DB::Taxonomy; > > > > my $db = new Bio::DB::Taxonomy(-source => 'flatfile', -directory => > > $taxonomy_dir, -nodesfile => $taxonomy_dir.'nodes.dmp', -namesfile => > > $taxonomy_dir.'names.dmp'); > > > > my $tax_id = 89593; > > my $node = $db->get_Taxonomy_Node($tax_id); > > > > print "node $tax_id has name '", @{$node->name('common')}, "' and rank > > '", $node->rank, "'\n"; > > > > Results in: > > node 89593 has name 'Craniata ' and rank 'subphylum' > > > > Other examples: > > node 2 has name 'Bacteria ' and rank 'superkingdom' > > node 1386 has name 'Bacillus ' and rank 'genus' > > node 7776 has name 'Gnathostomata ' and rank 'superclass' > > etc. > > > > For me the bits in <> are inappropriate and shouldn't be there. The > > NCBI > > website agrees, and you won't see these things if you use -source => > > 'entrez'. Should they be removed by the flatfile parser as a matter of > > course, with no warnings or option? Or do people want them? Typically > > they are just the name of the parent node, so I don't see why anyone > > would /need/ them, and I argue it's invalid for parent node > > information > > to be duplicated here. > > > > If there are no objections I'll strip the <> bits. I also plan to make > > $node->name('scientific', 'sapiens'); set and get the node name, and > > have flatfile and entrez store all common names with > > $obj->name('common', 'human', 'man');. As these changes will make the > > implementation match the docs I don't see any problems, except that > > flatfile users will now find the node name in a different place > > (@{$node->name('scientific')} instead of @{$node->name('common')}). > > > > I'll also fix the problem with node names for ranks species and lower, > > as discussed in thread 'Bio::DB::Taxonomy:: mishandles species, > > subspecies/variant names', in the way I suggested there. > > > > If anyone can see a problem with any of these changes, let me know > > asap. > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From bix at sendu.me.uk Mon Jul 17 14:09:44 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 17 Jul 2006 19:09:44 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <8C935CDE-BEFF-48BC-8CC8-0C3FFEF2EF17@gmx.net> References: <44BBBB69.6000906@sendu.me.uk> <8C935CDE-BEFF-48BC-8CC8-0C3FFEF2EF17@gmx.net> Message-ID: <44BBD268.2060308@sendu.me.uk> Hilmar Lapp wrote: >> I also plan to make $node->name('scientific', 'sapiens'); set and >> get the node name, [...] users will now find the node name in [...] >> @{$node->name('scientific')} > > BTW NCBI guarantees (well, promises) that there will only be one node > name of class 'scientific'. Yes, which is why I feel the API for name() isn't ideal, but thought it would be best to play along. Would having a new scientific_name() method be better, which gets/sets a single value? Perhaps it could just be a more 'sane' shorthand to setting @{$node->name('scientific')} to a list with only the supplied name, and getting ${$node->name('scientific')}[0] ? From hlapp at gmx.net Mon Jul 17 15:31:55 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 17 Jul 2006 15:31:55 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BBD268.2060308@sendu.me.uk> References: <44BBBB69.6000906@sendu.me.uk> <8C935CDE-BEFF-48BC-8CC8-0C3FFEF2EF17@gmx.net> <44BBD268.2060308@sendu.me.uk> Message-ID: <5B62229C-BAB7-4320-BBAE-87A483B0EC15@gmx.net> Yes I think $node->scientific_name() as shorthand would be good to have. Same BTW for $node->common_names() (which would return an array). -hilmar On Jul 17, 2006, at 2:09 PM, Sendu Bala wrote: > Hilmar Lapp wrote: >>> I also plan to make $node->name('scientific', 'sapiens'); set and >>> get the node name, [...] users will now find the node name in [...] >>> @{$node->name('scientific')} >> >> BTW NCBI guarantees (well, promises) that there will only be one node >> name of class 'scientific'. > > Yes, which is why I feel the API for name() isn't ideal, but > thought it > would be best to play along. Would having a new scientific_name() > method > be better, which gets/sets a single value? Perhaps it could just be a > more 'sane' shorthand to setting @{$node->name('scientific')} to a > list > with only the supplied name, and getting ${$node->name > ('scientific')}[0] ? > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Mon Jul 17 16:44:18 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 17 Jul 2006 15:44:18 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <5B62229C-BAB7-4320-BBAE-87A483B0EC15@gmx.net> Message-ID: <000001c6a9e1$c6b51610$15327e82@pyrimidine> There was some interest in getting Bio::Species to delegate to Bio::Taxonomy::Node, so having scientific_name() would help quite a bit since the name used on the ORGANISM line is the scientific name (well, is supposed to be; famous last words). Don't know about SwissProt, EMBL, and others though... Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Hilmar Lapp > Sent: Monday, July 17, 2006 2:32 PM > To: Sendu Bala > Cc: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > Yes I think $node->scientific_name() as shorthand would be good to > have. Same BTW for $node->common_names() (which would return an array). > > -hilmar > > On Jul 17, 2006, at 2:09 PM, Sendu Bala wrote: > > > Hilmar Lapp wrote: > >>> I also plan to make $node->name('scientific', 'sapiens'); set and > >>> get the node name, [...] users will now find the node name in [...] > >>> @{$node->name('scientific')} > >> > >> BTW NCBI guarantees (well, promises) that there will only be one node > >> name of class 'scientific'. > > > > Yes, which is why I feel the API for name() isn't ideal, but > > thought it > > would be best to play along. Would having a new scientific_name() > > method > > be better, which gets/sets a single value? Perhaps it could just be a > > more 'sane' shorthand to setting @{$node->name('scientific')} to a > > list > > with only the supplied name, and getting ${$node->name > > ('scientific')}[0] ? > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From vrramnar at student.cs.uwaterloo.ca Mon Jul 17 16:46:32 2006 From: vrramnar at student.cs.uwaterloo.ca (vrramnar at student.cs.uwaterloo.ca) Date: Mon, 17 Jul 2006 16:46:32 -0400 Subject: [Bioperl-l] Remote Blast - Blast Human Genome In-Reply-To: <000501c6a6e8$b9c24d20$15327e82@pyrimidine> References: <000501c6a6e8$b9c24d20$15327e82@pyrimidine> Message-ID: <1153169192.44bbf728056fd@www.nexusmail.uwaterloo.ca> Hi Chris, 1. I have tried changing the database to snp or dbSNP but neither works. It seems that depending on which type of blast you use(ie, Genome Blast, Blast SNP, normal blast such as blastn, etc...) you see a different listing of databases available for querys. Since you mention that the Blast page I see was generated by Genome, where could I go to see a complete listing of databases I can query?? Or if you knew off hand which database to search if I only wanted dbSNP hits? 2. You also mention, I can limit the search by using Entrez terms. Do you mean like: $Bio::Tools::Run::RemoteBlast::HEADER{'ENTREZ_QUERY'} = 'abc'; where 'abc' is the name of the subject with which you would only like to see result of. For example if you put it as 'Homo sapiens[Organism]' then only human sequences would be in hit lists. If this is what you mean, what would I change it to, to see only hits from dbSNP? Thanks for the ongoing help, Rohan Quoting Chris Fields : > I added a method to RemoteBlast in bioperl-live (CVS) if you want to play > with changing the URL. I have been thinking about doing this for a bit now > but I already see problems. > > Here's the issue: the BLAST page you see is NOT the NCBI BLAST page (note > the differences in the URL) but a user-friendly request page, generated on > the fly by Genome, to submit BLAST requests for the relevant database. So > changing the URL will not work (even by adding extra parameters); you only > get the original HTML web page. > > You could try changing the database or limiting the search using an Entrez > term (which you should be able to include in the request, probably by adding > it to the HEADER). > > Chris > > > -----Original Message----- > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > > bounces at lists.open-bio.org] On Behalf Of vrramnar at student.cs.uwaterloo.ca > > Sent: Thursday, July 13, 2006 5:39 PM > > To: bioperl-l at lists.open-bio.org > > Subject: [Bioperl-l] Remote Blast - Blast Human Genome > > > > > > Hello Again, > > > > I have another question regarding Remote blast but this time using Genome > > Blast. > > > > Here is the link: > > > > http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=9606 > > > > which again uses the main Blast web site: > > > > http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi > > > > Again I am not sure what to add or what HEADER information to change > > within my > > script. > > > > Here is my program, which was the same as the last email: > > > > #!/usr/bin/perl -w > > > > use Bio::Perl; > > use Bio::Tools::Run::RemoteBlast; > > > > my $prog = "blastn"; > > my $db = "refseq_genomic"; > > my $e_val = 0.01; > > > > my @params = ( '-prog' => $prog, > > '-data' => $db, > > '-expect' => $e_val); > > > > my $factory = new Bio::Tools::Run::RemoteBlast->new(@params); > > $Bio::Tools::Run::RemoteBlast::HEADER{'WWW_BLAST_TYPE'} = '????'; <----- > > what > > do I put here > > #$Bio::Tools::Run::RemoteBlast::HEADER{'?????'} = '????'; <--- Do I need > > to add > > any other values to the form inputs > > > > $factory->submit_blast("blast.in"); > > $v = 1; > > > > while (my @rids = $factory->each_rid) > > { foreach my $rid ( @rids ) > > { my $rc = $factory->retrieve_blast($rid); > > if( !ref($rc) ) > > { if( $rc < 0 ) > > { $factory->remove_rid($rid); > > } > > print STDERR "." if ( $v > 0 ); > > sleep 5; > > } > > else > > { my $result = $rc->next_result(); > > my $filename = $result->query_name()."\.out"; > > $factory->save_output($filename); > > $factory->remove_rid($rid); > > print "\nQuery Name: ", $result->query_name(), "\n"; > > } > > } > > } > > > > > > Both of my questions are very similiar as in I know how to use remote > > blast but > > not sure what to change to access the specific blast I want. > > > > Again, any help would be very appreciated!! > > > > Rohan > > > > > > > > ---------------------------------------- > > This mail sent through www.mywaterloo.ca > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > ---------------------------------------- This mail sent through www.mywaterloo.ca From cjfields at uiuc.edu Mon Jul 17 17:25:54 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 17 Jul 2006 16:25:54 -0500 Subject: [Bioperl-l] Remote Blast - Blast Human Genome In-Reply-To: <1153169192.44bbf728056fd@www.nexusmail.uwaterloo.ca> Message-ID: <001001c6a9e7$962b56c0$15327e82@pyrimidine> Okay, I think I may know what's going on a little more now with NCBI's BLAST interface. Looks like any NCBI BLAST query must use the default URL and so must set up to proper GET/PUT commands to retrieve everything correctly. Here's the API description for it all: http://www.ncbi.nlm.nih.gov/BLAST/Doc/urlapi.html You could try setting the database to 'snp' or something along those lines instead of 'nr'; or you could see what the name of the database is when you use the web form and try setting it to that. According to this page, this should be possible: http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helpsnpfaq.section.SearchdbSNP _test._Search_dbSNP_Using_B The Entrez Query limit was a recommendation for limiting your search to a set of sequences for human, for instance. I'll try looking into it a bit more but I'm pretty busy. If you find anything out you should probably post it here . Chris > Hi Chris, > > 1. I have tried changing the database to snp or dbSNP but neither works. > It > seems that depending on which type of blast you use(ie, Genome Blast, > Blast SNP, > normal blast such as blastn, etc...) you see a different listing of > databases > available for querys. Since you mention that the Blast page I see was > generated > by Genome, where could I go to see a complete listing of databases I can > query?? > Or if you knew off hand which database to search if I only wanted dbSNP > hits? > > 2. You also mention, I can limit the search by using Entrez terms. Do you > mean > like: > $Bio::Tools::Run::RemoteBlast::HEADER{'ENTREZ_QUERY'} = 'abc'; > where 'abc' is the name of the subject with which you would only like to > see > result of. For example if you put it as 'Homo sapiens[Organism]' then only > human > sequences would be in hit lists. > If this is what you mean, what would I change it to, to see only hits from > dbSNP? > > Thanks for the ongoing help, > > Rohan > > Quoting Chris Fields : > > > I added a method to RemoteBlast in bioperl-live (CVS) if you want to > play > > with changing the URL. I have been thinking about doing this for a bit > now > > but I already see problems. > > > > Here's the issue: the BLAST page you see is NOT the NCBI BLAST page > (note > > the differences in the URL) but a user-friendly request page, generated > on > > the fly by Genome, to submit BLAST requests for the relevant database. > So > > changing the URL will not work (even by adding extra parameters); you > only > > get the original HTML web page. > > > > You could try changing the database or limiting the search using an > Entrez > > term (which you should be able to include in the request, probably by > adding > > it to the HEADER). > > > > Chris > > > > > -----Original Message----- > > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > > > bounces at lists.open-bio.org] On Behalf Of > vrramnar at student.cs.uwaterloo.ca > > > Sent: Thursday, July 13, 2006 5:39 PM > > > To: bioperl-l at lists.open-bio.org > > > Subject: [Bioperl-l] Remote Blast - Blast Human Genome > > > > > > > > > Hello Again, > > > > > > I have another question regarding Remote blast but this time using > Genome > > > Blast. > > > > > > Here is the link: > > > > > > > http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=9606 > > > > > > which again uses the main Blast web site: > > > > > > http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi > > > > > > Again I am not sure what to add or what HEADER information to change > > > within my > > > script. > > > > > > Here is my program, which was the same as the last email: > > > > > > #!/usr/bin/perl -w > > > > > > use Bio::Perl; > > > use Bio::Tools::Run::RemoteBlast; > > > > > > my $prog = "blastn"; > > > my $db = "refseq_genomic"; > > > my $e_val = 0.01; > > > > > > my @params = ( '-prog' => $prog, > > > '-data' => $db, > > > '-expect' => $e_val); > > > > > > my $factory = new Bio::Tools::Run::RemoteBlast->new(@params); > > > $Bio::Tools::Run::RemoteBlast::HEADER{'WWW_BLAST_TYPE'} = '????'; <-- > --- > > > what > > > do I put here > > > #$Bio::Tools::Run::RemoteBlast::HEADER{'?????'} = '????'; <--- Do I > need > > > to add > > > any other values to the form inputs > > > > > > $factory->submit_blast("blast.in"); > > > $v = 1; > > > > > > while (my @rids = $factory->each_rid) > > > { foreach my $rid ( @rids ) > > > { my $rc = $factory->retrieve_blast($rid); > > > if( !ref($rc) ) > > > { if( $rc < 0 ) > > > { $factory->remove_rid($rid); > > > } > > > print STDERR "." if ( $v > 0 ); > > > sleep 5; > > > } > > > else > > > { my $result = $rc->next_result(); > > > my $filename = $result->query_name()."\.out"; > > > $factory->save_output($filename); > > > $factory->remove_rid($rid); > > > print "\nQuery Name: ", $result->query_name(), "\n"; > > > } > > > } > > > } > > > > > > > > > Both of my questions are very similiar as in I know how to use remote > > > blast but > > > not sure what to change to access the specific blast I want. > > > > > > Again, any help would be very appreciated!! > > > > > > Rohan > > > > > > > > > > > > ---------------------------------------- > > > This mail sent through www.mywaterloo.ca > > > _______________________________________________ > > > Bioperl-l mailing list > > > Bioperl-l at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > > > > ---------------------------------------- > This mail sent through www.mywaterloo.ca From bix at sendu.me.uk Mon Jul 17 17:33:26 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 17 Jul 2006 22:33:26 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <000001c6a9e1$c6b51610$15327e82@pyrimidine> References: <000001c6a9e1$c6b51610$15327e82@pyrimidine> Message-ID: <44BC0226.1080605@sendu.me.uk> Chris Fields wrote: > There was some interest in getting Bio::Species to delegate to > Bio::Taxonomy::Node, so having scientific_name() would help quite a bit > since the name used on the ORGANISM line is the scientific name (well, is > supposed to be; famous last words). Can you clarify exactly what you mean here? Preferably with an example? ORGANISM line of which file format? The reason I ask is that I still feel we need to do parsing of the names for species rank and lower: # The 'scientific name' for humans could be considered to be 'Homo sapiens'. # Taxid 9606 in the NCBI taxonomy database has rank 'species' and ScientificName 'Homo sapiens'. # For sanity, Bio::*Taxonomy* likes to interpret this ScientificName as 'sapiens' so that the genus is not held redundantly. It provides a binomial() method to give you 'Homo sapiens' again if you want it. # I plan on maintaining this; scientific_name() would give you the non-redundant sibling-unique name 'sapiens'. binomial() on a species rank and lower would give you 'Homo sapiens' (presumably grabbing the 'Homo' from the parent node with rank 'genus', or similar). Good, bad or ugly? I would prefer it works like this and we agree to differ with NCBI on what the 'scientific name' of a species node should be. Bio::Species can still delegate to Bio::Taxonomy::Node by calling binomial() (which I propose will actually give the correct answer, even for bacteria and viruses). Perhaps the short-hand (and the classifier used in name()) shouldn't mention the word 'scientific' to avoid confusion? But a) what else would we call it?, and b) for all ranks above species it /is/ the scientific name. From hlapp at gmx.net Mon Jul 17 19:47:24 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 17 Jul 2006 19:47:24 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BC0226.1080605@sendu.me.uk> References: <000001c6a9e1$c6b51610$15327e82@pyrimidine> <44BC0226.1080605@sendu.me.uk> Message-ID: <64AB2D48-B342-4FC4-BB4E-9927B24C972C@gmx.net> I don't think we should differ from NCBI in places where the connection between a method name and the NCBI data file is obvious or otherwise we will confuse people and send them into traps. $node->scientific_name() should simply report what NCBI reports. For simple species this will be identical to what $node->binomial() returns, but for others it may not, e.g., strains, varieties, etc or the weird world of viri and bacteria. This will also absolve us from retaining the business logic for how to construct the scientific name from genus, species, and possibly strain or whatever. binomial() isn't part of the NCBI taxonomy definition, so you have freedom there to report what suits you. -hilmar On Jul 17, 2006, at 5:33 PM, Sendu Bala wrote: > Chris Fields wrote: >> There was some interest in getting Bio::Species to delegate to >> Bio::Taxonomy::Node, so having scientific_name() would help quite >> a bit >> since the name used on the ORGANISM line is the scientific name >> (well, is >> supposed to be; famous last words). > > Can you clarify exactly what you mean here? Preferably with an > example? > ORGANISM line of which file format? > The reason I ask is that I still feel we need to do parsing of the > names > for species rank and lower: > > # The 'scientific name' for humans could be considered to be 'Homo > sapiens'. > # Taxid 9606 in the NCBI taxonomy database has rank 'species' and > ScientificName 'Homo sapiens'. > # For sanity, Bio::*Taxonomy* likes to interpret this > ScientificName as > 'sapiens' so that the genus is not held redundantly. It provides a > binomial() method to give you 'Homo sapiens' again if you want it. > # I plan on maintaining this; scientific_name() would give you the > non-redundant sibling-unique name 'sapiens'. binomial() on a species > rank and lower would give you 'Homo sapiens' (presumably grabbing the > 'Homo' from the parent node with rank 'genus', or similar). > > Good, bad or ugly? I would prefer it works like this and we agree to > differ with NCBI on what the 'scientific name' of a species node > should > be. Bio::Species can still delegate to Bio::Taxonomy::Node by calling > binomial() (which I propose will actually give the correct answer, > even > for bacteria and viruses). > > Perhaps the short-hand (and the classifier used in name()) shouldn't > mention the word 'scientific' to avoid confusion? But a) what else > would > we call it?, and b) for all ranks above species it /is/ the > scientific name. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From osborne1 at optonline.net Mon Jul 17 20:52:04 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Mon, 17 Jul 2006 20:52:04 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BC0226.1080605@sendu.me.uk> Message-ID: Sendu, The string "sapiens" is not what a biology textbook would call a scientific name. You're going to have to respect decades of convention and have scientific_name() return the genus and species name. Brian O. On 7/17/06 5:33 PM, "Sendu Bala" wrote: > # I plan on maintaining this; scientific_name() would give you the > non-redundant sibling-unique name 'sapiens'. binomial() on a species > rank and lower would give you 'Homo sapiens' (presumably grabbing the > 'Homo' from the parent node with rank 'genus', or similar). From cjfields at uiuc.edu Mon Jul 17 21:36:12 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 17 Jul 2006 20:36:12 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BC0226.1080605@sendu.me.uk> References: <000001c6a9e1$c6b51610$15327e82@pyrimidine> <44BC0226.1080605@sendu.me.uk> Message-ID: <1345AB61-E7AB-447A-AB40-2170244404B2@uiuc.edu> On Jul 17, 2006, at 4:33 PM, Sendu Bala wrote: > Chris Fields wrote: >> There was some interest in getting Bio::Species to delegate to >> Bio::Taxonomy::Node, so having scientific_name() would help quite >> a bit >> since the name used on the ORGANISM line is the scientific name >> (well, is >> supposed to be; famous last words). > > Can you clarify exactly what you mean here? Preferably with an > example? > ORGANISM line of which file format? > The reason I ask is that I still feel we need to do parsing of the > names > for species rank and lower: Sorry, should have clarified; GenBank sequence format. Here's the link: http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html The ORGANISM annotation line for a GenBank record contains the formal scientific name for the organism along with the lineage. I believe SwissProt/EMBL and several other RichSeq formats do the same. The lineage that is also present is almost always abbreviated, so it's not always possible to determine the formal rankings strictly from the file with any real degree of reliability (hence the past problems with Bio::Species). > > # The 'scientific name' for humans could be considered to be 'Homo > sapiens'. > # Taxid 9606 in the NCBI taxonomy database has rank 'species' and > ScientificName 'Homo sapiens'. > # For sanity, Bio::*Taxonomy* likes to interpret this > ScientificName as > 'sapiens' so that the genus is not held redundantly. It provides a > binomial() method to give you 'Homo sapiens' again if you want it. > # I plan on maintaining this; scientific_name() would give you the > non-redundant sibling-unique name 'sapiens'. binomial() on a species > rank and lower would give you 'Homo sapiens' (presumably grabbing the > 'Homo' from the parent node with rank 'genus', or similar). I think you should use scientific_name to designate the full formal scientific name for an organism according to the way NCBI describes it for that particular node (nothing more, except removing the <> stuff you mentioned earlier) and as it would appear for the ORGANISM line. Otherwise you'll run into serious species/subspecies/strain headaches (see below). If you want real genus/species (i.e. nothing extra, like strains or subspecies), separate them out and store them using a genus/species get/set if possible; the binomial them will give back the two name genus species designation. Here are a couple of example ones in (this is in XML, using EUtilities). These were retrieved using NCBI TaxIDs using Elink from a list of protein GI's (~700 of them total), so represent the actual NCBI TaxID linked with the sequence file. If you try breaking these apart into species, what happens to the strain/subspecies stuff? Notice that many of these nodes, which come directly from protein GI's, also have no rank. ... 376686 Flavobacterium johnsoniae UW101 Flavobacterium johnsoniae NBRC 14942 Flavobacterium johnsoniae IFO 14942 Flavobacterium johnsoniae IAM 14304 Flavobacterium johnsoniae MYX.1.1.1 Flavobacterium johnsoniae NCIB 11054 Flavobacterium johnsoniae DSM 2064 Flavobacterium johnsoniae LMG 1341 Flavobacterium johnsoniae ATCC 17061 Flavobacterium johnsoniae strain UW101 Flavobacterium johnsoniae str. UW101 986 no rank Bacteria ... 370552 Streptococcus pyogenes MGAS10270 Streptococcus pyogenes strain MGAS10270 Streptococcus pyogenes str. MGAS10270 301448 no rank Bacteria ... 224308 Bacillus subtilis subsp. subtilis str. 168 Bacillus subtilis subsp. subtilis 168 135461 no rank Bacteria > Good, bad or ugly? I would prefer it works like this and we agree to > differ with NCBI on what the 'scientific name' of a species node > should > be. Bio::Species can still delegate to Bio::Taxonomy::Node by calling > binomial() (which I propose will actually give the correct answer, > even > for bacteria and viruses). This is where I would strongly disagree (though I agree that the way NCBI uses 'scientific name' is a bit off). We are using the NCBI tax database, anf as such we are somewhat at the mercy of the NCBI tax nomenclature, unfortunately. If NCBI decides to change their official definition for the scientific name to something that made a bit more sense, the XML and dump data will reflect that and we won't have many problems adapting since the scientific name will always conform to their definition. But if we split the information up ad hoc then we are bound for disaster; it's just way too much headache to worry about. We could always point to the official NCBI definition as the one we adopt and then assign the tagged information from the node directly to scientific_name (no globbing together at all). Bio::Species could delegate likewise fro the ORGANISM line, so there's no piecemeal attempts to get Humpty Dumpty to fit back together again. You could go through and get the lineage from the XML/dump file data and try to sort the genus/species out, then paste it all back together (fingers crossed!), but I think it's more headache than it's worth to split these up, then hope that you can paste them back together again and always expect to get the same results. Chris > Perhaps the short-hand (and the classifier used in name()) shouldn't > mention the word 'scientific' to avoid confusion? But a) what else > would > we call it?, and b) for all ranks above species it /is/ the > scientific name. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Mon Jul 17 21:55:28 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 17 Jul 2006 20:55:28 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: References: Message-ID: <07041A82-BE01-4998-AF26-35212CA1F0F2@uiuc.edu> I agree with Hilmar's assessment, not b/c I disagree with your definition of scientific name or the reasoning Sendu proposes. I think we are somewhat bound to NCBI's nomenclature for their tax database. If we veer away from NCBI's definition for 'scientific name' it will just confuse users and lead to more trouble than it's worth, frankly. If we stick with it then any changes NCBI makes should be easier to deal with. Leaving the scientific_name as NCBI designates it, though it probably disagrees with ~99% of the world's textbooks, may be the most maintainable solution. Now, binomial() on the other hand... Chris On Jul 17, 2006, at 7:52 PM, Brian Osborne wrote: > Sendu, > > The string "sapiens" is not what a biology textbook would call a > scientific > name. You're going to have to respect decades of convention and have > scientific_name() return the genus and species name. > > Brian O. > > > On 7/17/06 5:33 PM, "Sendu Bala" wrote: > >> # I plan on maintaining this; scientific_name() would give you the >> non-redundant sibling-unique name 'sapiens'. binomial() on a species >> rank and lower would give you 'Homo sapiens' (presumably grabbing the >> 'Homo' from the parent node with rank 'genus', or similar). > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From hlapp at gmx.net Mon Jul 17 22:06:01 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 17 Jul 2006 22:06:01 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <07041A82-BE01-4998-AF26-35212CA1F0F2@uiuc.edu> References: <07041A82-BE01-4998-AF26-35212CA1F0F2@uiuc.edu> Message-ID: On Jul 17, 2006, at 9:55 PM, Chris Fields wrote: > Leaving the scientific_name as NCBI designates it, though it probably > disagrees with ~99% of the world's textbooks, may be the most > maintainable solution. It doesn't disagree, it's quite like what the world's textbooks give you as a 'scientific name'. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Tue Jul 18 00:24:50 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 17 Jul 2006 23:24:50 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: References: <07041A82-BE01-4998-AF26-35212CA1F0F2@uiuc.edu> Message-ID: <7BCA093B-90FB-4B0A-91FD-A6E0B34C96DD@uiuc.edu> When you mean genus-species, which would be yes. But parent nodes? If you trust WIkipedia, the scientific name == binomial nomenclature. Which could mean no subspecies, strains, etc if one were to be really strict about it, though that may be a grey area; I'm no taxonomist. http://en.wikipedia.org/wiki/Scientific_name The parent nodes shouldn't have a scientific name if one were to adhere strictly to the standard definition above, but NCBI refers to the names for the parent nodes as 'scientific name' (the XML element is still ScientificName, just like the child node). I'm not sure what the tax dump file is, though, so that may be different. Here's the lineage for Taxid 312284 (marine actinobacterium PHSC20C1). I cut out the irrelevant bits and just show the lineage with all the parent nodes, taxID, and rank: 131567 cellular organisms no rank 2 Bacteria superkingdom 201174 Actinobacteria phylum 1760 Actinobacteria (class) class 52018 unclassified Actinobacteria no rank 78537 unclassified Actinobacteria (miscellaneous) no rank .... Seems to me the easiest thing to do here, when looking at a particular node, is to use scientific_name() to hold that particular element for the node and have binomial represent the true 'scientific name', much as Sendu proposed. It would also make life much easier when parsing GenBank/SwissProt/EMBL (SeqIO) to have the data designating the formal scientific name (according to NCBI) be assigned to a scientific_name() get/set method in Bio::Species for later writing; then if we want to delegate this over to Bio::Taxonomy::Node from Bio::Species it would be that much easier. This would also get around some of the problems I have been seeing with bacterial names when passing GenBank data through SeqIO, since you wouldn't be required to glop the name together from the way Bio::Species tried to guess the lineage. Chris On Jul 17, 2006, at 9:06 PM, Hilmar Lapp wrote: > > On Jul 17, 2006, at 9:55 PM, Chris Fields wrote: > >> Leaving the scientific_name as NCBI designates it, though it probably >> disagrees with ~99% of the world's textbooks, may be the most >> maintainable solution. > > It doesn't disagree, it's quite like what the world's textbooks give > you as a 'scientific name'. > > -hilmar > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Tue Jul 18 03:27:49 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Tue, 18 Jul 2006 08:27:49 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <64AB2D48-B342-4FC4-BB4E-9927B24C972C@gmx.net> References: <000001c6a9e1$c6b51610$15327e82@pyrimidine> <44BC0226.1080605@sendu.me.uk> <64AB2D48-B342-4FC4-BB4E-9927B24C972C@gmx.net> Message-ID: <44BC8D75.1080806@sendu.me.uk> Hilmar Lapp wrote: > I don't think we should differ from NCBI in places where the > connection between a method name and the NCBI data file is obvious or > otherwise we will confuse people and send them into traps. > > $node->scientific_name() should simply report what NCBI reports. For > simple species this will be identical to what $node->binomial() > returns, but for others it may not, e.g., strains, varieties, etc or > the weird world of viri and bacteria. Ok, well this certainly seems to be consensus so I'll abide. > This will also absolve us from retaining the business logic for how > to construct the scientific name from genus, species, and possibly > strain or whatever. What about the existing genus(), species(), sub_species() and variant() methods? There would be no need for any logic to join things together, but I would still like to be able to get just 'sapiens' from somewhere. Can I use species() for that purpose (though again, species is strictly 'Homo sapiens')? Likewise sub_species() and variant() could hold the remaining non-redundant names. Or should all of these be deprecated because they don't really have a place in a generic Node class? What about node_name()? Yet another synonym of scientific_name? (right now it grabs the common name(s)). Ugh. What should I do with the classification array? Should it hold the raw ScientificName like: join(',', $node->classification) eq 'Homo sapiens, Homo, Homo/Pan/Gorilla group [...]'? Or should it be like: join(',', $node->classification) eq 'sapiens, Homo, Homo/Pan/Gorilla group [...]'? The latter is how it currently works (when it works correctly); I would rather fix it than lose the logic completely, but if we're staying true to proper classification (vs. what a programmer might expect), I guess I must use the raw ScientificName? > binomial() isn't part of the NCBI taxonomy definition, so you have > freedom there to report what suits you. I don't think binomial() would serve any useful purpose now, however. I can either deprecate it or make it a synonym of scientific_name() or both. Or binomial() can be a version of scientific_name() that complains if you use it on a rank higher or lower than species. As for species() et al., it may have no place in a generic Node class. Thoughts? From bix at sendu.me.uk Tue Jul 18 04:43:43 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Tue, 18 Jul 2006 09:43:43 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BBBB69.6000906@sendu.me.uk> References: <44BBBB69.6000906@sendu.me.uk> Message-ID: <44BC9F3F.2040500@sendu.me.uk> Sendu Bala wrote: [snip proposed changes to Bio::DB::Taxonomy::* and Bio::Taxonomy::Node] > If anyone can see a problem with any of these changes, let me know asap. I've just realised that there are currently no tests for Bio::DB::Taxonomy::flatfile, and that the ones for entrez get skipped. Node doesn't get an especially thorough work-out either (in the skipped section). I'm guessing it's not feasible to include the full taxdump from NCBI (~40MB) in t/data... do people think it would be reasonable to create some sort of small subset of the data? I could just pull out the lines from names.dmp and nodes.dmp relevant to a few example organisms. Say, for human and a tricky bacteria and virus? For the purposes of running the test, where should the index files be kept? In t/data with the .dmp files or in /tmp? Should the test script delete them afterwards, or leave them be? The entrez tests are skipped to 'avoid blocking', but the test only makes 2 entrez queries with a sleep(3) in-between. Basically, I don't think there's ever any reason to skip. Shall I remove the skip? Lots of other database-accessing tests in the test suite just go right ahead and access their database, no problem. Cheers, Sendu. From torsten.seemann at infotech.monash.edu.au Mon Jul 17 23:53:02 2006 From: torsten.seemann at infotech.monash.edu.au (Torsten Seemann) Date: Tue, 18 Jul 2006 13:53:02 +1000 Subject: [Bioperl-l] advice In-Reply-To: <000a01c6a9bb$6478ba90$15327e82@pyrimidine> References: <000a01c6a9bb$6478ba90$15327e82@pyrimidine> Message-ID: <44BC5B1E.5080600@infotech.monash.edu.au> > Ha ! I *almost* added something about that. I thought his vowel keys were > broken for a bit, maybe from pounding the keyboard with extreme frustration! The wide variety of pronunciation of English around the world can be mostly blamed on those damned vowels... so perhaps removing them helps one to reach a wider audience :-) > As an aside, doesn't Damian Conway say something about the non-use of vowels > in 'Perl Best Practices?' I think it was in relation to variables, > though... Yeah, on page 46 he says NOT to remove vowels in variable names, use prefixes instead. Apprntly rmvng vwls mks t hrdr t ndrstnd stff. (Actually, I studied at Monash University under Damian Conway, and recall his ridiculing of Perl, so I found it kind of ironic that he ended up changing the Perl landscape so significantly! He even wrote an internal publication "theStyle - a guide to C programming style" in about 1990 in which he violates some of his later Perl Best Practices :-) -- Dr Torsten Seemann http://www.vicbioinformatics.com Victorian Bioinformatics Consortium, Monash University, Australia From sharma.animesh at gmail.com Tue Jul 18 03:58:41 2006 From: sharma.animesh at gmail.com (Animesh Sharma) Date: Tue, 18 Jul 2006 13:28:41 +0530 Subject: [Bioperl-l] PDB file parser (Separates chain-sequence and chain-structure) Message-ID: <156674e60607180058r653fa8fesbc654508c9c19b5b@mail.gmail.com> Hi Chris, I have written a small script to separate the Chain in a PDB file. It stores the sequence (fasta format) and structure (pdb format) in separate files with middle name according to the Chain it contains. If the PDB file has only one chain, it creates a file with default as middle name. Eg, perl pdb_chain_extract.pl 1HCO.pdb Will create 4 files with names: 1HCO.A.fas ( Sequence of Chain A in fasta format) 1HCO.A.pdb ( Structure of Chain A in pdb format) 1HCO.B.fas ( Sequence of Chain B in fasta format) 1HCO.B.pdb ( Sequence of Chain B in pdb format) .I wrote it in the spirit of your example script given @ http://cvs.bioperl.org/cgi-bin/viewcvs/viewcvs.cgi/bioperl-live/examples/structure/structure-io.pl?rev=1.2&content-type=text/vnd.viewcvs-markupCan this be included in the example scripts too? Thanks and regards, Animesh -- ______________________"The Answer Lies in Genome"______________________ http://fuzzylife.org/animesh/ +919868580004 -------------- next part -------------- A non-text attachment was scrubbed... Name: pdb_chain_extract.pl Type: application/octet-stream Size: 2593 bytes Desc: not available Url : http://lists.open-bio.org/pipermail/bioperl-l/attachments/20060718/9e98ece2/attachment.obj From bix at sendu.me.uk Tue Jul 18 09:20:34 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Tue, 18 Jul 2006 14:20:34 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BCAE08.8070307@ebi.ac.uk> References: <44BCAE08.8070307@ebi.ac.uk> Message-ID: <44BCE022.5000502@sendu.me.uk> I thought I'd post this here incase anyone wants to discuss the points Nadeem brings up. As far as I can see it is acceptable to remove the <> bits so I still plan to do so. Nadeem Faruque wrote: [off-list, posted here with permission] > In case you didn't realise, odd node names such as 'Gnathostomata > ' are created to uniquify some tax nodes that have identical > scientific names, eg there are 8 entries for Rhodotorula. > > When we parse the ncbi tax dump we store this column as UNIQUE_NAME but > I don't think that we actually use it for anything at within EMBL > nucleotide sequence bank. [...] > Also, I note that there are 548 non-unique NAME_TXT of class 'scientific > name', so the UNIQUE_NAME column may be of use to someone (though given > the strength of using a taxid directly I don't see why you'd want to). Indeed. And given that we are building a taxonomy with nodes, it doesn't matter that two different nodes in the entire taxonomy tree share the same name - the position in the tree implicitly is something unique. So if you find yourself with a node called 'Rhodotorula' you can find out which one it is by looking at the closest ranked parent. That said, for 'Rhodotorula ' the closest ranked parent is 'Sporidiobolales' and not 'Sporidiobolaceae'. Is that a problem? Do we need to care about this word 'Sporidiobolaceae' that is effectively just a synonym of 'Sporidiobolales'? [Nadeem later replied "...I can't imagine the <> value to be of any use.". He also clarified that if species have identical names and you store those, you can't work out what the corresponding taxid is. Without the <> bit you need some other information, like the classification. I think this other information will be present in input file formats and it must be up to the user to store the extra when outputting from bioperl] From osborne1 at optonline.net Tue Jul 18 10:50:48 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Tue, 18 Jul 2006 10:50:48 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BC9F3F.2040500@sendu.me.uk> Message-ID: Sendu, The idea to create mini *dmp files is a good one, I think. With respect to temporary files I'm fairly sure that most tests that use them create them some where in t/data and then delete them after. Brian O. On 7/18/06 4:43 AM, "Sendu Bala" wrote: > (~40MB) in t/data... do people think it would be reasonable to create > some sort of small subset of the data? I could just pull out the lines > from names.dmp and nodes.dmp relevant to a few example organisms. Say, > for human and a tricky bacteria and virus? > For the purposes of running the test, where should the index files be > kept? In t/data with the .dmp files or in /tmp? Should the test script > delete them afterwards, or leave them be? From cjfields at uiuc.edu Tue Jul 18 11:44:07 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 18 Jul 2006 10:44:07 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BC8D75.1080806@sendu.me.uk> Message-ID: <003201c6aa81$01db9a30$15327e82@pyrimidine> > What about the existing genus(), species(), sub_species() and variant() > methods? There would be no need for any logic to join things together, > but I would still like to be able to get just 'sapiens' from somewhere. > Can I use species() for that purpose (though again, species is strictly > 'Homo sapiens')? Likewise sub_species() and variant() could hold the > remaining non-redundant names. Or should all of these be deprecated > because they don't really have a place in a generic Node class? This is where Hilmar suggests that you have a bit of freedom in doing what you want, as with binomial(). So species() should return species ('sapiens'), genus return genus, etc. At that level there will need to be some additional data munging since the ranks below species seem to include the entire name, not just the species. But this could be done from the lineage if all nodes are present and tagged as such. > What about node_name()? Yet another synonym of scientific_name? (right > now it grabs the common name(s)). Ugh. I agree things need cleaning up. You could always make node_name() an alias for scientific_name() though it could just be deprecated. > What should I do with the classification array? Should it hold the raw > ScientificName like: > join(',', $node->classification) eq 'Homo sapiens, Homo, > Homo/Pan/Gorilla group [...]'? > Or should it be like: > join(',', $node->classification) eq 'sapiens, Homo, Homo/Pan/Gorilla > group [...]'? Don't know what the dump file gives; the XML output using efetch via entrez has the raw lineage (as appears in a GenBank sequence file) and the actual full lineage with TaxID, rank, 'scientific name,' in the actual lineage order. I think one problem area will be the 'no rank' designations in the lineage. Note that the below example also has a species and no genus; tricky! 312284 marine actinobacterium PHSC20C1 marine actinobacterium strain PHSC20C1 marine actinobacterium str. PHSC20C1 78537 species Bacteria ... cellular organisms; Bacteria; Actinobacteria; Actinobacteria (class); unclassified Actinobacteria; unclassified Actinobacteria (miscellaneous) 131567 cellular organisms no rank 2 Bacteria superkingdom 201174 Actinobacteria phylum 1760 Actinobacteria (class) class 52018 unclassified Actinobacteria no rank 78537 unclassified Actinobacteria (miscellaneous) no rank > The latter is how it currently works (when it works correctly); I would > rather fix it than lose the logic completely, but if we're staying true > to proper classification (vs. what a programmer might expect), I guess I > must use the raw ScientificName? > > > binomial() isn't part of the NCBI taxonomy definition, so you have > > freedom there to report what suits you. > > I don't think binomial() would serve any useful purpose now, however. I > can either deprecate it or make it a synonym of scientific_name() or > both. Or binomial() can be a version of scientific_name() that complains > if you use it on a rank higher or lower than species. As for species() > et al., it may have no place in a generic Node class. Thoughts? The use of scientific_name() in this context would be more to conform with what NCBI defines it as rather than as the actual definition; this should be explicitly stated as such in POD and is more for long-term maintainability. No matter what is done here, you will have some degree of confusion: those who want strict adherence to the term 'scientific name' and those who want the method to conform to NCBI's definition. Better to document the reasoning for it in some way that risk the random masses complaining. We could use binomial() for the 'scientific name' as the rest of the world knows it (as in binomial nomenclature), having it built from genus-species like you had originally suggested. That's what Hilmar suggested as an 'experimental' area of sorts, since NCBI doesn't use that particular term in its taxonomy definition. Chris From cjfields at uiuc.edu Tue Jul 18 11:48:36 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 18 Jul 2006 10:48:36 -0500 Subject: [Bioperl-l] advice In-Reply-To: <44BC5B1E.5080600@infotech.monash.edu.au> Message-ID: <003301c6aa81$a34fd8e0$15327e82@pyrimidine> Guess Dr. Conway became a Perl convert. The reviews of the book state that the 'best practices' really come from his experience as a Perl programmer over the last couple of decades, so maybe he learned something since 1990. Chris > > Ha ! I *almost* added something about that. I thought his vowel keys > were > > broken for a bit, maybe from pounding the keyboard with extreme > frustration! > > The wide variety of pronunciation of English around the world can be > mostly blamed on those damned vowels... so perhaps removing them helps > one to reach a wider audience :-) > > > As an aside, doesn't Damian Conway say something about the non-use of > vowels > > in 'Perl Best Practices?' I think it was in relation to variables, > > though... > > Yeah, on page 46 he says NOT to remove vowels in variable names, use > prefixes instead. Apprntly rmvng vwls mks t hrdr t ndrstnd stff. > > (Actually, I studied at Monash University under Damian Conway, and > recall his ridiculing of Perl, so I found it kind of ironic that he > ended up changing the Perl landscape so significantly! He even wrote an > internal publication "theStyle - a guide to C programming style" in > about 1990 in which he violates some of his later Perl Best Practices :-) > > -- > Dr Torsten Seemann http://www.vicbioinformatics.com > Victorian Bioinformatics Consortium, Monash University, Australia > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Tue Jul 18 12:05:48 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 18 Jul 2006 11:05:48 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BC9F3F.2040500@sendu.me.uk> Message-ID: <003401c6aa84$08ff6c80$15327e82@pyrimidine> > I've just realised that there are currently no tests for > Bio::DB::Taxonomy::flatfile, and that the ones for entrez get skipped. > Node doesn't get an especially thorough work-out either (in the skipped > section). > > I'm guessing it's not feasible to include the full taxdump from NCBI > (~40MB) in t/data... do people think it would be reasonable to create > some sort of small subset of the data? I could just pull out the lines > from names.dmp and nodes.dmp relevant to a few example organisms. Say, > for human and a tricky bacteria and virus? > For the purposes of running the test, where should the index files be > kept? In t/data with the .dmp files or in /tmp? Should the test script > delete them afterwards, or leave them be? I would place a small section in t/data or several individual examples in a subdirectory thereof (t/data/taxonomy). > The entrez tests are skipped to 'avoid blocking', but the test only > makes 2 entrez queries with a sleep(3) in-between. Basically, I don't > think there's ever any reason to skip. Shall I remove the skip? Lots of > other database-accessing tests in the test suite just go right ahead and > access their database, no problem. Depends on whether there is someone out there who doesn't have a network connection (and there always is). The DB.t tests skip based on testing for the env. variable BIOPERLDEBUG. 1..121 ok 1 # Skipping tests which require remote servers - set env variable BIOPERLDEBUG to test You could always do something along those lines or add a test for a network connection using an eval block and skip the tests if the network test fails, but there you run the risk of the tests failing not b/c of code problems but from remote server issues; I've seen this happen with SwissProt and GenBank testing before during peak hours. Chris > Cheers, > Sendu. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From bix at sendu.me.uk Tue Jul 18 13:03:54 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Tue, 18 Jul 2006 18:03:54 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <003201c6aa81$01db9a30$15327e82@pyrimidine> References: <003201c6aa81$01db9a30$15327e82@pyrimidine> Message-ID: <44BD147A.9020103@sendu.me.uk> Chris Fields wrote: >> What about the existing genus(), species(), sub_species() and variant() >> methods? There would be no need for any logic to join things together, >> but I would still like to be able to get just 'sapiens' from somewhere. >> Can I use species() for that purpose (though again, species is strictly >> 'Homo sapiens')? Likewise sub_species() and variant() could hold the >> remaining non-redundant names. Or should all of these be deprecated >> because they don't really have a place in a generic Node class? > > This is where Hilmar suggests that you have a bit of freedom in doing what > you want, as with binomial(). So species() should return species > ('sapiens'), genus return genus, etc. [regarding changes to Bio::Taxonomy::Node] Actually, I'm really strongly leaning toward getting rid of the following methods and new() options (and giving up entirely on being able to keep 'sapiens' somewhere): -organelle, organelle() -division, division() -sub_species, sub_species() -variant, variant() species(), validate_species_name() genus() binomial() As far as I can see none of these methods have any place in a generic Node class. If you want to know what your species is you have to be rank() 'species' and you just call scientific_name(). The above kind of methods belong in something like Bio::Species or similar, NOT in Node. Does anyone disagree? Can anyone offer a justification for keeping these methods? Changes I haven't yet discussed but have already made (but not committed): *parent_taxon_id = \&parent_id; *common_name = \&common_names; -factory and factory() removed, since there is no Bio::Taxonomy::FactoryI-implementing module, nothing in Node to make use of a factory once set, and a factory seems redundant when we're a node with a -dbh. validate_name() removed because it just returns 1. >> What about node_name()? Yet another synonym of scientific_name? (right >> now it grabs the common name(s)). Ugh. > > I agree things need cleaning up. You could always make node_name() an alias > for scientific_name() though it could just be deprecated. Actually, I've gone with node_name as the 'pure' and best method to set the name of your node with, and made scientific_name an alias of it (though it behaves as suggested earlier in the thread). >> What should I do with the classification array? Should it hold the raw >> ScientificName like: >> join(',', $node->classification) eq 'Homo sapiens, Homo, >> Homo/Pan/Gorilla group [...]'? (I've decided to do it the above way for consistency with scientific_name) >> Or should it be like: >> join(',', $node->classification) eq 'sapiens, Homo, Homo/Pan/Gorilla >> group [...]'? > > Don't know what the dump file gives; the XML output using efetch via entrez > has the raw lineage (as appears in a GenBank sequence file) and the actual > full lineage with TaxID, rank, 'scientific name,' in the actual lineage > order. I think one problem area will be the 'no rank' designations in the > lineage. Note that the below example also has a species and no genus; > tricky! Currently, flatfile and entrez ignore nodes with a rank of 'no rank' when they build the classification array. I had no intention of changing this behaviour. > 1760 > Actinobacteria (class) > class Ugh. I guess my proposal to remove <> bits via flatfile extends to removing () bits via entrez. We don't need unique names; we can use object_id() when uniqueness matters. >> I don't think binomial() would serve any useful purpose now, however. > > We could use binomial() for the 'scientific name' as the rest of the world > knows it (as in binomial nomenclature), having it built from genus-species > like you had originally suggested. No, see above. I don't think it makes the slightest bit of sense for a Node to go around trying to build things from a parent it may or may not have. Again, binomial() is a method for something like Bio::Species, not a generic Node class. From cjfields at uiuc.edu Tue Jul 18 15:34:29 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 18 Jul 2006 14:34:29 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BD147A.9020103@sendu.me.uk> Message-ID: <000901c6aaa1$328dd3d0$15327e82@pyrimidine> ... > [regarding changes to Bio::Taxonomy::Node] > > Actually, I'm really strongly leaning toward getting rid of the > following methods and new() options (and giving up entirely on being > able to keep 'sapiens' somewhere): > > -organelle, organelle() > -division, division() > -sub_species, sub_species() > -variant, variant() > species(), validate_species_name() > genus() > binomial() > > As far as I can see none of these methods have any place in a generic > Node class. If you want to know what your species is you have to be > rank() 'species' and you just call scientific_name(). The above kind of > methods belong in something like Bio::Species or similar, NOT in Node. > Does anyone disagree? Can anyone offer a justification for keeping these > methods? Bio::Species and Bio::Taxonomy::Node are closely linked and plans are to have Bio::Species delegate methods to Bio::Taxonomy::Node. So any changes to Node will affect Bio::Species to some degree. If you can get the lineage from XML, you could set many of these based on the rank given. Jason uses XML::Twig in Bio::DB::Taxonomy::entrez to parse out the XML data into Bio::Taxonomy::Node objects; it shouldn't be difficult to leave some methods based on rank (genus, species, etc) as simple get/set methods for the time being and leave the heavy lifting to the modules dealing directly with the data. Bio::Species could then delegate data/methods over to Bio::Taxonomy::Node fairly easily. If there is no genus/species data to be grabbed (either it doesn't exist or isn't present for some reason), then simply leave it as undef. That's also why I thought binomial() could stick around; if you have both the genus() and species() you could grab both using binomial(), building in special cases or error handling in case genus() or species() or both return undef. I don't see the problem in keeping this as long as users know what it means: by detailing the method in POD. If someone complains we tell them to RTFM. > Changes I haven't yet discussed but have already made (but not committed): > > *parent_taxon_id = \&parent_id; > *common_name = \&common_names; > -factory and factory() removed, since there is no > Bio::Taxonomy::FactoryI-implementing module, nothing in Node to make use > of a factory once set, and a factory seems redundant when we're a node > with a -dbh. > validate_name() removed because it just returns 1. > ... > Actually, I've gone with node_name as the 'pure' and best method to set > the name of your node with, and made scientific_name an alias of it > (though it behaves as suggested earlier in the thread). I don't have any problem with that. As long as it conforms somewhat to the NCBI definition to prevent confusion I think it's okay. > >> What should I do with the classification array? Should it hold the raw > >> ScientificName like: > >> join(',', $node->classification) eq 'Homo sapiens, Homo, > >> Homo/Pan/Gorilla group [...]'? > > (I've decided to do it the above way for consistency with scientific_name) I think that's fine. ... > Currently, flatfile and entrez ignore nodes with a rank of 'no rank' > when they build the classification array. I had no intention of changing > this behaviour. If you ignore nodes with 'no rank' there will be major problems when retrieving certain TaxID's from protein/nucleotide sequences. I had posted some sample XML from many NCBI TaxIDs taken from sequence files and via ELink and a good many of those nodes (most of them from genome projects) have 'no rank'. 376686 Flavobacterium johnsoniae UW101 ... 986 no rank ... 373903 Halothermothrix orenii H 168 ... 31909 no rank These aren't 'edge cases' anymore but now are pretty common from genome sequencing. I would just assign 'no rank' to rank() and have the node retained for DB purposes. It seems that the tax dump loses quite a bit of information somewhere along the way that shows up in the XML. Or am I wrong? > > 1760 > > Actinobacteria (class) > > class > > Ugh. I guess my proposal to remove <> bits via flatfile extends to > removing () bits via entrez. We don't need unique names; we can use > object_id() when uniqueness matters. The XML parsing in Taxonomy::entrez will take care of the and retains the character data in between. It would be a matter of setting the parser correctly to grab the relevant data and assign it properly. > >> I don't think binomial() would serve any useful purpose now, however. > > > > We could use binomial() for the 'scientific name' as the rest of the > world > > knows it (as in binomial nomenclature), having it built from genus- > species > > like you had originally suggested. > > No, see above. I don't think it makes the slightest bit of sense for a > Node to go around trying to build things from a parent it may or may not > have. Again, binomial() is a method for something like Bio::Species, not > a generic Node class. Bio::Species, from what I gather, was initially created to hold the tax data from GenBank/EMBL/SwissProt (RichSeq) files and is not DB-aware. Bio::Taxonomy::Node was supposed to be like Bio::Species and also be DB-aware: http://thread.gmane.org/gmane.comp.lang.perl.bio.general/4284/focus=4321 Again, Bio::Species methods are supposed to (eventually) delegate to Bio::Taxonomy::Node, so the two are closely linked along with their methods. Any way we go about it here (keeping certain methods and tossing others, changing the data returned, etc), it looks like there will be API issues down the road which will directly affect anyone using tax data. That affects bioperl-db directly as well as any other bioperl-based DB's which rely on tax data. So we need to tread a bit carefully when making major changes to make sure that they work for bioperl-db and anywhere else that may require it. Chris From cjfields at uiuc.edu Tue Jul 18 15:41:31 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 18 Jul 2006 14:41:31 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BD147A.9020103@sendu.me.uk> Message-ID: <000a01c6aaa2$2b4f50c0$15327e82@pyrimidine> Sendu et al, I'll play around with adding a quick method to Bio::Species for scientific_name(); if I can get it to play nice with Bio::SeqIO::genbank and it passes tests I'll commit it. Chris From golharam at umdnj.edu Tue Jul 18 15:36:54 2006 From: golharam at umdnj.edu (Ryan Golhar) Date: Tue, 18 Jul 2006 15:36:54 -0400 Subject: [Bioperl-l] advice In-Reply-To: <003301c6aa81$a34fd8e0$15327e82@pyrimidine> Message-ID: <00a501c6aaa1$86edb620$2f01a8c0@GOLHARMOBILE1> Right. There was a chain letter going around the internet for awhile about how you can leave out certain letters and the human brain will still be able to correctly interpret what the word is supposed to be. Either that or it was something about how Europe was adopting a new variation of English and after many successions it started to sound/look like German. > The wide variety of pronunciation of English around the world can be > mostly blamed on those damned vowels... so perhaps removing them helps > one to reach a wider audience :-) > > > As an aside, doesn't Damian Conway say something about the non-use > > of > vowels > > in 'Perl Best Practices?' I think it was in relation to variables, > > though... > > Yeah, on page 46 he says NOT to remove vowels in variable names, use > prefixes instead. Apprntly rmvng vwls mks t hrdr t ndrstnd stff. From cjfields at uiuc.edu Tue Jul 18 17:44:29 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 18 Jul 2006 16:44:29 -0500 Subject: [Bioperl-l] Bio::SeqIO::genbank and Bio::Species Message-ID: <000001c6aab3$58ee7bd0$15327e82@pyrimidine> For a given GenBank file, you'll have the following (this is from NCBI's current flatfile format, http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html): LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999 DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds. ACCESSION U49845 VERSION U49845.1 GI:1293613 KEYWORDS . SOURCE Saccharomyces cerevisiae (baker's yeast) ORGANISM Saccharomyces cerevisiae Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces. ... The SOURCE line above, according to NCBI, contains an abbreviated name and a common name (optional); it can also apparently contain additional information, such as organelles and so on. The ORGANISM line contains NCBI's definition of the formal scientific name (see the related thread on Taxonomy proposed changes) along with lineage information Currently, Bio::SeqIO::genbank and Bio::Species are very inconsistent with bacterial names, so when I process everything through SeqIO I get: SOURCE Mycobacterium tuberculosis H37Rv H37Rv ORGANISM Mycobacterium tuberculosis SOURCE Mycobacterium tuberculosis CDC1551 CDC1551 ORGANISM Mycobacterium tuberculosis SOURCE Mycobacterium avium subsp. paratuberculosis K-10 paratuberculosis K-10 ORGANISM Mycobacterium avium subsp. SOURCE Bacillus sp. NRRL B-14911 NRRL B-14911 ORGANISM Bacillus sp. I have added a scientific_name() method to Bio::Species to contain the string on the ORGANISM line and replace it as is, which seems to work well (doesn't chop the name down). The bigger issue is the mess with the SOURCE line. This stems from adding back information from sub_species(), which I don't think needs to be done as it's supposed to be an abbreviated name. Anybody mind if I try splitting up the original SOURCE line data into organelle(), abbreviated_name(), and common_name()? This will change common_name a bit (so, instead of 'Saccharomyces cerevisiae' it will give 'baker's yeast') but will also conform more to the NCBI definition of 'common name.' Also, organelle info isn't handled yet; I could toy with adding support for it. Any objections? I may proceed to do the same with EMBL, SwissPort, and others that use Bio::Species if this works out. Christopher Fields Postdoctoral Researcher - Switzer Lab Dept. of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Tue Jul 18 18:50:37 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Tue, 18 Jul 2006 23:50:37 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <000901c6aaa1$328dd3d0$15327e82@pyrimidine> References: <000901c6aaa1$328dd3d0$15327e82@pyrimidine> Message-ID: <44BD65BD.4030501@sendu.me.uk> Chris Fields wrote: > ... >> [regarding changes to Bio::Taxonomy::Node] >> >> Actually, I'm really strongly leaning toward getting rid of the >> following methods and new() options (and giving up entirely on being >> able to keep 'sapiens' somewhere): >> >> -organelle, organelle() >> -division, division() >> -sub_species, sub_species() >> -variant, variant() >> species(), validate_species_name() >> genus() >> binomial() > > Bio::Species and Bio::Taxonomy::Node are closely linked and plans are to > have Bio::Species delegate methods to Bio::Taxonomy::Node. So any changes > to Node will affect Bio::Species to some degree. I see from the original postings that Node was intended to be like Species, but I don't think it makes the slightest bit of sense. A /single/ Node need only (must only!) represent the information for a single node in the taxonomy. Or else what do these objects mean? What is the object model? It's bad bad bad for it to be sensible one way (when you're making your own taxonomy by making your own nodes) and nonsensical another (when we stuff in methods so that Bio::Species is happy). The way Node is written right now, and what you're suggesting, is that we stuff the entire Taxonomy into the Node. Well, except that you don't even have methods for every taxonomic level - there is genus() but no subphylum(). I can't emphasise strongly enough how insane all this is. The correct thing for Bio::Species to interact with is Bio::Taxonomy. Bio::Taxonomy is a collection of Nodes and has the sort of methods that Bio::Species would need to delegate its current functionality. I'm quite willing to do a proper overhaul here so everything makes sense. You either make your own nodes and add these to a Taxonomy or use a factory (which would use a Bio::DB::Taxonomy presumably). A Taxonomy lets you discover the classification of any node it contains. Bio::Species could implement a method like genus() by: $node = $taxonomy->get_node('genus') || return; return $node->scientific_name; Bio::Taxonomy isn't perfect, but I can certainly get it to do its job. I'd probably make it rank-name and order independent for starters. Bio::Taxonomy::Node needs to be reduced right down to just hold data about the node it represents, and possibly its parent node id (or other way of getting to its parent). So now I'm proposing dropping the classification() method from Node as well. It's simply not necessary; Bio::Taxonomy should give you that information. Bio::Taxnomoy::FactoryI doesn't make much sense to me at the moment from its docs, but it could be used to build a Taxonomy (that seems to be its intent, I'm just not sure what some of the methods are really supposed to do) such that Node might not even need any methods for getting its parent or child nodes. The Factory or Taxonomy might be able to deal with that. In short, I'm proposing a major change to Bio::Taxonomy::Node (make it just a node), and minor changes to (& implementation of) Bio::Taxonomy and Bio::Taxonomy::FactoryI such that they actually get used to do their jobs. > That's also why I thought binomial() could stick around; if you have both > the genus() and species() you could grab both using binomial(), building in > special cases or error handling in case genus() or species() or both return > undef. binomial() would belong in (and is present in) Bio::Taxonomy. But in any case, it's not needed there either; if you want the binomial you just ask for the scientific_name of the species node in your Taxonomy, since this now contains the actual scientific name == binomial. binomial() in Bio::Taxonomy could be reimplemented as: $node = $self->get_node('species') || return; return $node->scientific_name; >> Currently, flatfile and entrez ignore nodes with a rank of 'no rank' >> when they build the classification array. I had no intention of changing >> this behaviour. > > If you ignore nodes with 'no rank' there will be major problems when > retrieving certain TaxID's from protein/nucleotide sequences. This is only for the classification array, which is meaningless anyway (there only for file-format compatibility). If you want the real information you ask your Bio::Taxonomy (which asks each of its nodes). This is the whole point of having Bio::Taxonomy in the first place. It gives you great flexibility to do whatever you want to do. >>> 1760 >>> Actinobacteria (class) >>> class >> Ugh. I guess my proposal to remove <> bits via flatfile extends to >> removing () bits via entrez. We don't need unique names; we can use >> object_id() when uniqueness matters. > > The XML parsing in Taxonomy::entrez will take care of the and retains > the character data in between. You misunderstood. I meant the <> bits I discussed at the very start of this thread, that flatfile gives you. Here I'm referring to getting rid of ' (class)' as well. > Any way we go about it here (keeping certain methods and tossing others, > changing the data returned, etc), it looks like there will be API issues > down the road which will directly affect anyone using tax data. That > affects bioperl-db directly as well as any other bioperl-based DB's which > rely on tax data. So we need to tread a bit carefully when making major > changes to make sure that they work for bioperl-db and anywhere else that > may require it. Does anything make serious use of the current Bio::Taxonomy code? Or are they using Bio::Species? From cjfields at uiuc.edu Wed Jul 19 00:38:05 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 18 Jul 2006 23:38:05 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BD65BD.4030501@sendu.me.uk> References: <000901c6aaa1$328dd3d0$15327e82@pyrimidine> <44BD65BD.4030501@sendu.me.uk> Message-ID: I think we should wait a bit for any dramatic changes but implement the ones there seems to be a consensus on. I understand your reasoning for taking this on but I'm not sure completely revamping Bio::Taxonomy w/o input from the core developers is wise, especially since we do NOT know who uses it, why they use it, and how changing/ removing methods will affect their code. We are doing nothing productive here by constantly butting heads on this and having different opinions on what we think Bio::Taxonomy/Bio::Species is best suited for, when neither one of us is actually sure about who uses it and why. A reasonable solution is there but we must rely on outside opinions in order to reach it, so I propose a short moratorium on changes to Bio::Taxonomy/Bio::Species that radically redefine the API on either class. BTW, for anbody following, I'm perfectly comfortable if Sendu takes the lead on this and implements his changes; I'm just not sure about stripping the class down to the bare minimum. So far, the only thing that has been proposed (and accepted by all) is that scientific_name() hold the data for that tag in a node. I think most here would agree that's fine; I've already added a get/set to Bio::Species but haven't committed it yet. However, what you propose doing below is refactoring the code and changing the API. I agree there needs to be an overhaul but we can't do this w/o guidance or input from the GBE (Great Bioperl Elders). I would like some of the 'senior' core developers chime in a bit more on their thoughts on this. Jason also mentioned somewhere that any changes for Taxonomy/ Species should be tracked on the wiki somewhere as well to make sure everything is kosher and keep users up-to-date. I would like his input here but I think he's still incommunicado at the moment. Chris On Jul 18, 2006, at 5:50 PM, Sendu Bala wrote: > Chris Fields wrote: >> ... >>> [regarding changes to Bio::Taxonomy::Node] >>> >>> Actually, I'm really strongly leaning toward getting rid of the >>> following methods and new() options (and giving up entirely on being >>> able to keep 'sapiens' somewhere): >>> >>> -organelle, organelle() >>> -division, division() >>> -sub_species, sub_species() >>> -variant, variant() >>> species(), validate_species_name() >>> genus() >>> binomial() >> >> Bio::Species and Bio::Taxonomy::Node are closely linked and plans >> are to >> have Bio::Species delegate methods to Bio::Taxonomy::Node. So any >> changes >> to Node will affect Bio::Species to some degree. > > I see from the original postings that Node was intended to be like > Species, but I don't think it makes the slightest bit of sense. A > /single/ Node need only (must only!) represent the information for a > single node in the taxonomy. Or else what do these objects mean? > What is > the object model? It's bad bad bad for it to be sensible one way (when > you're making your own taxonomy by making your own nodes) and > nonsensical another (when we stuff in methods so that Bio::Species is > happy). The way Node is written right now, and what you're suggesting, > is that we stuff the entire Taxonomy into the Node. Well, except that > you don't even have methods for every taxonomic level - there is > genus() > but no subphylum(). I can't emphasise strongly enough how insane all > this is. > > The correct thing for Bio::Species to interact with is Bio::Taxonomy. > Bio::Taxonomy is a collection of Nodes and has the sort of methods > that > Bio::Species would need to delegate its current functionality. > > I'm quite willing to do a proper overhaul here so everything makes > sense. You either make your own nodes and add these to a Taxonomy > or use > a factory (which would use a Bio::DB::Taxonomy presumably). A Taxonomy > lets you discover the classification of any node it contains. > Bio::Species could implement a method like genus() by: > $node = $taxonomy->get_node('genus') || return; > return $node->scientific_name; > > Bio::Taxonomy isn't perfect, but I can certainly get it to do its job. > I'd probably make it rank-name and order independent for starters. > > Bio::Taxonomy::Node needs to be reduced right down to just hold data > about the node it represents, and possibly its parent node id (or > other > way of getting to its parent). So now I'm proposing dropping the > classification() method from Node as well. It's simply not necessary; > Bio::Taxonomy should give you that information. > > Bio::Taxnomoy::FactoryI doesn't make much sense to me at the moment > from > its docs, but it could be used to build a Taxonomy (that seems to > be its > intent, I'm just not sure what some of the methods are really supposed > to do) such that Node might not even need any methods for getting its > parent or child nodes. The Factory or Taxonomy might be able to deal > with that. > > In short, I'm proposing a major change to Bio::Taxonomy::Node (make it > just a node), and minor changes to (& implementation of) Bio::Taxonomy > and Bio::Taxonomy::FactoryI such that they actually get used to do > their > jobs. > > >> That's also why I thought binomial() could stick around; if you >> have both >> the genus() and species() you could grab both using binomial(), >> building in >> special cases or error handling in case genus() or species() or >> both return >> undef. > > binomial() would belong in (and is present in) Bio::Taxonomy. But > in any > case, it's not needed there either; if you want the binomial you just > ask for the scientific_name of the species node in your Taxonomy, > since > this now contains the actual scientific name == binomial. > > binomial() in Bio::Taxonomy could be reimplemented as: > $node = $self->get_node('species') || return; > return $node->scientific_name; > > >>> Currently, flatfile and entrez ignore nodes with a rank of 'no rank' >>> when they build the classification array. I had no intention of >>> changing >>> this behaviour. >> >> If you ignore nodes with 'no rank' there will be major problems when >> retrieving certain TaxID's from protein/nucleotide sequences. > > This is only for the classification array, which is meaningless anyway > (there only for file-format compatibility). If you want the real > information you ask your Bio::Taxonomy (which asks each of its nodes). > This is the whole point of having Bio::Taxonomy in the first place. > > It gives you great flexibility to do whatever you want to do. > > >>>> 1760 >>>> Actinobacteria (class) >>>> class >>> Ugh. I guess my proposal to remove <> bits via flatfile extends to >>> removing () bits via entrez. We don't need unique names; we can use >>> object_id() when uniqueness matters. >> >> The XML parsing in Taxonomy::entrez will take care of the >> and retains >> the character data in between. > > You misunderstood. I meant the <> bits I discussed at the very > start of > this thread, that flatfile gives you. Here I'm referring to getting > rid > of ' (class)' as well. > > >> Any way we go about it here (keeping certain methods and tossing >> others, >> changing the data returned, etc), it looks like there will be API >> issues >> down the road which will directly affect anyone using tax data. That >> affects bioperl-db directly as well as any other bioperl-based >> DB's which >> rely on tax data. So we need to tread a bit carefully when making >> major >> changes to make sure that they work for bioperl-db and anywhere >> else that >> may require it. > > Does anything make serious use of the current Bio::Taxonomy code? > Or are > they using Bio::Species? > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From ong at embl.de Wed Jul 19 03:51:48 2006 From: ong at embl.de (ong at embl.de) Date: Wed, 19 Jul 2006 09:51:48 +0200 Subject: [Bioperl-l] Fwd: Re: BioPerl query Message-ID: <20060719095148.f71b1v3p7qosk440@webmail.embl.de> HI, Anybody have an answer to the below query? Thanks. Regards, Ong ----- Forwarded message from birney at ebi.ac.uk ----- Date: Wed, 19 Jul 2006 08:16:06 +0100 From: Ewan Birney Reply-To: Ewan Birney Subject: Re: BioPerl query To: ong at embl.de On 18 Jul 2006, at 10:26, ong at embl.de wrote: > Dear Birney, > > Good day i wish to get your advise on how do i print out the PSM > matrix from > the code below. Thanks > I would ask this message on the bioperl list, not to me directly. > Regards, > Ong > > use Bio::Matrix::PSM::IO; > > my $psmIO=new Bio::Matrix::PSM::IO(-file=>'matrix.dat',- > format=>'transfac'); > while (my $psm=$psmIO->next_psm) { > my $id=$psm->id; > my $an=$psm->accession_number; > my $re = $psm->regexp; > #my $l=$psm->width; > my $cons=$psm->IUPAC; > print"$id\t$an\t$re\t$l\t$cons\t$psm\n"; > } ----- End forwarded message ----- From rmb32 at cornell.edu Tue Jul 18 20:06:02 2006 From: rmb32 at cornell.edu (Robert Buels) Date: Tue, 18 Jul 2006 17:06:02 -0700 Subject: [Bioperl-l] bioperl pulls, xml parsers push, and things get complicated Message-ID: <44BD776A.1080402@cornell.edu> Hi all, Here's a kind of abstract question about Bioperl and XML parsing: I'm thinking about writing a bioperl parser for genomethreader XML, and I'm sort of mulling over the 'impedence mismatch' between the way bioperl Bio::*IO::* modules work and the way all of the current XML parsers work. Bioperl uses a 'pull' model, where every time you want a new chunk of stuff, you call $io_object->next_thing. All the XML parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a 'push' model, where every time they parse a chunk, they call _your_ code, usually via a subroutine reference you've given to the XML parser when you start it up. From what I can tell, current Bioperl IO modules that parse XML are using push parsers to parse the whole document, holding stuff in memory, then spoon-feeding it in chunks to the calling program when it calls next_*(). This is fine until the input XML gets really big, in which case you can quickly run out of memory. Does anybody have good ideas for nice, robust ways of writing a bioperl IO module for really big input XML files? There don't seem to be any perl pull parsers for XML. All I've dug up so far would be having the XML push parser running in a different thread or process, pushing chunks of data into a pipe or similar structure that blocks the progress of the push parser until the pulling bioperl code wants the next piece of data, but there are plenty of ugly issues with that, whether one were too use perl threads for it (aaagh!) or fork and push some kind of intermediate format through a pipe or socket between the two processes (eek!). So, um, if you've read this far, do you have any ideas? Rob -- Robert Buels SGN Bioinformatics Analyst 252A Emerson Hall, Cornell University Ithaca, NY 14853 Tel: 503-889-8539 rmb32 at cornell.edu http://www.sgn.cornell.edu From alc at sanger.ac.uk Wed Jul 19 06:55:12 2006 From: alc at sanger.ac.uk (Avril Coghlan) Date: Wed, 19 Jul 2006 11:55:12 +0100 Subject: [Bioperl-l] parsing est2genome output Message-ID: <1153306513.27383.12.camel@deskpro104.dynamic.sanger.ac.uk> An embedded and charset-unspecified text was scrubbed... Name: not available Url: http://lists.open-bio.org/pipermail/bioperl-l/attachments/20060719/67f858ce/attachment.pl From bernd.web at gmail.com Wed Jul 19 07:36:08 2006 From: bernd.web at gmail.com (Bernd Web) Date: Wed, 19 Jul 2006 13:36:08 +0200 Subject: [Bioperl-l] SearchIO HOWTO Message-ID: <716af09c0607190436n5fdd5576m23887051aaf95f8e@mail.gmail.com> Hi, On http://www.bioperl.org/wiki/HOWTO:SearchIO there is a great HOWTO parse your BLAST report. In the Table of methods, the third line from the bottom is: "HSP alignment Not available in this report Bio::SimpleAlign object " Would it not be good to add the get_aln method ( $hsp->get_aln) ? The line in "Using the methods" my $alignment_as_string = $alnIO->write_aln($aln); may be confusing: $alignment_as_string will be "1" on success and the alignment is printed to STDIO. Should IO::String be introduced here too set up a string filehandle? Best regards, Bernd From hlapp at gmx.net Wed Jul 19 09:40:47 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 19 Jul 2006 09:40:47 -0400 Subject: [Bioperl-l] bioperl pulls, xml parsers push, and things get complicated In-Reply-To: <44BD776A.1080402@cornell.edu> References: <44BD776A.1080402@cornell.edu> Message-ID: <73755CCF-2966-4580-BBEF-1F8A94CDC55D@gmx.net> In the past the way this was done for potentially big XML files is to use regex-based extraction of chunks that correspond to a object you want to return per call to next_XXX(). That chunk would then be passed on to the XML parser under the hood. This only gets problematic once even the chunks are huge, or the name of the element that encloses your chunk can be ambiguous with what's in your text. The latter is unlikely though if you include the angle brackets. I believe this is how at least some bioperl parsers for XML-based formats were written, and it seemed to work fine. -hilmar On Jul 18, 2006, at 8:06 PM, Robert Buels wrote: > Hi all, > > Here's a kind of abstract question about Bioperl and XML parsing: > > I'm thinking about writing a bioperl parser for genomethreader XML, > and > I'm sort of mulling over the 'impedence mismatch' between the way > bioperl Bio::*IO::* modules work and the way all of the current XML > parsers work. Bioperl uses a 'pull' model, where every time you > want a > new chunk of stuff, you call $io_object->next_thing. All the XML > parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a > 'push' model, where every time they parse a chunk, they call _your_ > code, usually via a subroutine reference you've given to the XML > parser > when you start it up. > > From what I can tell, current Bioperl IO modules that parse XML are > using push parsers to parse the whole document, holding stuff in > memory, > then spoon-feeding it in chunks to the calling program when it calls > next_*(). This is fine until the input XML gets really big, in which > case you can quickly run out of memory. > > Does anybody have good ideas for nice, robust ways of writing a > bioperl > IO module for really big input XML files? There don't seem to be any > perl pull parsers for XML. All I've dug up so far would be having the > XML push parser running in a different thread or process, pushing > chunks > of data into a pipe or similar structure that blocks the progress > of the > push parser until the pulling bioperl code wants the next piece of > data, > but there are plenty of ugly issues with that, whether one were too > use > perl threads for it (aaagh!) or fork and push some kind of > intermediate > format through a pipe or socket between the two processes (eek!). > > So, um, if you've read this far, do you have any ideas? > > Rob > > -- > Robert Buels > SGN Bioinformatics Analyst > 252A Emerson Hall, Cornell University > Ithaca, NY 14853 > Tel: 503-889-8539 > rmb32 at cornell.edu > http://www.sgn.cornell.edu > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From jay at jays.net Wed Jul 19 09:43:52 2006 From: jay at jays.net (Jay Hannah) Date: Wed, 19 Jul 2006 08:43:52 -0500 (CDT) Subject: [Bioperl-l] Walking multiple bioentries using bioperl-db Message-ID: Howdy -- I'm using bioperl-db + biosql-schema + mySQL. I can now successfully build a biosql-schema instance in mySQL, load taxonomy, then using bioperl-db load a GenBank file from disk, commiting the sequences I want. For a given accession number + version + namespace, I can tell bioperl-db to delete that from mySQL and it does. Yay!! I'll be throwing a "Using bioperl-db" document onto the wiki over the next week. What I am current baffled by: How do I ask bioperl-db to walk over multiple bioentries in my database so I can do things with them? The simplest possible example: print a list of all bioentries in my database. It is trivially easy to just query mySQL directly, but if I'm reading / understanding the documentation correctly bioperl-db intends to be database schema and RDBMS agnostic. In that case, I should use bioperl-db to walk my records. So, how do I do that? Is Bio::DB::Query::BioQuery the way to do this? The only way? If so then can someone help me understand the datacollections() and where() methods? perldoc Bio::DB::Query::BioQuery # all mouse sequences loaded under namespace ensembl that # have receptor in their description $query->datacollections(["Bio::PrimarySeqI e", "Bio::Species=>Bio::PrimarySeqI sp", "BioNamespace=>Bio::PrimarySeqI db"]); $query->where(["sp.binomial like 'Mus *'", "e.desc like '*receptor*'", "db.namespace = 'ensembl'"]); # all mouse sequences loaded under namespace ensembl that # have receptor in their description, and that also have a # cross-reference with SWISS as the database $query->datacollections(["Bio::PrimarySeqI e", "Bio::Species=>Bio::PrimarySeqI sp", "BioNamespace=>Bio::PrimarySeqI db", "Bio::Annotation::DBLink xref", I'm bewildered by this API. Please forgive my ignorance. 1) How do I get *all* bioentries out of my database? 2) Say I did want just the "namespace" 'Pico' (one of my biodatabase.name's). Where did "BioNamespace=>Bio::PrimarySeqI db"]); come from? How was I supposed to figure out the left hand side of that mapping? The right hand side? If that line wasn't sitting in that document was there a way for me to figure it out as a *user* of bioperl-db? Or would I need to be a *programmer* of bioperl-db reading source to figure this out? Where did "db.namespace = 'ensembl'"]); come from? Again, do I have to read source code to know how to invoke that magic? Sorry if I sound like a jerk. That is not my intention. Hopefully I can document the answers for future bioperl-db'ers. Thanks in advance, j my current plaything: http://openlab.jays.net From cjfields at uiuc.edu Wed Jul 19 10:34:48 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 19 Jul 2006 09:34:48 -0500 Subject: [Bioperl-l] bioperl pulls, xml parsers push, and things get complicated In-Reply-To: <44BD776A.1080402@cornell.edu> Message-ID: <002801c6ab40$7cfcd980$15327e82@pyrimidine> The Bio::SearchIO modules are supposed work like a SAX parser, where results are returned as the report is parsed b/c of the occurrence of specific 'events' (start_element, end_element, and so on). However, the actual behaviour for each module changes depending on the report type and the author's intention. There was a thread about a month ago on HMMPFAM report parsing where there was some contention as to how to build hits(models)/HSPs(domains). HMMPFAM output has one HSP per hit and is sorted on the sequence length so a particular hit can appear more than once, depending on how many times it hits along the sequence length itself. So, to gather all the HSPs together under one hit you would have to parse the entire report and build up a Hit/HSP tree, then use the next_hit/next_hsp oterators to parse through everything. Currently it just reports Hit/HSP pairs and it is up to the user to build that tree. In contrast, BLAST output should be capable of throwing hit/HSP clusters on the fly based on the report output, but is quite slow (event the XML output crawls). Jason thinks it's b/c of object inheritance and instantiation; I think it's probably more complicated than that (there are a ton of method calls which tend to slow things down quite a bit as well). I would say try using SearchIO, but instead of relying directly on object handler calls to create Hit/HSP objects using an object factory (which is where I think a majority of the speed is lost), build the data internally on the fly using start_element/end_element, then return hashes instead based on the element type triggered using end_element. As an aside, I'm trying to switch the SearchIO::blastxml over to XML::SAX (using XML::SAX::ExpatXS/expat) and plan on switching it over to using hashes at some point, possibly starting off with a different SearchIO plugin module. If you have other suggestions (XML parser of choice, ways to speed up parsing/retrieve data) we would be glad to hear them. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Robert Buels > Sent: Tuesday, July 18, 2006 7:06 PM > To: bioperl-l at bioperl.org > Subject: [Bioperl-l] bioperl pulls, xml parsers push,and things get > complicated > > Hi all, > > Here's a kind of abstract question about Bioperl and XML parsing: > > I'm thinking about writing a bioperl parser for genomethreader XML, and > I'm sort of mulling over the 'impedence mismatch' between the way > bioperl Bio::*IO::* modules work and the way all of the current XML > parsers work. Bioperl uses a 'pull' model, where every time you want a > new chunk of stuff, you call $io_object->next_thing. All the XML > parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a > 'push' model, where every time they parse a chunk, they call _your_ > code, usually via a subroutine reference you've given to the XML parser > when you start it up. > > From what I can tell, current Bioperl IO modules that parse XML are > using push parsers to parse the whole document, holding stuff in memory, > then spoon-feeding it in chunks to the calling program when it calls > next_*(). This is fine until the input XML gets really big, in which > case you can quickly run out of memory. > > Does anybody have good ideas for nice, robust ways of writing a bioperl > IO module for really big input XML files? There don't seem to be any > perl pull parsers for XML. All I've dug up so far would be having the > XML push parser running in a different thread or process, pushing chunks > of data into a pipe or similar structure that blocks the progress of the > push parser until the pulling bioperl code wants the next piece of data, > but there are plenty of ugly issues with that, whether one were too use > perl threads for it (aaagh!) or fork and push some kind of intermediate > format through a pipe or socket between the two processes (eek!). > > So, um, if you've read this far, do you have any ideas? > > Rob > > -- > Robert Buels > SGN Bioinformatics Analyst > 252A Emerson Hall, Cornell University > Ithaca, NY 14853 > Tel: 503-889-8539 > rmb32 at cornell.edu > http://www.sgn.cornell.edu > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Wed Jul 19 10:44:30 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 19 Jul 2006 09:44:30 -0500 Subject: [Bioperl-l] SearchIO HOWTO In-Reply-To: <716af09c0607190436n5fdd5576m23887051aaf95f8e@mail.gmail.com> Message-ID: <002901c6ab41$d7f61350$15327e82@pyrimidine> The information in that table is referring to the BLAST report example before the table itself. However, I can tell you that using that report works (sorry if the text wrapping here mangles the output), so the table information is erroneous. I'll do some updating on that. Chris Here's the script: use Bio::SearchIO; use Bio::AlignIO; my $parser = Bio::SearchIO->new (-file => shift @ARGV, -format => 'blast'); my $aln_out = Bio::AlignIO->new(-fh => \*STDOUT, -format => 'clustalw'); while (my $result = $parser->next_result) { while (my $hit = $result->next_hit) { while (my $hsp = $hit->next_hsp) { $aln_out->write_aln($hsp->get_aln); } } } Output (via STDOUT): ------------------------------------ CLUSTAL W(1.81) multiple sequence alignment gi|20521485|dbj|AP004641.2/2896-3051 DMGRCSSGCNRYPEPMTPDTMIKLYREKEGLGAYIWMPTPDMSTEGRVQMLP gb|443893|124775/197-246 DIVQNSSGCNRYPEPMTPDTMIKLYRE-EGL-AYIWMPTPDMSTEGRVQMLP *: : ********************** *** ******************** ------------------------------------ > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Bernd Web > Sent: Wednesday, July 19, 2006 6:36 AM > To: bioperl-l at bioperl.org > Subject: [Bioperl-l] SearchIO HOWTO > > Hi, > > On http://www.bioperl.org/wiki/HOWTO:SearchIO there is a great HOWTO > parse your BLAST report. > In the Table of methods, the third line from the bottom is: > "HSP alignment Not available in this report Bio::SimpleAlign object " > > Would it not be good to add the get_aln method ( $hsp->get_aln) ? > > The line in "Using the methods" > my $alignment_as_string = $alnIO->write_aln($aln); > > may be confusing: $alignment_as_string will be "1" on success and the > alignment is printed to STDIO. Should IO::String be introduced here > too set up a string filehandle? > > > Best regards, > Bernd > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Wed Jul 19 10:55:02 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 19 Jul 2006 09:55:02 -0500 Subject: [Bioperl-l] ListSummaries delay apologies Message-ID: <002a01c6ab43$508aa5a0$15327e82@pyrimidine> Sorry about the delay for the ListSummaries the past couple months; things have been pretty hectic here which has put me really behind on them (it hasn't ever been my top priority, anyway). We're getting papers ready for publication, I going to a summer institute in a few weeks, and research (as always) is full steam ahead. Just so everybody know, I haven't given up on them, and plan on getting caught up after I get back from the institute in Connecticut (beginning of August). Cheers! Christopher Fields Postdoctoral Researcher - Switzer Lab Dept. of Biochemistry University of Illinois Urbana-Champaign From hlapp at gmx.net Wed Jul 19 11:31:50 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 19 Jul 2006 11:31:50 -0400 Subject: [Bioperl-l] Walking multiple bioentries using bioperl-db In-Reply-To: References: Message-ID: <62DA6CBC-CD0E-46A7-A669-71FFC808041B@gmx.net> On Jul 19, 2006, at 9:43 AM, Jay Hannah wrote: > Howdy -- > > I'm using bioperl-db + biosql-schema + mySQL. > > I can now successfully build a biosql-schema instance in mySQL, load > taxonomy, then using bioperl-db load a GenBank file from disk, > commiting > the sequences I want. For a given accession number + version + > namespace, > I can tell bioperl-db to delete that from mySQL and it does. Yay!! > I'll be > throwing a "Using bioperl-db" document onto the wiki over the next > week. Excellent! > > What I am current baffled by: > > How do I ask bioperl-db to walk over multiple bioentries in my > database so > I can do things with them? The simplest possible example: print a > list of > all bioentries in my database. > > It is trivially easy to just query mySQL directly, but if I'm > reading / > understanding the documentation correctly bioperl-db intends to be > database schema and RDBMS agnostic. In that case, I should use > bioperl-db > to walk my records. So, how do I do that? Bioperl-db indeed intends to be schema(-variant) and RDBMS agnostic, but that doesn't mean that you have to be as well. If you find it trivially easy to query your database using SQL and DBI and you don't care about being RDBMS or schema-variant agnostic, then by all means don't feel obligated to go through the bioperl-db API for querying. Note you can obtain the DBI database handle being used by a persistence adaptor by calling dbh(): my $dbh = $adaptor->dbh(); (The advantage of this is that you use the same connection, and therefore the same machinery for obtaining connection parameters and building the DSN that the rest of bioperl-db uses. Also, you have the ability to see transactions in progress that have not been committed yet by the adaptor.) What you should not do through SQL directly is modifying (UPDATE & DELETE) entities which bioperl-db also holds in a cache (by default terms, dbxrefs), unless you also take care to clear the cache of the respective adaptor. > > Is Bio::DB::Query::BioQuery the way to do this? The only way? Well, yes, unless you want to use SQL directly (which is not 0a despised option, see above). > > If so then can someone help me understand the datacollections() and > where() methods? datacollections() in essence corresponds to the FROM clause in a SQL statement, including JOIN statements. '=>' joins two entities in 1:n relationship, '<=>' joins two entities in n:n relationship. Instead of the table(s) you give the (Bioperl) objects that are to be joined, and bioperl-db will translate the objects to database entities, i.e., tables. Each object may be followed by an alias. The alias makes it easier to refer to the object (entity) in the query constraint part (where()). A single alias following a join expression will always apply to the master object (table). > > perldoc Bio::DB::Query::BioQuery > > # all mouse sequences loaded under namespace ensembl that > # have receptor in their description > $query->datacollections(["Bio::PrimarySeqI e", > "Bio::Species=>Bio::PrimarySeqI sp", > "BioNamespace=>Bio::PrimarySeqI > db"]); This is short for $query->datacollections([ # enumare the objects we need: "Bio::PrimarySeqI e", "Bio::Species sp", "BioNamespace db", # specify master-detail relationships "Bio::Species=>Bio::PrimarySeqI", "BioNamespace=>Bio::PrimarySeqI"]); because the alias following the join statement applies to the master entity. > $query->where(["sp.binomial like 'Mus *'", > "e.desc like '*receptor*'", > "db.namespace = 'ensembl'"]); The where() method corresponds to the WHERE clause in SQL. The default logical operator between constraints is AND. There is more documentation in on the syntax of expressing constraints in Bio::DB::Query::QueryConstraint. The column for which to constrain the value is given as the attribute (method) of the (bioperl) object. If there are multiple objects in the 'datacollections' then you need to qualify each attribute by prefixing it with the object, or the alias assigned in datacollections (), followed by a dot; corresponding to typical OO syntax. > > # all mouse sequences loaded under namespace ensembl that > # have receptor in their description, and that also have a > # cross-reference with SWISS as the database > $query->datacollections(["Bio::PrimarySeqI e", > "Bio::Species=>Bio::PrimarySeqI sp", > "BioNamespace=>Bio::PrimarySeqI db", > "Bio::Annotation::DBLink xref", > > I'm bewildered by this API. Please forgive my ignorance. I understand. This part of the API is by far the one with the skimpiest documentation. There are a considerable number of tests in t/query.t which may serve as examples. They also are known to work if their tests don't fail. The tests don't actually execute any query, instead some internal guts are used to test the translation to SQL, so if you know SQL you may be able to understand better what's going on by seeing the object- level query and the SQL-level query side-by-side. > > 1) How do I get *all* bioentries out of my database? Your datacollections would consist of the single object Bio::SeqI (or Bio::PrimarySeqI if you didn't want any annotation), and there would be no query constraint: my $query = Bio::DB::Query::BioQuery->new(-datacollections=> ["Bio::SeqI"]); > > 2) Say I did want just the "namespace" 'Pico' (one of my > biodatabase.name's). Where did > > "BioNamespace=>Bio::PrimarySeqI db"]); > > come from? How was I supposed to figure out the left hand side of that > mapping? The right hand side? If that line wasn't sitting in that > document > was there a way for me to figure it out as a *user* of bioperl-db? You would not know from Bioperl itself. The right hand side is a Bioperl class. The left hand side is a kludge because Bioperl does not have a namespace class, instead objects that have a namespace implement the Bio::IdentifiableI interface directly. This kind of one class mapping to two database entities (biodatabase is a table separate from, in fact a master for, bioentry) is extremely cumbersome to express in a generic way, so I chose to create a Bio::DB::Persistent::BioNamespace class to represent that for the purpose of queries. > Or would I need to be a *programmer* of bioperl-db reading source > to figure > this out? Where did > > "db.namespace = 'ensembl'"]); > > come from? Again, do I have to read source code to know how to invoke > that magic? Well, I'm not sure even reading the source code clears it all up ;) As I said before, the part before the dot is the alias or object, the part after is the attribute (or method) to be constrained. > > Sorry if I sound like a jerk. That is not my intention. Hopefully I > can > document the answers for future bioperl-db'ers. No problem, that's fine - and whatever you would be willing to contribute to documentation would be highly appreciated. -hilmar > > Thanks in advance, > > j > my current plaything: http://openlab.jays.net > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From aaron.j.mackey at gsk.com Wed Jul 19 09:48:55 2006 From: aaron.j.mackey at gsk.com (aaron.j.mackey at gsk.com) Date: Wed, 19 Jul 2006 09:48:55 -0400 Subject: [Bioperl-l] bioperl pulls, xml parsers push, and things get complicated In-Reply-To: <44BD776A.1080402@cornell.edu> Message-ID: There are 3rd generation XML "Pull" parsers (also called "StAX" for Streaming API for XML), but they seem to still be stuck in Java land (e.g. "MXP1") You could probably use POE to setup a state machine that used XML::Twig to "push" units of XML content onto a stack, to be read by your "next_*" pull method (where the XML::Twig push "stalled" until the "next_*" method was called, and vice versa). -Aaron bioperl-l-bounces at lists.open-bio.org wrote on 07/18/2006 08:06:02 PM: > Hi all, > > Here's a kind of abstract question about Bioperl and XML parsing: > > I'm thinking about writing a bioperl parser for genomethreader XML, and > I'm sort of mulling over the 'impedence mismatch' between the way > bioperl Bio::*IO::* modules work and the way all of the current XML > parsers work. Bioperl uses a 'pull' model, where every time you want a > new chunk of stuff, you call $io_object->next_thing. All the XML > parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a > 'push' model, where every time they parse a chunk, they call _your_ > code, usually via a subroutine reference you've given to the XML parser > when you start it up. > > From what I can tell, current Bioperl IO modules that parse XML are > using push parsers to parse the whole document, holding stuff in memory, > then spoon-feeding it in chunks to the calling program when it calls > next_*(). This is fine until the input XML gets really big, in which > case you can quickly run out of memory. > > Does anybody have good ideas for nice, robust ways of writing a bioperl > IO module for really big input XML files? There don't seem to be any > perl pull parsers for XML. All I've dug up so far would be having the > XML push parser running in a different thread or process, pushing chunks > of data into a pipe or similar structure that blocks the progress of the > push parser until the pulling bioperl code wants the next piece of data, > but there are plenty of ugly issues with that, whether one were too use > perl threads for it (aaagh!) or fork and push some kind of intermediate > format through a pipe or socket between the two processes (eek!). > > So, um, if you've read this far, do you have any ideas? > > Rob > > -- > Robert Buels > SGN Bioinformatics Analyst > 252A Emerson Hall, Cornell University > Ithaca, NY 14853 > Tel: 503-889-8539 > rmb32 at cornell.edu > http://www.sgn.cornell.edu > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From arareko at campus.iztacala.unam.mx Wed Jul 19 12:20:21 2006 From: arareko at campus.iztacala.unam.mx (Mauricio Herrera Cuadra) Date: Wed, 19 Jul 2006 11:20:21 -0500 Subject: [Bioperl-l] bioperl pulls, xml parsers push, and things get complicated In-Reply-To: <002801c6ab40$7cfcd980$15327e82@pyrimidine> References: <002801c6ab40$7cfcd980$15327e82@pyrimidine> Message-ID: <44BE5BC5.5040006@campus.iztacala.unam.mx> There are a lot of different XML processing strategies. Most fall into two categories: stream-based and tree-based. With the stream-based strategy, the parser continuously alerts a program to patterns in the XML. The parser functions like a pipeline, taking XML markup on one end and pumping out processed nuggets of data to your program. With the tree-based strategy, the parser keeps the data to itself until the very end, when it presents a complete model of the document to your program. The whole point to this strategy is that your program can pull out any data it needs, in any order. Most of the times I use tree-based strategies because they place all of the data into a structure which lets me to access any internal node using array/hash references. The simplest parser for this is XML::Simple using XML::Parser as the 'preferred parser' (which is built on top of XML::Parser::Expat, which is a wrapper around the expat library). More advanced parsers (both stream and tree-based) are: * XML::LibXML (a wrapper for libxml2's C library) * XML::Grove (takes a tree and changes it into an object hierarchy. Each node type is represented by a different class) * XML::PYX (for repackaging XML as a stream of easily recognizable and transmutable symbols) * XML::SimpleObject (changes a hierarchy of lists into a hierarchy of objects) * XML::XPath (for writing expressions that pinpoint specific pieces of documents) There are also some standards-based solutions like: * XML::SAX (Simple API for XML) for event streams. * XML::DOM (Document Object Model) for tree processing. Your strategy of choice depends a lot on the type of XML files you want to parse. Understanding the structure of the files and deciding which is the data you want to extract from them is a fundamental step to choose the appropriate method/parser to use. Just my 2 cents :) Regards, Mauricio. Chris Fields wrote: > The Bio::SearchIO modules are supposed work like a SAX parser, where results > are returned as the report is parsed b/c of the occurrence of specific > 'events' (start_element, end_element, and so on). However, the actual > behaviour for each module changes depending on the report type and the > author's intention. > > There was a thread about a month ago on HMMPFAM report parsing where there > was some contention as to how to build hits(models)/HSPs(domains). HMMPFAM > output has one HSP per hit and is sorted on the sequence length so a > particular hit can appear more than once, depending on how many times it > hits along the sequence length itself. So, to gather all the HSPs together > under one hit you would have to parse the entire report and build up a > Hit/HSP tree, then use the next_hit/next_hsp oterators to parse through > everything. Currently it just reports Hit/HSP pairs and it is up to the > user to build that tree. > > In contrast, BLAST output should be capable of throwing hit/HSP clusters on > the fly based on the report output, but is quite slow (event the XML output > crawls). Jason thinks it's b/c of object inheritance and instantiation; I > think it's probably more complicated than that (there are a ton of method > calls which tend to slow things down quite a bit as well). > > I would say try using SearchIO, but instead of relying directly on object > handler calls to create Hit/HSP objects using an object factory (which is > where I think a majority of the speed is lost), build the data internally on > the fly using start_element/end_element, then return hashes instead based on > the element type triggered using end_element. > > As an aside, I'm trying to switch the SearchIO::blastxml over to XML::SAX > (using XML::SAX::ExpatXS/expat) and plan on switching it over to using > hashes at some point, possibly starting off with a different SearchIO plugin > module. If you have other suggestions (XML parser of choice, ways to speed > up parsing/retrieve data) we would be glad to hear them. > > Chris > > > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> bounces at lists.open-bio.org] On Behalf Of Robert Buels >> Sent: Tuesday, July 18, 2006 7:06 PM >> To: bioperl-l at bioperl.org >> Subject: [Bioperl-l] bioperl pulls, xml parsers push,and things get >> complicated >> >> Hi all, >> >> Here's a kind of abstract question about Bioperl and XML parsing: >> >> I'm thinking about writing a bioperl parser for genomethreader XML, and >> I'm sort of mulling over the 'impedence mismatch' between the way >> bioperl Bio::*IO::* modules work and the way all of the current XML >> parsers work. Bioperl uses a 'pull' model, where every time you want a >> new chunk of stuff, you call $io_object->next_thing. All the XML >> parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a >> 'push' model, where every time they parse a chunk, they call _your_ >> code, usually via a subroutine reference you've given to the XML parser >> when you start it up. >> >> From what I can tell, current Bioperl IO modules that parse XML are >> using push parsers to parse the whole document, holding stuff in memory, >> then spoon-feeding it in chunks to the calling program when it calls >> next_*(). This is fine until the input XML gets really big, in which >> case you can quickly run out of memory. >> >> Does anybody have good ideas for nice, robust ways of writing a bioperl >> IO module for really big input XML files? There don't seem to be any >> perl pull parsers for XML. All I've dug up so far would be having the >> XML push parser running in a different thread or process, pushing chunks >> of data into a pipe or similar structure that blocks the progress of the >> push parser until the pulling bioperl code wants the next piece of data, >> but there are plenty of ugly issues with that, whether one were too use >> perl threads for it (aaagh!) or fork and push some kind of intermediate >> format through a pipe or socket between the two processes (eek!). >> >> So, um, if you've read this far, do you have any ideas? >> >> Rob >> >> -- >> Robert Buels >> SGN Bioinformatics Analyst >> 252A Emerson Hall, Cornell University >> Ithaca, NY 14853 >> Tel: 503-889-8539 >> rmb32 at cornell.edu >> http://www.sgn.cornell.edu >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- MAURICIO HERRERA CUADRA arareko at campus.iztacala.unam.mx Laboratorio de Gen?tica Unidad de Morfofisiolog?a y Funci?n Facultad de Estudios Superiores Iztacala, UNAM From cjfields at uiuc.edu Wed Jul 19 14:45:55 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 19 Jul 2006 13:45:55 -0500 Subject: [Bioperl-l] bioperl pulls, xml parsers push, and things get complicated In-Reply-To: <44BE5BC5.5040006@campus.iztacala.unam.mx> Message-ID: <000301c6ab63$91d31680$15327e82@pyrimidine> Yeah, we use XML::SAX, with XML::SAX::ExpatXS and expat, for SearchIO::blastxml. It previously used XML::Parser::PerlSAX but that didn't support SAX2-based parsing. XML::Twig is also used quite a bit Jason added his thoughts about this to the wiki: http://www.bioperl.org/wiki/XML_parsers Personally, I use XML::Simple with EUtilities because the XML returned is remarkably simple and normally fairly short. The trick is making sure when parsing data to dereference everything properly since XML::Simple stores everything in an elaborate data structure. I plan on switching to XML::SAX::ExpatXS or XML::Twig soon. Chris > There are a lot of different XML processing strategies. Most fall into > two categories: stream-based and tree-based. > > With the stream-based strategy, the parser continuously alerts a program > to patterns in the XML. The parser functions like a pipeline, taking XML > markup on one end and pumping out processed nuggets of data to your > program. > > With the tree-based strategy, the parser keeps the data to itself until > the very end, when it presents a complete model of the document to your > program. The whole point to this strategy is that your program can pull > out any data it needs, in any order. > > Most of the times I use tree-based strategies because they place all of > the data into a structure which lets me to access any internal node > using array/hash references. The simplest parser for this is XML::Simple > using XML::Parser as the 'preferred parser' (which is built on top of > XML::Parser::Expat, which is a wrapper around the expat library). > > More advanced parsers (both stream and tree-based) are: > > * XML::LibXML (a wrapper for libxml2's C library) > * XML::Grove (takes a tree and changes it into an object hierarchy. Each > node type is represented by a different class) > * XML::PYX (for repackaging XML as a stream of easily recognizable and > transmutable symbols) > * XML::SimpleObject (changes a hierarchy of lists into a hierarchy of > objects) > * XML::XPath (for writing expressions that pinpoint specific pieces of > documents) > > There are also some standards-based solutions like: > > * XML::SAX (Simple API for XML) for event streams. > * XML::DOM (Document Object Model) for tree processing. > > Your strategy of choice depends a lot on the type of XML files you want > to parse. Understanding the structure of the files and deciding which is > the data you want to extract from them is a fundamental step to choose > the appropriate method/parser to use. > > Just my 2 cents :) > > Regards, > Mauricio. > > Chris Fields wrote: > > The Bio::SearchIO modules are supposed work like a SAX parser, where > results > > are returned as the report is parsed b/c of the occurrence of specific > > 'events' (start_element, end_element, and so on). However, the actual > > behaviour for each module changes depending on the report type and the > > author's intention. > > > > There was a thread about a month ago on HMMPFAM report parsing where > there > > was some contention as to how to build hits(models)/HSPs(domains). > HMMPFAM > > output has one HSP per hit and is sorted on the sequence length so a > > particular hit can appear more than once, depending on how many times it > > hits along the sequence length itself. So, to gather all the HSPs > together > > under one hit you would have to parse the entire report and build up a > > Hit/HSP tree, then use the next_hit/next_hsp oterators to parse through > > everything. Currently it just reports Hit/HSP pairs and it is up to the > > user to build that tree. > > > > In contrast, BLAST output should be capable of throwing hit/HSP clusters > on > > the fly based on the report output, but is quite slow (event the XML > output > > crawls). Jason thinks it's b/c of object inheritance and instantiation; > I > > think it's probably more complicated than that (there are a ton of > method > > calls which tend to slow things down quite a bit as well). > > > > I would say try using SearchIO, but instead of relying directly on > object > > handler calls to create Hit/HSP objects using an object factory (which > is > > where I think a majority of the speed is lost), build the data > internally on > > the fly using start_element/end_element, then return hashes instead > based on > > the element type triggered using end_element. > > > > As an aside, I'm trying to switch the SearchIO::blastxml over to > XML::SAX > > (using XML::SAX::ExpatXS/expat) and plan on switching it over to using > > hashes at some point, possibly starting off with a different SearchIO > plugin > > module. If you have other suggestions (XML parser of choice, ways to > speed > > up parsing/retrieve data) we would be glad to hear them. > > > > Chris > > > > > > > >> -----Original Message----- > >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > >> bounces at lists.open-bio.org] On Behalf Of Robert Buels > >> Sent: Tuesday, July 18, 2006 7:06 PM > >> To: bioperl-l at bioperl.org > >> Subject: [Bioperl-l] bioperl pulls, xml parsers push,and things get > >> complicated > >> > >> Hi all, > >> > >> Here's a kind of abstract question about Bioperl and XML parsing: > >> > >> I'm thinking about writing a bioperl parser for genomethreader XML, and > >> I'm sort of mulling over the 'impedence mismatch' between the way > >> bioperl Bio::*IO::* modules work and the way all of the current XML > >> parsers work. Bioperl uses a 'pull' model, where every time you want a > >> new chunk of stuff, you call $io_object->next_thing. All the XML > >> parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a > >> 'push' model, where every time they parse a chunk, they call _your_ > >> code, usually via a subroutine reference you've given to the XML parser > >> when you start it up. > >> > >> From what I can tell, current Bioperl IO modules that parse XML are > >> using push parsers to parse the whole document, holding stuff in > memory, > >> then spoon-feeding it in chunks to the calling program when it calls > >> next_*(). This is fine until the input XML gets really big, in which > >> case you can quickly run out of memory. > >> > >> Does anybody have good ideas for nice, robust ways of writing a bioperl > >> IO module for really big input XML files? There don't seem to be any > >> perl pull parsers for XML. All I've dug up so far would be having the > >> XML push parser running in a different thread or process, pushing > chunks > >> of data into a pipe or similar structure that blocks the progress of > the > >> push parser until the pulling bioperl code wants the next piece of > data, > >> but there are plenty of ugly issues with that, whether one were too use > >> perl threads for it (aaagh!) or fork and push some kind of intermediate > >> format through a pipe or socket between the two processes (eek!). > >> > >> So, um, if you've read this far, do you have any ideas? > >> > >> Rob > >> > >> -- > >> Robert Buels > >> SGN Bioinformatics Analyst > >> 252A Emerson Hall, Cornell University > >> Ithaca, NY 14853 > >> Tel: 503-889-8539 > >> rmb32 at cornell.edu > >> http://www.sgn.cornell.edu > >> > >> > >> _______________________________________________ > >> Bioperl-l mailing list > >> Bioperl-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- > MAURICIO HERRERA CUADRA > arareko at campus.iztacala.unam.mx > Laboratorio de Gen?tica > Unidad de Morfofisiolog?a y Funci?n > Facultad de Estudios Superiores Iztacala, UNAM > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From rmb32 at cornell.edu Wed Jul 19 15:30:28 2006 From: rmb32 at cornell.edu (Robert Buels) Date: Wed, 19 Jul 2006 12:30:28 -0700 Subject: [Bioperl-l] bioperl pulls, xml parsers push, and things get complicated In-Reply-To: References: Message-ID: <44BE8854.8010301@cornell.edu> POE is a really neat thing, I didn't know about it before. Something tells me, however, that I would have trouble convincing people to install POE as a dependency for a genomethreader output parser. ;-) I hope I'll have the opportunity to use it sometime. For the curious, here's a nice intro to POE: http://perl.com/pub/a/2001/01/poe.html And the POE main site: http://poe.perl.org/ Rob aaron.j.mackey at GSK.COM wrote: > There are 3rd generation XML "Pull" parsers (also called "StAX" for > Streaming API for XML), but they seem to still be stuck in Java land (e.g. > "MXP1") > > You could probably use POE to setup a state machine that used XML::Twig to > "push" units of XML content onto a stack, to be read by your "next_*" pull > method (where the XML::Twig push "stalled" until the "next_*" method was > called, and vice versa). > > -Aaron > > bioperl-l-bounces at lists.open-bio.org wrote on 07/18/2006 08:06:02 PM: > > >> Hi all, >> >> Here's a kind of abstract question about Bioperl and XML parsing: >> >> I'm thinking about writing a bioperl parser for genomethreader XML, and >> I'm sort of mulling over the 'impedence mismatch' between the way >> bioperl Bio::*IO::* modules work and the way all of the current XML >> parsers work. Bioperl uses a 'pull' model, where every time you want a >> new chunk of stuff, you call $io_object->next_thing. All the XML >> parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a >> 'push' model, where every time they parse a chunk, they call _your_ >> code, usually via a subroutine reference you've given to the XML parser >> when you start it up. >> >> From what I can tell, current Bioperl IO modules that parse XML are >> using push parsers to parse the whole document, holding stuff in memory, >> > > >> then spoon-feeding it in chunks to the calling program when it calls >> next_*(). This is fine until the input XML gets really big, in which >> case you can quickly run out of memory. >> >> Does anybody have good ideas for nice, robust ways of writing a bioperl >> IO module for really big input XML files? There don't seem to be any >> perl pull parsers for XML. All I've dug up so far would be having the >> XML push parser running in a different thread or process, pushing chunks >> > > >> of data into a pipe or similar structure that blocks the progress of the >> > > >> push parser until the pulling bioperl code wants the next piece of data, >> > > >> but there are plenty of ugly issues with that, whether one were too use >> perl threads for it (aaagh!) or fork and push some kind of intermediate >> format through a pipe or socket between the two processes (eek!). >> >> So, um, if you've read this far, do you have any ideas? >> >> Rob >> >> -- >> Robert Buels >> SGN Bioinformatics Analyst >> 252A Emerson Hall, Cornell University >> Ithaca, NY 14853 >> Tel: 503-889-8539 >> rmb32 at cornell.edu >> http://www.sgn.cornell.edu >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> > > > -- Robert Buels SGN Bioinformatics Analyst 252A Emerson Hall, Cornell University Ithaca, NY 14853 Tel: 503-889-8539 rmb32 at cornell.edu http://www.sgn.cornell.edu From dwaner at scitegic.com Wed Jul 19 15:47:58 2006 From: dwaner at scitegic.com (dwaner at scitegic.com) Date: Wed, 19 Jul 2006 12:47:58 -0700 Subject: [Bioperl-l] EMBL release 87 format changes. Message-ID: BioPerl Users and Developers, I have updated the EMBL SeqIO parser to work correctly with Release 87 of EMBL (June 19th, 2006). As suggested by Chris Fields in an earlier message, the EMBL parser now reads both new and old formats, but only writes the new format. I don't think that my changes will affect most users, but if you are using the EMBL format can you review the changes described below and speak up if anything looks like it could create a problem for you? If I don't hear any objections soon, I will submit a patch to bugzilla. Thanks, - David Parser changes: - EMBL files no longer contain the "entry name". When reading old format files, the EMBL "entry name" from the ID line is used as the Bio::Seq::id and Bio::Seq::display_id, but when reading new format files, the accession number is used for these fields. Changes to output: - The ID line was changed to the new format. - The SV line is never written; SV is now part of the ID line. - "DNA" and "RNA" are no longer valid EMBL molecule types. They are now written as "unassigned DNA" and "unassigned RNA" - Strictly speaking, EMBL format should only be used for nucleotide sequences. If the alphabet is 'protein', write_seq() emits a warning and writes the non-standard molecule type "AA" in the ID line. - Because BioPerl sequences do not have a "data class" attribute, all sequences are written with a data class of "STD" in the ID line. - The ID line contains the Bio::Seq::accession, unless it is missing, in which case the Bio::Seq::id is used. - molecule type is strictly validated. Non-EMBL values are output as "unassigned DNA" or "unassigned RNA", depending on the sequence alphabet. - "taxonomic division" is strictly validated. Non-EMBL values are output as "UNC". - The taxonomic division code "UNK" is now written as "UNC" (unclassified). Possible Gotchas for some users: - Because the EMBL entry name is no longer included anywhere in the file, when round-tripping from old format to new format the entry name will be lost. - In order to ensure that BioPerl writes valid EMBL files, I have added strict validation to the writer for "molecule type" and "taxonomic division". This could present a problem for users who are using non-standard values for these fields, but I felt it was important to write files that adhere to the EMBL spec. From slenk at emich.edu Wed Jul 19 16:04:16 2006 From: slenk at emich.edu (Stephen Gordon Lenk) Date: Wed, 19 Jul 2006 16:04:16 -0400 Subject: [Bioperl-l] bioperl pulls, xml parsers push, and things get complicated Message-ID: <13edac5b13ed8208.13ed820813edac5b@emich.edu> Hi, I have found that POE fails to execute a periodic task after 32 iterations in a Perl thread, consistent failure on both XP and OSX - if I knew how to write up a defect for Perl I would do this (hint ? how is this done - I'm *not* asking RTFM etc) - probably remiss for not doing so - I was going to write messages to a Controller Area Network (CAN) to control automotive widgets from Perl - I wound up using a C code exe (piped to from Perl) with its own threads to do this. Oh yes I believe that bio lab systems can be done this way as well. But ... POE is really neat if you think in state machine terms. I have an alternate architecture for my test harness (Perlizer) that would use POE to run tests with CAN and GPIB. Steve Lenk ----- Original Message ----- From: Robert Buels Date: Wednesday, July 19, 2006 3:30 pm Subject: Re: [Bioperl-l] bioperl pulls, xml parsers push, and things get complicated > POE is a really neat thing, I didn't know about it before. > Something > tells me, however, that I would have trouble convincing people to > install POE as a dependency for a genomethreader output parser. ;- > ) I > hope I'll have the opportunity to use it sometime. > > For the curious, here's a nice intro to POE: > http://perl.com/pub/a/2001/01/poe.html > And the POE main site: > http://poe.perl.org/ > > Rob > > aaron.j.mackey at GSK.COM wrote: > > There are 3rd generation XML "Pull" parsers (also called "StAX" > for > > Streaming API for XML), but they seem to still be stuck in Java > land (e.g. > > "MXP1") > > > > You could probably use POE to setup a state machine that used > XML::Twig to > > "push" units of XML content onto a stack, to be read by your > "next_*" pull > > method (where the XML::Twig push "stalled" until the "next_*" > method was > > called, and vice versa). > > > > -Aaron > > > > bioperl-l-bounces at lists.open-bio.org wrote on 07/18/2006 > 08:06:02 PM: > > > > > >> Hi all, > >> > >> Here's a kind of abstract question about Bioperl and XML parsing: > >> > >> I'm thinking about writing a bioperl parser for genomethreader > XML, and > >> I'm sort of mulling over the 'impedence mismatch' between the > way > >> bioperl Bio::*IO::* modules work and the way all of the current > XML > >> parsers work. Bioperl uses a 'pull' model, where every time > you want a > >> new chunk of stuff, you call $io_object->next_thing. All the > XML > >> parsers (including XML::SAX, XML::Parser::PerlSAX and > XML::Twig) use a > >> 'push' model, where every time they parse a chunk, they call > _your_ > >> code, usually via a subroutine reference you've given to the > XML parser > >> when you start it up. > >> > >> From what I can tell, current Bioperl IO modules that parse > XML are > >> using push parsers to parse the whole document, holding stuff > in memory, > >> > > > > > >> then spoon-feeding it in chunks to the calling program when it > calls > >> next_*(). This is fine until the input XML gets really big, in > which > >> case you can quickly run out of memory. > >> > >> Does anybody have good ideas for nice, robust ways of writing a > bioperl > >> IO module for really big input XML files? There don't seem to > be any > >> perl pull parsers for XML. All I've dug up so far would be > having the > >> XML push parser running in a different thread or process, > pushing chunks > >> > > > > > >> of data into a pipe or similar structure that blocks the > progress of the > >> > > > > > >> push parser until the pulling bioperl code wants the next piece > of data, > >> > > > > > >> but there are plenty of ugly issues with that, whether one were > too use > >> perl threads for it (aaagh!) or fork and push some kind of > intermediate > >> format through a pipe or socket between the two processes (eek!). > >> > >> So, um, if you've read this far, do you have any ideas? > >> > >> Rob > >> > >> -- > >> Robert Buels > >> SGN Bioinformatics Analyst > >> 252A Emerson Hall, Cornell University > >> Ithaca, NY 14853 > >> Tel: 503-889-8539 > >> rmb32 at cornell.edu > >> http://www.sgn.cornell.edu > >> > >> > >> _______________________________________________ > >> Bioperl-l mailing list > >> Bioperl-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > >> > >> > > > > > > > > -- > Robert Buels > SGN Bioinformatics Analyst > 252A Emerson Hall, Cornell University > Ithaca, NY 14853 > Tel: 503-889-8539 > rmb32 at cornell.edu > http://www.sgn.cornell.edu > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From cjfields at uiuc.edu Wed Jul 19 17:46:43 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 19 Jul 2006 16:46:43 -0500 Subject: [Bioperl-l] EMBL release 87 format changes. In-Reply-To: Message-ID: <000601c6ab7c$d39d8cd0$15327e82@pyrimidine> You can go ahead and submit the patch to Bugzilla anyway. Comments about the proposed changes from the developers can be added there. I think there's some confusion here, though: the EMBL SeqIO change you mentioned I committed is actually for Bio::SeqIO::swiss (SwissProt). I haven't touched Bio::SeqIO::embl (yet). 'swiss' format now reads old and new swiss data files and writes only new format; no major changes have been made to SeqIO::embl in about a year (and even that was a small one). Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of dwaner at scitegic.com > Sent: Wednesday, July 19, 2006 2:48 PM > To: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] EMBL release 87 format changes. > > BioPerl Users and Developers, > > I have updated the EMBL SeqIO parser to work correctly with Release 87 of > EMBL (June 19th, 2006). As suggested by Chris Fields in an earlier > message, the EMBL parser now reads both new and old formats, but only > writes the new format. > > I don't think that my changes will affect most users, but if you are using > the EMBL format can you review the changes described below and speak up if > anything looks like it could create a problem for you? > > If I don't hear any objections soon, I will submit a patch to bugzilla. > > Thanks, > > - David > > Parser changes: > > - EMBL files no longer contain the "entry name". When reading old format > files, > the EMBL "entry name" from the ID line is used as the Bio::Seq::id and > Bio::Seq::display_id, but when reading new format files, the accession > number > is used for these fields. > > Changes to output: > > - The ID line was changed to the new format. > > - The SV line is never written; SV is now part of the ID line. > > - "DNA" and "RNA" are no longer valid EMBL molecule types. They are now > written > as "unassigned DNA" and "unassigned RNA" > > - Strictly speaking, EMBL format should only be used for nucleotide > sequences. > If the alphabet is 'protein', write_seq() emits a warning and writes the > > non-standard molecule type "AA" in the ID line. > > - Because BioPerl sequences do not have a "data class" attribute, all > sequences > are written with a data class of "STD" in the ID line. > > - The ID line contains the Bio::Seq::accession, unless it is missing, in > which > case the Bio::Seq::id is used. > > - molecule type is strictly validated. Non-EMBL values are output as > "unassigned DNA" or "unassigned RNA", depending on the sequence > alphabet. > > - "taxonomic division" is strictly validated. Non-EMBL values are output > as "UNC". > > - The taxonomic division code "UNK" is now written as "UNC" > (unclassified). > > Possible Gotchas for some users: > > - Because the EMBL entry name is no longer included anywhere in the file, > when round-tripping from old format to new format the entry name will be > lost. > > - In order to ensure that BioPerl writes valid EMBL files, I have added > strict > validation to the writer for "molecule type" and "taxonomic division". > This > could present a problem for users who are using non-standard values for > these > fields, but I felt it was important to write files that adhere to the > EMBL spec. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From stewarta at nmrc.navy.mil Wed Jul 19 18:00:26 2006 From: stewarta at nmrc.navy.mil (Andrew Stewart) Date: Wed, 19 Jul 2006 18:00:26 -0400 Subject: [Bioperl-l] #bioperl Message-ID: Wandering about the new bioperl.org page, I noticed that there's never really been much mention of starting up a bioperl chat channel on IRC for casual bioperl discussion and support. This has worked really well for projects like MediaWiki, etc. I'll sit on the channel for awhile and maybe we can see if the idea picks up. Point your favorite IRC client to... (windows users I would suggest mIRC, mac I would suggest Colloquy) server: irc.freenode.net channel: #bioperl Hope to see you there. -- Andrew Stewart Research Assistant, Genomics Team Navy Medical Research Center (NMRC) Biological Defense Research Directorate (BDRD) BDRD Annex 12300 Washington Avenue, 2nd Floor Rockville, MD 20852 email: stewarta at nmrc.navy.mil phone: 301-231-6700 Ext 270 From rmb32 at cornell.edu Wed Jul 19 18:40:52 2006 From: rmb32 at cornell.edu (Robert Buels) Date: Wed, 19 Jul 2006 15:40:52 -0700 Subject: [Bioperl-l] bioperl pulls, xml parsers push, and things get complicated In-Reply-To: <002801c6ab40$7cfcd980$15327e82@pyrimidine> References: <002801c6ab40$7cfcd980$15327e82@pyrimidine> Message-ID: <44BEB4F4.1060407@cornell.edu> Hi Chris, It seems to me the SearchIO framework isn't really appropriate for genomethreader, since it's more of a gene prediction program than a search/alignment program. Also, w.r.t. XML parsing and buffering, I don't see how Bio::SearchIO is fundamentally different from the other bioperl IO systems, it still has a next_this(), next_that() interface,which means lots of buffering memory if you're doing your actual parsing with a push parser (or a tree parser, of course, which is buffering an expanded form of the entire document). It looks like it just adds another layer of method calls for parser events, allowing the SearchIO to make different kinds of objects and stuff. It looks like none of this changes the fact that these are all push parsers, and bioperl pulls, so you have to buffer a lot of stuff. I guess the only really general strategies for reducing the buffering is a.) to break up the XML with regexps and such like Hilmar said, b.) to put your push parser in another process, and somehow keep it blocking in one of its callbacks until you're ready for its next data. I think what I'll do with the gthxml parser is find a way to split the input XML into chunks and run a parser separately on each, like Hilmar said. If more performance is needed, maybe a multi-process approach would be appropriate, but not yet. Anyway, looking at blastxml, I have some ruminations, which fill the rest of this email: Looking at SearchIO::blastxml, it looks like it's already using XML::SAX, which will use XML::SAX::ExpatXS if installed. Is that recent? Is blastxml faster when using the tempfile option than when putting the whole report in a string in memory? If you're looking for speed gains, have you tried running some kind of profiling on it? Whenever one is out to optimize code, profiling should be stop number one. Almost every time, you will be surprised at what parts of the code are actually eating up the most time. Here's a perl profiling intro: http://perl.com/pub/a/2004/06/25/profiling.html . The profiling mechansim talked about in that article is kind of old, there are also a bunch of newer code profiling tools available on CPAN. I haven't used any of them though. But yeah, I can't emphasize enough the importance of profiling if you're trying to optimize for speed. As for memory, the blastxml parser suffers from the same handicap I was pondering at the start of this thread. To see what I mean, think of what would happen if there were somehow 10 million HSPs in one of the reports? It's buffering all of them before returning each result, and your machine could melt. :-) Things would be beautiful (and fast, probably) if next_hsp() would actually parse the next HSP in the report instead of just returning a HSP object that's sitting in memory. But there's not really anything that can be done about that, I don't think. One nice thing, the blastxml parser's memory footprint doesn't really suffer if you have 100,000 blast reports in your input file, because it splits out the reports and parses each one individually. This I think is a good illustration of what Hilmar was talking about, breaking the input XML into chunks cuts down on the amount of buffering you have to do. As XML parsers go, I kind of like XML::Twig, because it manages to combine most of the easy use of a DOM/tree parser with the better memory usage and speed of a push parser (like SAX and XML::Parser). Within a parser callback, you have a DOM-like tree that's just the part of your XML document you're interested in at that time, and then you free that structure when you're done picking things out of it. I'm not sure how fast it is, though, probably not as fast as ExpatXS. At any rate, it is definitely a lot more intuitive to use than a more standard push parser, since if you make good choices about what elements to use as the roots of your twigs, you can often do your processing on a self-contained chunk and not have to keep track of a bunch of parse state like you typically need with a straight push parser like XML::Parser or a SAX parser. Rob Chris Fields wrote: > The Bio::SearchIO modules are supposed work like a SAX parser, where results > are returned as the report is parsed b/c of the occurrence of specific > 'events' (start_element, end_element, and so on). However, the actual > behaviour for each module changes depending on the report type and the > author's intention. > > There was a thread about a month ago on HMMPFAM report parsing where there > was some contention as to how to build hits(models)/HSPs(domains). HMMPFAM > output has one HSP per hit and is sorted on the sequence length so a > particular hit can appear more than once, depending on how many times it > hits along the sequence length itself. So, to gather all the HSPs together > under one hit you would have to parse the entire report and build up a > Hit/HSP tree, then use the next_hit/next_hsp oterators to parse through > everything. Currently it just reports Hit/HSP pairs and it is up to the > user to build that tree. > > In contrast, BLAST output should be capable of throwing hit/HSP clusters on > the fly based on the report output, but is quite slow (event the XML output > crawls). Jason thinks it's b/c of object inheritance and instantiation; I > think it's probably more complicated than that (there are a ton of method > calls which tend to slow things down quite a bit as well). > > I would say try using SearchIO, but instead of relying directly on object > handler calls to create Hit/HSP objects using an object factory (which is > where I think a majority of the speed is lost), build the data internally on > the fly using start_element/end_element, then return hashes instead based on > the element type triggered using end_element. > > As an aside, I'm trying to switch the SearchIO::blastxml over to XML::SAX > (using XML::SAX::ExpatXS/expat) and plan on switching it over to using > hashes at some point, possibly starting off with a different SearchIO plugin > module. If you have other suggestions (XML parser of choice, ways to speed > up parsing/retrieve data) we would be glad to hear them. > > Chris > > > > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> bounces at lists.open-bio.org] On Behalf Of Robert Buels >> Sent: Tuesday, July 18, 2006 7:06 PM >> To: bioperl-l at bioperl.org >> Subject: [Bioperl-l] bioperl pulls, xml parsers push,and things get >> complicated >> >> Hi all, >> >> Here's a kind of abstract question about Bioperl and XML parsing: >> >> I'm thinking about writing a bioperl parser for genomethreader XML, and >> I'm sort of mulling over the 'impedence mismatch' between the way >> bioperl Bio::*IO::* modules work and the way all of the current XML >> parsers work. Bioperl uses a 'pull' model, where every time you want a >> new chunk of stuff, you call $io_object->next_thing. All the XML >> parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a >> 'push' model, where every time they parse a chunk, they call _your_ >> code, usually via a subroutine reference you've given to the XML parser >> when you start it up. >> >> From what I can tell, current Bioperl IO modules that parse XML are >> using push parsers to parse the whole document, holding stuff in memory, >> then spoon-feeding it in chunks to the calling program when it calls >> next_*(). This is fine until the input XML gets really big, in which >> case you can quickly run out of memory. >> >> Does anybody have good ideas for nice, robust ways of writing a bioperl >> IO module for really big input XML files? There don't seem to be any >> perl pull parsers for XML. All I've dug up so far would be having the >> XML push parser running in a different thread or process, pushing chunks >> of data into a pipe or similar structure that blocks the progress of the >> push parser until the pulling bioperl code wants the next piece of data, >> but there are plenty of ugly issues with that, whether one were too use >> perl threads for it (aaagh!) or fork and push some kind of intermediate >> format through a pipe or socket between the two processes (eek!). >> >> So, um, if you've read this far, do you have any ideas? >> >> Rob >> >> -- >> Robert Buels >> SGN Bioinformatics Analyst >> 252A Emerson Hall, Cornell University >> Ithaca, NY 14853 >> Tel: 503-889-8539 >> rmb32 at cornell.edu >> http://www.sgn.cornell.edu >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > -- Robert Buels SGN Bioinformatics Analyst 252A Emerson Hall, Cornell University Ithaca, NY 14853 Tel: 503-889-8539 rmb32 at cornell.edu http://www.sgn.cornell.edu From skirov at utk.edu Wed Jul 19 17:54:03 2006 From: skirov at utk.edu (Stefan Kirov) Date: Wed, 19 Jul 2006 17:54:03 -0400 Subject: [Bioperl-l] Accessing TRANSFAC matrices In-Reply-To: <20060717145000.0aig7ymcmurk4wsk@webmail.embl.de> References: <20060717145000.0aig7ymcmurk4wsk@webmail.embl.de> Message-ID: <44BEA9FB.1070009@utk.edu> I have nothing to do with TFBS (except for using it). I suggest you contact Boris Lenhard who is behind TFBS. Please also send bioperl questions to the list. Finally, I believe TRANSFAC does not distribute the data files anymore. However, if you find out this is not the case, please let me know. Stefan ong at embl.de wrote: >HI , > > Good day, i am trying to retrieve TRANSFAC matrices via TFBS Perl module, but >it happens that about 50 matrices are missing after M00359 do you have any idea? >Also i wish to try using the Bio::Matrix::PSM::IO object, but can you advise how >do i get the matrix.dat which is a transfac file? > > Tahnks and hear for you soon. > >REgards, >Ong > > From bix at sendu.me.uk Thu Jul 20 02:49:45 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 20 Jul 2006 07:49:45 +0100 Subject: [Bioperl-l] Accessing TRANSFAC matrices In-Reply-To: <44BEA9FB.1070009@utk.edu> References: <20060717145000.0aig7ymcmurk4wsk@webmail.embl.de> <44BEA9FB.1070009@utk.edu> Message-ID: <44BF2789.1090204@sendu.me.uk> Stefan Kirov wrote: > Finally, I believe TRANSFAC does not distribute the data files anymore. > However, if you find out this is not the case, please let me know. They get distributed as Transfac 'Pro', for which you need a license (money). > ong at embl.de wrote: >> good day, i am trying to retrieve TRANSFAC matrices via TFBS Perl module, but >> it happens that about 50 matrices are missing after M00359 do you have any idea? What is meant by this? Missing from where? At the least, M00360 is accessible via the website (public database). >> Also i wish to try using the Bio::Matrix::PSM::IO object, but can you advise how >> do i get the matrix.dat which is a transfac file? http://www.biobase-international.com/pages/index.php?id=174 From dhoworth at mrc-lmb.cam.ac.uk Thu Jul 20 05:19:22 2006 From: dhoworth at mrc-lmb.cam.ac.uk (Dave Howorth) Date: Thu, 20 Jul 2006 10:19:22 +0100 Subject: [Bioperl-l] bioperl pulls, xml parsers push, and things get complicated In-Reply-To: <13edac5b13ed8208.13ed820813edac5b@emich.edu> References: <13edac5b13ed8208.13ed820813edac5b@emich.edu> Message-ID: <44BF4A9A.60100@mrc-lmb.cam.ac.uk> Stephen Gordon Lenk wrote: > I have found that POE fails to execute a periodic task after 32 > iterations in a Perl thread, consistent failure on both XP and OSX - > if I knew how to write up a defect for Perl I would do this (hint ? > how is this done - I'm *not* asking RTFM etc) Generally: Go to http://search.cpan.org and search for the module (POE). Click on the distribution link, rather than the doc link (i.e. POE-0.3502, which takes you to http://search.cpan.org/~rcaputo/POE-0.3502/). Click on the View/Report Bugs link. Check through the existing bugs and if it's not there click on the Report a new bug link. Cheers, Dave From georg.otto at tuebingen.mpg.de Thu Jul 20 06:53:53 2006 From: georg.otto at tuebingen.mpg.de (Georg Otto) Date: Thu, 20 Jul 2006 12:53:53 +0200 Subject: [Bioperl-l] Features in SeqIO GenBank output Message-ID: Hi, this is probably a FAQ but I could not find anything to solve it. I want to get sequences from GenBank and save them in GenBank format. This works with the script shown below, but the "Features" part is missing and contains references instead (see below). How can I print out the complete GenBank entry? I am running Bioperl 1.5, Perl 5.8.6, Mac 10.4.7 Best, Georg Here is my script: use strict; use warnings; use Bio::Seq; use Bio::SeqIO; use Bio::DB::GenBank; my $acc = 'AB017118'; my $db_obj = Bio::DB::GenBank->new(); my $seq_obj = $db_obj-> get_Seq_by_acc($acc); my $out = Bio::SeqIO->new(-format => 'genbank', -file => '>output.gb'); $out->write_seq($seq_obj); Here is the output: LOCUS AB017118 2038 bp mRNA linear VRT 06-JUN-2006 DEFINITION Danio rerio mRNA for ornithine decarboxylase antizyme long isoform, complete cds. ACCESSION AB017118 VERSION AB017118.1 GI:4239978 KEYWORDS . SOURCE Danio rerio (zebrafish) ORGANISM Danio rerio Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Actinopterygii; Neopterygii; Teleostei; Ostariophysi; Cypriniformes; Cyprinidae; Danio. REFERENCE 1 AUTHORS Saito,T., Hascilowicz,T., Ohkido,I., Kikuchi,Y., Okamoto,H., Hayashi,S., Murakami,Y. and Matsufuji,S. TITLE Two zebrafish (Danio rerio) antizymes with different expression and activities JOURNAL Biochem. J. 345 PT 1, 99-106 (2000) PUBMED 10600644 REFERENCE 2 (bases 1 to 2038) AUTHORS Matsufuji,S. and Saito,T. TITLE Direct Submission JOURNAL Submitted (23-AUG-1998) Senya Matsufuji, Jikei University School of Medicine, Department of Biochemistry II; 3-25-8 Nishishinbashi, Minato-ku, Tokyo 105-8461, Japan (E-mail:senya at jikei.ac.jp, Tel:+81-3-3433-1111(ex.2276), Fax:+81-3-3436-3897) FEATURES Location/Qualifiers source 1..2038 /db_xref="Bio::Annotation::SimpleValue=HASH(0x19b9a28)" /mol_type="Bio::Annotation::SimpleValue=HASH(0x19b9b6c)" /dev_stage="Bio::Annotation::SimpleValue=HASH(0x19b9bb4)" /organism="Bio::Annotation::SimpleValue=HASH(0x19bfe18)" /clone_lib="Bio::Annotation::SimpleValue=HASH(0x19bfe60)" CDS join(45..224,226..702) /db_xref="Bio::Annotation::SimpleValue=HASH(0x19c0960)" /ribosomal_slippage="Bio::Annotation::SimpleValue=HASH(0x1 9beecc)" /codon_start=Bio::Annotation::SimpleValue=HASH(0x19bef14) /protein_id="Bio::Annotation::SimpleValue=HASH(0x19bef5c)" /translation="Bio::Annotation::SimpleValue=HASH(0x19befa4) " /product="Bio::Annotation::SimpleValue=HASH(0x19befec)" /note="Bio::Annotation::SimpleValue=HASH(0x19bf034)" CDS 45..227 /db_xref="Bio::Annotation::SimpleValue=HASH(0x19bee24)" /codon_start=Bio::Annotation::SimpleValue=HASH(0x19bf160) /protein_id="Bio::Annotation::SimpleValue=HASH(0x19bf1cc)" /translation="Bio::Annotation::SimpleValue=HASH(0x19c1830) " /note="Bio::Annotation::SimpleValue=HASH(0x19c1878)" polyA_signal 2017..2022 polyA_site 2038 /note="Bio::Annotation::SimpleValue=HASH(0x19bffc8)" BASE COUNT 439 a 377 c 532 g 690 t ORIGIN 1 cagcagccga gcgcacaggc cgccgtgaaa cctcccgagg ccggatggta aaatccaacc 1981 ttatcctcta tagtggtaca cctttgcttc tgtcataata aaaccattat ttaaagac // From cjfields at uiuc.edu Thu Jul 20 08:43:08 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 20 Jul 2006 07:43:08 -0500 Subject: [Bioperl-l] Features in SeqIO GenBank output In-Reply-To: References: Message-ID: <73C89D17-91FE-47E4-80C1-AA6A689FA14E@uiuc.edu> I'll give it a look. You might try upgrading to Bioperl 1.5.1 to see if this was fixed. Chris On Jul 20, 2006, at 5:53 AM, Georg Otto wrote: > > Hi, > > this is probably a FAQ but I could not find anything to solve it. > > I want to get sequences from GenBank and save them in GenBank > format. This works with the script shown below, but the "Features" > part is missing and contains references instead (see below). How can I > print out the complete GenBank entry? > > I am running Bioperl 1.5, Perl 5.8.6, Mac 10.4.7 > > Best, > > Georg > > > > Here is my script: > > use strict; > use warnings; > > use Bio::Seq; > use Bio::SeqIO; > use Bio::DB::GenBank; > > > my $acc = 'AB017118'; > my $db_obj = Bio::DB::GenBank->new(); > my $seq_obj = $db_obj-> get_Seq_by_acc($acc); > my $out = Bio::SeqIO->new(-format => 'genbank', > -file => '>output.gb'); > $out->write_seq($seq_obj); > > > > Here is the output: > > LOCUS AB017118 2038 bp mRNA linear VRT > 06-JUN-2006 > DEFINITION Danio rerio mRNA for ornithine decarboxylase antizyme long > isoform, complete cds. > ACCESSION AB017118 > VERSION AB017118.1 GI:4239978 > KEYWORDS . > SOURCE Danio rerio (zebrafish) > ORGANISM Danio rerio > Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; > Euteleostomi; > Actinopterygii; Neopterygii; Teleostei; Ostariophysi; > Cypriniformes; Cyprinidae; Danio. > REFERENCE 1 > AUTHORS Saito,T., Hascilowicz,T., Ohkido,I., Kikuchi,Y., > Okamoto,H., > Hayashi,S., Murakami,Y. and Matsufuji,S. > TITLE Two zebrafish (Danio rerio) antizymes with different > expression > and activities > JOURNAL Biochem. J. 345 PT 1, 99-106 (2000) > PUBMED 10600644 > REFERENCE 2 (bases 1 to 2038) > AUTHORS Matsufuji,S. and Saito,T. > TITLE Direct Submission > JOURNAL Submitted (23-AUG-1998) Senya Matsufuji, Jikei > University School > of Medicine, Department of Biochemistry II; 3-25-8 > Nishishinbashi, > Minato-ku, Tokyo 105-8461, Japan (E- > mail:senya at jikei.ac.jp, > Tel:+81-3-3433-1111(ex.2276), Fax:+81-3-3436-3897) > FEATURES Location/Qualifiers > source 1..2038 > /db_xref="Bio::Annotation::SimpleValue=HASH > (0x19b9a28)" > /mol_type="Bio::Annotation::SimpleValue=HASH > (0x19b9b6c)" > /dev_stage="Bio::Annotation::SimpleValue=HASH > (0x19b9bb4)" > /organism="Bio::Annotation::SimpleValue=HASH > (0x19bfe18)" > /clone_lib="Bio::Annotation::SimpleValue=HASH > (0x19bfe60)" > CDS join(45..224,226..702) > /db_xref="Bio::Annotation::SimpleValue=HASH > (0x19c0960)" > / > ribosomal_slippage="Bio::Annotation::SimpleValue=HASH(0x1 > 9beecc)" > /codon_start=Bio::Annotation::SimpleValue=HASH > (0x19bef14) > /protein_id="Bio::Annotation::SimpleValue=HASH > (0x19bef5c)" > /translation="Bio::Annotation::SimpleValue=HASH > (0x19befa4) > " > /product="Bio::Annotation::SimpleValue=HASH > (0x19befec)" > /note="Bio::Annotation::SimpleValue=HASH > (0x19bf034)" > CDS 45..227 > /db_xref="Bio::Annotation::SimpleValue=HASH > (0x19bee24)" > /codon_start=Bio::Annotation::SimpleValue=HASH > (0x19bf160) > /protein_id="Bio::Annotation::SimpleValue=HASH > (0x19bf1cc)" > /translation="Bio::Annotation::SimpleValue=HASH > (0x19c1830) > " > /note="Bio::Annotation::SimpleValue=HASH > (0x19c1878)" > polyA_signal 2017..2022 > polyA_site 2038 > /note="Bio::Annotation::SimpleValue=HASH > (0x19bffc8)" > BASE COUNT 439 a 377 c 532 g 690 t > ORIGIN > 1 cagcagccga gcgcacaggc cgccgtgaaa cctcccgagg ccggatggta > aaatccaacc > > > > > 1981 ttatcctcta tagtggtaca cctttgcttc tgtcataata aaaccattat > ttaaagac > // > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Thu Jul 20 09:35:43 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 20 Jul 2006 14:35:43 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BBBB69.6000906@sendu.me.uk> References: <44BBBB69.6000906@sendu.me.uk> Message-ID: <44BF86AF.8080408@sendu.me.uk> Sendu Bala wrote: > node 2 has name 'Bacteria ' and rank 'superkingdom' > node 1386 has name 'Bacillus ' and rank 'genus' > node 7776 has name 'Gnathostomata ' and rank 'superclass' > etc. > > For me the bits in <> are inappropriate and shouldn't be there. > [...] > If there are no objections I'll strip the <> bits. I also plan to make > $node->name('scientific', 'sapiens'); set and get the node name, and > have flatfile and entrez store all common names with > $obj->name('common', 'human', 'man');. I'll describe all the changes I've now made and if no-one complains I'll commit. (I've also made these notes into bug 2047 for easier reference in the future.) Bio::DB::Taxonomy::flatfile --------------------------- # Bug-fixes Removed invalid requirement that all species nodes have at least 7 named-rank parents. The names->id solution used by get_taxonid() only stored that last id associated with a name. However the name used wasn't necessarily unique, such that multiple ids could match. names->id solution now remembers all ids that match a name. API-CHANGE: for this reason I've renamed get_taxonid() to get_taxonids() and it returns an array of ids in list context. For backward compatibility it returns one of the ids in scalar context, and *get_taxonid = \&get_taxonids. Added missing division ENV 'Environmental samples'. # Improvements Like Bio::DB::Taxonomy::entrez, flatfile now retrieves and stores the common names, genetic code and mitochondrial genetic code in each node it makes. NOTE: entrez also stores creation, publication and update dates, but this data is not available in the taxdump from NCBI ftp site. NOTE: the common names are stored in no particular order; the genbank common name in particular isn't necessarily the first in the list (cf. old entrez.pm behaviour). BEHAVIOUR-CHANGE: flatfile used to store within the nodes it makes the division as a three letter code, like 'PRI'. However, for consistency with entrez and the scientific_name() of the node the division is supposed to correspond to, it is now stored as the full name, like 'Primates'. The names->id solution also stores the artificially uniqued names like 'Craniata ', allowing you for the first time to retrieve the correct id. Previously the search would have simply failed completely. The names->id solution now handles nodes with scientific names of 'xyz (class)', allowing you to retrieve the id with both get_taxonids('xyz') and get_taxonids('xyz (class)'). Previously only the latter would work. NOTE: the previous 2 changes (and the issues with entrez, see below) make flatfile better at searching the taxonomy database than entrez module or the website, both in terms of speed and completeness of results. BEHAVIOUR-CHANGE: The scientific name field isn't touched in any way, always being sent directly to Bio::Taxonomy::Node->new(-name => $untouched) or the $node->classification() array. Previously, a species node would have its name converted from 'Homo sapiens' to 'sapiens', but the conversion mangled very badly certain other species names. Bio::DB::Taxonomy::entrez ------------------------- # Bug-fixes Special characters like ", ( and ) in the input query string to get_taxonid() result in the failure or inaccuracy of the search. These characters are now removed prior to submission, allowing for correct search results. API-CHANGE: entrez has always been able to return multiple ids that match a single input name, so I've renamed get_taxonid() to get_taxonids() and it returns an array of ids in list context. It returns one of the ids in scalar context. For backward compatibility, *get_taxonid = \&get_taxonids. NOTE: entrez modules (and website) cannot cope with '' in the query, failing searches like 'Craniata '. For this reason, if get_taxonids() is given a query with '' it will immediately return undefined, saving a pointless website access. If you want the id of 'Craniata ' you must search for 'Craniata', then get the node for each returned id to see which one has a parent node with a scientific_name() or common_names() case-insensitive matching to 'chordata'. # Improvements BEHAVIOUR-CHANGE: now throws on failure to retrieve data from website. BEHAVIOUR-CHANGE: the ScientificName field isn't touched except for s/ \(class\)$//, being sent directly to Bio::Taxonomy::Node->new(-name => $untouched) or the $node->classification() array. Previously, a species node would have its name converted from 'Homo sapiens' to 'sapiens', but the conversion mangled very badly certain other species names. BEHAVIOUR-CHANGE: all common names of a node are now stored in the resulting Node object with Bio::Taxonomy::Node->new(-common_names => \@names). This means that the Genbank common name is now just one amongst others, and isn't guaranteed to be the first in the list either. Bio::Taxonomy::Node ------------------- # Bug-fixes non-interesting fixes to get get_Children_Nodes(), get_Lineage_Nodes() and get_LCA_Node() to work correctly. classification() has a proper solution to finding the classification when the array wasn't manually set. # Improvements BEHAVIOUR-CHANGE: node_name() used to be an alias to name('common'). Now it is an alias to name('scientific'). NOTE: node_name is what is set when ->new(-name => $name) is set, so flatfile and entrez and user-created nodes now implicitly associate the name of the node they create with its scientific name. BEHAVIOUR-CHANGE: scientific_name() used to be an alias to binomial(). Now it is *scientific_name = \&node_name. binomial(), in addition to working the old way (assume first two elements of classification array are species and genus, combine them), will shortcut and return the scientific_name() if we are a node with rank 'species' and scientific_name is two words. This makes binomial() an effective synonym of scientific_name() when Nodes were constructed as per flatfile or entrez, and when it is used correctly on a species node. BEHAVIOUR-CHANGE: *parent_taxon_id = \&parent_id. (Previously, you could assign and retrieve different values to/from each method.) New method common_names() supersedes common_name(), returning a list of all common_names. For backward compatibility, returns one of the names in scalar context, and *common_name = \&common_names. -factory and factory() removed, since there is no Bio::Taxonomy::FactoryI-implementing module, nothing in Node to make use of a factory once set, and a factory seems redundant when we're a node with a -dbh. species() and genus() issue a warning when you try to use them on a node that isn't of rank 'species' (since they interact with the classification array and not names('method') like the other similar methods). validate_name() removed because it just returns 1. validate_species_name() removed because species() can (should) now contain the real species name, like 'Homo sapiens', not 'sapiens'. But it could also be any wonderfully complex thing, so there's nothing we can confidently check for as being 'correct'. t/Taxonomy.t ------------ Runs a slightly more comprehensive set of tests on entrez, which are now only skipped if data retrieval fails. Tests flatfile on a cut-down version of the taxdump. > I'll also fix the problem with node names for ranks species and lower, > as discussed in thread 'Bio::DB::Taxonomy:: mishandles species, > subspecies/variant names', in the way I suggested there. This hasn't been done per se, because we now store the real ScientificName so there is no 'mishandling' to fix. From bix at sendu.me.uk Thu Jul 20 09:49:04 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 20 Jul 2006 14:49:04 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BF86AF.8080408@sendu.me.uk> References: <44BBBB69.6000906@sendu.me.uk> <44BF86AF.8080408@sendu.me.uk> Message-ID: <44BF89D0.7090103@sendu.me.uk> Sendu Bala wrote: > > Bio::DB::Taxonomy::flatfile > > BEHAVIOUR-CHANGE: The scientific name field isn't touched in any way, > always being sent directly to Bio::Taxonomy::Node->new(-name => > $untouched) or the $node->classification() array. Previously, a species > node would have its name converted from 'Homo sapiens' to 'sapiens', but > the conversion mangled very badly certain other species names. [...] > Bio::DB::Taxonomy::entrez > > BEHAVIOUR-CHANGE: the ScientificName field isn't touched except for s/ > \(class\)$//, being sent directly to Bio::Taxonomy::Node->new(-name => > $untouched) or the $node->classification() array. Previously, a species > node would have its name converted from 'Homo sapiens' to 'sapiens', but > the conversion mangled very badly certain other species names. Oops. In both cases the scientific name has ' (class)' removed from it, but the original name (with ' (class)') is stored as one of the common names. From georg.otto at tuebingen.mpg.de Thu Jul 20 10:29:33 2006 From: georg.otto at tuebingen.mpg.de (Georg Otto) Date: Thu, 20 Jul 2006 16:29:33 +0200 Subject: [Bioperl-l] Features in SeqIO GenBank output References: <73C89D17-91FE-47E4-80C1-AA6A689FA14E@uiuc.edu> Message-ID: This indeed seems to be the case. After upgrading it works fine. Sorry for stealing your time. Georg Chris Fields writes: > I'll give it a look. You might try upgrading to Bioperl 1.5.1 to see > if this was fixed. > > Chris > > On Jul 20, 2006, at 5:53 AM, Georg Otto wrote: > >> >> Hi, >> >> this is probably a FAQ but I could not find anything to solve it. >> >> I want to get sequences from GenBank and save them in GenBank >> format. This works with the script shown below, but the "Features" >> part is missing and contains references instead (see below). How can I >> print out the complete GenBank entry? >> >> I am running Bioperl 1.5, Perl 5.8.6, Mac 10.4.7 >> >> Best, >> >> Georg >> >> >> >> Here is my script: >> >> use strict; >> use warnings; >> >> use Bio::Seq; >> use Bio::SeqIO; >> use Bio::DB::GenBank; >> >> >> my $acc = 'AB017118'; >> my $db_obj = Bio::DB::GenBank->new(); >> my $seq_obj = $db_obj-> get_Seq_by_acc($acc); >> my $out = Bio::SeqIO->new(-format => 'genbank', >> -file => '>output.gb'); >> $out->write_seq($seq_obj); >> >> >> >> Here is the output: >> >> LOCUS AB017118 2038 bp mRNA linear VRT >> 06-JUN-2006 >> DEFINITION Danio rerio mRNA for ornithine decarboxylase antizyme long >> isoform, complete cds. >> ACCESSION AB017118 >> VERSION AB017118.1 GI:4239978 >> KEYWORDS . >> SOURCE Danio rerio (zebrafish) >> ORGANISM Danio rerio >> Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; >> Euteleostomi; >> Actinopterygii; Neopterygii; Teleostei; Ostariophysi; >> Cypriniformes; Cyprinidae; Danio. >> REFERENCE 1 >> AUTHORS Saito,T., Hascilowicz,T., Ohkido,I., Kikuchi,Y., >> Okamoto,H., >> Hayashi,S., Murakami,Y. and Matsufuji,S. >> TITLE Two zebrafish (Danio rerio) antizymes with different >> expression >> and activities >> JOURNAL Biochem. J. 345 PT 1, 99-106 (2000) >> PUBMED 10600644 >> REFERENCE 2 (bases 1 to 2038) >> AUTHORS Matsufuji,S. and Saito,T. >> TITLE Direct Submission >> JOURNAL Submitted (23-AUG-1998) Senya Matsufuji, Jikei >> University School >> of Medicine, Department of Biochemistry II; 3-25-8 >> Nishishinbashi, >> Minato-ku, Tokyo 105-8461, Japan (E- >> mail:senya at jikei.ac.jp, >> Tel:+81-3-3433-1111(ex.2276), Fax:+81-3-3436-3897) >> FEATURES Location/Qualifiers >> source 1..2038 >> /db_xref="Bio::Annotation::SimpleValue=HASH >> (0x19b9a28)" >> /mol_type="Bio::Annotation::SimpleValue=HASH >> (0x19b9b6c)" >> /dev_stage="Bio::Annotation::SimpleValue=HASH >> (0x19b9bb4)" >> /organism="Bio::Annotation::SimpleValue=HASH >> (0x19bfe18)" >> /clone_lib="Bio::Annotation::SimpleValue=HASH >> (0x19bfe60)" >> CDS join(45..224,226..702) >> /db_xref="Bio::Annotation::SimpleValue=HASH >> (0x19c0960)" >> / >> ribosomal_slippage="Bio::Annotation::SimpleValue=HASH(0x1 >> 9beecc)" >> /codon_start=Bio::Annotation::SimpleValue=HASH >> (0x19bef14) >> /protein_id="Bio::Annotation::SimpleValue=HASH >> (0x19bef5c)" >> /translation="Bio::Annotation::SimpleValue=HASH >> (0x19befa4) >> " >> /product="Bio::Annotation::SimpleValue=HASH >> (0x19befec)" >> /note="Bio::Annotation::SimpleValue=HASH >> (0x19bf034)" >> CDS 45..227 >> /db_xref="Bio::Annotation::SimpleValue=HASH >> (0x19bee24)" >> /codon_start=Bio::Annotation::SimpleValue=HASH >> (0x19bf160) >> /protein_id="Bio::Annotation::SimpleValue=HASH >> (0x19bf1cc)" >> /translation="Bio::Annotation::SimpleValue=HASH >> (0x19c1830) >> " >> /note="Bio::Annotation::SimpleValue=HASH >> (0x19c1878)" >> polyA_signal 2017..2022 >> polyA_site 2038 >> /note="Bio::Annotation::SimpleValue=HASH >> (0x19bffc8)" >> BASE COUNT 439 a 377 c 532 g 690 t >> ORIGIN >> 1 cagcagccga gcgcacaggc cgccgtgaaa cctcccgagg ccggatggta >> aaatccaacc >> >> >> >> >> 1981 ttatcctcta tagtggtaca cctttgcttc tgtcataata aaaccattat >> ttaaagac >> // >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign From prabubio at gmail.com Thu Jul 20 12:01:35 2006 From: prabubio at gmail.com (Prabu R) Date: Thu, 20 Jul 2006 21:31:35 +0530 Subject: [Bioperl-l] Blast Output Parsing Message-ID: Dear All! I am now trying to parse a Blast output using PERL. I have to extract each alignment and have to parse the alignment. I mean, I have to check whether a particular part of the given sequence got aligned 100%. Anybody please tell me what module in PERL I have to use for getting this. I've tried Bio::SearchIO. But I didnt get any method to get the alignment. Kindly help. Thanks, R. Prabu From cjfields at uiuc.edu Thu Jul 20 13:03:17 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 20 Jul 2006 12:03:17 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BF86AF.8080408@sendu.me.uk> Message-ID: <002901c6ac1e$66ea3820$15327e82@pyrimidine> These all seem fine to me. Fantastic work! I added some comments but everything seems fine to me. I still plan on switching Bio::DB::Taxonomy::entrez to use Bio::DB::EUtilities at some point but probably won't get around to it until August; I still need to write up tests for the EUtilities modules. I may add a method for retrieving tax data based on protein/nucleotide sequence primary ID and relevant sequence database, so you could directly retrieve the relevant TaxID w/o parsing sequences directly for them. This would mainly be useful if you gather GIs from a BLAST search, for instance. Anyway, I could add this in then base class Bio::DB::Taxonomy directly so one could used the retrieved TaxIDs for flat-file or entrez searches; this requires, of course, access to the remote Entrez database (it would use ELink). Would that be of interest? If so, I'll work on that and add relevant tests to Taxonomy.t when I can. > Bio::DB::Taxonomy::flatfile > --------------------------- ... > API-CHANGE: for this reason I've renamed get_taxonid() to get_taxonids() > and it returns an array of ids in list context. For backward > compatibility it returns one of the ids in scalar context, and > *get_taxonid = \&get_taxonids. Returning a scalar makes sense as long as its noted in the POD. I have seen similar methods return an array ref based on wantarray instead of a scalar, but that largely depends on the complexity of the array (an array of hashes, for instance). ... > Bio::DB::Taxonomy::entrez > ------------------------- ... > NOTE: entrez modules (and website) cannot cope with '' in the > query, failing searches like 'Craniata '. For this reason, if > get_taxonids() is given a query with '' it will immediately > return undefined, saving a pointless website access. If you want the id > of 'Craniata ' you must search for 'Craniata', then get the > node for each returned id to see which one has a parent node with a > scientific_name() or common_names() case-insensitive matching to > 'chordata'. It may be something with the esearch interface, though the direct TaxBrowser query also seems to have problems with this: http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/ I'll try looking into it to see if there is a more direct way to get those (there probably isn't). > # Improvements > BEHAVIOUR-CHANGE: now throws on failure to retrieve data from website. > > BEHAVIOUR-CHANGE: the ScientificName field isn't touched except for s/ > \(class\)$//, being sent directly to Bio::Taxonomy::Node->new(-name => > $untouched) or the $node->classification() array. Previously, a species > node would have its name converted from 'Homo sapiens' to 'sapiens', but > the conversion mangled very badly certain other species names. This actually relates to the similar comment made for Bio::DB::Taxonomy::flatfle. The mangling probably depends on the current node and whether using flatfile or XML (entrez). Most of the odd XML examples I posted before, where the TaxID associated with a sequence had extra data, were a rank of 'no rank'. The species rank, if present, has a normal binomial name for : Flavobacterium johnsoniae UW101 ... Flavobacterium johnsoniae species Pseudomonas putida F1 ... Pseudomonas putida species Caldicellulosiruptor saccharolyticus DSM 8903 ... Caldicellulosiruptor saccharolyticus species The genus rank has one name; the subspecies rank has the full species name with 'subsp.' followed by the subspecies name. So, if using XML, one could use the taxon subelements stored in the XML element to sort out genus(), species(), subspecies(), and also higher order elements if someone wanted to implement them. This, of course, isn't necessary for the current changes, but down the road if anybody wanted it... ... > Bio::Taxonomy::Node > ------------------- ... > species() and genus() issue a warning when you try to use them on a node > that isn't of rank 'species' (since they interact with the > classification array and not names('method') like the other similar > methods). I would just have genus() and species() issue warnings if they aren't set to a particular value. So, if the current node is at the genus rank, genus() will be set but species() won't be. And no need to do additional checking! Fabulous work Sendu! Chris From cjfields at uiuc.edu Thu Jul 20 13:23:14 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 20 Jul 2006 12:23:14 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BF89D0.7090103@sendu.me.uk> Message-ID: <002a01c6ac21$2ed16190$15327e82@pyrimidine> Just thought of something... You had mentioned using a stripped-down version of Bio::Taxonomy::Node previously, which led to a bit of contention. One way to make everybody happy would be to create an interface class that contains the basic shared methods (Bio::Taxonomy::NodeI), then have the currently-named Bio::Taxonomy::Node (which could be renamed to Bio::Taxonomy::Species or something similar) implement those methods along with the current methods. Another class (your stripped down version, which could then be Bio::Taxonomy::Node) would also implement whatever base class methods were needed. They would both be Bio::Taxonomy::NodeI-implementing, so you could use either object type where required. |------Node NodeI----| |------Species Another option would be to have Bio::Taxonomy::Node itself stripped down, then have another class (Bio::Taxonomy::Species) inherit methods from it and also implement additional methods (genus(), species(), etc). Node----Species Would something like that be feasible? I favor the interface version as it sticks with the interface-implementation design that Bioperl has been migrating towards: http://www.bioperl.org/wiki/Advanced_BioPerl#Bioperl_Interface_design This would also help out with the whole Bio::Species issue; just have Bio::Taxonomy::Species replace it. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Sendu Bala > Sent: Thursday, July 20, 2006 8:49 AM > To: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > Sendu Bala wrote: > > > > Bio::DB::Taxonomy::flatfile > > > > BEHAVIOUR-CHANGE: The scientific name field isn't touched in any way, > > always being sent directly to Bio::Taxonomy::Node->new(-name => > > $untouched) or the $node->classification() array. Previously, a species > > node would have its name converted from 'Homo sapiens' to 'sapiens', but > > the conversion mangled very badly certain other species names. > [...] > > Bio::DB::Taxonomy::entrez > > > > BEHAVIOUR-CHANGE: the ScientificName field isn't touched except for s/ > > \(class\)$//, being sent directly to Bio::Taxonomy::Node->new(-name => > > $untouched) or the $node->classification() array. Previously, a species > > node would have its name converted from 'Homo sapiens' to 'sapiens', but > > the conversion mangled very badly certain other species names. > > Oops. In both cases the scientific name has ' (class)' removed from it, > but the original name (with ' (class)') is stored as one of the common > names. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Thu Jul 20 13:31:42 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 20 Jul 2006 12:31:42 -0500 Subject: [Bioperl-l] Blast Output Parsing In-Reply-To: Message-ID: <002b01c6ac22$5d75e1f0$15327e82@pyrimidine> Grab the HSPs, then use get_aln() to generate a Bio::SimpleAlign object. You can then use Bio::AlignIO to generate the alignment output if needed, or use the Bio::SimpleAlign methods to get what you want. http://www.bioperl.org/wiki/HOWTO:Beginners http://www.bioperl.org/wiki/HOWTO:SearchIO http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/SimpleAlign .html Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Prabu R > Sent: Thursday, July 20, 2006 11:02 AM > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] Blast Output Parsing > > Dear All! > > I am now trying to parse a Blast output using PERL. > > I have to extract each alignment and have to parse the alignment. I mean, > I > have to check whether a particular part of the given sequence got aligned > 100%. > > Anybody please tell me what module in PERL I have to use for getting this. > > I've tried Bio::SearchIO. But I didnt get any method to get the > alignment. > > Kindly help. > > Thanks, > R. Prabu > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From bix at sendu.me.uk Thu Jul 20 13:53:03 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 20 Jul 2006 18:53:03 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <002901c6ac1e$66ea3820$15327e82@pyrimidine> References: <002901c6ac1e$66ea3820$15327e82@pyrimidine> Message-ID: <44BFC2FF.3030704@sendu.me.uk> Chris Fields wrote: > > I still plan on switching Bio::DB::Taxonomy::entrez to use > Bio::DB::EUtilities at some point but probably won't get around to it until > August; If I may make two feature requests (you've probably already done them, if so apologies)? a) Automatically enforce the 3second wait rule when querying via the ncbi website. b) Automatically cache results locally in a reasonable way, such that repeated queries aiming to get the same result don't have to go via the website. > Anyway, I could add this in then base class Bio::DB::Taxonomy directly so > one could used the retrieved TaxIDs for flat-file or entrez searches; this > requires, of course, access to the remote Entrez database (it would use > ELink). Would that be of interest? Sorry, I don't really understand this paragraph. I'm unable to parse '...then base class Bio::DB::Taxonomy directly so...', for starters. >> Bio::Taxonomy::Node >> ------------------- > > ... > >> species() and genus() issue a warning when you try to use them on a node >> that isn't of rank 'species' (since they interact with the >> classification array and not names('method') like the other similar >> methods). > > I would just have genus() and species() issue warnings if they aren't set to > a particular value. So, if the current node is at the genus rank, genus() > will be set but species() won't be. And no need to do additional checking! The problem is, genus() and species() are special cases that aren't normally directly set. They get their values from the classification array: genus() returns (classification())[1] and species() returns (classification())[0]. They set the same values. Doing this is only sane (though is still likely to be wrong, given that there can be ranks between species and genus) when the node is of rank 'species', hence the warnings. I imagine this is to work with pesky file formats like genbank, so I can't really change anything here without major overhaul. And my plans for overhaul involve getting rid of genus() and species(), so I'll just leave them be for now. Anyway, thanks for your comments and input into this thread! It's much appreciated. From bix at sendu.me.uk Thu Jul 20 13:55:56 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 20 Jul 2006 18:55:56 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <002a01c6ac21$2ed16190$15327e82@pyrimidine> References: <002a01c6ac21$2ed16190$15327e82@pyrimidine> Message-ID: <44BFC3AC.8010704@sendu.me.uk> Chris Fields wrote: > Just thought of something... > > You had mentioned using a stripped-down version of Bio::Taxonomy::Node > previously, which led to a bit of contention. One way to make everybody > happy would be to create an interface class that contains the basic shared > methods (Bio::Taxonomy::NodeI), then have the currently-named > Bio::Taxonomy::Node (which could be renamed to Bio::Taxonomy::Species or > something similar) implement those methods along with the current methods. > Another class (your stripped down version, which could then be > Bio::Taxonomy::Node) would also implement whatever base class methods were > needed. They would both be Bio::Taxonomy::NodeI-implementing, so you could > use either object type where required. > > |------Node > NodeI----| > |------Species [...] > I favor the interface version as it > sticks with the interface-implementation design that Bioperl has been > migrating towards: > > http://www.bioperl.org/wiki/Advanced_BioPerl#Bioperl_Interface_design > > This would also help out with the whole Bio::Species issue; just have > Bio::Taxonomy::Species replace it. Yes, this sounds good to me. Should I still wait until Jason/elders are able to comment before I start exploring this avenue? From cjfields at uiuc.edu Thu Jul 20 14:21:48 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 20 Jul 2006 13:21:48 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BFC3AC.8010704@sendu.me.uk> Message-ID: <000601c6ac29$5d533a90$15327e82@pyrimidine> I would say go ahead, why not? This would likely lead to the eventual deprecation of Bio::Species, which was in the cards anyway. The only problem I can foresee is which class to use with Bio::DB::Taxonomy*? I guess one could settle on one class by default and have the option to use another Bio::Taxonomy::NodeI-implementing class if you wanted more data/methods available... Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Sendu Bala > Sent: Thursday, July 20, 2006 12:56 PM > To: bioperl-l at bioperl.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > Chris Fields wrote: > > Just thought of something... > > > > You had mentioned using a stripped-down version of Bio::Taxonomy::Node > > previously, which led to a bit of contention. One way to make everybody > > happy would be to create an interface class that contains the basic > shared > > methods (Bio::Taxonomy::NodeI), then have the currently-named > > Bio::Taxonomy::Node (which could be renamed to Bio::Taxonomy::Species or > > something similar) implement those methods along with the current > methods. > > Another class (your stripped down version, which could then be > > Bio::Taxonomy::Node) would also implement whatever base class methods > were > > needed. They would both be Bio::Taxonomy::NodeI-implementing, so you > could > > use either object type where required. > > > > |------Node > > NodeI----| > > |------Species > [...] > > I favor the interface version as it > > sticks with the interface-implementation design that Bioperl has been > > migrating towards: > > > > http://www.bioperl.org/wiki/Advanced_BioPerl#Bioperl_Interface_design > > > > This would also help out with the whole Bio::Species issue; just have > > Bio::Taxonomy::Species replace it. > > Yes, this sounds good to me. Should I still wait until Jason/elders are > able to comment before I start exploring this avenue? > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From hlapp at gmx.net Thu Jul 20 14:24:19 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 20 Jul 2006 14:24:19 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BFC3AC.8010704@sendu.me.uk> References: <002a01c6ac21$2ed16190$15327e82@pyrimidine> <44BFC3AC.8010704@sendu.me.uk> Message-ID: On Jul 20, 2006, at 1:55 PM, Sendu Bala wrote: > > Yes, this sounds good to me. Should I still wait until Jason/elders > are > able to comment before I start exploring this avenue? Unless you're afraid that your suggestions are going too wild for our palate please do go ahead. The joy of CVS is we can always go back. For my part, I just haven't been able to keep up with the flurry of long emails ... I'll have to do some extensive bedtime reading (and then writing ;) soon I guess :-) -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From saunders at uchicago.edu Thu Jul 20 17:47:08 2006 From: saunders at uchicago.edu (Matthew A. Saunders) Date: Thu, 20 Jul 2006 16:47:08 -0500 (CDT) Subject: [Bioperl-l] installing bioperl Message-ID: Dear Bioperl representative, I have been trying to install bioperl (in order to ultimately run some Ensembl APIs) but I seem to be having some problems with the bioperl installation. I have followed the installation directions and I get to the last steps of the "make" process, yet this stage fails with the error message below. Can you possibly tell me what is the problem. I am not sure that I understand the command "make", but I think that it requires that there be a file named "makefile" in the given folder, when I look in my newly formed "bioperl-1.4" folder there is no "makefile" in there. Perhaps that is a problem. If so, how might I rectify the matter? Thanks! Matt ************************************************************* . . . Enjoy the rest of bioperl, which you can use after going 'make install' Checking if your kit is complete... Looks good /usr/bin/perl: symbol lookup error: /usr/lib64/perl5/5.8.5/x86_64-linux-thread-multi/auto/DB_File/DB_File.so: undefined symbol: db_version Running make test Make had some problems, maybe interrupted? Won't test Running make install Make had some problems, maybe interrupted? Won't install *************************************************************** ----------------------------------------------------- Matthew A. Saunders UNCF-MERCK Postdoctoral Research Fellow Dept. of Ecology and Evolution University of Chicago (773)834-3964 Skype: mattsaunders555 http://home.uchicago.edu/~saunders ------------------------------------------------------- From saunders at uchicago.edu Thu Jul 20 18:01:53 2006 From: saunders at uchicago.edu (Matthew A. Saunders) Date: Thu, 20 Jul 2006 17:01:53 -0500 (CDT) Subject: [Bioperl-l] installing bioperl In-Reply-To: References: Message-ID: In continuation to my described problem, I have just installed the bioperl-run file from the .tar.gz format and that was successful through the "perl Makefile.PL" and the "make" & "make test" phases. It is the "bioperl core" file that is still giving me the problems described below. Thanks! Matt ******************************** On Thu, 20 Jul 2006, Matthew A. Saunders wrote: > Dear Bioperl representative, > > I have been trying to install bioperl (in order to ultimately run some > Ensembl APIs) but I seem to be having some problems with the bioperl > installation. > > I have followed the installation directions and I get to the last steps of > the "make" process, yet this stage fails with the error message below. Can > you possibly tell me what is the problem. I am not sure that I understand > the command "make", but I think that it requires that there be a file named > "makefile" in the given folder, when I look in my newly formed "bioperl-1.4" > folder there is no "makefile" in there. Perhaps that is a problem. If so, > how might I rectify the matter? > > Thanks! > > Matt > > > ************************************************************* . . . > Enjoy the rest of bioperl, which you can use after going 'make install' > > Checking if your kit is complete... > Looks good > /usr/bin/perl: symbol lookup error: > /usr/lib64/perl5/5.8.5/x86_64-linux-thread-multi/auto/DB_File/DB_File.so: > undefined symbol: db_version > Running make test > Make had some problems, maybe interrupted? Won't test > Running make install > Make had some problems, maybe interrupted? Won't install > *************************************************************** > > > > ----------------------------------------------------- > Matthew A. Saunders > UNCF-MERCK Postdoctoral Research Fellow > > Dept. of Ecology and Evolution > University of Chicago > (773)834-3964 > Skype: mattsaunders555 > http://home.uchicago.edu/~saunders > ------------------------------------------------------- > > From bix at sendu.me.uk Thu Jul 20 18:47:33 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 20 Jul 2006 23:47:33 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine> References: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine> Message-ID: <44C00805.7090403@sendu.me.uk> Chris Fields wrote: > As for caching, > do you mean caching of the tax information or the sequence ID information? Anything you get from entrez. > Caching of tax information would be great, but how would you go about it? I > can see how it would be easy to have a cache for the flatfile using a local > index, but not so much for XML data retrieved from Entrez (a > continually-appended local file, maybe, with a n accompanying index file?). I didn't actually mean a stored file (but that would be possible with a tied hash or something: DB_File, just like flatfile), but an in-memory one for use during the course of program execution. Stored file would probably be dangerous because you wouldn't know if the data has become stale or not - and checking to see if it wasn't would defeat the point. >> The problem is, genus() and species() are special cases that aren't >> normally directly set. They get their values from the classification >> array: genus() returns (classification())[1] and species() returns >> (classification())[0]. They set the same values. Doing this is only sane >> (though is still likely to be wrong, given that there can be ranks >> between species and genus) when the node is of rank 'species', hence the >> warnings. >> >> I imagine this is to work with pesky file formats like genbank, so I >> can't really change anything here without major overhaul. And my plans >> for overhaul involve getting rid of genus() and species(), so I'll just >> leave them be for now. > > This would all depend on where the information came from; if the information > came from the Entrez XML element data: > [snip] > > The subspecies(), genus(), and species() could all be set from this instead > of the classification array. The problem lies then with the flatfile data > and how it would be parsed out, if that's at all possible with the flatfile > data. If not, I see why you would rather have this return a stripped-down > Bio::Taxonomy::Node object. > > I would have to look at how everything is indexed in > Bio::DB::Taxonomy::entrez, but I think it's feasible. entrez already parses through LineageEx to build the classification array. flatfile walks up all the parents to do the same. Having the information isn't the issue. We have the information. The methods genus() and species() need to work with the genbank fileformat, that is the problem. From MEC at stowers-institute.org Thu Jul 20 18:40:55 2006 From: MEC at stowers-institute.org (Cook, Malcolm) Date: Thu, 20 Jul 2006 17:40:55 -0500 Subject: [Bioperl-l] Remote Blast - Blast Human Genome Message-ID: Rohan, 'snp/human/human_snp' is the database name you need to use to blast into human snp database at NCBI See the following document for the full list (which link was provided to me via personal correspondace with NCBI helpdesk). Very useful... Hmm, looming again, there appear now to be two versions: http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastdblist.html (last updated 2/7/2006) http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/remote_accessible_blastdbli st.html (last uypdated 5/29/2006) Neither are linked to by any other document on the internet (google sez) including anywhere else at NCBI. Go figure. It should be IMHO since this info is nowhere else collected. Of course it may be out of date, but it always has got me through. Good luck Malcolm Cook - mec at stowers-institute.org - 816-926-4449 Database Applications Manager - Bioinformatics Stowers Institute for Medical Research - Kansas City, MO USA >-----Original Message----- >From: bioperl-l-bounces at lists.open-bio.org >[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Chris Fields >Sent: Monday, July 17, 2006 4:26 PM >To: vrramnar at student.cs.uwaterloo.ca; bioperl-l at lists.open-bio.org >Subject: Re: [Bioperl-l] Remote Blast - Blast Human Genome > >Okay, I think I may know what's going on a little more now >with NCBI's BLAST >interface. Looks like any NCBI BLAST query must use the >default URL and so >must set up to proper GET/PUT commands to retrieve everything >correctly. > >Here's the API description for it all: > >http://www.ncbi.nlm.nih.gov/BLAST/Doc/urlapi.html > >You could try setting the database to 'snp' or something along >those lines >instead of 'nr'; or you could see what the name of the >database is when you >use the web form and try setting it to that. According to >this page, this >should be possible: > >http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helpsnpfaq.sectio >n.SearchdbSNP >_test._Search_dbSNP_Using_B > >The Entrez Query limit was a recommendation for limiting your >search to a >set of sequences for human, for instance. > >I'll try looking into it a bit more but I'm pretty busy. If you find >anything out you should probably post it here . > >Chris > >> Hi Chris, >> >> 1. I have tried changing the database to snp or dbSNP but >neither works. >> It >> seems that depending on which type of blast you use(ie, Genome Blast, >> Blast SNP, >> normal blast such as blastn, etc...) you see a different listing of >> databases >> available for querys. Since you mention that the Blast page I see was >> generated >> by Genome, where could I go to see a complete listing of >databases I can >> query?? >> Or if you knew off hand which database to search if I only >wanted dbSNP >> hits? >> >> 2. You also mention, I can limit the search by using Entrez >terms. Do you >> mean >> like: >> $Bio::Tools::Run::RemoteBlast::HEADER{'ENTREZ_QUERY'} = 'abc'; >> where 'abc' is the name of the subject with which you would >only like to >> see >> result of. For example if you put it as 'Homo >sapiens[Organism]' then only >> human >> sequences would be in hit lists. >> If this is what you mean, what would I change it to, to see >only hits from >> dbSNP? >> >> Thanks for the ongoing help, >> >> Rohan >> >> Quoting Chris Fields : >> >> > I added a method to RemoteBlast in bioperl-live (CVS) if >you want to >> play >> > with changing the URL. I have been thinking about doing >this for a bit >> now >> > but I already see problems. >> > >> > Here's the issue: the BLAST page you see is NOT the NCBI BLAST page >> (note >> > the differences in the URL) but a user-friendly request >page, generated >> on >> > the fly by Genome, to submit BLAST requests for the >relevant database. >> So >> > changing the URL will not work (even by adding extra >parameters); you >> only >> > get the original HTML web page. >> > >> > You could try changing the database or limiting the search using an >> Entrez >> > term (which you should be able to include in the request, >probably by >> adding >> > it to the HEADER). >> > >> > Chris >> > >> > > -----Original Message----- >> > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> > > bounces at lists.open-bio.org] On Behalf Of >> vrramnar at student.cs.uwaterloo.ca >> > > Sent: Thursday, July 13, 2006 5:39 PM >> > > To: bioperl-l at lists.open-bio.org >> > > Subject: [Bioperl-l] Remote Blast - Blast Human Genome >> > > >> > > >> > > Hello Again, >> > > >> > > I have another question regarding Remote blast but this >time using >> Genome >> > > Blast. >> > > >> > > Here is the link: >> > > >> > > >> >http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=9606 >> > > >> > > which again uses the main Blast web site: >> > > >> > > http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi >> > > >> > > Again I am not sure what to add or what HEADER >information to change >> > > within my >> > > script. >> > > >> > > Here is my program, which was the same as the last email: >> > > >> > > #!/usr/bin/perl -w >> > > >> > > use Bio::Perl; >> > > use Bio::Tools::Run::RemoteBlast; >> > > >> > > my $prog = "blastn"; >> > > my $db = "refseq_genomic"; >> > > my $e_val = 0.01; >> > > >> > > my @params = ( '-prog' => $prog, >> > > '-data' => $db, >> > > '-expect' => $e_val); >> > > >> > > my $factory = new Bio::Tools::Run::RemoteBlast->new(@params); >> > > $Bio::Tools::Run::RemoteBlast::HEADER{'WWW_BLAST_TYPE'} >= '????'; <-- >> --- >> > > what >> > > do I put here >> > > #$Bio::Tools::Run::RemoteBlast::HEADER{'?????'} = >'????'; <--- Do I >> need >> > > to add >> > > any other values to the form inputs >> > > >> > > $factory->submit_blast("blast.in"); >> > > $v = 1; >> > > >> > > while (my @rids = $factory->each_rid) >> > > { foreach my $rid ( @rids ) >> > > { my $rc = $factory->retrieve_blast($rid); >> > > if( !ref($rc) ) >> > > { if( $rc < 0 ) >> > > { $factory->remove_rid($rid); >> > > } >> > > print STDERR "." if ( $v > 0 ); >> > > sleep 5; >> > > } >> > > else >> > > { my $result = $rc->next_result(); >> > > my $filename = $result->query_name()."\.out"; >> > > $factory->save_output($filename); >> > > $factory->remove_rid($rid); >> > > print "\nQuery Name: ", $result->query_name(), "\n"; >> > > } >> > > } >> > > } >> > > >> > > >> > > Both of my questions are very similiar as in I know how >to use remote >> > > blast but >> > > not sure what to change to access the specific blast I want. >> > > >> > > Again, any help would be very appreciated!! >> > > >> > > Rohan >> > > >> > > >> > > >> > > ---------------------------------------- >> > > This mail sent through www.mywaterloo.ca >> > > _______________________________________________ >> > > Bioperl-l mailing list >> > > Bioperl-l at lists.open-bio.org >> > > http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > >> >> >> >> >> ---------------------------------------- >> This mail sent through www.mywaterloo.ca > >_______________________________________________ >Bioperl-l mailing list >Bioperl-l at lists.open-bio.org >http://lists.open-bio.org/mailman/listinfo/bioperl-l > From cjfields at uiuc.edu Thu Jul 20 19:01:02 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 20 Jul 2006 18:01:02 -0500 Subject: [Bioperl-l] installing bioperl In-Reply-To: References: Message-ID: <68C6025D-A9FE-47F0-905C-28B79C4B843A@uiuc.edu> Did you run perl Makefile.PL make make install 'perl Makefile.PL' generates the Makefile. Something screwy with DB_File, apparently, is also going on. > /usr/lib64/perl5/5.8.5/x86_64-linux-thread-multi/auto/DB_File/ > DB_File.so: Try updating or reinstalling DB_File. Chris On Jul 20, 2006, at 4:47 PM, Matthew A. Saunders wrote: > Dear Bioperl representative, > > I have been trying to install bioperl (in order to ultimately run some > Ensembl APIs) but I seem to be having some problems with the > bioperl installation. > > I have followed the installation directions and I get to the last > steps of > the "make" process, yet this stage fails with the error message below. > Can you possibly tell me what is the problem. I am not sure that I > understand the command "make", but I think that it requires that > there be > a file named "makefile" in the given folder, when I look in my newly > formed "bioperl-1.4" folder there is no "makefile" in there. > Perhaps that > is a problem. If so, how might I rectify the matter? > > Thanks! > > Matt > > > ************************************************************* . . > . > Enjoy the rest of bioperl, which you can use after going 'make > install' > > Checking if your kit is complete... > Looks good > /usr/bin/perl: symbol lookup error: > /usr/lib64/perl5/5.8.5/x86_64-linux-thread-multi/auto/DB_File/ > DB_File.so: > undefined symbol: db_version > Running make test > Make had some problems, maybe interrupted? Won't test > Running make install > Make had some problems, maybe interrupted? Won't install > *************************************************************** > > > > ----------------------------------------------------- > Matthew A. Saunders > UNCF-MERCK Postdoctoral Research Fellow > > Dept. of Ecology and Evolution > University of Chicago > (773)834-3964 > Skype: mattsaunders555 > http://home.uchicago.edu/~saunders > ------------------------------------------------------- > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Thu Jul 20 19:02:08 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 20 Jul 2006 18:02:08 -0500 Subject: [Bioperl-l] Remote Blast - Blast Human Genome In-Reply-To: References: Message-ID: Nice to know! I'll add this to the wiki. Chris On Jul 20, 2006, at 5:40 PM, Cook, Malcolm wrote: > Rohan, > > 'snp/human/human_snp' is the database name you need to use to blast > into > human snp database at NCBI > > See the following document for the full list (which link was > provided to > me via personal correspondace with NCBI helpdesk). Very useful... > > Hmm, looming again, there appear now to be two versions: > > http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastdblist.html (last > updated 2/7/2006) > http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/ > remote_accessible_blastdbli > st.html (last uypdated 5/29/2006) > > Neither are linked to by any other document on the internet (google > sez) > including anywhere else at NCBI. Go figure. It should be IMHO since > this info is nowhere else collected. > > Of course it may be out of date, but it always has got me through. > > Good luck > > Malcolm Cook - mec at stowers-institute.org - 816-926-4449 > Database Applications Manager - Bioinformatics > Stowers Institute for Medical Research - Kansas City, MO USA > > > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org >> [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Chris >> Fields >> Sent: Monday, July 17, 2006 4:26 PM >> To: vrramnar at student.cs.uwaterloo.ca; bioperl-l at lists.open-bio.org >> Subject: Re: [Bioperl-l] Remote Blast - Blast Human Genome >> >> Okay, I think I may know what's going on a little more now >> with NCBI's BLAST >> interface. Looks like any NCBI BLAST query must use the >> default URL and so >> must set up to proper GET/PUT commands to retrieve everything >> correctly. >> >> Here's the API description for it all: >> >> http://www.ncbi.nlm.nih.gov/BLAST/Doc/urlapi.html >> >> You could try setting the database to 'snp' or something along >> those lines >> instead of 'nr'; or you could see what the name of the >> database is when you >> use the web form and try setting it to that. According to >> this page, this >> should be possible: >> >> http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helpsnpfaq.sectio >> n.SearchdbSNP >> _test._Search_dbSNP_Using_B >> >> The Entrez Query limit was a recommendation for limiting your >> search to a >> set of sequences for human, for instance. >> >> I'll try looking into it a bit more but I'm pretty busy. If you find >> anything out you should probably post it here . >> >> Chris >> >>> Hi Chris, >>> >>> 1. I have tried changing the database to snp or dbSNP but >> neither works. >>> It >>> seems that depending on which type of blast you use(ie, Genome >>> Blast, >>> Blast SNP, >>> normal blast such as blastn, etc...) you see a different listing of >>> databases >>> available for querys. Since you mention that the Blast page I see >>> was >>> generated >>> by Genome, where could I go to see a complete listing of >> databases I can >>> query?? >>> Or if you knew off hand which database to search if I only >> wanted dbSNP >>> hits? >>> >>> 2. You also mention, I can limit the search by using Entrez >> terms. Do you >>> mean >>> like: >>> $Bio::Tools::Run::RemoteBlast::HEADER{'ENTREZ_QUERY'} = 'abc'; >>> where 'abc' is the name of the subject with which you would >> only like to >>> see >>> result of. For example if you put it as 'Homo >> sapiens[Organism]' then only >>> human >>> sequences would be in hit lists. >>> If this is what you mean, what would I change it to, to see >> only hits from >>> dbSNP? >>> >>> Thanks for the ongoing help, >>> >>> Rohan >>> >>> Quoting Chris Fields : >>> >>>> I added a method to RemoteBlast in bioperl-live (CVS) if >> you want to >>> play >>>> with changing the URL. I have been thinking about doing >> this for a bit >>> now >>>> but I already see problems. >>>> >>>> Here's the issue: the BLAST page you see is NOT the NCBI BLAST page >>> (note >>>> the differences in the URL) but a user-friendly request >> page, generated >>> on >>>> the fly by Genome, to submit BLAST requests for the >> relevant database. >>> So >>>> changing the URL will not work (even by adding extra >> parameters); you >>> only >>>> get the original HTML web page. >>>> >>>> You could try changing the database or limiting the search using an >>> Entrez >>>> term (which you should be able to include in the request, >> probably by >>> adding >>>> it to the HEADER). >>>> >>>> Chris >>>> >>>>> -----Original Message----- >>>>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >>>>> bounces at lists.open-bio.org] On Behalf Of >>> vrramnar at student.cs.uwaterloo.ca >>>>> Sent: Thursday, July 13, 2006 5:39 PM >>>>> To: bioperl-l at lists.open-bio.org >>>>> Subject: [Bioperl-l] Remote Blast - Blast Human Genome >>>>> >>>>> >>>>> Hello Again, >>>>> >>>>> I have another question regarding Remote blast but this >> time using >>> Genome >>>>> Blast. >>>>> >>>>> Here is the link: >>>>> >>>>> >>> >> http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi? >> taxid=9606 >>>>> >>>>> which again uses the main Blast web site: >>>>> >>>>> http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi >>>>> >>>>> Again I am not sure what to add or what HEADER >> information to change >>>>> within my >>>>> script. >>>>> >>>>> Here is my program, which was the same as the last email: >>>>> >>>>> #!/usr/bin/perl -w >>>>> >>>>> use Bio::Perl; >>>>> use Bio::Tools::Run::RemoteBlast; >>>>> >>>>> my $prog = "blastn"; >>>>> my $db = "refseq_genomic"; >>>>> my $e_val = 0.01; >>>>> >>>>> my @params = ( '-prog' => $prog, >>>>> '-data' => $db, >>>>> '-expect' => $e_val); >>>>> >>>>> my $factory = new Bio::Tools::Run::RemoteBlast->new(@params); >>>>> $Bio::Tools::Run::RemoteBlast::HEADER{'WWW_BLAST_TYPE'} >> = '????'; <-- >>> --- >>>>> what >>>>> do I put here >>>>> #$Bio::Tools::Run::RemoteBlast::HEADER{'?????'} = >> '????'; <--- Do I >>> need >>>>> to add >>>>> any other values to the form inputs >>>>> >>>>> $factory->submit_blast("blast.in"); >>>>> $v = 1; >>>>> >>>>> while (my @rids = $factory->each_rid) >>>>> { foreach my $rid ( @rids ) >>>>> { my $rc = $factory->retrieve_blast($rid); >>>>> if( !ref($rc) ) >>>>> { if( $rc < 0 ) >>>>> { $factory->remove_rid($rid); >>>>> } >>>>> print STDERR "." if ( $v > 0 ); >>>>> sleep 5; >>>>> } >>>>> else >>>>> { my $result = $rc->next_result(); >>>>> my $filename = $result->query_name()."\.out"; >>>>> $factory->save_output($filename); >>>>> $factory->remove_rid($rid); >>>>> print "\nQuery Name: ", $result->query_name(), "\n"; >>>>> } >>>>> } >>>>> } >>>>> >>>>> >>>>> Both of my questions are very similiar as in I know how >> to use remote >>>>> blast but >>>>> not sure what to change to access the specific blast I want. >>>>> >>>>> Again, any help would be very appreciated!! >>>>> >>>>> Rohan >>>>> >>>>> >>>>> >>>>> ---------------------------------------- >>>>> This mail sent through www.mywaterloo.ca >>>>> _______________________________________________ >>>>> Bioperl-l mailing list >>>>> Bioperl-l at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>> >>> >>> >>> >>> >>> ---------------------------------------- >>> This mail sent through www.mywaterloo.ca >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From vrramnar at student.cs.uwaterloo.ca Thu Jul 20 19:07:15 2006 From: vrramnar at student.cs.uwaterloo.ca (vrramnar at student.cs.uwaterloo.ca) Date: Thu, 20 Jul 2006 19:07:15 -0400 Subject: [Bioperl-l] Remote Blast - Blast Human Genome In-Reply-To: References: Message-ID: <1153436835.44c00ca39f2ee@www.nexusmail.uwaterloo.ca> Hi Malcolm, Thanks for the help, I actually figured this out today the same way you did through discussions with NCBI help deskng. He mentioned the main site is: http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/ But specifically: http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/remote_accessible_blastdblist.html So all you would need to do while using remoteblast is set your $db to one of the following: snp/human_9606/human_9606 Human SNPs snp/human_9606/rs_ch1 Human chr 1 SNPs snp/human_9606/rs_ch10 Human chr 10 SNPs snp/human_9606/rs_ch11 Human chr 11 SNPs snp/human_9606/rs_ch12 Human chr 12 SNPs snp/human_9606/rs_ch13 Human chr 13 SNPs snp/human_9606/rs_ch14 Human chr 14 SNPs snp/human_9606/rs_ch15 Human chr 15 SNPs snp/human_9606/rs_ch16 Human chr 16 SNPs snp/human_9606/rs_ch17 Human chr 17 SNPs snp/human_9606/rs_ch18 Human chr 18 SNPs snp/human_9606/rs_ch19 Human chr 19 SNPs snp/human_9606/rs_ch2 Human chr 2 SNPs snp/human_9606/rs_ch20 Human chr 20 SNPs snp/human_9606/rs_ch21 Human chr 21 SNPs snp/human_9606/rs_ch22 Human chr 22 SNPs snp/human_9606/rs_ch3 Human chr 3 SNPs snp/human_9606/rs_ch4 Human chr 4 SNPs snp/human_9606/rs_ch5 Human chr 5 SNPs snp/human_9606/rs_ch6 Human chr 6 SNPs snp/human_9606/rs_ch7 Human chr 7 SNPs snp/human_9606/rs_ch8 Human chr 8 SNPs snp/human_9606/rs_ch9 Human chr 9 SNPs snp/human_9606/rs_chMT Human chr Mitochondrial SNPs snp/human_9606/rs_chMulti Human SNPs mapped to multiple locations snp/human_9606/rs_chNotOn Human SNPs not mapped snp/human_9606/rs_chUn Human SNPs mapped to unplaced contigs snp/human_9606/rs_chX Human chr x SNPs snp/human_9606/rs_chY Human chr y SNPs The web site has a more complete list of all other databases available using the remoteblast module. Rohan Quoting "Cook, Malcolm" : > Rohan, > > 'snp/human/human_snp' is the database name you need to use to blast into > human snp database at NCBI > > See the following document for the full list (which link was provided to > me via personal correspondace with NCBI helpdesk). Very useful... > > Hmm, looming again, there appear now to be two versions: > > http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastdblist.html (last > updated 2/7/2006) > http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/remote_accessible_blastdbli > st.html (last uypdated 5/29/2006) > > Neither are linked to by any other document on the internet (google sez) > including anywhere else at NCBI. Go figure. It should be IMHO since > this info is nowhere else collected. > > Of course it may be out of date, but it always has got me through. > > Good luck > > Malcolm Cook - mec at stowers-institute.org - 816-926-4449 > Database Applications Manager - Bioinformatics > Stowers Institute for Medical Research - Kansas City, MO USA > > > > >-----Original Message----- > >From: bioperl-l-bounces at lists.open-bio.org > >[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Chris Fields > >Sent: Monday, July 17, 2006 4:26 PM > >To: vrramnar at student.cs.uwaterloo.ca; bioperl-l at lists.open-bio.org > >Subject: Re: [Bioperl-l] Remote Blast - Blast Human Genome > > > >Okay, I think I may know what's going on a little more now > >with NCBI's BLAST > >interface. Looks like any NCBI BLAST query must use the > >default URL and so > >must set up to proper GET/PUT commands to retrieve everything > >correctly. > > > >Here's the API description for it all: > > > >http://www.ncbi.nlm.nih.gov/BLAST/Doc/urlapi.html > > > >You could try setting the database to 'snp' or something along > >those lines > >instead of 'nr'; or you could see what the name of the > >database is when you > >use the web form and try setting it to that. According to > >this page, this > >should be possible: > > > >http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helpsnpfaq.sectio > >n.SearchdbSNP > >_test._Search_dbSNP_Using_B > > > >The Entrez Query limit was a recommendation for limiting your > >search to a > >set of sequences for human, for instance. > > > >I'll try looking into it a bit more but I'm pretty busy. If you find > >anything out you should probably post it here . > > > >Chris > > > >> Hi Chris, > >> > >> 1. I have tried changing the database to snp or dbSNP but > >neither works. > >> It > >> seems that depending on which type of blast you use(ie, Genome Blast, > >> Blast SNP, > >> normal blast such as blastn, etc...) you see a different listing of > >> databases > >> available for querys. Since you mention that the Blast page I see was > >> generated > >> by Genome, where could I go to see a complete listing of > >databases I can > >> query?? > >> Or if you knew off hand which database to search if I only > >wanted dbSNP > >> hits? > >> > >> 2. You also mention, I can limit the search by using Entrez > >terms. Do you > >> mean > >> like: > >> $Bio::Tools::Run::RemoteBlast::HEADER{'ENTREZ_QUERY'} = 'abc'; > >> where 'abc' is the name of the subject with which you would > >only like to > >> see > >> result of. For example if you put it as 'Homo > >sapiens[Organism]' then only > >> human > >> sequences would be in hit lists. > >> If this is what you mean, what would I change it to, to see > >only hits from > >> dbSNP? > >> > >> Thanks for the ongoing help, > >> > >> Rohan > >> > >> Quoting Chris Fields : > >> > >> > I added a method to RemoteBlast in bioperl-live (CVS) if > >you want to > >> play > >> > with changing the URL. I have been thinking about doing > >this for a bit > >> now > >> > but I already see problems. > >> > > >> > Here's the issue: the BLAST page you see is NOT the NCBI BLAST page > >> (note > >> > the differences in the URL) but a user-friendly request > >page, generated > >> on > >> > the fly by Genome, to submit BLAST requests for the > >relevant database. > >> So > >> > changing the URL will not work (even by adding extra > >parameters); you > >> only > >> > get the original HTML web page. > >> > > >> > You could try changing the database or limiting the search using an > >> Entrez > >> > term (which you should be able to include in the request, > >probably by > >> adding > >> > it to the HEADER). > >> > > >> > Chris > >> > > >> > > -----Original Message----- > >> > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > >> > > bounces at lists.open-bio.org] On Behalf Of > >> vrramnar at student.cs.uwaterloo.ca > >> > > Sent: Thursday, July 13, 2006 5:39 PM > >> > > To: bioperl-l at lists.open-bio.org > >> > > Subject: [Bioperl-l] Remote Blast - Blast Human Genome > >> > > > >> > > > >> > > Hello Again, > >> > > > >> > > I have another question regarding Remote blast but this > >time using > >> Genome > >> > > Blast. > >> > > > >> > > Here is the link: > >> > > > >> > > > >> > >http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=9606 > >> > > > >> > > which again uses the main Blast web site: > >> > > > >> > > http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi > >> > > > >> > > Again I am not sure what to add or what HEADER > >information to change > >> > > within my > >> > > script. > >> > > > >> > > Here is my program, which was the same as the last email: > >> > > > >> > > #!/usr/bin/perl -w > >> > > > >> > > use Bio::Perl; > >> > > use Bio::Tools::Run::RemoteBlast; > >> > > > >> > > my $prog = "blastn"; > >> > > my $db = "refseq_genomic"; > >> > > my $e_val = 0.01; > >> > > > >> > > my @params = ( '-prog' => $prog, > >> > > '-data' => $db, > >> > > '-expect' => $e_val); > >> > > > >> > > my $factory = new Bio::Tools::Run::RemoteBlast->new(@params); > >> > > $Bio::Tools::Run::RemoteBlast::HEADER{'WWW_BLAST_TYPE'} > >= '????'; <-- > >> --- > >> > > what > >> > > do I put here > >> > > #$Bio::Tools::Run::RemoteBlast::HEADER{'?????'} = > >'????'; <--- Do I > >> need > >> > > to add > >> > > any other values to the form inputs > >> > > > >> > > $factory->submit_blast("blast.in"); > >> > > $v = 1; > >> > > > >> > > while (my @rids = $factory->each_rid) > >> > > { foreach my $rid ( @rids ) > >> > > { my $rc = $factory->retrieve_blast($rid); > >> > > if( !ref($rc) ) > >> > > { if( $rc < 0 ) > >> > > { $factory->remove_rid($rid); > >> > > } > >> > > print STDERR "." if ( $v > 0 ); > >> > > sleep 5; > >> > > } > >> > > else > >> > > { my $result = $rc->next_result(); > >> > > my $filename = $result->query_name()."\.out"; > >> > > $factory->save_output($filename); > >> > > $factory->remove_rid($rid); > >> > > print "\nQuery Name: ", $result->query_name(), "\n"; > >> > > } > >> > > } > >> > > } > >> > > > >> > > > >> > > Both of my questions are very similiar as in I know how > >to use remote > >> > > blast but > >> > > not sure what to change to access the specific blast I want. > >> > > > >> > > Again, any help would be very appreciated!! > >> > > > >> > > Rohan > >> > > > >> > > > >> > > > >> > > ---------------------------------------- > >> > > This mail sent through www.mywaterloo.ca > >> > > _______________________________________________ > >> > > Bioperl-l mailing list > >> > > Bioperl-l at lists.open-bio.org > >> > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > >> > > >> > >> > >> > >> > >> ---------------------------------------- > >> This mail sent through www.mywaterloo.ca > > > >_______________________________________________ > >Bioperl-l mailing list > >Bioperl-l at lists.open-bio.org > >http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > ---------------------------------------- This mail sent through www.mywaterloo.ca From vrramnar at student.cs.uwaterloo.ca Thu Jul 20 19:18:27 2006 From: vrramnar at student.cs.uwaterloo.ca (vrramnar at student.cs.uwaterloo.ca) Date: Thu, 20 Jul 2006 19:18:27 -0400 Subject: [Bioperl-l] SNP reference file download Message-ID: <1153437507.44c00f43b53d4@www.nexusmail.uwaterloo.ca> Hello All, I was wondering if anyone knew how to download an entire SNP reference file from NCBI?? Or even downloading the sequence data for a particular SNP. I know how to do this via Bio::DB::GenBank, Bio::DB::SwissP, etc.. when referring to NM_##### but when I try to access rs###### files I am unsure of what Bio::DB to point to, if there is one. For example, if I had the accession number: rs4986950 How could I retrieve NCBI's entire reference file for this SNP record OR just the SNP sequence relating to this accession number. Any help on this subject would greatly be appreciated, Rohan ---------------------------------------- This mail sent through www.mywaterloo.ca From cjfields at uiuc.edu Fri Jul 21 00:51:30 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 20 Jul 2006 23:51:30 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C00805.7090403@sendu.me.uk> References: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine> <44C00805.7090403@sendu.me.uk> Message-ID: <9777D0A4-5340-4367-879D-F9125393B4FD@uiuc.edu> > I didn't actually mean a stored file (but that would be possible > with a > tied hash or something: DB_File, just like flatfile), but an in-memory > one for use during the course of program execution. Stored file would > probably be dangerous because you wouldn't know if the data has become > stale or not - and checking to see if it wasn't would defeat the > point. Okay, that wouldn't be a problem. I currently use in-memory caches to hold NCBI history information and ELink information for EUtilities. It would just a matter of doing the same for Bio::DB::Taxonomy. ... > entrez already parses through LineageEx to build the classification > array. flatfile walks up all the parents to do the same. Having the > information isn't the issue. We have the information. The methods > genus() and species() need to work with the genbank fileformat, > that is > the problem. The original purpose for Bio::Species was a simple object to hold taxonomic information. This object was then used in an attempt to hold the basic organism information (scientific name, common name, lineage information, etc) contained in a RichSeq file, like GenBank, EMBL, SwissProt, etc. The problem: trying to determine which term in the lineage corresponds to which rank and what part of the organism's scientific name is the genus, the species, and so on based solely on the data in the file, which comes down to a best-guess scenario for many organisms. It does work, but not equally well for all RichSeq files, not for every organism, and definitely not all the time. So, yes, genus(), species(), binomial, and other methods are present, but one must realize that parsing out the data into the appropriate object data using the various get/sets, with the obvious exceptions, is not the best way. Unless... you incorporate information available only outside the actual file itself (i.e. NCBI Taxonomy information). This is where Bio::Taxonomy seems to come along, as it's not-species specific (it can represent any rank) and is also DB-aware. Though Bio::Species was originally going to delegate all its data to Bio::Taxonomy::Node, I think the purpose was to eventually replace Bio::Species. So, my question is, why not use a Bio::Taxonomy::Node-like class initially to contain the appropriate data for a GenBank file (just for read/write purposes)? This object, since it implements Bio::Taxonomy::NodeI, is also DB-aware and thus, if set up with a database could also get/set the appropriate object data correctly using the lineage data. So, for instance, if I called $species = $seq->species(); and wanted the classification, scientific_name(), common_name, and other information that is gleaned from the file, then there's no need for a lookup. Once you cross into the bounds of: print $species->species(); print $species->genus(); then there's trouble, since we're working straight from the file (i.e. parsing is mainly correct, but still guesswork and sometimes wrong). But what if you could do something like this: my $db = Bio::DB::Taxonomy->new(-source => 'entrez'); # normally not needed as this is set by default internally, but as a demo here... $species->db_handle($db); # reset the appropriate data (genus, species, etc) based on Entrez tax data $species->reset_data(); # this method, BTW, doesn't exist yet but should be easy to implement print $species->species(); my $parent = $species->get_Parent_Node; my @child = $species->get_Children_Nodes; ...and so on Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From prabubio at gmail.com Fri Jul 21 02:17:41 2006 From: prabubio at gmail.com (Prabu R) Date: Fri, 21 Jul 2006 11:47:41 +0530 Subject: [Bioperl-l] Blast Output Parsing In-Reply-To: <002b01c6ac22$5d75e1f0$15327e82@pyrimidine> References: <002b01c6ac22$5d75e1f0$15327e82@pyrimidine> Message-ID: It works great Thanks a lot Mr.Chris. R. Prabu On 7/20/06, Chris Fields wrote: > > Grab the HSPs, then use get_aln() to generate a Bio::SimpleAlign object. > You can then use Bio::AlignIO to generate the alignment output if needed, > or > use the Bio::SimpleAlign methods to get what you want. > > http://www.bioperl.org/wiki/HOWTO:Beginners > > http://www.bioperl.org/wiki/HOWTO:SearchIO > > > http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/SimpleAlign > .html > > Chris > > > -----Original Message----- > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > > bounces at lists.open-bio.org] On Behalf Of Prabu R > > Sent: Thursday, July 20, 2006 11:02 AM > > To: bioperl-l at lists.open-bio.org > > Subject: [Bioperl-l] Blast Output Parsing > > > > Dear All! > > > > I am now trying to parse a Blast output using PERL. > > > > I have to extract each alignment and have to parse the alignment. I > mean, > > I > > have to check whether a particular part of the given sequence got > aligned > > 100%. > > > > Anybody please tell me what module in PERL I have to use for getting > this. > > > > I've tried Bio::SearchIO. But I didnt get any method to get the > > alignment. > > > > Kindly help. > > > > Thanks, > > R. Prabu > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- "Every noble work is at first impossible." - Thomas Carlyle From mh6 at sanger.ac.uk Fri Jul 21 05:04:57 2006 From: mh6 at sanger.ac.uk (Michael Han) Date: Fri, 21 Jul 2006 10:04:57 +0100 Subject: [Bioperl-l] PAML parser Message-ID: <44C098B9.4090003@sanger.ac.uk> Hi, I have some questions about the PAML parser (Bio::Tools::Phylo::PAML in CVS HEAD). Maybe some of you could help. If you call next_result, $self->_parse_summary might be called, which loops over $self->_readline . Later in next_result when "while (defined ($_=$self->_readline))" is used isn't the filepointer/filehandle already at the end of the output file and should return undef breaking the parsing? I added a crude seek($self->{_filehandle},0,0) after the _parse_summary and it seemed to work, but I wonder if I missed something obvious. thanks, Mike From cjfields at uiuc.edu Fri Jul 21 08:22:01 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 21 Jul 2006 07:22:01 -0500 Subject: [Bioperl-l] PAML parser In-Reply-To: <44C098B9.4090003@sanger.ac.uk> References: <44C098B9.4090003@sanger.ac.uk> Message-ID: Normally when you parse a report you use a loop to iterate through results: while (my $result = $parser->next_result) { # do work here } So returning undef is necessary to end the loop. This type of loop construct is common in BioPerl (and in Perl in general). There is a HOWTO for PAML: http://www.bioperl.org/wiki/HOWTO:PAML Chris On Jul 21, 2006, at 4:04 AM, Michael Han wrote: > Hi, > > I have some questions about the PAML parser > (Bio::Tools::Phylo::PAML in CVS HEAD). Maybe some of you could help. > > If you call next_result, $self->_parse_summary might be called, > which loops over $self->_readline . > > Later in next_result when "while (defined ($_=$self->_readline))" > is used isn't the filepointer/filehandle > already at the end of the output file and should return undef > breaking the parsing? > > I added a crude seek($self->{_filehandle},0,0) after the > _parse_summary and it seemed to work, but I wonder if I missed > something obvious. > > thanks, > > Mike > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Fri Jul 21 11:50:20 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 21 Jul 2006 10:50:20 -0500 Subject: [Bioperl-l] SNP reference file download In-Reply-To: <1153437507.44c00f43b53d4@www.nexusmail.uwaterloo.ca> Message-ID: <000901c6acdd$5f38ddb0$15327e82@pyrimidine> You'll need the latest code from CVS; you could try (the highly experimental) Bio::DB::EUtilities to get the raw flatfile XML data, then pass everything through Bio::ClusterIO. Currently there isn't tempfile, file, or filehandle support for the EUtilities but I plan on adding this soon. You could also pipe STDOUT from one SNP retrieval script into STDIN for the ClusterIO. BTW, the EFetch object below accepts an array reference of primary IDs if you want to use them instead, so you don't need to run an ESearch query first. To do this you'll need to set the database parameter (-db => 'snp'); the database from the ESearch query is passed to EFetch via the Cookie object. Chris use Bio::DB::EUtilities; use Bio::ClusterIO; # save XML to tempfile for read/write open my $XMLDATA, '+>', 'tempfile.xml'; # ESearch for term, place data in search history my $esearch= Bio::DB::EUtilities->new(-eutil => 'esearch', -term => 'dihydroorotase', -db => 'snp', -usehistory => 'y'); $esearch->get_response; print STDERR "Count: ", $esearch->count,"\n"; # efetch is default EUtility my $efetch = Bio::DB::EUtilities->new(-cookie => $esearch->next_cookie, -rettype => 'flt'); # SNP flatfile print $XMLDATA $efetch->get_response->content; seek ($XMLDATA, 0, 0); # don't forget to rewind... my $cio = Bio::ClusterIO->new(-format => 'dbsnp', -fh => $XMLDATA); # $snp is a Bio::Variation::snp object, see perldoc for methods while (my $snp = $cio->next_cluster) { print "ID : ",$snp->id,"\n"; } close $XMLDATA; > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of vrramnar at student.cs.uwaterloo.ca > Sent: Thursday, July 20, 2006 6:18 PM > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] SNP reference file download > > > Hello All, > > I was wondering if anyone knew how to download an entire SNP reference > file from > NCBI?? Or even downloading the sequence data for a particular SNP. > > I know how to do this via Bio::DB::GenBank, Bio::DB::SwissP, etc.. when > referring > to NM_##### but when I try to access rs###### files I am unsure of what > Bio::DB > to point to, if there is one. > > For example, if I had the accession number: rs4986950 How could I retrieve > NCBI's > entire reference file for this SNP record OR just the SNP sequence > relating to > this accession number. > > Any help on this subject would greatly be appreciated, > > Rohan > > > ---------------------------------------- > This mail sent through www.mywaterloo.ca > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Sun Jul 23 15:09:48 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sun, 23 Jul 2006 14:09:48 -0500 Subject: [Bioperl-l] obo_parser.t test warnings Message-ID: Hilmar, Sohel, Didn't know who to notify, so sorry in advance about cross-posting this to the list. I was running through cleaning up some bugs and found that obo_parser.t is throwing a ton of warnings: bayou-75:~/Chris/Bioperl/bioperl-live natashacapell$ perl -I. -w t/ obo_parser.t 1..40 "my" variable $val masks earlier declaration in same scope at Bio/ OntologyIO/obo.pm line 592. "my" variable $qh masks earlier declaration in same scope at Bio/ OntologyIO/obo.pm line 592. Use of uninitialized value in string eq at Bio/OntologyIO/obo.pm line 239, line 13. ... Good news: all tests pass! Cheers! Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Sun Jul 23 16:53:32 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sun, 23 Jul 2006 15:53:32 -0500 Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes Message-ID: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu> Sendu, Hilmar, et al, I was looking through SeqIO::genbank and though I would bring up a couple of things to think about re: GenBank Taxonomy information. This is how NCBI defines the names used for SOURCE and ORGANISM according to the latest GenBank release notes: SOURCE - Common name of the organism or the name most frequently used in the literature. Mandatory keyword in all annotated entries/one or more records/includes one subkeyword. ORGANISM - Formal scientific name of the organism (first line) and taxonomic classification levels (second and subsequent lines). Mandatory subkeyword in all annotated entries/two or more records. According to their sample file page (http://www.ncbi.nlm.nih.gov/ Sitemap/samplerecord.html), the SOURCE is this: Free-format information including an abbreviated form of the organism name, sometimes followed by a molecule type. (See section 3.4.10 of the GenBank release notes for more info.) The SOURCE can also include the organelle and also may include additional information, such as an abbreviated name and a common name in parentheses. ... SOURCE Saccharomyces cerevisiae (baker's yeast) ORGANISM Saccharomyces cerevisiae Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces. ... Setting scientific_name() isn't a problem; acc. to the above definition, it is the full name on the ORGANISM line. The lineage (or classification() array) is also straight-forward. The common_name (), though as used by Bio::SeqIO::genbank, is the entire SOURCE line (not just the abbreviated name, but the name and everything else). No additional parsing is performed on it. write_seq() also seems to do the wrong thing when rebuilding the SOURCE line as well as the method writes the subspecies to the line. I plan on using Bio::SeqIO::genbank as a guinea pig of sorts to try using Bio::Taxonomy::Node objects instead of Bio::Species, then get the parsing for these lines corrected and simplified. Essentially, the way NCBI describes it, the main name on the line is actually the free-form abbreviated name, the name in parentheses is the common name (optionally present), and the organelle precedes all of these if present. I want to try getting common_name() to match the common name found for taxonomy (baker's yeast) rather than have it be a simple container, add an abbreviated_name() method for the name container for the SOURCE line, and have the organelle() method actually be used if an organelle is present (it doesn't seem to be set at the moment in SeqIO::genbank). Right now, I have NO idea how EMBL, DDBJ, other formats deal with organism info; I would think that the main three (GenBank/EMBL- SwissProt/DDBJ) handle them similarly...(Famous Last Words) I also propose (I'll probably get yelled at here) NOT actively supporting additional parsing of species, subspecies, etc directly from a file w/o a DB lookup. As in, leave species, subspecies, genus parsing from the flatfile as is (no longer support it) or remove it completely and leave them unset. I haven't looked, but I have a strong feeling that the species parsing in Bio::SeqIO is different from format to format. It really seems like more trouble than it's worth to maintain this, especially as there is perfectly valid Taxonomy database information available either locally using a flatfile or via Entrez. If people want to have reliable $species->species or $species-genus for taxonomy information, they will need to have the db_handle() set for the Bio::Taxonomy::Node object and have an Node-based method to reset species, genus, etc to the tax database information (maybe reset_taxon or something along those lines). Okay, rambled on enough. Any thoughts? Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From hlapp at gmx.net Sun Jul 23 19:40:45 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 23 Jul 2006 19:40:45 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BF86AF.8080408@sendu.me.uk> References: <44BBBB69.6000906@sendu.me.uk> <44BF86AF.8080408@sendu.me.uk> Message-ID: On Jul 20, 2006, at 9:35 AM, Sendu Bala wrote: > I'll describe all the changes I've now made and if no-one complains > I'll > commit. (I've also made these notes into bug 2047 for easier reference > in the future.) > > Bio::DB::Taxonomy::flatfile > --------------------------- > [...] > > BEHAVIOUR-CHANGE: flatfile used to store within the nodes it makes the > division as a three letter code, like 'PRI'. However, for consistency > with entrez and the scientific_name() of the node the division is > supposed to correspond to, it is now stored as the full name, like > 'Primates'. What about adding a method division_code() which would return the 3- letter abbreviation? The abbreviation may be needed by flat-file writers, so it may be handy to have in some cases. > > The names->id solution also stores the artificially uniqued names like > 'Craniata ', allowing you for the first time to retrieve the > correct id. Previously the search would have simply failed completely. > > The names->id solution now handles nodes with scientific names of 'xyz > (class)', allowing you to retrieve the id with both get_taxonids > ('xyz') > and get_taxonids('xyz (class)'). Previously only the latter would > work. Should angle brackets be allowed too? > > NOTE: the previous 2 changes (and the issues with entrez, see below) > make flatfile better at searching the taxonomy database than entrez > module or the website, both in terms of speed and completeness of > results. > > BEHAVIOUR-CHANGE: The scientific name field isn't touched in any way, > always being sent directly to Bio::Taxonomy::Node->new(-name => > $untouched) Maybe there should also be a -names parameter which accepts a hash reference with keys being the kind of name (scientific, common, etc) and the values being array references with the set of names of that kind? > or the $node->classification() array. Bio::Taxonomy::Node shouldn't have this attribute. It is legacy brought over from a flawed (because flat) object model in Bio::Species. > [...] > > Bio::DB::Taxonomy::entrez > ------------------------- > > # Bug-fixes > Special characters like ", ( and ) in the input query string to > get_taxonid() result in the failure or inaccuracy of the search. These > characters are now removed prior to submission, allowing for correct > search results. > API-CHANGE: entrez has always been able to return multiple ids that > match a single input name, so I've renamed get_taxonid() to > get_taxonids() and it returns an array of ids in list context. It > returns one of the ids in scalar context. For backward compatibility, > *get_taxonid = \&get_taxonids. Sounds good to me. > NOTE: entrez modules (and website) cannot cope with '' > in the > query, failing searches like 'Craniata '. For this > reason, if > get_taxonids() is given a query with '' it will immediately > return undefined, saving a pointless website access. If there is a 'next-best-thing' that is still semantically compatible with the API documentation, I would do that. In this case, if there is a in the query the entrez module should strip it and automatically use the rest for searching. If indeed multiple IDs match there should be a warning to inform the user that entrez cannot use the notation to limit the query results. In fact, you might as well provide an option to enable an automatic check for the correct branch for each ID if multiple ones are returned. I.e., if this option is enabled, the module would automatically query the parent nodes to see if is in the lineage, and if not will remove the respective ID from the result set. The reason you may want to make it optional is because it potentially costs time. (but in reality I'm not sure why a client will not want to enable the option - so maybe this should even be default) > If you want the id > of 'Craniata ' you must search for 'Craniata', then get the > node for each returned id to see which one has a parent node with a > scientific_name() or common_names() case-insensitive matching to > 'chordata'. Yep, see above. The more burden you can shield from the user the better. > [...] > Bio::Taxonomy::Node > ------------------- > [...] > classification() has a proper solution to finding the classification > when the array wasn't manually set. > > # Improvements > BEHAVIOUR-CHANGE: node_name() used to be an alias to name > ('common'). Now > it is an alias to name('scientific'). > NOTE: node_name is what is set when ->new(-name => $name) is set, so > flatfile and entrez and user-created nodes now implicitly associate > the > name of the node they create with its scientific name. I'm not even sure node_name() should just be deprecated. The methods falsely suggests that there is only a single and definitive name for the taxon node. In NCBI reality, this is only true for the scientific name of the node. In real reality, many nodes have multiple scientific names - taxonomy isn't static and therefore the scientific naming of nodes isn't either. > [...] > Thanks for the work, all other changes sound great. Thanks also to Chris for assisting! Looks like this is in much better shape now than before. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Sun Jul 23 19:44:23 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 23 Jul 2006 19:44:23 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BD147A.9020103@sendu.me.uk> References: <003201c6aa81$01db9a30$15327e82@pyrimidine> <44BD147A.9020103@sendu.me.uk> Message-ID: <6701FDC3-7630-48EB-9B21-5BB5C566424E@gmx.net> On Jul 18, 2006, at 1:03 PM, Sendu Bala wrote: > > [regarding changes to Bio::Taxonomy::Node] > > Actually, I'm really strongly leaning toward getting rid of the > following methods and new() options (and giving up entirely on being > able to keep 'sapiens' somewhere): > > -organelle, organelle() > -division, division() > -sub_species, sub_species() > -variant, variant() > species(), validate_species_name() > genus() > binomial() > > As far as I can see none of these methods have any place in a generic > Node class. I agree. Some of them are a special case for genbank files (organelle () etc), and the rest is legacy from Bio::Species. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Sun Jul 23 20:48:22 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 23 Jul 2006 20:48:22 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <9777D0A4-5340-4367-879D-F9125393B4FD@uiuc.edu> References: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine> <44C00805.7090403@sendu.me.uk> <9777D0A4-5340-4367-879D-F9125393B4FD@uiuc.edu> Message-ID: <21C7763A-B8FF-4E91-BAEC-E5F79E462DD3@gmx.net> On Jul 21, 2006, at 12:51 AM, Chris Fields wrote: > my $db = Bio::DB::Taxonomy->new(-source => 'entrez'); > > # normally not needed as this is set by default internally, but as a > demo here... > $species->db_handle($db); > > # reset the appropriate data (genus, species, etc) based on Entrez > tax data > $species->reset_data(); # this method, BTW, doesn't exist yet but > should be easy to implement Don't call this reset_data() as it may be misleading (usually reset() means to revert into a native or original state). Instead, you would use fetch_from_db() or something. However, it seems redundant to me to begin with. If we ignore for a second that the return value in the following isn't exactly compatible, why would you not just call $species = $db->get_Taxonomy_Node(-taxonid => $species->ncbi_taxid); So I think more than anything else, this should be made to work, and you would have a more seamless interface. > Short and sweet summary: > > Sendu volunteered making changes to Bio::Taxonomy::Node and related > modules; > we disagreed on exactly what changes should be made. Sendu wanted a > stripped-down version of Bio::Taxonomy::Node; I wanted one which would > support similar methods as in Bio::Species. Bio::Species should be considered legacy; I think it is flawed as an object model because it imposes a flat view on something which in reality is only a node in a tree and not flat at all. The only real need for the flat view came from the desire to write sequence files - for all other purposes the classification() etc attributes are useless anyway. I.e., binomial() and common_name() (corresponding to scientific_name () and names('common')) are the only real useful attributes, the rest is baggage for writing sequence files. The baggage should not be passed on to a better model ... Instead, there should be a separate module (in essence a Bio::Species factory) which can translate a Bio::Taxonomy::Node into a Bio::Species object - if the rank is 'species' or below. Alternatively, you could have a Bio::Taxonomy::SpeciesNode object which implements both APIs and can be initialized with either a Bio::Taxonomy::Node instance, or the combination of a Bio::Species and a db handle. At any rate, I think Bio::Taxonomy::Node should be stripped of legacy methods that are only there to achieve Bio::Species compatibility. > > I suggested have a common interface module, one for Node and > another for > Species; both implement the same interface methods (NodeI maybe), > so you > could use either a bare-bones Node or a full-fledged Species > object. I then > suggested this new version of Species could replace Bio::Species. > We could > worry about which one to use for Bio::DB::Taxonomy* later. I'm not following here... How would this look like? What would the API (s) be? > > We both agreed. Everybody's happy. Happiness is great, so maybe you shouldn't bother about me not following... > I still plan on switching Bio::DB::Taxonomy::entrez to use > Bio::DB::EUtilities at some point Wouldn't that rather be Bio::DB::Taxonomy::eutil? > I may > add a method for retrieving tax data based on protein/nucleotide > sequence > primary ID and relevant sequence database, so you could directly > retrieve > the relevant TaxID w/o parsing sequences directly for them. This > would > mainly be useful if you gather GIs from a BLAST search, for instance. > > Anyway, I could add this in then base class Bio::DB::Taxonomy > directly so > one could used the retrieved TaxIDs for flat-file or entrez > searches; this > requires, of course, access to the remote Entrez database (it would > use > ELink). Would that be of interest? If you add the API methods for this to the base class (which in this case is close in concept to an interface, much like Bio/SeqIO.pm), then make clear that not every database will allow you to implement this. > > |------Node > NodeI----| > |------Species > > Another option would be to have Bio::Taxonomy::Node itself stripped > down, > then have another class (Bio::Taxonomy::Species) inherit methods > from it and > also implement additional methods (genus(), species(), etc). I think this would be the way to go. I.e., |------Node NodeI----| |-| |----SpeciesNode Species----| This way the NodeI interface and its direct implementors are kept free of legacy. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Sun Jul 23 20:43:45 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sun, 23 Jul 2006 19:43:45 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <6701FDC3-7630-48EB-9B21-5BB5C566424E@gmx.net> References: <003201c6aa81$01db9a30$15327e82@pyrimidine> <44BD147A.9020103@sendu.me.uk> <6701FDC3-7630-48EB-9B21-5BB5C566424E@gmx.net> Message-ID: <5F6027E0-A504-4019-8DAB-C50DF9EB6E18@uiuc.edu> As an aside, the 'source' seqfeature in a GenBank file contains some of the following information as tags; that's where the NCBI tax ID is taken from in Bio::SeqIO::genbank: FEATURES Location/Qualifiers source 1..814 /organism="Porterinema fluviatile" /organelle="plastid:chloroplast" /mol_type="genomic DNA" /strain="SAG 124.79" /db_xref="taxon:246123" /country="Germany" ... So, variant(), organelle(), and ncbi_taxid() could all be set from the same point in Bio::SeqIO::genbank. I suggested an option to Sendu, but I'd like to hear your thoughts on this since this will possibly affect bioperl-db. We could have two Node-like Taxonomy objects using a common interface class (Bio::Taxonomy::NodeI) : Bio::Taxonomy::Node (stripped down version), and Bio::Taxonomy::Species (the sequence-based NodeI-implementing object, which would retain the other Bio::Species-like methods). Bio::Taxonomy::Species would act sort of as an 'entry point' for Bio::Taxonomy from sequences; moving up or down the tax node hierarchy gets Tax::Node objects, unless you are specifically at a 'species'-ranked node (though this could be just a Tax::Node as well). BTW, I have managed to get Bio::SeqIO::genbank switched over to Bio::Taxonomy::Node (er... Bio::Taxonomy::Species); all tests pass. I was quite surprised how easy it was. It shouldn't be too hard to get a NodeI/Node/Species class hierarchy set up if everybody thinks it's worth it. Then we could deprecate Bio::Species! Chris On Jul 23, 2006, at 6:44 PM, Hilmar Lapp wrote: > > On Jul 18, 2006, at 1:03 PM, Sendu Bala wrote: > >> >> [regarding changes to Bio::Taxonomy::Node] >> >> Actually, I'm really strongly leaning toward getting rid of the >> following methods and new() options (and giving up entirely on being >> able to keep 'sapiens' somewhere): >> >> -organelle, organelle() >> -division, division() >> -sub_species, sub_species() >> -variant, variant() >> species(), validate_species_name() >> genus() >> binomial() >> >> As far as I can see none of these methods have any place in a generic >> Node class. > > I agree. Some of them are a special case for genbank files (organelle > () etc), and the rest is legacy from Bio::Species. > > -hilmar > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From hlapp at gmx.net Sun Jul 23 20:58:32 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 23 Jul 2006 20:58:32 -0400 Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes In-Reply-To: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu> References: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu> Message-ID: On Jul 23, 2006, at 4:53 PM, Chris Fields wrote: > I also propose (I'll probably get yelled at here) NOT actively > supporting additional parsing of species, subspecies, etc directly > from a file w/o a DB lookup. As in, leave species, subspecies, genus > parsing from the flatfile as is (no longer support it) or remove it > completely and leave them unset. Note that most (as in: most used, not most taxa) cases are actually straightforward. I don't think removing what's there is desirable, just everyone needs to understand that it will recognize only a limited number of syntactical variations, and beyond that if you want correct taxon attributes you will a database (be it flatfile, eutil, whatever) lookup. > If people want to > have reliable $species->species or $species-genus for taxonomy > information, they will need to have the db_handle() set for the > Bio::Taxonomy::Node object and have an Node-based method to reset > species, genus, etc to the tax database information (maybe > reset_taxon or something along those lines). That's what I've saying all along. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Sun Jul 23 23:30:07 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sun, 23 Jul 2006 22:30:07 -0500 Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes In-Reply-To: References: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu> Message-ID: <28D3470B-DA8F-4C41-96C7-F0D0DE89BAEE@uiuc.edu> On Jul 23, 2006, at 7:58 PM, Hilmar Lapp wrote: > > On Jul 23, 2006, at 4:53 PM, Chris Fields wrote: > >> I also propose (I'll probably get yelled at here) NOT actively >> supporting additional parsing of species, subspecies, etc directly >> from a file w/o a DB lookup. As in, leave species, subspecies, genus >> parsing from the flatfile as is (no longer support it) or remove it >> completely and leave them unset. > > Note that most (as in: most used, not most taxa) cases are actually > straightforward. I don't think removing what's there is desirable, > just everyone needs to understand that it will recognize only a > limited number of syntactical variations, and beyond that if you > want correct taxon attributes you will a database (be it flatfile, > eutil, whatever) lookup. Aha! We seem to agree on that... >> If people want to >> have reliable $species->species or $species-genus for taxonomy >> information, they will need to have the db_handle() set for the >> Bio::Taxonomy::Node object and have an Node-based method to reset >> species, genus, etc to the tax database information (maybe >> reset_taxon or something along those lines). > > That's what I've saying all along. > > -hilmar > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== I thought you had mentioned something about this a few months back on EMBL format issues with organism data. Anyway, I don't think it was from anybody disagreeing with you as much as it was one of the project priorities that sort of got lost in the shuffle. I'm sure Sendu will like having a bit of freedom with Bio::Taxonomy::Node. Anyway, I'll do what I can within reason; I have to leave next weekend for a 5-day conference. Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Mon Jul 24 04:21:55 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 24 Jul 2006 09:21:55 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <21C7763A-B8FF-4E91-BAEC-E5F79E462DD3@gmx.net> References: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine> <44C00805.7090403@sendu.me.uk> <9777D0A4-5340-4367-879D-F9125393B4FD@uiuc.edu> <21C7763A-B8FF-4E91-BAEC-E5F79E462DD3@gmx.net> Message-ID: <44C48323.5060704@sendu.me.uk> Hilmar Lapp wrote: > On Jul 21, 2006, at 12:51 AM, Chris Fields wrote: > >> my $db = Bio::DB::Taxonomy->new(-source => 'entrez'); >> >> # normally not needed as this is set by default internally, but as a >> demo here... >> $species->db_handle($db); >> >> # reset the appropriate data (genus, species, etc) based on Entrez >> tax data >> $species->reset_data(); # this method, BTW, doesn't exist yet but >> should be easy to implement > > Don't call this reset_data() as it may be misleading (usually reset() > means to revert into a native or original state). Instead, you would > use fetch_from_db() or something. > > However, it seems redundant to me to begin with. If we ignore for a > second that the return value in the following isn't exactly > compatible, why would you not just call > > $species = $db->get_Taxonomy_Node(-taxonid => $species->ncbi_taxid); If Bio::Species was a Bio::Taxonomy, and we had FactoryI implementing classes or similar, we would say: $species = $factory->fetch(-taxon_id => $species->ncbi_taxid); > Instead, there should be a separate module (in essence a Bio::Species > factory) which can translate a Bio::Taxonomy::Node into a > Bio::Species object - if the rank is 'species' or below. I don't think a 'translation' module is necessary. Bio::Species can just be a Bio::Taxonomy. > At any rate, I think Bio::Taxonomy::Node should be stripped of legacy > methods that are only there to achieve Bio::Species compatibility. Yes :) > I think this would be the way to go. I.e., > > > |------Node > NodeI----| > |-| > |----SpeciesNode > Species----| Actually, if we're changing the name of the module that Species interacts with, any existing code needs to be re-written. So why not just do it properly and have Bio::Species interact with Bio::Taxonomy? |----Bio::Taxonomy Bio::TaxonomyI----| |----Bio::Species Or Bio::TaxonomyI----|----Bio::Taxonomy----|----Bio::Species Leaving Node completely free to be just a node. This way we don't have a crufty SpeciesNode there simply for the sake of Bio::Species. Bio::Species itself provides all the legacy stuff it needs for itself, while interacting with Nodes via TaxonomyI methods in the 'correct' way only. From bix at sendu.me.uk Mon Jul 24 03:58:57 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 24 Jul 2006 08:58:57 +0100 Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes In-Reply-To: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu> References: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu> Message-ID: <44C47DC1.8020503@sendu.me.uk> Chris Fields wrote: > Sendu, Hilmar, et al, > > I was looking through SeqIO::genbank and though I would bring up a > couple of things to think about re: GenBank Taxonomy information. [...] > SOURCE - Common name of the organism or the name most frequently used > in the literature. Mandatory keyword in all annotated entries/one or > more records/includes one subkeyword. [...] > Free-format information including an abbreviated form of the organism > name, sometimes followed by a molecule type. (See section 3.4.10 of > the GenBank release notes for more info.) > > The SOURCE can also include the organelle and also may include > additional information, such as an abbreviated name and a common name > in parentheses. More specifically: (from 3.4.10 ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt) The SOURCE field consists of two parts. The first part is found after the SOURCE keyword and contains free-format information including an abbreviated form of the organism name followed by a molecule type; multiple lines are allowed, but the last line must end with a period. The second part consists of information found after the ORGANISM subkeyword. The formal scientific name for the source organism (genus and species, where appropriate) is found on the same line as ORGANISM. The records following the ORGANISM line list the taxonomic classification levels, separated by semicolons and ending with a period. > The common_name (), though as used by Bio::SeqIO::genbank, is the > entire SOURCE line (not just the abbreviated name, but the name and > everything else). No additional parsing is performed on it. > write_seq() also seems to do the wrong thing when rebuilding the > SOURCE line as well as the method writes the subspecies to the line. > > I plan on using Bio::SeqIO::genbank as a guinea pig of sorts to try > using Bio::Taxonomy::Node objects instead of Bio::Species, then get > the parsing for these lines corrected and simplified. Essentially, > the way NCBI describes it, the main name on the line is actually the > free-form abbreviated name, the name in parentheses is the common > name (optionally present), and the organelle precedes all of these if > present. I want to try getting common_name() to match the common > name found for taxonomy (baker's yeast) rather than have it be a > simple container, add an abbreviated_name() method for the name > container for the SOURCE line, and have the organelle() method > actually be used if an organelle is present (it doesn't seem to be > set at the moment in SeqIO::genbank). This is not how I read the specification. Everything on the the same line as 'Source' is free-format text and therefore cannot be parsed. For the purposes of writing out it must be stored as-is, but it serves no other useful purpose. The file also provides the scientific name which can be used to do an accurate database lookup, which in turn gives you access to the common names, like "baker's yeast". On a side note, why would we care about 'organelle' when we're dealing with taxonomy? Why does the NCBI taxonomy db have a slot for organelle? From bix at sendu.me.uk Mon Jul 24 04:45:38 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 24 Jul 2006 09:45:38 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: References: <44BBBB69.6000906@sendu.me.uk> <44BF86AF.8080408@sendu.me.uk> Message-ID: <44C488B2.5070806@sendu.me.uk> Hilmar Lapp wrote: > On Jul 20, 2006, at 9:35 AM, Sendu Bala wrote: > >> Bio::DB::Taxonomy::flatfile >> --------------------------- >> [...] >> >> BEHAVIOUR-CHANGE: flatfile used to store within the nodes it makes the >> division as a three letter code, like 'PRI'. However, for consistency >> with entrez and the scientific_name() of the node the division is >> supposed to correspond to, it is now stored as the full name, like >> 'Primates'. > > What about adding a method division_code() which would return the 3- > letter abbreviation? > > The abbreviation may be needed by flat-file writers, so it may be > handy to have in some cases. As far as I know you can't get the 3-letter version via entrez, so no other module can really expect to be able to get it, not knowing which database (flatfile.pm or entez.pm) the taxonomic information is coming from. But of course it would be somewhat harmless to add division_code() anyway. It might be better done as a -code => 1 option to division()? >> The names->id solution also stores the artificially uniqued names like >> 'Craniata ', allowing you for the first time to retrieve the >> correct id. Previously the search would have simply failed completely. >> >> The names->id solution now handles nodes with scientific names of 'xyz >> (class)', allowing you to retrieve the id with both get_taxonids >> ('xyz') >> and get_taxonids('xyz (class)'). Previously only the latter would >> work. > > Should angle brackets be allowed too? Allowed in what sense? You can indeed search for both get_taxonids('Craniata ') [returns a single id] and get_taxonids('Craniata') [returns multipe ids, one of which is the previous answer]. > Maybe there should also be a -names parameter which accepts a hash > reference with keys being the kind of name (scientific, common, etc) > and the values being array references with the set of names of that > kind? Not sure what you mean. name() has that data structure, though you're not supposed to set its hash ref directly. >> or the $node->classification() array. > > Bio::Taxonomy::Node shouldn't have this attribute. It is legacy > brought over from a flawed (because flat) object model in Bio::Species. Yes, I agree. >> NOTE: entrez modules (and website) cannot cope with '' >> in the >> query, failing searches like 'Craniata '. For this >> reason, if >> get_taxonids() is given a query with '' it will immediately >> return undefined, saving a pointless website access. > > If there is a 'next-best-thing' that is still semantically compatible > with the API documentation, I would do that. > > In this case, if there is a in the query the entrez > module should strip it and automatically use the rest for searching. > If indeed multiple IDs match there should be a warning to inform the > user that entrez cannot use the notation to limit the > query results. I wouldn't like this. I actually had it working this way initially, but decided that if someone entered 'xyz ' they really didn't want multiple ids, expected to get multiple ids with just 'xyz' and don't want their query made something else and then be warned about it. > In fact, you might as well provide an option to enable an automatic > check for the correct branch for each ID if multiple ones are > returned. I.e., if this option is enabled, the module would > automatically query the parent nodes to see if is in the > lineage, and if not will remove the respective ID from the result > set. The reason you may want to make it optional is because it > potentially costs time. (but in reality I'm not sure why a client > will not want to enable the option - so maybe this should even be > default) I can certainly add that, it seems like a good idea. I don't, however, see any scope for an option at all. What would the option be called? -don't_give_me_the_answer_I_actually_want_to_save_time ? Pointless, imho. If the user queries 'xyz ' with that option, they're just going to have to do for themselves manually what the method would have done for them without that option, in order to get the correct answer. It'll be slower that way, if anything. So the option would actually be called -don't_give_me_the_answer_I_actually_want_so_I_can_get_it_myself_a_little_slower (!). >> Bio::Taxonomy::Node >> ------------------- >> [...] >> classification() has a proper solution to finding the classification >> when the array wasn't manually set. >> >> # Improvements >> BEHAVIOUR-CHANGE: node_name() used to be an alias to name >> ('common'). Now >> it is an alias to name('scientific'). >> NOTE: node_name is what is set when ->new(-name => $name) is set, so >> flatfile and entrez and user-created nodes now implicitly associate >> the >> name of the node they create with its scientific name. > > I'm not even sure node_name() should just be deprecated. The methods > falsely suggests that there is only a single and definitive name for > the taxon node. > > In NCBI reality, this is only true for the scientific name of the > node. In real reality, many nodes have multiple scientific names - > taxonomy isn't static and therefore the scientific naming of nodes > isn't either. For the programmer not using any database but just making up his own nodes, I think he needs a node_name() because he may not be thinking about anything fancy or realistic. He just want to give his node a single name that he invents. node_name() seems like the ideal method name to me. From jaynelvallance at hotmail.com Mon Jul 24 05:45:50 2006 From: jaynelvallance at hotmail.com (Jayne Vallance) Date: Mon, 24 Jul 2006 09:45:50 +0000 Subject: [Bioperl-l] SearchIO - Stop throwing away data Message-ID: Hi I developing someone elses work. I wondered whether anyone could identify the mistake that the previous coder made? I am not very familiar with SearchIO yet. They are trying to extract filenames from an output report. This is their code: # store the query name of the mito db blast hits into an array my $searchio = new Bio::SearchIO( -file => $blast_mito_output ); # array to store the mitochondrial BLAST database hits my @mito_hits; # name of query for BLAST hit my $query_name; while ( my $result = $searchio->next_result() ) { # get the hits and their associated name # do not want to include these in the clustering step while( my $hit = $result->next_hit ) { # store the names of these hits into an array # these filenames will not be copied over $query_name = $result->query_name(); #print "\nQuery $query_name\n"; push(@mito_hits, $query_name); } } I think they have based it on the code at http://www.bioperl.org/wiki/HOWTO:SearchIO#Authors use Bio::SearchIO; use Bio::SearchIO::FastHitEventBuilder; my $searchio = new Bio::SearchIO(-format => $format, -file => $file); $searchio->attach_EventHandler(Bio::SearchIO::FastHitEventBuilder->new); while( my $r = $searchio->next_result ) { while( my $h = $r->next_hit ) { # Hits will NOT have HSPs print $h->significance,"\n"; } which "throws away data you don't want"??? I am finding that our code is finding the last file name in the ouput report, rather than each and every one. I suspect it is overwriting (or throwing away the data). How do I need to change the code to make sure *every* file name goes into @mito_hits? Thankyou Jayne _________________________________________________________________ The new MSN Search Toolbar now includes Desktop search! http://join.msn.com/toolbar/overview From simon.andrews at bbsrc.ac.uk Mon Jul 24 07:14:08 2006 From: simon.andrews at bbsrc.ac.uk (simon andrews (BI)) Date: Mon, 24 Jul 2006 12:14:08 +0100 Subject: [Bioperl-l] SearchIO - Stop throwing away data In-Reply-To: Message-ID: > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org > [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of > Jayne Vallance > Sent: 24 July 2006 10:46 > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] SearchIO - Stop throwing away data > > Hi > > I developing someone > elses work. I wondered whether anyone could identify the > mistake that the previous coder made? > I am not very familiar with SearchIO yet. > > They are trying to extract filenames from an output report. I'm not sure what you mean by filenames here. The value which is being collected in your code snippet is the name of the original query sequence. > This is their code: > while ( my $result = $searchio->next_result() ) { > # get the hits and their associated name > # do not want to include these in the clustering step > while( my $hit = $result->next_hit ) { > # store the names of these hits into an array > # these filenames will not be copied over > $query_name = $result->query_name(); > #print "\nQuery $query_name\n"; > push(@mito_hits, $query_name); OK, this bit is odd. You're collecting the name of the query sequence but you're doing it as you're looping through the hits. Since all the hits come from the same result you're just going to get the same query name put into your array multiple times (once for each hit). This almost certainly isn't what you want. If you just want the name of the query sequence you can miss out the inner loop (the $result->next_hit() loop). If you actually want to collect the names of the sequences which were hit then you need to collect $hit->name() rather than $result->query_name(); > } > } > > I think they have based it on the code at > http://www.bioperl.org/wiki/HOWTO:SearchIO#Authors > $searchio->attach_EventHandler(Bio::SearchIO::FastHitEventBuil der->new); > while( my $r = $searchio->next_result ) { while( my $h = > $r->next_hit ) { > # Hits will NOT have HSPs > print $h->significance,"\n"; > } > > which "throws away data you don't want"??? Indeed, but probably not in the way you're thinking. The data it throws away is the details of each individual HSP (mostly the alinment data). You're not using hsp data in your code so it will have no effect (other than making it a bit quicker). It doesn't throw away whole hits or anything like that. > I am finding that our code is finding the last file name in > the ouput report, rather than each and every one. I suspect > it is overwriting (or throwing away the data). I suspect then that you should be collecting the hit names rather than the query names? Simon. From hlapp at gmx.net Mon Jul 24 08:20:00 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 24 Jul 2006 08:20:00 -0400 Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes In-Reply-To: <44C47DC1.8020503@sendu.me.uk> References: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu> <44C47DC1.8020503@sendu.me.uk> Message-ID: <27E70EEF-28F6-4B54-9CCB-6090E338CEFB@gmx.net> On Jul 24, 2006, at 3:58 AM, Sendu Bala wrote: > On a side note, why would we care about 'organelle' when we're dealing > with taxonomy? Why does the NCBI taxonomy db have a slot for > organelle? Because some sequences are of the organelle DNA, and Genbank needs a way to express this. Highly artificial, but still can't be ignored. -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Mon Jul 24 08:27:28 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 24 Jul 2006 08:27:28 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C488B2.5070806@sendu.me.uk> References: <44BBBB69.6000906@sendu.me.uk> <44BF86AF.8080408@sendu.me.uk> <44C488B2.5070806@sendu.me.uk> Message-ID: <11A2B917-C633-4806-A6F4-920F02F0BF6E@gmx.net> :-) I think we're largely in agreement. As for node_name() I fully understand the motivation, but it needs to be understood that the attribute's value will be based on a largely arbitrary choice unless it is set directly by the user. -hilmar On Jul 24, 2006, at 4:45 AM, Sendu Bala wrote: > Hilmar Lapp wrote: >> On Jul 20, 2006, at 9:35 AM, Sendu Bala wrote: >> >>> Bio::DB::Taxonomy::flatfile >>> --------------------------- >>> [...] >>> >>> BEHAVIOUR-CHANGE: flatfile used to store within the nodes it >>> makes the >>> division as a three letter code, like 'PRI'. However, for >>> consistency >>> with entrez and the scientific_name() of the node the division is >>> supposed to correspond to, it is now stored as the full name, like >>> 'Primates'. >> >> What about adding a method division_code() which would return the 3- >> letter abbreviation? >> >> The abbreviation may be needed by flat-file writers, so it may be >> handy to have in some cases. > > As far as I know you can't get the 3-letter version via entrez, so no > other module can really expect to be able to get it, not knowing which > database (flatfile.pm or entez.pm) the taxonomic information is > coming from. > > But of course it would be somewhat harmless to add division_code() > anyway. It might be better done as a -code => 1 option to division()? > > >>> The names->id solution also stores the artificially uniqued names >>> like >>> 'Craniata ', allowing you for the first time to >>> retrieve the >>> correct id. Previously the search would have simply failed >>> completely. >>> >>> The names->id solution now handles nodes with scientific names of >>> 'xyz >>> (class)', allowing you to retrieve the id with both get_taxonids >>> ('xyz') >>> and get_taxonids('xyz (class)'). Previously only the latter would >>> work. >> >> Should angle brackets be allowed too? > > Allowed in what sense? You can indeed search for both > get_taxonids('Craniata ') [returns a single id] and > get_taxonids('Craniata') [returns multipe ids, one of which is the > previous answer]. > > >> Maybe there should also be a -names parameter which accepts a hash >> reference with keys being the kind of name (scientific, common, etc) >> and the values being array references with the set of names of that >> kind? > > Not sure what you mean. name() has that data structure, though you're > not supposed to set its hash ref directly. > > >>> or the $node->classification() array. >> >> Bio::Taxonomy::Node shouldn't have this attribute. It is legacy >> brought over from a flawed (because flat) object model in >> Bio::Species. > > Yes, I agree. > > >>> NOTE: entrez modules (and website) cannot cope with '' >>> in the >>> query, failing searches like 'Craniata '. For this >>> reason, if >>> get_taxonids() is given a query with '' it will >>> immediately >>> return undefined, saving a pointless website access. >> >> If there is a 'next-best-thing' that is still semantically compatible >> with the API documentation, I would do that. >> >> In this case, if there is a in the query the entrez >> module should strip it and automatically use the rest for searching. >> If indeed multiple IDs match there should be a warning to inform the >> user that entrez cannot use the notation to limit the >> query results. > > I wouldn't like this. I actually had it working this way initially, > but > decided that if someone entered 'xyz ' they really didn't > want multiple ids, expected to get multiple ids with just 'xyz' and > don't want their query made something else and then be warned about > it. > > >> In fact, you might as well provide an option to enable an automatic >> check for the correct branch for each ID if multiple ones are >> returned. I.e., if this option is enabled, the module would >> automatically query the parent nodes to see if is in the >> lineage, and if not will remove the respective ID from the result >> set. The reason you may want to make it optional is because it >> potentially costs time. (but in reality I'm not sure why a client >> will not want to enable the option - so maybe this should even be >> default) > > I can certainly add that, it seems like a good idea. I don't, however, > see any scope for an option at all. What would the option be called? > -don't_give_me_the_answer_I_actually_want_to_save_time ? Pointless, > imho. If the user queries 'xyz ' with that option, they're > just going to have to do for themselves manually what the method would > have done for them without that option, in order to get the correct > answer. It'll be slower that way, if anything. So the option would > actually be called > - > don't_give_me_the_answer_I_actually_want_so_I_can_get_it_myself_a_litt > le_slower > (!). > > >>> Bio::Taxonomy::Node >>> ------------------- >>> [...] >>> classification() has a proper solution to finding the classification >>> when the array wasn't manually set. >>> >>> # Improvements >>> BEHAVIOUR-CHANGE: node_name() used to be an alias to name >>> ('common'). Now >>> it is an alias to name('scientific'). >>> NOTE: node_name is what is set when ->new(-name => $name) is set, so >>> flatfile and entrez and user-created nodes now implicitly associate >>> the >>> name of the node they create with its scientific name. >> >> I'm not even sure node_name() should just be deprecated. The methods >> falsely suggests that there is only a single and definitive name for >> the taxon node. >> >> In NCBI reality, this is only true for the scientific name of the >> node. In real reality, many nodes have multiple scientific names - >> taxonomy isn't static and therefore the scientific naming of nodes >> isn't either. > > For the programmer not using any database but just making up his own > nodes, I think he needs a node_name() because he may not be thinking > about anything fancy or realistic. He just want to give his node a > single name that he invents. node_name() seems like the ideal method > name to me. > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Mon Jul 24 08:31:44 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 24 Jul 2006 08:31:44 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C48323.5060704@sendu.me.uk> References: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine> <44C00805.7090403@sendu.me.uk> <9777D0A4-5340-4367-879D-F9125393B4FD@uiuc.edu> <21C7763A-B8FF-4E91-BAEC-E5F79E462DD3@gmx.net> <44C48323.5060704@sendu.me.uk> Message-ID: <8B9F12C5-C520-48CF-825F-DD24975B43BB@gmx.net> Sounds good to me, except there is no Bio::TaxonomyI yet, and also Bio::Species shouldn't fully depend on an internet connection or flat file to do anything meaningful. I.e., it should take advantage of a lookup database if there is one, but in the absence of that one should also be able to statically set attribute values to whatever one thinks can be gleaned from a parsed text or whatever. -hilmar On Jul 24, 2006, at 4:21 AM, Sendu Bala wrote: >> I think this would be the way to go. I.e., >> >> >> |------Node >> NodeI----| >> |-| >> |----SpeciesNode >> Species----| > > Actually, if we're changing the name of the module that Species > interacts with, any existing code needs to be re-written. So why not > just do it properly and have Bio::Species interact with Bio::Taxonomy? > > |----Bio::Taxonomy > Bio::TaxonomyI----| > |----Bio::Species > > Or > > Bio::TaxonomyI----|----Bio::Taxonomy----|----Bio::Species > > Leaving Node completely free to be just a node. This way we don't > have a > crufty SpeciesNode there simply for the sake of Bio::Species. > Bio::Species itself provides all the legacy stuff it needs for itself, > while interacting with Nodes via TaxonomyI methods in the 'correct' > way > only. > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From bix at sendu.me.uk Mon Jul 24 08:34:45 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 24 Jul 2006 13:34:45 +0100 Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes In-Reply-To: <27E70EEF-28F6-4B54-9CCB-6090E338CEFB@gmx.net> References: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu> <44C47DC1.8020503@sendu.me.uk> <27E70EEF-28F6-4B54-9CCB-6090E338CEFB@gmx.net> Message-ID: <44C4BE65.8080304@sendu.me.uk> Hilmar Lapp wrote: > > On Jul 24, 2006, at 3:58 AM, Sendu Bala wrote: > >> On a side note, why would we care about 'organelle' when we're dealing >> with taxonomy? Why does the NCBI taxonomy db have a slot for organelle? > > Because some sequences are of the organelle DNA, and Genbank needs a way > to express this. Highly artificial, but still can't be ignored. Ok, but why is it stored as part of the taxonomy? Why isn't it stored in its own field? And does /bioperl/ have to store it as part of the taxonomy? Maybe the file parser could have its own organelle() method and leave all taxonomic classes without such a method. Or it could stay as is, I don't know. Do different organelles in the same species get unique taxonomy ids? From hlapp at gmx.net Mon Jul 24 08:46:51 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 24 Jul 2006 08:46:51 -0400 Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes In-Reply-To: <44C4BE65.8080304@sendu.me.uk> References: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu> <44C47DC1.8020503@sendu.me.uk> <27E70EEF-28F6-4B54-9CCB-6090E338CEFB@gmx.net> <44C4BE65.8080304@sendu.me.uk> Message-ID: <2C99E56B-84D2-4C51-BBF1-76BAF81205AB@gmx.net> On Jul 24, 2006, at 8:34 AM, Sendu Bala wrote: > Hilmar Lapp wrote: >> >> On Jul 24, 2006, at 3:58 AM, Sendu Bala wrote: >> >>> On a side note, why would we care about 'organelle' when we're >>> dealing >>> with taxonomy? Why does the NCBI taxonomy db have a slot for >>> organelle? >> Because some sequences are of the organelle DNA, and Genbank needs >> a way >> to express this. Highly artificial, but still can't be ignored. > > Ok, but why is it stored as part of the taxonomy? Why isn't it > stored in > its own field? And does /bioperl/ have to store it as part of the > taxonomy? No, but clients need to be able to obtain it. Organelles have their own genome. If we talk about the human genome, for instance, most commonly we refer to the nuclear genome only. > Maybe the file parser could have its own organelle() method > and leave all taxonomic classes without such a method. Or it could > stay > as is, I don't know. Like I said above, at the end of the day there needs to be a way to qualify a sequence by the genome it is part of. > > Do different organelles in the same species get unique taxonomy ids? I would have to confirm, but I believe so. As I said, from a genome/ sequence-centric viewpoint, the organelle and nuclear genomes are two different things. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From simon.andrews at bbsrc.ac.uk Mon Jul 24 09:34:10 2006 From: simon.andrews at bbsrc.ac.uk (simon andrews (BI)) Date: Mon, 24 Jul 2006 14:34:10 +0100 Subject: [Bioperl-l] New EMBL format parsing/writing Message-ID: I few weeks ago I saw a couple of messages on this list mentioning the new ID/SV line format used in the latest EMBL release. I'm in the process of moving our database server over to the new format and was looking to update SeqIO::embl.pm. I'm sure someone said they'd made a patch to fix up parsing of the new format, but I can't find it either in CVS or bugzilla. Rather than do this again myself can someone point me to an updated SeqIO::embl.pm please? If there isn't one then I'll look into making the patch myself. Since this is such a major change are there any plans to put out a new release with this fix included? I'm sure this will start to bite more people as the new format becomes more widely adopted. Cheers Simon. -- Simon Andrews PhD Bioinformatics Group The Babraham Institute simon.andrews at bbsrc.ac.uk +44 (0) 1223 496463 From cjfields at uiuc.edu Mon Jul 24 09:44:37 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 08:44:37 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <8B9F12C5-C520-48CF-825F-DD24975B43BB@gmx.net> References: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine> <44C00805.7090403@sendu.me.uk> <9777D0A4-5340-4367-879D-F9125393B4FD@uiuc.edu> <21C7763A-B8FF-4E91-BAEC-E5F79E462DD3@gmx.net> <44C48323.5060704@sendu.me.uk> <8B9F12C5-C520-48CF-825F-DD24975B43BB@gmx.net> Message-ID: <5CE136DF-2443-4AC0-AD12-B333D894A75B@uiuc.edu> Hence the reason to have it be a hybrid of Bio::Species and Tax::Node. Bio::SeqIO::genbank works very happily with the current Bio::Taxonomy::Node now; if we intend to remove most of the method we need to have a similar DB-aware module to house the flatfile data (like Bio::Species) yet be capable of working with Bio::Taxonomy (like Tax::Node). As for organelle(), that could be made into something else (Bio::Annotation::SimpleValue or similar) but as it's always been included with the tax data, that's where it has been. The TaxID in the 'source' seqfeature doesn't refer to the organelle but the organism. Chris On Jul 24, 2006, at 7:31 AM, Hilmar Lapp wrote: > Sounds good to me, except there is no Bio::TaxonomyI yet, and also > Bio::Species shouldn't fully depend on an internet connection or flat > file to do anything meaningful. > > I.e., it should take advantage of a lookup database if there is one, > but in the absence of that one should also be able to statically set > attribute values to whatever one thinks can be gleaned from a parsed > text or whatever. > > -hilmar > > On Jul 24, 2006, at 4:21 AM, Sendu Bala wrote: > >>> I think this would be the way to go. I.e., >>> >>> >>> |------Node >>> NodeI----| >>> |-| >>> |----SpeciesNode >>> Species----| >> >> Actually, if we're changing the name of the module that Species >> interacts with, any existing code needs to be re-written. So why not >> just do it properly and have Bio::Species interact with >> Bio::Taxonomy? >> >> |----Bio::Taxonomy >> Bio::TaxonomyI----| >> |----Bio::Species >> >> Or >> >> Bio::TaxonomyI----|----Bio::Taxonomy----|----Bio::Species >> >> Leaving Node completely free to be just a node. This way we don't >> have a >> crufty SpeciesNode there simply for the sake of Bio::Species. >> Bio::Species itself provides all the legacy stuff it needs for >> itself, >> while interacting with Nodes via TaxonomyI methods in the 'correct' >> way >> only. >> > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Mon Jul 24 09:49:42 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 24 Jul 2006 14:49:42 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <8B9F12C5-C520-48CF-825F-DD24975B43BB@gmx.net> References: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine> <44C00805.7090403@sendu.me.uk> <9777D0A4-5340-4367-879D-F9125393B4FD@uiuc.edu> <21C7763A-B8FF-4E91-BAEC-E5F79E462DD3@gmx.net> <44C48323.5060704@sendu.me.uk> <8B9F12C5-C520-48CF-825F-DD24975B43BB@gmx.net> Message-ID: <44C4CFF6.40609@sendu.me.uk> Hilmar Lapp wrote: > Sounds good to me, except there is no Bio::TaxonomyI yet, Indeed, I propose making one. > Bio::Species shouldn't fully depend on an internet connection or flat > file to do anything meaningful. > > I.e., it should take advantage of a lookup database if there is one, but > in the absence of that one should also be able to statically set > attribute values to whatever one thinks can be gleaned from a parsed > text or whatever. Yes, which is why Bio::Taxonomy is appropriate here. Assuming that Bio::Species isa Bio::TaxonomyI: ... SOURCE Saccharomyces cerevisiae (baker's yeast) ORGANISM Saccharomyces cerevisiae Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces. ... ## the fully-manual way my $species = new Bio::Species; my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces cerevisiae', -rank => 'species', -object_id => 1, -parent_id => 2); my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces', -object_id => 2, -parent_id => 3); # (no assumption that 'Saccharomyces' is the genus, so rank() undefined) my $n3 = [etc] $species->add_node($node); $species->add_node($n2); [etc] ## Using a factory without db access # assume that Bio::Taxonomy::GenbankFactory implements # some modified Bio::Taxonomy::FactoryI my $factory = Bio::Taxonomy::GenbankFactory->new(); my $species = $factory->generate(-classification => ['Saccharomyces cerevisiae', 'Saccharomyces', 'Saccharomycetaceae' ...]); # the generate() method above just does the fully-manual way for you ## Using a factory with db access # assume that Bio::Taxonomy::EntrezFactory implements some # modified Bio::Taxonomy::FactoryI and uses Bio::DB::Taxonomy::entrez # to get the nodes my $factory = Bio::Taxonomy::EntrezFactory->new(); my $species = $factory->fetch(-scientifc_name => 'Saccharomyces cerevisiae'); # (would probably want to come up with a more generic name for the # fetch() and generate() methods, so that all Factories use the same # same method name) It's very clean and flexible this way. Ultimately you always make your Bio::Species the same way - you add nodes to it. You can make those nodes yourself or use a factory. We also solve Chris' earlier quandary: [ in a world where Bio::Taxonomy::Node and Bio::Taxonomy::SpeciesNode exist, and given that Bio::DB::Taxonomy* currently directly make Node objects ] > The only problem I can foresee is which class to use with > Bio::DB::Taxonomy*? I guess one could settle on one class by default and > have the option to use another Bio::Taxonomy::NodeI-implementing class if > you wanted more data/methods available... The way to do it is to have the Bio::DB::Taxonomy* modules return only the information that a Bio::Taxonomy::FactoryI would need to make a NodeI. The specific Factory that you use could generate whatever type of Node you wanted. But actually I propose there is only one Node and the specific Factory that you use determines the kind of Bio::TaxonomyI made; GenbankFactory might make a Bio::Species, while EntrezFactory might make a Bio::Taxonomy. Bio::Species differs from Bio::Taxonomy only so it contains all the legacy methods names that Bio::Species currently has, for backward compatibility. Setting $species->classification() would delete all nodes of self, use a GenbankFactory to make a new Bio::Species, then pull out all its Nodes and add them to self. Unless anyone can think of a better way of doing things, I'll explore the above ideas and start writing code. To summarise: major changes to Bio::DB::Taxonomy* (make them factory slaves), implementation of some Bio::Taxonomy::FactoryIs, tweak Bio::Taxonomy::FactoryI and make Bio::TaxonomyI, make Bio::Species a Bio::TaxonomyI. Oh, Bio::Taxonomy might need some changes as well. It has a classify() method does something with a Bio::Species, which would be all wrong in the new way of doing things. From bix at sendu.me.uk Mon Jul 24 09:53:23 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 24 Jul 2006 14:53:23 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <5CE136DF-2443-4AC0-AD12-B333D894A75B@uiuc.edu> References: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine> <44C00805.7090403@sendu.me.uk> <9777D0A4-5340-4367-879D-F9125393B4FD@uiuc.edu> <21C7763A-B8FF-4E91-BAEC-E5F79E462DD3@gmx.net> <44C48323.5060704@sendu.me.uk> <8B9F12C5-C520-48CF-825F-DD24975B43BB@gmx.net> <5CE136DF-2443-4AC0-AD12-B333D894A75B@uiuc.edu> Message-ID: <44C4D0D3.1020506@sendu.me.uk> Chris Fields wrote: > Bio::SeqIO::genbank works very happily with the current > Bio::Taxonomy::Node now; if we intend to remove most of the method we > need to have a similar DB-aware module to house the flatfile data (like > Bio::Species) yet be capable of working with Bio::Taxonomy (like Tax::Node). Can you give code examples of what Bio::SeqIO::genbank is doing and what makes it 'happy'? What are the requirements? Would it be as happy working with a Bio::Taxonomy object? From aramsey at vecna.com Mon Jul 24 10:23:46 2006 From: aramsey at vecna.com (Al Ramsey) Date: Mon, 24 Jul 2006 10:23:46 -0400 Subject: [Bioperl-l] Making BioPerl Faster Message-ID: <44C4D7F2.6020107@vecna.com> I'm interested into following up with a suggestion from the bioperl.org site about making it faster (http://www.bioperl.org/wiki/Why_BioPerl_is_slow). In particular, I wanted to look a little more into how the object instantiations might be more efficient. Is anyone else looking into this actively now? I want to ask if anyone had any additional insights that weren't previously published before I started. Thank you, Al Ramsey -- Alvin Ramsey, PhD. Vecna Technologies, Inc. 5205 Leesburg Pike Falls Church, VA 22041 aramsey at vecna.com t: 703.998.5333 f: 703.998.5816 From s-merchant at northwestern.edu Mon Jul 24 11:09:49 2006 From: s-merchant at northwestern.edu (Sohel Merchant) Date: Mon, 24 Jul 2006 10:09:49 -0500 Subject: [Bioperl-l] obo_parser.t test warnings In-Reply-To: Message-ID: <004301c6af33$3564a8e0$c2987ca5@pc13> Hey Chris, I usually run perl with all warnings disabled. So I never saw these. I will put a fix to them sometime this week. Thanks, Sohel. _____ From: Chris Fields [mailto:cjfields at uiuc.edu] Sent: Sunday, July 23, 2006 2:10 PM To: bioperl-l List; Hilmar Lapp; s-merchant at northwestern.edu Subject: obo_parser.t test warnings Hilmar, Sohel, Didn't know who to notify, so sorry in advance about cross-posting this to the list. I was running through cleaning up some bugs and found that obo_parser.t is throwing a ton of warnings: bayou-75:~/Chris/Bioperl/bioperl-live natashacapell$ perl -I. -w t/obo_parser.t 1..40 "my" variable $val masks earlier declaration in same scope at Bio/OntologyIO/obo.pm line 592. "my" variable $qh masks earlier declaration in same scope at Bio/OntologyIO/obo.pm line 592. Use of uninitialized value in string eq at Bio/OntologyIO/obo.pm line 239, line 13. ... Good news: all tests pass! Cheers! Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From prabubio at gmail.com Mon Jul 24 11:39:43 2006 From: prabubio at gmail.com (Prabu R) Date: Mon, 24 Jul 2006 21:09:43 +0530 Subject: [Bioperl-l] Remote Blast Execution Message-ID: Dear All! I am trying to run Remote Blast using Bio::Tools::Run::RemoteBlast. I am not able to get the blast result. Upto my knowledge, the Bio::SearchIO::blast hash object does not returns any result. Secondly, I tried 'remote_blast.pl ' a program from CPAN bioperl 1.5release. Command: perl bp_remote_blast.pl -p blastn -d est_mouse -e 1e-5 -i /home/prabucn/Blast/mm_test1.fa Error Message: retrieving blasts.. -------------------- WARNING --------------------- MSG: Possible error (1) while parsing BLAST report! --------------------------------------------------- Please help. Thanks, R. Prabu. Please look into my test program. ---------------------------------------------------------------------------------------------- use Bio::Tools::Run::RemoteBlast; use strict; use Bio::SeqIO; use Bio::SearchIO; my $prog = 'blastn'; my $db = 'est'; my $e_val= '1e-10'; my @params = ( '-prog' => $prog, '-data' => $db, '-expect' => $e_val, '-readmethod' => 'SearchIO' ); my $factory = Bio::Tools::Run::RemoteBlast->new(@params) || die "Cant do"; my $v = 1; my $str = Bio::SeqIO->new(-file=>'mm_test2.txt' , '-format' => 'fasta' ); while (my $input = $str->next_seq()){ my $r = $factory->submit_blast($input); print STDERR "waiting..." if( $v > 0 ); while ( my @rids = $factory->each_rid ) { foreach my $rid ( @rids ) { my $rc = $factory->retrieve_blast($rid); if( !ref($rc) ) { if( $rc < 0 ) { $factory->remove_rid($rid); } print STDERR "." if ( $v > 0 ); sleep 5; } else { print "$rc\n"; my $result = $rc->next_result(); my $filename = $result->query_name()."\.out"; $factory->save_output($filename); $factory->remove_rid($rid); print "\nQuery Name: ", $result->query_name(), "\n"; while ( my $hit = $result->next_hit ) { next unless ( $v > 0); print "\thit name is ", $hit->name, "\n"; while( my $hsp = $hit->next_hsp ) { print "\t\tscore is ", $hsp->score, "\n"; } } } } } } ---------------------------------------------------------------------------------------------- -- "Every noble work is at first impossible." - Thomas Carlyle From cjfields at uiuc.edu Mon Jul 24 11:48:45 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 10:48:45 -0500 Subject: [Bioperl-l] SearchIO - Stop throwing away data In-Reply-To: Message-ID: <001701c6af38$a81c1580$15327e82@pyrimidine> > Hi > > I developing someone > elses work. I wondered whether anyone could identify the > mistake that the previous coder made? > I am not very familiar with SearchIO yet. > > They are trying to extract filenames from an output report. > This is their code: > > # store the query name of the mito db blast hits into an array > my $searchio = new Bio::SearchIO( -file => $blast_mito_output ); > # array to store the mitochondrial BLAST database hits > my @mito_hits; > # name of query for BLAST hit > my $query_name; > Just as a gripe here: you should always designate the '-format' here to be 'blast' for BLAST text output. my $searchio = new Bio::SearchIO(-file => $blast_mito_output, -format => 'blast' ); The default is still text, so the above works, but that very well may change in the future. Each BLAST report is a Result. Each Result contains one or more hits; each hit contains one or more HSPs. SearchIO only parses the information contained in the BLAST report (i.e. no filenames). From here, it looks like you want Hit information, though. The code below copies the query_name from the BlastResult object, $result (i.e. the name of your query sequence, the one you submitted for BLAST'ing against a database). You need the BlastHit data from $hit. Change : $query_name = $result->query_name(); #print "\nQuery $query_name\n"; push(@mito_hits, $query_name); To : $hit_name = $hit->description(); #print "\nHit $hit_name\n"; push(@mito_hits, $hit_name); or, for the hit accession, use $hit_name = $hit->accession(); For all accessions in the description (there may be multiples if sequences are identical), use an array and @hit_name = $hit->get_all_accessions(); You can use a different EventHandler if you want to speed things up: my $searchio = new Bio::SearchIO(-format => $format, -file => $file); $searchio->attach_EventHandler(Bio::SearchIO::FastHitEventBuilder->new); But to have this work you need to update to the latest CVS version of bioperl; this was a recent bug that was fixed. Chris > while ( my $result = $searchio->next_result() ) { > # get the hits and their associated name > # do not want to include these in the clustering step > while( my $hit = $result->next_hit ) { > # store the names of these hits into an array > # these filenames will not be copied over > $query_name = $result->query_name(); > #print "\nQuery $query_name\n"; > push(@mito_hits, $query_name); > } > } > I think they have based it on the code at > http://www.bioperl.org/wiki/HOWTO:SearchIO#Authors > > use Bio::SearchIO; > use Bio::SearchIO::FastHitEventBuilder; > my $searchio = new Bio::SearchIO(-format => $format, -file => $file); > > $searchio->attach_EventHandler(Bio::SearchIO::FastHitEventBuilder->new); > while( my $r = $searchio->next_result ) { > while( my $h = $r->next_hit ) { > # Hits will NOT have HSPs > print $h->significance,"\n"; > } > > which "throws away data you don't want"??? > > I am finding that our code is finding the last file name in the ouput > report, > rather than each and every one. I suspect it is overwriting (or throwing > away the data). > > How do I need to change the code to make sure *every* file name goes > into @mito_hits? > > Thankyou > > Jayne > > _________________________________________________________________ > The new MSN Search Toolbar now includes Desktop search! > http://join.msn.com/toolbar/overview > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From dwaner at scitegic.com Mon Jul 24 12:03:21 2006 From: dwaner at scitegic.com (dwaner at scitegic.com) Date: Mon, 24 Jul 2006 09:03:21 -0700 Subject: [Bioperl-l] New EMBL format parsing/writing Message-ID: Simon, I have already updated SeqIO::embl.pm to support release 87. All I have left to do is generate the patch and update the /t test. I will try to get this submitted to bugzilla today (24 July). - David From cjfields at uiuc.edu Mon Jul 24 12:04:40 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 11:04:40 -0500 Subject: [Bioperl-l] Making BioPerl Faster In-Reply-To: <44C4D7F2.6020107@vecna.com> Message-ID: <001901c6af3a$df146ea0$15327e82@pyrimidine> Give it a look, sure! Not sure if this the only problem though when it comes to speed; I think it's more complicated than that. I think that (at least on WinXP) the Perl version used is also partially to blame. It's possible that something modified between v 5.6 and 5.8 slowed everything down considerably. I always wondered if it had something to do with Unicode support in perl 5.8 ... There is a report on Bugzilla about a dramatic slowdown on sequence parsing between v. 1.4 and v. 1.5 (including the latest, v 1.5.1) http://bugzilla.open-bio.org/show_bug.cgi?id=1875 This is unresolved at this time but may be unrelated to the possible perl versioning issue above. I've a feeling you may find regexes and redundant methods calls also add quite a bit of overhead. I've seen several places where accessors are called over and over w/o assigning to a local variable. Or places where a tr/// would work much faster than a s///. There was an instance of the latter in SeqIO which sped up parsing about 2-3x faster on WinXP. If you want to look at the impact of object instantiation on speed, check out Bio::SearchIO (parsing of BLAST/FASTA/HMMER reports). Lots of method calls, object creation, etc. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Al Ramsey > Sent: Monday, July 24, 2006 9:24 AM > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] Making BioPerl Faster > > I'm interested into following up with a suggestion from the bioperl.org > site about making it faster > (http://www.bioperl.org/wiki/Why_BioPerl_is_slow). In particular, I > wanted to look a little more into how the object instantiations might be > more efficient. Is anyone else looking into this actively now? I want > to ask if anyone had any additional insights that weren't previously > published before I started. > > Thank you, > Al Ramsey > > > -- > Alvin Ramsey, PhD. > > Vecna Technologies, Inc. > 5205 Leesburg Pike > Falls Church, VA 22041 > aramsey at vecna.com > t: 703.998.5333 > f: 703.998.5816 > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Mon Jul 24 12:06:03 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 11:06:03 -0500 Subject: [Bioperl-l] Remote Blast Execution In-Reply-To: Message-ID: <001a01c6af3b$10187f50$15327e82@pyrimidine> You need to update to the latest code (bioperl-live) from CVS. BLAST parsing using RemoteBlast is broken in all the latest releases. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Prabu R > Sent: Monday, July 24, 2006 10:40 AM > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] Remote Blast Execution > > Dear All! > > I am trying to run Remote Blast using Bio::Tools::Run::RemoteBlast. > > I am not able to get the blast result. > Upto my knowledge, the Bio::SearchIO::blast hash object does not returns > any > result. > > > Secondly, I tried 'remote_blast.pl ' a program from CPAN bioperl > 1.5release. > > Command: > perl bp_remote_blast.pl -p blastn -d est_mouse -e 1e-5 -i > /home/prabucn/Blast/mm_test1.fa > > Error Message: > > retrieving blasts.. > > -------------------- WARNING --------------------- > MSG: Possible error (1) while parsing BLAST report! > --------------------------------------------------- > > Please help. > > Thanks, > R. Prabu. > > > Please look into my test program. > -------------------------------------------------------------------------- > -------------------- > use Bio::Tools::Run::RemoteBlast; > use strict; > use Bio::SeqIO; > use Bio::SearchIO; > > my $prog = 'blastn'; > my $db = 'est'; > my $e_val= '1e-10'; > > my @params = ( '-prog' => $prog, > '-data' => $db, > '-expect' => $e_val, > '-readmethod' => 'SearchIO' ); > > my $factory = Bio::Tools::Run::RemoteBlast->new(@params) || die "Cant > do"; > > my $v = 1; > > my $str = Bio::SeqIO->new(-file=>'mm_test2.txt' , '-format' => 'fasta' > ); > > while (my $input = $str->next_seq()){ > my $r = $factory->submit_blast($input); > > print STDERR "waiting..." if( $v > 0 ); > while ( my @rids = $factory->each_rid ) { > foreach my $rid ( @rids ) { > my $rc = $factory->retrieve_blast($rid); > > if( !ref($rc) ) { > if( $rc < 0 ) { > $factory->remove_rid($rid); > } > print STDERR "." if ( $v > 0 ); > sleep 5; > } else { > print "$rc\n"; > my $result = $rc->next_result(); > my $filename = $result->query_name()."\.out"; > $factory->save_output($filename); > $factory->remove_rid($rid); > print "\nQuery Name: ", $result->query_name(), "\n"; > while ( my $hit = $result->next_hit ) { > next unless ( $v > 0); > print "\thit name is ", $hit->name, "\n"; > while( my $hsp = $hit->next_hsp ) { > print "\t\tscore is ", $hsp->score, "\n"; > } > } > } > } > } > } > -------------------------------------------------------------------------- > -------------------- > > -- > "Every noble work is at first impossible." > - Thomas Carlyle > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Mon Jul 24 12:21:39 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 11:21:39 -0500 Subject: [Bioperl-l] New EMBL format parsing/writing In-Reply-To: Message-ID: <001c01c6af3d$3df2dc70$15327e82@pyrimidine> The only proposed EMBL changes I can remember were for Tax data (organism lines). It shouldn't be hard to change the way these are parsed. We could leave parsing of SV for older files and run a check on the ID line format to accommodate old and new sequences, though I have no problem with only supporting the latest formats. Continual support for old deprecated sequence formats leads to lots of cruft over time; SwissPort parsing has the same issue. You would be surprised how many people out there never bother to update their sequences and use old data... I believe you are referring to this (from the latest EMBL release notes): ... 2 CHANGES IN THIS RELEASE 2.1 Changes to the Feature Table Document: Chapter 3.5 "Location" The use of range (.) descriptor within location spans is no longer legal. 2.2 ID line changes ID line structure underwent the following changes * All tokens are separated by a semicolon. * The entry name is not displayed, in its place there is the primary accession number. * The sequence version is indicated. * The topology is a separate token and is indicated for both circular and linear molecules. * Both the data class and taxonomic divisions will be displayed. This is an example of the new ID line: ID CD789012; SV 4; linear; genomic DNA; HTG; MAM; 500 BP. (1) (2) (3) (4) (5) (6) (7) The tokens represent: 1. Primary accession number. 2. 'SV' + sequence version number. 3. Topology: 'circular' or 'linear'. 4. Molecule type. 5. Data class (ANN, CON, PAT, EST, GSS, HTC, HTG, MGA, WGS, TPA, STS, STD, "normal" entries will have STD for standard). 6. Taxonomic division (HUM, MUS, ROD, PRO, MAM, VRT, FUN, PLN, ENV, INV, SYN, UNC, VRL, PHG). 7. Sequence length + 'BP.'. The entry name is no longer displayed in the ID line. A mapping file (entryname to accession number) ftp://ftp.ebi.ac.uk/pub/databases/embl/misc/entryname_to_acc.mapping is provided for those entries where the entryname is not the same as the accession number. The SV line has been dropped as sequence version information is now displayed in the ID line. In order to facilitate the changeover to the new ID line structure, two small utilities have been released: 'new2oldID.pl' and 'old2newID.pl'. They can be used to convert EMBL flat files from the old to the new format and vice-versa. The converters can be found at ftp://ftp.ebi.ac.uk/pub/databases/embl/tools A new version of the Syncron tools (for maintaining synchronised copies of EMBL database updates) that became the working version with EMBL release 87 can be found in the same directory. In this version the tools were adjusted to cope with the new format of the ID line in EMBL entries and some related changes. ... Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of simon andrews (BI) > Sent: Monday, July 24, 2006 8:34 AM > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] New EMBL format parsing/writing > > I few weeks ago I saw a couple of messages on this list mentioning the > new ID/SV line format used in the latest EMBL release. I'm in the > process of moving our database server over to the new format and was > looking to update SeqIO::embl.pm. > > I'm sure someone said they'd made a patch to fix up parsing of the new > format, but I can't find it either in CVS or bugzilla. > > Rather than do this again myself can someone point me to an updated > SeqIO::embl.pm please? If there isn't one then I'll look into making > the patch myself. > > Since this is such a major change are there any plans to put out a new > release with this fix included? I'm sure this will start to bite more > people as the new format becomes more widely adopted. > > > Cheers > > Simon. > > -- > Simon Andrews PhD > Bioinformatics Group > The Babraham Institute > > simon.andrews at bbsrc.ac.uk > +44 (0) 1223 496463 > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Mon Jul 24 12:37:32 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 11:37:32 -0500 Subject: [Bioperl-l] New EMBL format parsing/writing In-Reply-To: Message-ID: <002001c6af3f$76214490$15327e82@pyrimidine> Great work! Does it support old and new EMBL or only the newest? I don't have a problem with dumping old format support, but if we do we need to note this in POD and elsewhere (wiki, perhaps). Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of dwaner at scitegic.com > Sent: Monday, July 24, 2006 11:03 AM > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] New EMBL format parsing/writing > > Simon, > > I have already updated SeqIO::embl.pm to support release 87. All I have > left to do is generate the patch and update the /t test. I will try to > get this submitted to bugzilla today (24 July). > > - David > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Mon Jul 24 14:40:03 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 13:40:03 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C4D0D3.1020506@sendu.me.uk> Message-ID: <002f01c6af50$97242250$15327e82@pyrimidine> I have to do a little catching up on things here; lots of conversation this morning! According to NCBI, the SOURCE line can hold organelle data, an abbreviated version of the scientific name, and the GenBank common name in parentheses. No other information is present. The ORGANISM lines contains the scientific name (NCBI definition) and the lineage, generally only ranked node but not always. I believe it was Nadeem Faruque who indicated that there is some way that NCBI marks the ranks which determines whether or not they appear in the lineage. Here's what Bio::SeqIO::genbank does to get data into and out of GenBank files: ------------------------------------------------------ Bio::SeqIO::genbank in methods next_seq() and _read_GenBank_Species(): 1) Bio::Species acts as a container object 2) The SOURCE data is dumped entirely into common_name() (ughhhh). There is some additional work done as well before instantiating a Bio::Species ; if it is considered an unknown organism there is no Bio::Species object returned. We should get rid of that bit; every GenBank SOURCE has a TaxID and therefore has a node, including plasmids and unknowns. There will be no genus/species or anything else set for that group. 3) The ORGANISM name was divided up into genus(), species(), and subspecies(), based on the classification array (again, ughhh). 4) The classification array is split into an array and dumped into classification() 5) No parsing of potential organelle information occurs. None. Zero. Squat. 6) TaxID is grabbed from the 'source' seqfeature and assigned via ncbi_taxid(). We could use this to also grab the organelle, etc. ------------------------------------------------------ Bio::SeqIO::genbank in method write_seq(): 1) SOURCE line : use the common_name data for output, but tag on the subspecies information (?!?!?!). 2) ORGANISM lines : the name is rebuilt from the organelle() (which should be on the SOURCE line) and genus and species, which comes from the classification array (?!?!?!). The classification array is rebuilt from classification() ------------------------------------------------------ Much of this may be cruft from changes in the official GenBank format that we neglected to update. However, I think there's WAY too much hand-wringing about trying to get everything into genus() species() etc without anything more that the (very scant) information in the flatfile, esp. when using the classification array as a basis. The only places where reliable tax information is present in the flatfile are: 1) SOURCE line (organelle, common name, abbreviated name) 2) ORGANISM lines (scientific name, classification array) 3) 'source' seqfeature (strain/variant (!), organelle, TaxID, etc found here). We should assign those accordingly; we could even use the 'source' seqfeature to grab strain, organelle, etc. just like we now do for the TaxID. Beyond that we're really just guessing the ranks and the genus-species names. Makes no sense, especially when that is easily available in Bio::Taxonomy using entrez/flatfile. We could have Bio::Taxonomy::Species act as a container for IO purpose, ONLY using the methods in the 'reliable information' list above in Bio::SeqIO::genbank and other SeqIO RichSeqs. Then hold the additional data with warnings attached if a lookup hasn't been run, or not set them at all. Or, use Hilmar's suggestion and force the user to use the db handle and ncbi_taxid() to grab a new Bio::Taxonomy::Node/Species object (based on the rank) which has the correct information. As for the other container get/sets: species(), genus() etc. These methods should be present, but only for species or below (hence Bio::Taxonomy::Species). In a way Bio::Taxonomy::Species is not entirely correct as the sequence file many times the sequence is from an organism at the genus level (unassigned species) or subspecies/strain levels, or is unranked (environmental samples, for instance). All of these seem to have TaxIDs though. Don't think it really matters... We could convert Bio::Species into an abstract interface class (Bio::SpeciesI), moving the implemented methods over to Bio::Taxonomy::Species, and have Bio::Taxonomy::Species implement Bio::Taxonomy::NodeI or Bio::TaxonomyI as well. Bio::Taxonomy::Species could be checked with $obj->isa('Bio::TaxonomyI') && $obj->isa('Bio::SpeciesI') Or, modifying Hilmar's suggestion: |-----Tax::Node NodeI/TaxI -| |-----Tax::Species | SpeciesI -------| So Species doesn't 'contaminate' Node. This will allow you to proceed with doing what you want to Bio::Taxonomy::Node; both Node and Species could be checked simultaneously though they need to be changed at some point to implement the same base class, so you could check using : if ($obj->isa('Bio::Taxonomy::NodeI')) { As for getting Bio::SeqIO::genbank to play well with Bio::Taxonomy::Species, all I did was 'clone' the Bio::Taxonomy::Node module into Bio::Taxonomy::Species, removed the warnings in species() and other methods for the time being, and changed the method call for classification() in Bio::SeqIO::genbank to send an array instead of an array_ref. Then I modified the parsing to retain the scientific_name and abbreviated_name (though the latter should go into common_names()). Passed all but one test, where common_name was called and returned the entire SOURCE line (not correct!). Pretty simple, really... BTW, I checked EMBL format, and it is very similar in format to the way GenBank is with the interesting addition of the OG line (for organelle). Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Sendu Bala > Sent: Monday, July 24, 2006 8:53 AM > To: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > Chris Fields wrote: > > Bio::SeqIO::genbank works very happily with the current > > Bio::Taxonomy::Node now; if we intend to remove most of the method we > > need to have a similar DB-aware module to house the flatfile data (like > > Bio::Species) yet be capable of working with Bio::Taxonomy (like > Tax::Node). > > Can you give code examples of what Bio::SeqIO::genbank is doing and what > makes it 'happy'? What are the requirements? Would it be as happy > working with a Bio::Taxonomy object? > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Mon Jul 24 15:24:23 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 14:24:23 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C4CFF6.40609@sendu.me.uk> Message-ID: <003c01c6af56$c5fd2df0$15327e82@pyrimidine> > Hilmar Lapp wrote: > > Sounds good to me, except there is no Bio::TaxonomyI yet, > > Indeed, I propose making one. So, Node would implement this, correct? Naming it Bio::TaxonomyI makes me think that Bio::Taxonomy implements TaxonomyI, not that Bio::Taxonomy::Node implements it. ... > Yes, which is why Bio::Taxonomy is appropriate here. Assuming that > Bio::Species isa Bio::TaxonomyI: > > ... > SOURCE Saccharomyces cerevisiae (baker's yeast) > ORGANISM Saccharomyces cerevisiae > Eukaryota; Fungi; Ascomycota; Saccharomycotina; > Saccharomycetes; > Saccharomycetales; Saccharomycetaceae; Saccharomyces. > > ... > > ## the fully-manual way > my $species = new Bio::Species; > my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces cerevisiae', > -rank => 'species', -object_id => 1, > -parent_id => 2); > my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces', > -object_id => 2, -parent_id => 3); > # (no assumption that 'Saccharomyces' is the genus, so rank() undefined) > my $n3 = [etc] > $species->add_node($node); > $species->add_node($n2); > [etc] Hrmm... why would you add multiple nodes to a species object? A Species is-a Node, not a full Bio::Taxonomy. Taxonomy has-a Node (hence the add_node() method). So, you should be able to add a NodeI-implementing object to a Taxonomy object (either a Node or a Species). Not sure I agree with what you propose here; doesn't seem right... ... > We also solve Chris' earlier quandary: > > [ in a world where Bio::Taxonomy::Node and Bio::Taxonomy::SpeciesNode > exist, and given that Bio::DB::Taxonomy* currently directly make Node > objects ] > > The only problem I can foresee is which class to use with > > Bio::DB::Taxonomy*? I guess one could settle on one class by default > and > > have the option to use another Bio::Taxonomy::NodeI-implementing class > if > > you wanted more data/methods available... > > The way to do it is to have the Bio::DB::Taxonomy* modules return only > the information that a Bio::Taxonomy::FactoryI would need to make a > NodeI. The specific Factory that you use could generate whatever type of > Node you wanted. Yes, using an object factory here makes a lot of sense, returning the correct object type based on the rank. ... > Bio::Species differs from Bio::Taxonomy only so it contains all the > legacy methods names that Bio::Species currently has, for backward > compatibility. Setting $species->classification() would delete all nodes > of self, use a GenbankFactory to make a new Bio::Species, then pull out > all its Nodes and add them to self. The idea is to replace Bio::Species with something that works well, so having it implement a Node-like interface works since it is-a Node. Having it implement a Taxonomy-like interface, though, doesn't make a lot of sense as a species is-not-a Taxonomy. It should act just like a fancier node object. Using a factory in Bio::DB::Taxonomy should solve any issues about what object type is returned, since that could simply be made based on the rank itself (species rank or below == Bio::Taxonomy::Species, genus and above == Bio::Taxonomy::Node). > Unless anyone can think of a better way of doing things, I'll explore > the above ideas and start writing code. To summarise: major changes to > Bio::DB::Taxonomy* (make them factory slaves), implementation of some > Bio::Taxonomy::FactoryIs, tweak Bio::Taxonomy::FactoryI and make > Bio::TaxonomyI, make Bio::Species a Bio::TaxonomyI. Nope. Don't agree. Sorry. I can't see why you would force a Species to be a Taxonomy when it isn't. The object hierarchy doesn't make sense to me. I would just have a simple interface for Node (NodeI), and either convert Bio::Species to an abstract interface or place its methods in Bio::Taxonomy::Species/SpeciesNode. I like the interface idea as Bio::Taxonomy::Node is-a NodeI only, while Bio::Taxonomy::Species is-a NodeI and SpeciesI; these checks can be run using the UNIVERSAL object method 'isa' when using a Factory. I'll repeat: a Node and a Species is-not-a Taxonomy. A Taxonomy object has-a Node or Species or combinations thereof ; all would be NodeI-implementing. That's the reason that add_node() is there, which could be modified to allow only objects that isa->('Bio::Taxonomy::NodeI') (i.e. a Node or a Species). > Oh, Bio::Taxonomy might need some changes as well. It has a classify() > method does something with a Bio::Species, which would be all wrong in > the new way of doing things. We'll have to make eventual changes to anything referencing Bio::Species to get them to work correctly. Getting the object hierarchy finalized and worked out is priority one. Getting Bio::SeqIO modules switched over to Bio::Taxonomy::Species (pretty commonly used) and making sure that Bio::DB::Taxonomy returns the correct objects from the factory is a close second. Any small issues that pop up along the way can be taken care of when they reveal themselves. Chris From cjfields at uiuc.edu Mon Jul 24 15:34:55 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 14:34:55 -0500 Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes In-Reply-To: <2C99E56B-84D2-4C51-BBF1-76BAF81205AB@gmx.net> Message-ID: <003d01c6af58$3dc4ac40$15327e82@pyrimidine> > > Maybe the file parser could have its own organelle() method > > and leave all taxonomic classes without such a method. Or it could > > stay > > as is, I don't know. > > Like I said above, at the end of the day there needs to be a way to > qualify a sequence by the genome it is part of. Agreed. I think Sendu's right in one regard, it doesn't seem to have anything to do with the taxonomy itself. See below... There should be a way of containing this somehow, maybe using a Bio::Annotation::SimpleValue object or having a get/set somehow. > > Do different organelles in the same species get unique taxonomy ids? > > I would have to confirm, but I believe so. As I said, from a genome/ > sequence-centric viewpoint, the organelle and nuclear genomes are two > different things. Looks like the organelle sequence data uses the organism TaxID. I couldn't find organelle-specific taxon information using the TaxBrowser for mitochondrion, chloroplast, or plastid. source 1..426 /organism="Reticulitermes tibialis" /organelle="mitochondrion" /mol_type="genomic DNA" /db_xref="taxon:186107" /haplotype="T9" TaxID refers to the organism ("Reticulitermes tibialis"), not the mitochondrion. source 1..814 /organism="Porterinema fluviatile" /organelle="plastid:chloroplast" /mol_type="genomic DNA" /strain="SAG 124.79" /db_xref="taxon:246123" /country="Germany" TaxID refers to the organism ("Porterinema fluviatile"), not the chloroplast. Chris From bix at sendu.me.uk Mon Jul 24 15:45:09 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 24 Jul 2006 20:45:09 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <003c01c6af56$c5fd2df0$15327e82@pyrimidine> References: <003c01c6af56$c5fd2df0$15327e82@pyrimidine> Message-ID: <44C52345.5060903@sendu.me.uk> Chris Fields wrote: >> Hilmar Lapp wrote: >>> Sounds good to me, except there is no Bio::TaxonomyI yet, >> Indeed, I propose making one. > > So, Node would implement this, correct? Naming it Bio::TaxonomyI makes me > think that Bio::Taxonomy implements TaxonomyI, not that Bio::Taxonomy::Node > implements it. No no, I guess the whole rest of you reply was confused by this one point. Bio::TaxonomyI would be the interface for Bio::Taxonomy. Definitely not a Node. >> Yes, which is why Bio::Taxonomy is appropriate here. Assuming that >> Bio::Species isa Bio::TaxonomyI: >> >> ... >> SOURCE Saccharomyces cerevisiae (baker's yeast) >> ORGANISM Saccharomyces cerevisiae >> Eukaryota; Fungi; Ascomycota; Saccharomycotina; >> Saccharomycetes; >> Saccharomycetales; Saccharomycetaceae; Saccharomyces. >> >> ... >> >> ## the fully-manual way >> my $species = new Bio::Species; >> my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces cerevisiae', >> -rank => 'species', -object_id => 1, >> -parent_id => 2); >> my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces', >> -object_id => 2, -parent_id => 3); >> # (no assumption that 'Saccharomyces' is the genus, so rank() undefined) >> my $n3 = [etc] >> $species->add_node($node); >> $species->add_node($n2); >> [etc] > > > Hrmm... why would you add multiple nodes to a species object? A Species > is-a Node, not a full Bio::Taxonomy. In my proposal, a Bio::Species certainly is a full Bio::Taxonomy. >> Bio::Species differs from Bio::Taxonomy only so it contains all the >> legacy methods names that Bio::Species currently has, for backward >> compatibility. Setting $species->classification() would delete all nodes >> of self, use a GenbankFactory to make a new Bio::Species, then pull out >> all its Nodes and add them to self. > > The idea is to replace Bio::Species with something that works well, so > having it implement a Node-like interface works since it is-a Node. Having > it implement a Taxonomy-like interface, though, doesn't make a lot of sense > as a species is-not-a Taxonomy. Right. So this is why we've been 'butting heads'. Up till now I had no idea why you were so adamant about keeping things the old Bio::Taxonomy::Node way. Bio::Species very definitely has never been, nor do we want it to become, a single node of a taxonomy. It has always been a complete taxonomy. You can tell that by the fact it has a classification, and you could ask what its genus is. This is why I'm proposing that Bio::Species become a Bio::Taxonomy. Because that's the correct object model for the kinds of things Bio::Species wants to do. > Using a factory in Bio::DB::Taxonomy should solve any issues about what > object type is returned, since that could simply be made based on the rank > itself (species rank or below == Bio::Taxonomy::Species, genus and above == > Bio::Taxonomy::Node). Frankly, that idea makes me ill. A Node, at the fundamental level, is just a very simple object that needs to associated a taxonomic rank with a scientific name. If you start making different objects for different ranks, you've departed from any semblance of meaning in the object model. > Nope. Don't agree. Sorry. I can't see why you would force a Species to be > a Taxonomy when it isn't. The object hierarchy doesn't make sense to me. Does it make sense now? > I'll repeat: a Node and a Species is-not-a Taxonomy. I'll repeat: A Node is a Node and a Bio::Species is a Taxonomy ;) > A Taxonomy object has-a Node or Species or combinations thereof ; No, a Taxonomy contains Nodes. One of those Nodes might have a rank() of 'species'. A Bio::Species contains Nodes. One of those Nodes definitely has a rank() of 'species'. It /must/ have other nodes, because the job of Bio::Species has in the past and will in the future be to store all the other taxonomic levels in a Genbank file. For the same reason Bio::Species can't be a Node itself, because you can't store other Nodes inside a Node. From cjfields at uiuc.edu Mon Jul 24 15:49:06 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 14:49:06 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <11A2B917-C633-4806-A6F4-920F02F0BF6E@gmx.net> Message-ID: <003e01c6af5a$390cdea0$15327e82@pyrimidine> Yes, 'largely' the key word. I don't really agree with Sendu's hierarchy scheme (making Species implement Taxonomy and not Node doesn't make sense), but, besides that, everything else seems fine. I like the following setup (which is similar to what you proposed, I believe), which I already posted. |-----Tax::Node NodeI-------| |-----Tax::SpeciesNode | SpeciesI -------| Taxonomy::Node is-a NodeI Taxonomy::SpeciesNode is-a NodeI and-a SpeciesI Bio::Taxonomy 'has-a' NodeI-implementing module SeqIO has-a SpeciesI-implementing module Bio::DB::Taxonomy uses a factory to return NodeI-implementing modules; specifically, a SpeciesNode for species ranks or below, and a Node for anything else. It would be nice to get this hammered out soon. I think we can actually start work on the Bio::Taxonomy::Node/SpeciesNode split; the interface classes would be easy to add. I could work on getting SeqIO to work with Bio::Taxonomy::SpeciesNode when I can (sometime in the next few weeks). Like I mentioned before, I got Bio::SeqIO::genbank already using it but haven't committed it to CVS until we sorted out the class hierarchy and interface-implementation issues. I won't be able to add too much more to this for a few weeks, unfortunately. I need to prepare for a conference as well as finish up a ton of bench research. I'll try keeping up though... Chris > :-) I think we're largely in agreement. As for node_name() I fully > understand the motivation, but it needs to be understood that the > attribute's value will be based on a largely arbitrary choice unless > it is set directly by the user. > > -hilmar > > On Jul 24, 2006, at 4:45 AM, Sendu Bala wrote: > > > Hilmar Lapp wrote: > >> On Jul 20, 2006, at 9:35 AM, Sendu Bala wrote: > >> > >>> Bio::DB::Taxonomy::flatfile > >>> --------------------------- > >>> [...] > >>> > >>> BEHAVIOUR-CHANGE: flatfile used to store within the nodes it > >>> makes the > >>> division as a three letter code, like 'PRI'. However, for > >>> consistency > >>> with entrez and the scientific_name() of the node the division is > >>> supposed to correspond to, it is now stored as the full name, like > >>> 'Primates'. > >> > >> What about adding a method division_code() which would return the 3- > >> letter abbreviation? > >> > >> The abbreviation may be needed by flat-file writers, so it may be > >> handy to have in some cases. > > > > As far as I know you can't get the 3-letter version via entrez, so no > > other module can really expect to be able to get it, not knowing which > > database (flatfile.pm or entez.pm) the taxonomic information is > > coming from. > > > > But of course it would be somewhat harmless to add division_code() > > anyway. It might be better done as a -code => 1 option to division()? > > > > > >>> The names->id solution also stores the artificially uniqued names > >>> like > >>> 'Craniata ', allowing you for the first time to > >>> retrieve the > >>> correct id. Previously the search would have simply failed > >>> completely. > >>> > >>> The names->id solution now handles nodes with scientific names of > >>> 'xyz > >>> (class)', allowing you to retrieve the id with both get_taxonids > >>> ('xyz') > >>> and get_taxonids('xyz (class)'). Previously only the latter would > >>> work. > >> > >> Should angle brackets be allowed too? > > > > Allowed in what sense? You can indeed search for both > > get_taxonids('Craniata ') [returns a single id] and > > get_taxonids('Craniata') [returns multipe ids, one of which is the > > previous answer]. > > > > > >> Maybe there should also be a -names parameter which accepts a hash > >> reference with keys being the kind of name (scientific, common, etc) > >> and the values being array references with the set of names of that > >> kind? > > > > Not sure what you mean. name() has that data structure, though you're > > not supposed to set its hash ref directly. > > > > > >>> or the $node->classification() array. > >> > >> Bio::Taxonomy::Node shouldn't have this attribute. It is legacy > >> brought over from a flawed (because flat) object model in > >> Bio::Species. > > > > Yes, I agree. > > > > > >>> NOTE: entrez modules (and website) cannot cope with '' > >>> in the > >>> query, failing searches like 'Craniata '. For this > >>> reason, if > >>> get_taxonids() is given a query with '' it will > >>> immediately > >>> return undefined, saving a pointless website access. > >> > >> If there is a 'next-best-thing' that is still semantically compatible > >> with the API documentation, I would do that. > >> > >> In this case, if there is a in the query the entrez > >> module should strip it and automatically use the rest for searching. > >> If indeed multiple IDs match there should be a warning to inform the > >> user that entrez cannot use the notation to limit the > >> query results. > > > > I wouldn't like this. I actually had it working this way initially, > > but > > decided that if someone entered 'xyz ' they really didn't > > want multiple ids, expected to get multiple ids with just 'xyz' and > > don't want their query made something else and then be warned about > > it. > > > > > >> In fact, you might as well provide an option to enable an automatic > >> check for the correct branch for each ID if multiple ones are > >> returned. I.e., if this option is enabled, the module would > >> automatically query the parent nodes to see if is in the > >> lineage, and if not will remove the respective ID from the result > >> set. The reason you may want to make it optional is because it > >> potentially costs time. (but in reality I'm not sure why a client > >> will not want to enable the option - so maybe this should even be > >> default) > > > > I can certainly add that, it seems like a good idea. I don't, however, > > see any scope for an option at all. What would the option be called? > > -don't_give_me_the_answer_I_actually_want_to_save_time ? Pointless, > > imho. If the user queries 'xyz ' with that option, they're > > just going to have to do for themselves manually what the method would > > have done for them without that option, in order to get the correct > > answer. It'll be slower that way, if anything. So the option would > > actually be called > > - > > don't_give_me_the_answer_I_actually_want_so_I_can_get_it_myself_a_litt > > le_slower > > (!). > > > > > >>> Bio::Taxonomy::Node > >>> ------------------- > >>> [...] > >>> classification() has a proper solution to finding the classification > >>> when the array wasn't manually set. > >>> > >>> # Improvements > >>> BEHAVIOUR-CHANGE: node_name() used to be an alias to name > >>> ('common'). Now > >>> it is an alias to name('scientific'). > >>> NOTE: node_name is what is set when ->new(-name => $name) is set, so > >>> flatfile and entrez and user-created nodes now implicitly associate > >>> the > >>> name of the node they create with its scientific name. > >> > >> I'm not even sure node_name() should just be deprecated. The methods > >> falsely suggests that there is only a single and definitive name for > >> the taxon node. > >> > >> In NCBI reality, this is only true for the scientific name of the > >> node. In real reality, many nodes have multiple scientific names - > >> taxonomy isn't static and therefore the scientific naming of nodes > >> isn't either. > > > > For the programmer not using any database but just making up his own > > nodes, I think he needs a node_name() because he may not be thinking > > about anything fancy or realistic. He just want to give his node a > > single name that he invents. node_name() seems like the ideal method > > name to me. > > > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From hlapp at gmx.net Mon Jul 24 15:56:02 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 24 Jul 2006 15:56:02 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <003c01c6af56$c5fd2df0$15327e82@pyrimidine> References: <003c01c6af56$c5fd2df0$15327e82@pyrimidine> Message-ID: <88700A84-B426-4BC7-88F2-D5E793870ADF@gmx.net> On Jul 24, 2006, at 3:24 PM, Chris Fields wrote: > >> Hilmar Lapp wrote: >>> Sounds good to me, except there is no Bio::TaxonomyI yet, >> >> Indeed, I propose making one. > > So, Node would implement this, correct? No - > Naming it Bio::TaxonomyI makes me > think that Bio::Taxonomy implements TaxonomyI, not that > Bio::Taxonomy::Node > implements it. I'd suppose so. >> Yes, which is why Bio::Taxonomy is appropriate here. Assuming that >> Bio::Species isa Bio::TaxonomyI: >> >> ... >> SOURCE Saccharomyces cerevisiae (baker's yeast) >> ORGANISM Saccharomyces cerevisiae >> Eukaryota; Fungi; Ascomycota; Saccharomycotina; >> Saccharomycetes; >> Saccharomycetales; Saccharomycetaceae; Saccharomyces. >> >> ... >> >> ## the fully-manual way >> my $species = new Bio::Species; >> my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces >> cerevisiae', >> -rank => 'species', -object_id >> => 1, >> -parent_id => 2); >> my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces', >> -object_id => 2, -parent_id => 3); >> # (no assumption that 'Saccharomyces' is the genus, so rank() >> undefined) >> my $n3 = [etc] >> $species->add_node($node); >> $species->add_node($n2); >> [etc] > > > Hrmm... why would you add multiple nodes to a species object? A > Species > is-a Node, not a full Bio::Taxonomy. No. See above: Bio::Species is-a Bio::Taxonomy. > Taxonomy has-a Node (hence the > add_node() method). So, you should be able to add a NodeI- > implementing > object to a Taxonomy object (either a Node or a Species). Let's keep Bio::Species and Taxonomy::Node separate. They look like representing something similar but once you look at the Bio::Species API (and a Genbank record) you realize they do not. Bio::Species is more like an entire lineage and the species node all flattened out into one. I'm not sure Bio::Species would need to implement a Bio::TaxonomyI interface; it may as well just use an implementation of it internally. I'm not sure how Sendu wants to design this, but for sure Bio::Taxonomy::Node should not be a Bio::Species, and the reverse should rather be avoided too. >> [..] >> The way to do it is to have the Bio::DB::Taxonomy* modules return >> only >> the information that a Bio::Taxonomy::FactoryI would need to make a >> NodeI. The specific Factory that you use could generate whatever >> type of >> Node you wanted. > > Yes, using an object factory here makes a lot of sense, returning the > correct object type based on the rank. Well, I don't think you'd want to create instances of different node classes depending on the rank of the node. However, a particular factory implementation may of course be free to do exactly that. > ... >> Bio::Species differs from Bio::Taxonomy only so it contains all the >> legacy methods names that Bio::Species currently has, for backward >> compatibility. Setting $species->classification() would delete all >> nodes >> of self, use a GenbankFactory to make a new Bio::Species, then >> pull out >> all its Nodes and add them to self. > > The idea is to replace Bio::Species with something that works well, so > having it implement a Node-like interface works since it is-a > Node. Having > it implement a Taxonomy-like interface, though, doesn't make a lot > of sense > as a species is-not-a Taxonomy. It should act just like a fancier > node > object. No, I'd really recommend against muddling up a taxonomy node model with the Bio::Species legacy model. Bio::Species is not a node at all. You may argue it's not a taxonomy either. This is just one more reason for containing the Bio::Species contagious disease of conflating disjoint concepts into one. > > Using a factory in Bio::DB::Taxonomy should solve any issues about > what > object type is returned, since that could simply be made based on > the rank > itself (species rank or below == Bio::Taxonomy::Species, genus and > above == > Bio::Taxonomy::Node). Bio::Taxonomy::Species was an invention of mine and - if created - should not be used for anything else other than representing a taxonomy node as a Bio::Species object iff necessary (i.e., if the client really wants a Bio::Species object). I'd actually like to see what Sendu would come up with. It sounds at the very minimum like an excellent start. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Mon Jul 24 15:59:10 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 24 Jul 2006 15:59:10 -0400 Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes In-Reply-To: <003d01c6af58$3dc4ac40$15327e82@pyrimidine> References: <003d01c6af58$3dc4ac40$15327e82@pyrimidine> Message-ID: <3C520B8C-8755-4A7E-80CF-8B94FEAB867E@gmx.net> On Jul 24, 2006, at 3:34 PM, Chris Fields wrote: > Looks like the organelle sequence data uses the organism TaxID. Then you might as well store it as annotation. Really the only thing that matters is that the flat file writers can get from an expected location. In fact storing as annotation is better e.g. for Biosql since right now the taxonomy model is the NCBI model and so organelle will not be stored (and hence neither be round-tripped). -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Mon Jul 24 16:10:20 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 15:10:20 -0500 Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes In-Reply-To: <3C520B8C-8755-4A7E-80CF-8B94FEAB867E@gmx.net> Message-ID: <000001c6af5d$3094b830$15327e82@pyrimidine> Sounds good. Will be easy to change this over. Chris > -----Original Message----- > From: Hilmar Lapp [mailto:hlapp at gmx.net] > Sent: Monday, July 24, 2006 2:59 PM > To: Chris Fields > Cc: 'Sendu Bala'; bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Bio::Species/Bio::Taxonomy changes > > > On Jul 24, 2006, at 3:34 PM, Chris Fields wrote: > > > Looks like the organelle sequence data uses the organism TaxID. > > Then you might as well store it as annotation. Really the only thing > that matters is that the flat file writers can get from an expected > location. > > In fact storing as annotation is better e.g. for Biosql since right > now the taxonomy model is the NCBI model and so organelle will not be > stored (and hence neither be round-tripped). > > -hilmar > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > From hlapp at gmx.net Mon Jul 24 16:12:39 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 24 Jul 2006 16:12:39 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <003e01c6af5a$390cdea0$15327e82@pyrimidine> References: <003e01c6af5a$390cdea0$15327e82@pyrimidine> Message-ID: <5FB07071-42D7-4F43-B2A1-3AF5F1FC5193@gmx.net> On Jul 24, 2006, at 3:49 PM, Chris Fields wrote: > Yes, 'largely' the key word. I don't really agree with Sendu's > hierarchy > scheme (making Species implement Taxonomy and not Node doesn't make > sense), > but, besides that, everything else seems fine. I like the > following setup > (which is similar to what you proposed, I believe), which I already > posted. > > |-----Tax::Node > NodeI-------| > |-----Tax::SpeciesNode > | > SpeciesI -------| > > Taxonomy::Node is-a NodeI > Taxonomy::SpeciesNode is-a NodeI and-a SpeciesI I don't even think we would need SpeciesI - why would a species- ranked taxonomy node be so different from any other node such that it would need its own interface. Chris - just one suggestion: take a step back and imagine a Bioperl in which Bio::Species had never existed. Instead, only taxonomy nodes existed, and code that can effectively deal with them, including filtering by rank. In this picture, what would you make to want to introduce SpeciesI and Bio::Species? Frankly, I don't see anything. I.e., the only reason is backward compatibility (which is a valid reason), but let's not glorify Bio::Species by adding ill-conceived interfaces. > > Bio::DB::Taxonomy uses a factory to return NodeI-implementing modules; > specifically, a SpeciesNode for species ranks or below, and a Node for > anything else. Like I said before, SpeciesNode or whatever it's called would draw its right of existence solely from backward compatibility - don't use it for anything else. And if you can achieve backward compatibility by other means, don't even create a SpeciesNode. My $0.02 ... -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Mon Jul 24 17:34:29 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 16:34:29 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <5FB07071-42D7-4F43-B2A1-3AF5F1FC5193@gmx.net> Message-ID: <000101c6af68$f27521a0$15327e82@pyrimidine> > I don't even think we would need SpeciesI - why would a species- > ranked taxonomy node be so different from any other node such that it > would need its own interface. > > Chris - just one suggestion: take a step back and imagine a Bioperl > in which Bio::Species had never existed. Instead, only taxonomy nodes > existed, and code that can effectively deal with them, including > filtering by rank. In this picture, what would you make to want to > introduce SpeciesI and Bio::Species? Argh!!! Just when I thought I could pull away... Okay. I thought it would be nice to have a class that could accomplish two things: 1) Act as a container for GenBank taxonomy information; Bio::Taxonomy::Node, as written by Jason, was meant to be a replacement for Bio::Species. 2) Also act as a bridge, so you had the option to retrieve the Species object from a sequence object and have it act like a Node (be db-aware out-of-the-box, so to speak). Also, I'm trying to follow the original idea as proposed by Jason (this is from perldoc Bio::Taxonomy::Node): DESCRIPTION This is the next generation (for Bioperl) of representing Taxonomy information. Previously all information was managed by a single object called Bio::Species. This new implementation allows representation of the intermediate nodes not just the species nodes and can relate their connections. Which, to me, indicated that this would eventually replace Bio::Species (so, in effect, must at least contain the relevant data for sequence objects w/o being completely reliant on DB, yet still be DB-aware). Everything about Bio::Species on the wiki also leads me to believe that this was the original intent for Bio::Taxonomy::Node. http://www.bioperl.org/wiki/Module:Bio::Species http://www.bioperl.org/wiki/Project_priority_list#Taxonomy_.2F_Species_data And all the original methods (genus(), species(), etc.) also seem to indicate this. That's really it. I could give a toss about getting taxonomy information directly from Bio::Species. And you're right: in hindsight Bio::Species is flawed. However, it seemed from the beginning of this discussion with Sendu and the proposed changes, that Bio::Species should stick around in some capacity but should also be involved with Bio::Taxonomy (contrary to Jason's idea above). Now I'm hearing something completely different (Sendu still argues that it should be involved). I had originally wanted to start delegating everything over to Taxonomy::Node about a month ago, when I found that it was remarkably easy to do so. However, when Sendu proposed making changes to remove methods in Bio::Taxonomy::Node and make sweeping changes to Taxonomy which would prevent an easy transition over to Node, I felt that it would be harder to effectively have it take over for Bio::Species when parsing SeqIO objects (all the calls to genus/species/subspecies etc methods would have to be removed from all the classes which use Bio::Species). Hence Bio::Taxonomy::Species as a compromise. Now it turns out no one wants to have either Bio::Species (your 'contagion' references clues me in there) or Bio::Taxonomy::Species. If we think it would be better to completely toss all this out the window and use only a bare-bones Node, then I'm fine with that. But if we go that route we should just get rid of the Bio::Species 'disease' completely and have things be much simpler. Simple is good! I think Node can still act as a viable container class for the tax data from a GenBank file (it's original purpose) as long as it has the very basic methods for doing so. That would require: scientific_name() - ORGANISM line data common_names() - which could hold common names (in parentheses on the SOURCE line) and the abbreviated name (from the SOURCE line) ncbi_taxid() - from the 'source' seqfeature (already there). The lineage information and organelle information could be stored in Node or in SimpleValue objects. My vote is for the latter as there's no need for a classification() container for Node, which you have repeatedly pointed out. > Frankly, I don't see anything. I.e., the only reason is backward > compatibility (which is a valid reason), but let's not glorify > Bio::Species by adding ill-conceived interfaces. I think we should just get rid of Bio::Species completely. We would need to go in and rework species parsing in the SeqIO modules that use Bio::Species, but that would only make things simpler, not more complex. Get rid of trying to figure out what is a genus or species based on the GenBank information only, and have the bridge between the sequences be stored in a Taxonomy::Node object (which should contain the NCBI TaxID, so then it can use the associated DB object to traverse up and down other nodes). The interface idea was a proposed compromise i.e. my 'bridge' between GenBank taxonomy hell and Bio::Taxonomy bliss, and intended to follow what I thought was Jason's original intent for Bio::Taxonomy::Node. Nothing more. > > Bio::DB::Taxonomy uses a factory to return NodeI-implementing modules; > > specifically, a SpeciesNode for species ranks or below, and a Node for > > anything else. > > Like I said before, SpeciesNode or whatever it's called would draw > its right of existence solely from backward compatibility - don't use > it for anything else. And if you can achieve backward compatibility > by other means, don't even create a SpeciesNode. Agreed. But, if there is such venom towards Bio::Species, why not put it out of it's misery as well? Seems like it has outlived it's usefulness. Chris From cjfields at uiuc.edu Mon Jul 24 17:53:46 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 16:53:46 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C52345.5060903@sendu.me.uk> Message-ID: <000201c6af6b$a4534580$15327e82@pyrimidine> > > I'll repeat: a Node and a Species is-not-a Taxonomy. > > I'll repeat: A Node is a Node and a Bio::Species is a Taxonomy ;) Nope. I think this is incorrect. Here's why. Let's look at the reasons Bio::Taxonomy was started, shall we? >From perldoc Bio::Taxonomy: DESCRIPTION Bio::Taxonomy object represents any rank-level in taxonomy system, rather than Bio::Species which is able to represent only species-level. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >From perldoc Bio::Taxonomy::Node DESCRIPTION This is the next generation (for Bioperl) of representing Taxonomy information. Previously all information was managed by a single object called Bio::Species. This new implementation allows representation of the intermediate nodes not just the species nodes and can relate their ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ connections. Bioperl wiki: http://www.bioperl.org/wiki/Project_priority_list#Taxonomy_.2F_Species_data http://www.bioperl.org/wiki/Module:Bio::Species Both talk about delegating or replacing Bio::Species with Bio::Taxonomy::Node. Everyone of those indicates what the original idea for Bio::Taxonomy::Node was (eventual replacement for Bio::Species). Even the original methods for Bio::Taxonomy::Node are the same. So, according to this alone, Bio::Species would eventually be replaced by Bio::Taxonomy::Node. I wanted an easier transition to Node from Bio::Species (hell, just a few changes and using Bio::Taxonomy::Node worked fine!) , but your proposals made sense. I saw having a Species-based Tax object as a nice compromise, but Hilmar has made a few good points: would we have a Bio::Species object around knowing what we know now? When Bio::Species was originally designed, it was probably before the NCBI Tax database existed. I think it has outlasted its current use. I have posted a response to Hilmar. I think we should just get rid of Bio::Species altogether and have a Taxonomy::Node contain the basic data (scientific_name(), common_names(), etc). And remove any SeqIO parsing of genus/species to simplify everything. All this extra parsing and hand-wringing over trying to get species/genus information from a GenBank file just mucks up ORGANISM and SOURCE line parsing anyway. Simplify it. Simple is good. Radical? Yes, but I agree with him that Bio::Species has outlasted it's use. As for organelle and lineage information, they could be placed in SimpleValue objects. If anyone wants to grab tax information, they can use the Node object to get it but they'll need a local flatfile database or network connection to do so. This also means there is no need for a Bio::DB::Taxonomy factory: just return Node objects directly. Each format (flatfile and entrez) currently works this way anyway, correct? Simplifies that. Simple is better. Of course, we couldn't get rid of Bio::Species until all the following were shifted over to Node somehow: ; > Instances: 2 BP Module : Bio::Cluster::SequenceFamily Instances: 4 BP Module : Bio::Cluster::UniGene Instances: 1 BP Module : Bio::Cluster::UniGeneI Instances: 1 BP Module : Bio::DB::FileCache Instances: 3 BP Module : Bio::DB::GFF::Segment Instances: 1 BP Module : Bio::DB::Taxonomy::flatfile Instances: 2 BP Module : Bio::Graph::IO::psi_xml Instances: 1 BP Module : Bio::Map::CytoMap Instances: 1 BP Module : Bio::Map::LinkageMap Instances: 3 BP Module : Bio::Map::MapI Instances: 3 BP Module : Bio::Map::SimpleMap Instances: 3 BP Module : Bio::Matrix::PSM::InstanceSite Instances: 6 BP Module : Bio::Phenotype::Correlate Instances: 1 BP Module : Bio::Phenotype::OMIM::OMIMentry Instances: 3 BP Module : Bio::Phenotype::OMIM::OMIMparser Instances: 5 BP Module : Bio::Phenotype::Phenotype Instances: 2 BP Module : Bio::Phenotype::PhenotypeI Instances: 4 BP Module : Bio::Seq Instances: 3 BP Module : Bio::SeqI Instances: 2 BP Module : Bio::SeqIO::agave Instances: 4 BP Module : Bio::SeqIO::bsml Instances: 2 BP Module : Bio::SeqIO::bsml_sax Instances: 1 BP Module : Bio::SeqIO::chadoxml Instances: 1 BP Module : Bio::SeqIO::chaos Instances: 4 BP Module : Bio::SeqIO::embl Instances: 2 BP Module : Bio::SeqIO::entrezgene Instances: 3 BP Module : Bio::SeqIO::game::seqHandler Instances: 4 BP Module : Bio::SeqIO::genbank Instances: 2 BP Module : Bio::SeqIO::kegg Instances: 2 BP Module : Bio::SeqIO::locuslink Instances: 4 BP Module : Bio::SeqIO::swiss Instances: 2 BP Module : Bio::SeqIO::table Instances: 2 BP Module : Bio::SeqIO::tigr Instances: 2 BP Module : Bio::SeqIO::tigrxml Instances: 7 BP Module : Bio::SeqIO::tinyseq Instances: 4 BP Module : Bio::Taxonomy Instances: 1 BP Module : Bio::Taxonomy::Node Instances: 6 BP Module : Bio::Taxonomy::Taxon Instances: 9 BP Module : Bio::Taxonomy::Tree Instances: 5 BP Module : Bio::Tools::Analysis::Protein::ELM Chris From bix at sendu.me.uk Mon Jul 24 18:15:31 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 24 Jul 2006 23:15:31 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <000101c6af68$f27521a0$15327e82@pyrimidine> References: <000101c6af68$f27521a0$15327e82@pyrimidine> Message-ID: <44C54683.70707@sendu.me.uk> Chris Fields wrote: > > Also, I'm trying to follow the original idea as proposed by Jason (this is > from perldoc Bio::Taxonomy::Node): > > Which, to me, indicated that this would eventually replace Bio::Species Well, we don't really know that Jason didn't later change his mind, but in any case it doesn't make sense (anymore, given that we have Bio::Taxonomy). In a direct reply to me you point out specific passages in the current docs that explain why you have thought we should delegate or replace Bio::Species with Bio::Taxonomy::Node. With respect, the old plans are not something we are forced to blindly follow. We decide for ourselves if they make sense, we decide for ourselves if there is a better way of doing it, and then we do it the best way. So if you ignore what those old bits of documentation say, just pretend you never ever read them, would my proposals make sense or not? Since those old proposals were never implemented we have no reason to try and stick with them if there is a better proposal. And for the record, '...Bio::Species which is able to represent only species-level' can (correctly) be interpreted as 'Bio::Species is only supposed to be used for representing a taxonomy that includes the species-level'. You can't interpret it literally because Bio::Species is used for levels below species, and also represents all the levels above species-level as well. Either Jason got it wrong when he wrote that, or you have misinterpreted it. Likewise, let's play the interpretation game again: 'Previously all information was managed by a single object called Bio::Species. [the Bio::Taxonomy::Node] implementation allows representation of the intermediate nodes not just the species nodes'. Note the apposition of 'single object' vs implication of multiple Node objects to do the same job. I imagine at the time Jason wrote that there was no Bio::Taxonomy, no holder for multiple Nodes. > I had originally wanted to start delegating everything over to > Taxonomy::Node about a month ago, when I found that it was remarkably easy > to do so. However, when Sendu proposed making changes to remove methods in > Bio::Taxonomy::Node and make sweeping changes to Taxonomy which would > prevent an easy transition over to Node, But an equally easy transition to Bio::Taxonomy instead. I don't know why you would care about the name of the class we switch to. My concern is that when the switch is made it makes sense. > If we think it would be better to completely toss all this out the window > and use only a bare-bones Node, then I'm fine with that. But if we go that > route we should just get rid of the Bio::Species 'disease' completely and > have things be much simpler. Simple is good! > > I think Node can still act as a viable container class for the tax data from > a GenBank file (it's original purpose) as long as it has the very basic > methods for doing so. That would require: > > scientific_name() - ORGANISM line data > common_names() - which could hold common names (in parentheses on the SOURCE > line) and the abbreviated name (from the SOURCE line) > ncbi_taxid() - from the 'source' seqfeature (already there). > > The lineage information and organelle information could be stored in Node or > in SimpleValue objects. My vote is for the latter as there's no need for a > classification() container for Node, which you have repeatedly pointed out. No, this is the whole point. The lineage information can NOT be stored in a Node (unless you absuse Node by having all those crufty methods like genus() and classification()), and why would we store it in SimpleValue objects when we have Bio::Taxonomy? Bio::Taxonomy is completely perfect for storing the taxonomic information from a GenBank file. That's all you need to worry about. Can we represent the data correctly? Yes. Do we gain all the good things about a pure Bio::Taxonomy? Yes. Can we still do everything we used to be able to do? Yes. > I think we should just get rid of Bio::Species completely. There's no need to get rid of Bio::Species. It can be a Bio::Taxonomy with backward-compatible methods. No harm done, all good. I'll tell you what. This will be easier if I just write the code for my proposals, including whatever changes would be needed in Bio::SeqIO::genbank et al. You'll see how easy and appropriate it is, and hopefully everyone will be happy. Perhaps you could just hold off doing any similar-but-contradictory work until then. From hlapp at gmx.net Mon Jul 24 19:47:10 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 24 Jul 2006 19:47:10 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C54683.70707@sendu.me.uk> References: <000101c6af68$f27521a0$15327e82@pyrimidine> <44C54683.70707@sendu.me.uk> Message-ID: <86C97545-8DDF-4FE3-9CE9-1799D8BD2EB5@gmx.net> On Jul 24, 2006, at 6:15 PM, Sendu Bala wrote: > I'll tell you what. This will be easier if I just write the code > for my > proposals, including whatever changes would be needed in > Bio::SeqIO::genbank et al. Never get in the way of somebody who threatens to code :-) so I certainly won't. I think you're on the right track. My suggestion is, if you have a good picture in front of you of how it's going to look like when done, just pretend for a second it is done already and give us some code examples that use the new (to be done) API. As a start, some of the situations it's currently used in: - genbank.pm parsing and setting species information for the sequence - user asking for the scientific name of the species of the sequence (obviously, the call would remain unchanged: $seq->species->binomial (). But what happens behind the scene?) - genbank.pm writing the SOURCE information for a sequence Replace genbank.pm with your rich annotation source parser of choice. Then maybe some advanced uses: - from a sequence stream, retain only those of primates - like above, but only mitochondrial sequences - for an organism, query entrez for all sequences of strains, varieties, or subspecies sequences for that organism Add your own if these sound stupid ... Just an idea. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Mon Jul 24 22:06:16 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 21:06:16 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <86C97545-8DDF-4FE3-9CE9-1799D8BD2EB5@gmx.net> References: <000101c6af68$f27521a0$15327e82@pyrimidine> <44C54683.70707@sendu.me.uk> <86C97545-8DDF-4FE3-9CE9-1799D8BD2EB5@gmx.net> Message-ID: <4678548F-ABEC-4E14-AD7F-D282D2DC2730@uiuc.edu> > >> I'll tell you what. This will be easier if I just write the code >> for my >> proposals, including whatever changes would be needed in >> Bio::SeqIO::genbank et al. > > Never get in the way of somebody who threatens to code :-) so I > certainly won't. I think you're on the right track. Fine by me. My only request: I don't want every sequence passing through SeqIO having an automatic DB lookup performed on it. SeqIO parsing of GenBank files is slow enough as it is w/o enforcing lookups, even if they are cached. If you want lookups, have it as an option and not as default behavior. We could have the option for a lookup added pretty easily in genbank.pm _initialize or the main SeqIO constructor as a simple Boolean flag. That might be pretty nice. ... > (). But what happens behind the scene?) > - genbank.pm writing the SOURCE information for a sequence You know, the only really divisive point here is the lineage data and how to store it in _read_GenBank_Species or reproduce it in write_seq (). Again, I don't think we should have a forced lookup for this; it should just be stored as is, either in Node or SimpleValue. Again, I think the latter as everyone seems averse to containing this in Node. > Then maybe some advanced uses: > > - from a sequence stream, retain only those of primates > - like above, but only mitochondrial sequences > - for an organism, query entrez for all sequences of strains, > varieties, or subspecies sequences for that organism For the primate example, would you screen those out via the in-file lineage or using lookups? Something like '$seqout->write_seq($seq) if ($seq->species->organelle eq 'mitochondrion');' for the mitochondria example, which would mean leaving organelle() in Species/Node or whatever is used. The last one, I think, can be done w/o using the sequence directly using NCBI's ELink and the TaxID to cross-reference the nucleotide database. You would probably have to walk through all child nodes, but it's feasible that way. > Add your own if these sound stupid ... > > Just an idea. > > -hilmar > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Mon Jul 24 22:29:57 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 21:29:57 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C54683.70707@sendu.me.uk> References: <000101c6af68$f27521a0$15327e82@pyrimidine> <44C54683.70707@sendu.me.uk> Message-ID: <6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu> Look, we're just going back and forth on this stupid little thing, when the only point we really are divided on is what object type we should store certain items in a GenBank file (Bio::Species/ Bio::Tax::Node/Bio::Whatever). In particular, the main sticking point is the lineage. We could go back and forth on what Jason really intended. Personally, I think his past statements are quite clear on what his intent was (he's very clear in the wiki on what Bio::Taxonomy::Node was built to replace, in two separate posts and within the last four months). The reality is he's not here and you're willing to do the job. There is one thing I will make perfectly clear here: there should never, ever be enforced lookups for SeqIO (even using caches), though I have no problem having optional ones. This is something I have stated before and what you propose below steers dangerously in that direction. Where, for instance, do you store the lineage from a GenBank file? Do you want to do a series of Tax lookups to restore that data? I think that the number one complaint for sequence parsing is speed, which would only get slower with lookups (even cached). What I propose is we make it as simple as possible. Remove the unnecessary genus/species/subspecies parsing in genbank.pm, store the scientific name, common names, and lineage in some easily accessible way to make it easier for everyday users to use, have it tied to Bio::Taxonomy in some way (I propose Node, as it contains almost all the methods needed) so that you could get more information by moving up and down nodes, or retrieve more information. I, personally, don't see the point in having Bio:Species around after this discussion as Node seems to do the job adequately. My last word (I will be exiting this discussion and the group for two weeks): This would have been MUCH easier if all three of us could have gone to the local bar for a beer and discussed it. We should just take the time out to videoconference next time. Chris > Chris Fields wrote: >> >> Also, I'm trying to follow the original idea as proposed by Jason >> (this is >> from perldoc Bio::Taxonomy::Node): >> >> Which, to me, indicated that this would eventually replace >> Bio::Species > > Well, we don't really know that Jason didn't later change his mind, > but > in any case it doesn't make sense (anymore, given that we have > Bio::Taxonomy). > > In a direct reply to me you point out specific passages in the current > docs that explain why you have thought we should delegate or replace > Bio::Species with Bio::Taxonomy::Node. With respect, the old plans are > not something we are forced to blindly follow. We decide for ourselves > if they make sense, we decide for ourselves if there is a better > way of > doing it, and then we do it the best way. > > So if you ignore what those old bits of documentation say, just > pretend > you never ever read them, would my proposals make sense or not? Since > those old proposals were never implemented we have no reason to try > and > stick with them if there is a better proposal. > > And for the record, '...Bio::Species which is able to represent only > species-level' can (correctly) be interpreted as 'Bio::Species is only > supposed to be used for representing a taxonomy that includes the > species-level'. You can't interpret it literally because > Bio::Species is > used for levels below species, and also represents all the levels > above > species-level as well. Either Jason got it wrong when he wrote > that, or > you have misinterpreted it. > > Likewise, let's play the interpretation game again: 'Previously all > information was managed by a single object called Bio::Species. [the > Bio::Taxonomy::Node] implementation allows representation of the > intermediate nodes not just the species nodes'. Note the apposition of > 'single object' vs implication of multiple Node objects to do the same > job. I imagine at the time Jason wrote that there was no > Bio::Taxonomy, > no holder for multiple Nodes. > > >> I had originally wanted to start delegating everything over to >> Taxonomy::Node about a month ago, when I found that it was >> remarkably easy >> to do so. However, when Sendu proposed making changes to remove >> methods in >> Bio::Taxonomy::Node and make sweeping changes to Taxonomy which would >> prevent an easy transition over to Node, > > But an equally easy transition to Bio::Taxonomy instead. I don't know > why you would care about the name of the class we switch to. My > concern > is that when the switch is made it makes sense. > > >> If we think it would be better to completely toss all this out the >> window >> and use only a bare-bones Node, then I'm fine with that. But if >> we go that >> route we should just get rid of the Bio::Species 'disease' >> completely and >> have things be much simpler. Simple is good! >> >> I think Node can still act as a viable container class for the tax >> data from >> a GenBank file (it's original purpose) as long as it has the very >> basic >> methods for doing so. That would require: >> >> scientific_name() - ORGANISM line data >> common_names() - which could hold common names (in parentheses on >> the SOURCE >> line) and the abbreviated name (from the SOURCE line) >> ncbi_taxid() - from the 'source' seqfeature (already there). >> >> The lineage information and organelle information could be stored >> in Node or >> in SimpleValue objects. My vote is for the latter as there's no >> need for a >> classification() container for Node, which you have repeatedly >> pointed out. > > No, this is the whole point. The lineage information can NOT be stored > in a Node (unless you absuse Node by having all those crufty methods > like genus() and classification()), and why would we store it in > SimpleValue objects when we have Bio::Taxonomy? > > Bio::Taxonomy is completely perfect for storing the taxonomic > information from a GenBank file. That's all you need to worry > about. Can > we represent the data correctly? Yes. Do we gain all the good things > about a pure Bio::Taxonomy? Yes. Can we still do everything we used to > be able to do? Yes. > > >> I think we should just get rid of Bio::Species completely. > > There's no need to get rid of Bio::Species. It can be a Bio::Taxonomy > with backward-compatible methods. No harm done, all good. > > > I'll tell you what. This will be easier if I just write the code > for my > proposals, including whatever changes would be needed in > Bio::SeqIO::genbank et al. You'll see how easy and appropriate it is, > and hopefully everyone will be happy. > > Perhaps you could just hold off doing any similar-but-contradictory > work > until then. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From hlapp at gmx.net Mon Jul 24 23:31:41 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 24 Jul 2006 23:31:41 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu> References: <000101c6af68$f27521a0$15327e82@pyrimidine> <44C54683.70707@sendu.me.uk> <6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu> Message-ID: <49AA1351-5741-4E19-AFCE-EEA5532531B6@gmx.net> On Jul 24, 2006, at 10:29 PM, Chris Fields wrote: > [...] > We could go back and forth on what Jason really intended. [...] The > reality is he's not here and you're willing to do the job. Right. And, knowing Jason, I think he'd be perfectly fine with seeing his original idea develop in a possibly different direction, provided it will all work nicely in the end. I'm willing to take the beating on me if that doesn't turn out to be true ... > > There is one thing I will make perfectly clear here: there should > never, ever be enforced lookups for SeqIO (even using caches), You certainly don't want taxonomy lookups during the parsing stage, and also not for the client requesting properties of the species that have been parsed with high confidence, i.e., genus and species for a straightforward binomial like 'Homo sapiens'. Writing sequences, IMHO, doesn't have to be as fast. It may be better to emit strict format a bit slower rather than sloppy format a bit faster. Upon parsing, one idea could be for the flat file parser to set a dirty bit in the parsed out species if the parsed text didn't follow strict binomial conventions, hence the parser may have made a mistake and if a client requests the information it is better to lookup the correct values from a taxonomy database. I.e., you could try with a strict regex first that would imply a high-confidence result. If that fails you don't give up but mark the result as untrustworthy. > [...] > This would have been MUCH easier if all three of us could have gone > to the local bar for a beer and discussed it. We should just take > the time out to videoconference next time. You're not honestly suggesting that a videoconference is better than having beer together? Enjoy your trip, and thanks for hanging in there in the discussion, I appreciate it. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Tue Jul 25 01:53:33 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 25 Jul 2006 00:53:33 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <49AA1351-5741-4E19-AFCE-EEA5532531B6@gmx.net> References: <000101c6af68$f27521a0$15327e82@pyrimidine> <44C54683.70707@sendu.me.uk> <6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu> <49AA1351-5741-4E19-AFCE-EEA5532531B6@gmx.net> Message-ID: <190F1365-1F8B-401F-8FBA-8DE39D851111@uiuc.edu> So do we intend on having everyone who installs bioperl have a local copy of the taxonomy dumpfile? Or perform a remote lookup via Entrez? Seems a bit extreme. I would like the option of not having the lookup run; as I mentioned to Sendu, one of the biggest complaints about bioperl is speed. Additional lookups won't help on that end. Chris On Jul 24, 2006, at 10:31 PM, Hilmar Lapp wrote: > > On Jul 24, 2006, at 10:29 PM, Chris Fields wrote: > >> [...] >> We could go back and forth on what Jason really intended. [...] The >> reality is he's not here and you're willing to do the job. > > Right. And, knowing Jason, I think he'd be perfectly fine with seeing > his original idea develop in a possibly different direction, provided > it will all work nicely in the end. I'm willing to take the beating > on me if that doesn't turn out to be true ... > >> >> There is one thing I will make perfectly clear here: there should >> never, ever be enforced lookups for SeqIO (even using caches), > > You certainly don't want taxonomy lookups during the parsing stage, > and also not for the client requesting properties of the species that > have been parsed with high confidence, i.e., genus and species for a > straightforward binomial like 'Homo sapiens'. > > Writing sequences, IMHO, doesn't have to be as fast. It may be better > to emit strict format a bit slower rather than sloppy format a bit > faster. > > Upon parsing, one idea could be for the flat file parser to set a > dirty bit in the parsed out species if the parsed text didn't follow > strict binomial conventions, hence the parser may have made a mistake > and if a client requests the information it is better to lookup the > correct values from a taxonomy database. I.e., you could try with a > strict regex first that would imply a high-confidence result. If that > fails you don't give up but mark the result as untrustworthy. > > >> [...] >> This would have been MUCH easier if all three of us could have gone >> to the local bar for a beer and discussed it. We should just take >> the time out to videoconference next time. > > You're not honestly suggesting that a videoconference is better than > having beer together? > > Enjoy your trip, and thanks for hanging in there in the discussion, I > appreciate it. > > -hilmar > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Tue Jul 25 03:05:23 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Tue, 25 Jul 2006 08:05:23 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu> References: <000101c6af68$f27521a0$15327e82@pyrimidine> <44C54683.70707@sendu.me.uk> <6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu> Message-ID: <44C5C2B3.1020304@sendu.me.uk> Chris Fields wrote: > > There is one thing I will make perfectly clear here: there should > never, ever be enforced lookups for SeqIO (even using caches), though > I have no problem having optional ones. This is something I have > stated before and what you propose below steers dangerously in that > direction. Where, for instance, do you store the lineage from a > GenBank file? Do you want to do a series of Tax lookups to restore > that data? I think that the number one complaint for sequence > parsing is speed, which would only get slower with lookups (even > cached). I already gave a code example of exactly how Bio::Taxonomy is perfect for storing the lineage data in a GenBank file with or without a database lookup. I think perhaps at the time you first read this you basically ignored it because you had trouble with the idea of adding nodes to a species. If you have been glossing over my argument, it may be instructive to go over what I've been saying with a clear eye. Anyway, here it is again, and remember in this example, Bio::Species isa Bio::Taxonomy: ## the fully-manual way my $species = new Bio::Species; my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces cerevisiae', -rank => 'species', -object_id => 1, -parent_id => 2); my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces', -object_id => 2, -parent_id => 3); # (no assumption that 'Saccharomyces' is the genus, so rank() undefined) my $n3 = [etc] $species->add_node($node); $species->add_node($n2); [etc] ## Using a factory without db access # assume that Bio::Taxonomy::GenbankFactory implements # some modified Bio::Taxonomy::FactoryI my $factory = Bio::Taxonomy::GenbankFactory->new(); my $species = $factory->generate(-classification => ['Saccharomyces cerevisiae', 'Saccharomyces', 'Saccharomycetaceae' ...]); # the generate() method above just does the fully-manual way for you ## Using a factory with db access # assume that Bio::Taxonomy::EntrezFactory implements some # modified Bio::Taxonomy::FactoryI and uses Bio::DB::Taxonomy::entrez # to get the nodes my $factory = Bio::Taxonomy::EntrezFactory->new(); my $species = $factory->fetch(-scientifc_name => 'Saccharomyces cerevisiae'); So now do you see how we're able to do the Genbank no-db way and the db-using way with the same object model? We're able to do it the same, sane way because a Node is just a node; you can make them yourself manually, or retrieve them from a database. Once you stick them in a Taxonomy you can then (potentially) ask all the questions of the data that you can with existing Bio::Species. No cruft is required anywhere at all. All the Taxonomy classes can be 'pure', while only Bio::Species has to have backward-compatibility methods. From bernd.web at gmail.com Tue Jul 25 06:47:50 2006 From: bernd.web at gmail.com (Bernd Web) Date: Tue, 25 Jul 2006 12:47:50 +0200 Subject: [Bioperl-l] Structure::IO Message-ID: <716af09c0607250347w24e0d8dbj674758049bc0a4e2@mail.gmail.com> Hi, Does someone have experience with Bio::Structure::IO? The example III.9.1 from the bptutorial.pl covers most, but what is e.g. the chain() method of Bio::Structure::Entry doing? The POD states: Title : chain Usage : @chains = $structure->chain($chain); Function: Connects a (or a list of) Chain objects to a Bio::Structure::Entry. Returns : list of Bio::Structure::Residue objects Args : One Residue or a reference to an array of Residue objects But in e.g my $stream = Bio::Structure::IO->new(-file => $filename, -format => 'pdb'); while ( my $struc = $stream->next_structure() ) { for my $chain ($struc->get_chains) { my $chainid = $chain->id; my @chains = $struc->chain($chain); } } I get Bio::Structure::Chain=HASH(0x9f1ab50). What is the function of the chain method and how to use it? Best regards, bernd From bernd.web at gmail.com Tue Jul 25 07:44:28 2006 From: bernd.web at gmail.com (Bernd Web) Date: Tue, 25 Jul 2006 13:44:28 +0200 Subject: [Bioperl-l] SeqUtils Message-ID: <716af09c0607250444y3e005fb1t4e20094fd8db993d@mail.gmail.com> Hi, With Bio::SeqUtils it may be nice to support 3 letter codes with capitals only, too. Now my $string = Bio::SeqUtils->seq3in($seqobj, 'METGLYTER'); will give in $string->seq: XXX. Possibly the capitals in MetGlyTer are used to find the amino acids codes? If not maybe it's easy to implement case-insensitive, or all-capitals for AA codes in SeqUtils? In addition about the POD: maybe it's better not use use $string since Bio::SeqUtils->seq3in does not return a string but a Bio::PrimarySeq object. Regards, Bernd From cjfields at uiuc.edu Tue Jul 25 08:28:01 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 25 Jul 2006 07:28:01 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C5C2B3.1020304@sendu.me.uk> References: <000101c6af68$f27521a0$15327e82@pyrimidine> <44C54683.70707@sendu.me.uk> <6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu> <44C5C2B3.1020304@sendu.me.uk> Message-ID: <05718B29-6432-4ACA-A10F-2B384CB68191@uiuc.edu> Look, you explaining this to me, as you see it, does not convince me that its the correct or right way to do it. Okay? Can we agree on that? I do not think that Species and Taxonomy are the same thing. A species should not hold more than one node. A species, by definition, is a rank in Taxonomy, and is a node, not a full Taxonomy, so Bio::Species should be a Node, not a Taxonomy. I don't see how I can be any clearer... The fact that it may work is beyond the point. That's like putting duct tape on a leak to me. Why not just simplify Bio::Species into a Node? Or make it into a Node and get rid of it altogether. You are going to do what you want to do, regardless of what I say. Seems to be par for the course here. I'm REALLY tired of arguing the point. Okay? Just drop it. I have other priorities in life besides goddamned bioperl right now... Chris On Jul 25, 2006, at 2:05 AM, Sendu Bala wrote: > Chris Fields wrote: >> >> There is one thing I will make perfectly clear here: there should >> never, ever be enforced lookups for SeqIO (even using caches), though >> I have no problem having optional ones. This is something I have >> stated before and what you propose below steers dangerously in that >> direction. Where, for instance, do you store the lineage from a >> GenBank file? Do you want to do a series of Tax lookups to restore >> that data? I think that the number one complaint for sequence >> parsing is speed, which would only get slower with lookups (even >> cached). > > I already gave a code example of exactly how Bio::Taxonomy is perfect > for storing the lineage data in a GenBank file with or without a > database lookup. I think perhaps at the time you first read this you > basically ignored it because you had trouble with the idea of adding > nodes to a species. If you have been glossing over my argument, it may > be instructive to go over what I've been saying with a clear eye. > Anyway, here it is again, and remember in this example, > Bio::Species isa > Bio::Taxonomy: > > > ## the fully-manual way > my $species = new Bio::Species; > my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces > cerevisiae', > -rank => 'species', -object_id > => 1, > -parent_id => 2); > my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces', > -object_id => 2, -parent_id => 3); > # (no assumption that 'Saccharomyces' is the genus, so rank() > undefined) > my $n3 = [etc] > $species->add_node($node); > $species->add_node($n2); > [etc] > > ## Using a factory without db access > # assume that Bio::Taxonomy::GenbankFactory implements > # some modified Bio::Taxonomy::FactoryI > my $factory = Bio::Taxonomy::GenbankFactory->new(); > my $species = $factory->generate(-classification => ['Saccharomyces > cerevisiae', 'Saccharomyces', > 'Saccharomycetaceae' ...]); > # the generate() method above just does the fully-manual way for you > > ## Using a factory with db access > # assume that Bio::Taxonomy::EntrezFactory implements some > # modified Bio::Taxonomy::FactoryI and uses Bio::DB::Taxonomy::entrez > # to get the nodes > my $factory = Bio::Taxonomy::EntrezFactory->new(); > my $species = $factory->fetch(-scientifc_name => 'Saccharomyces > cerevisiae'); > > > So now do you see how we're able to do the Genbank no-db way and the > db-using way with the same object model? We're able to do it the same, > sane way because a Node is just a node; you can make them yourself > manually, or retrieve them from a database. Once you stick them in a > Taxonomy you can then (potentially) ask all the questions of the data > that you can with existing Bio::Species. No cruft is required anywhere > at all. All the Taxonomy classes can be 'pure', while only > Bio::Species > has to have backward-compatibility methods. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Tue Jul 25 08:52:03 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Tue, 25 Jul 2006 13:52:03 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <05718B29-6432-4ACA-A10F-2B384CB68191@uiuc.edu> References: <000101c6af68$f27521a0$15327e82@pyrimidine> <44C54683.70707@sendu.me.uk> <6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu> <44C5C2B3.1020304@sendu.me.uk> <05718B29-6432-4ACA-A10F-2B384CB68191@uiuc.edu> Message-ID: <44C613F3.7070903@sendu.me.uk> Chris Fields wrote: > A species should not hold more than one node. A species, by > definition, is a rank in Taxonomy, and is a node, not a full > Taxonomy, so Bio::Species should be a Node, not a Taxonomy. I don't > see how I can be any clearer... Right, we have differing viewpoints because you're concerned with what Bio::Species /should/ be, based on the name of the file and perhaps its original intent, whilst I am treating it as what it actually /is/, which is an object that is used to contain information about multiple taxonomic nodes. > The fact that it may work is beyond the point. That's like putting > duct tape on a leak to me. Why not just simplify Bio::Species into a > Node? Or make it into a Node and get rid of it altogether. Bio::Species, again ignore the name, is just a thing that lets us store and retrieve a certain set of data. If we simplified it into a pure Node, it could no longer do that job. If we just get rid of it all together it can no longer do its job. By making it a Bio::Taxonomy it can continue to do its job without having to have Node objects with cruft. It would also gain the useful methods of Bio::Taxonomy at the same time. I really don't mean to upset you, and I apologise for having done so. I've been presenting what I thought was a logical argument in favour of Bio::Species as Bio::Taxonomy, and waiting to see if anyone would come up with a logical argument why that would be inappropriate, or why something else would be better. I'm not saying you're wrong and I'm certainly listening and would change my choice based on what you have to say. I don't think it's fair to say that disregarding what you have to say is 'par for the course' - I already /have/ regarded what you had to say in this thread and ended up doing scientific_name() as purely what we get from the database. From hlapp at gmx.net Tue Jul 25 09:47:47 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 25 Jul 2006 09:47:47 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C5C2B3.1020304@sendu.me.uk> References: <000101c6af68$f27521a0$15327e82@pyrimidine> <44C54683.70707@sendu.me.uk> <6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu> <44C5C2B3.1020304@sendu.me.uk> Message-ID: On Jul 25, 2006, at 3:05 AM, Sendu Bala wrote: > [...] > ## the fully-manual way > my $species = new Bio::Species; > my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces > cerevisiae', > -rank => 'species', -object_id > => 1, > -parent_id => 2); If this is meant as an example for the use cases I enumerated, then you wouldn't have the parent_id from a Genbank file. However, you didn't have that before either, so no problem. > my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces', > -object_id => 2, -parent_id => 3); > # (no assumption that 'Saccharomyces' is the genus, so rank() > undefined) I think in a confident parse you want to assign 'genus' if there's little doubt, for example 'Saccharomyces cerevisiae'. Not sure whether there are weird viri whose names look innocuous but in reality the name doesn't follow binomial convention. > my $n3 = [etc] > $species->add_node($node); > $species->add_node($n2); I know why you are doing this, but seeing this people will hit a mental snag. You should listen to Chris' refusal to see the sense in this as an indication that many people down the road won't see the sense either. So instead, make the logical model in your design more obvious, which I think ultimately will help maintainability as well. For example: my $taxonomy = Bio::Taxonomy->new(); my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces cerevisiae', -rank => 'species', -object_id => 1, -parent_id => 2); my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces', -object_id => 2, -parent_id => 3); $taxonomy->add_node($node); $taxonomy->add_node($n2); my $species = Bio::Species->new(-lineage => $taxonomy); print $species->binomial(); print $species->genus(); # this may trigger a lookup if a taxonomy db handle has been set, e.g.: # $taxonomy->db_handle(Bio::DB::Taxonomy->new(-source => 'entrez')); print $species->classification(); > [etc] > > ## Using a factory without db access > # assume that Bio::Taxonomy::GenbankFactory implements > # some modified Bio::Taxonomy::FactoryI > my $factory = Bio::Taxonomy::GenbankFactory->new(); > my $species = $factory->generate(-classification => ['Saccharomyces > cerevisiae', 'Saccharomyces', > 'Saccharomycetaceae' ...]); > # the generate() method above just does the fully-manual way for you Except the method name would be create_object(), the parameter would be a hash ref, and the return value would be a Bio::TaxonomyI compliant object: my $taxonomy = $factory->create_object({-classification => ['Saccharomyces cerevisiae', 'Saccharomyces', 'Saccharomycetaceae' ...]}); my $species = Bio::Species->new(-lineage => $taxonomy); > > ## Using a factory with db access > # assume that Bio::Taxonomy::EntrezFactory implements some > # modified Bio::Taxonomy::FactoryI and uses Bio::DB::Taxonomy::entrez > # to get the nodes > my $factory = Bio::Taxonomy::EntrezFactory->new(); The logic where to do a lookup on should not be duplicated here. It only belongs under Bio::DB::Taxonomy::*. > my $species = $factory->fetch(-scientifc_name => 'Saccharomyces > cerevisiae'); Likewise, use the methods defined in Bio::DB::Taxonomy, and again, the return type is Bio::Taxonomy, which you would pass to Bio::Species->new(). -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Tue Jul 25 09:54:14 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 25 Jul 2006 09:54:14 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <190F1365-1F8B-401F-8FBA-8DE39D851111@uiuc.edu> References: <000101c6af68$f27521a0$15327e82@pyrimidine> <44C54683.70707@sendu.me.uk> <6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu> <49AA1351-5741-4E19-AFCE-EEA5532531B6@gmx.net> <190F1365-1F8B-401F-8FBA-8DE39D851111@uiuc.edu> Message-ID: <793AFD5C-D220-493F-BE11-B9023DC9F569@gmx.net> We intend on having everyone who wants correct taxonomy parsing results for the entire kingdom of life to define his/her authoritative taxonomy database, be it local or not, be it HTTP or SQL queried. If you don't care about the correctness of the taxonomy parse, or if the taxonomy information in the flat file is trivially parseable because it conforms to standard binomial convention, then whatever is to be put in place needs to work fine regardless of whether a taxonomy database is defined or not. -hilmar On Jul 25, 2006, at 1:53 AM, Chris Fields wrote: > So do we intend on having everyone who installs bioperl have a local > copy of the taxonomy dumpfile? Or perform a remote lookup via > Entrez? Seems a bit extreme. > > I would like the option of not having the lookup run; as I mentioned > to Sendu, one of the biggest complaints about bioperl is speed. > Additional lookups won't help on that end. > > Chris > > On Jul 24, 2006, at 10:31 PM, Hilmar Lapp wrote: > >> >> On Jul 24, 2006, at 10:29 PM, Chris Fields wrote: >> >>> [...] >>> We could go back and forth on what Jason really intended. [...] The >>> reality is he's not here and you're willing to do the job. >> >> Right. And, knowing Jason, I think he'd be perfectly fine with seeing >> his original idea develop in a possibly different direction, provided >> it will all work nicely in the end. I'm willing to take the beating >> on me if that doesn't turn out to be true ... >> >>> >>> There is one thing I will make perfectly clear here: there should >>> never, ever be enforced lookups for SeqIO (even using caches), >> >> You certainly don't want taxonomy lookups during the parsing stage, >> and also not for the client requesting properties of the species that >> have been parsed with high confidence, i.e., genus and species for a >> straightforward binomial like 'Homo sapiens'. >> >> Writing sequences, IMHO, doesn't have to be as fast. It may be better >> to emit strict format a bit slower rather than sloppy format a bit >> faster. >> >> Upon parsing, one idea could be for the flat file parser to set a >> dirty bit in the parsed out species if the parsed text didn't follow >> strict binomial conventions, hence the parser may have made a mistake >> and if a client requests the information it is better to lookup the >> correct values from a taxonomy database. I.e., you could try with a >> strict regex first that would imply a high-confidence result. If that >> fails you don't give up but mark the result as untrustworthy. >> >> >>> [...] >>> This would have been MUCH easier if all three of us could have gone >>> to the local bar for a beer and discussed it. We should just take >>> the time out to videoconference next time. >> >> You're not honestly suggesting that a videoconference is better than >> having beer together? >> >> Enjoy your trip, and thanks for hanging in there in the discussion, I >> appreciate it. >> >> -hilmar >> -- >> =========================================================== >> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : >> =========================================================== >> >> >> >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Tue Jul 25 10:58:29 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 25 Jul 2006 09:58:29 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <793AFD5C-D220-493F-BE11-B9023DC9F569@gmx.net> Message-ID: <002601c6affa$ca4433f0$15327e82@pyrimidine> Agreed. I fully support the addition of an optional lookup; it gives much more flexibility SeqIO re: your previous examples of screening sequence streams for sequences that are primate, mitochondrial, etc. The key word I want to emphasize is 'optional', not 'enforced'. I appreciate what Sendu is trying to do; I really do. I think carrying over an object named 'Bio::Species' into Taxonomy is too confusing (your 'contagion' analogy, as it were). The 'species' concept (biologically speaking here, not talking about the Bioperl class) is a taxonomic rank (i.e. part of a taxonomy). I'm trying to take a biologist's point of view here. What is a 'species'? Or, if we were to stick strictly with using NCBI definitions, what is a 'species'? The NCBI definition of 'species' is simply a rank in a lineage, so it is (in Bioperl terms) a Node. If we were to follow that line of reasoning, why also have a Species object represent a Taxonomy as well? It's way too confusing. Sendu's repeatedly stating "a Species is a Taxonomy" makes some sense in a BioPerl world only, as we're speaking about a class that has been around for a long time, one that acted as a container of sorts for sequence data. And I understand what he intends to do. Conceptually speaking here, though, the way it is laid out, a Bio::Species object can hold a Node that represents a 'species' rank, as well as a 'genus' Node, and a 'family' node, and on and on. That's not a 'species', that's a taxonomy. So just call it a Taxonomy. The object itself (Bio::Species) never truly represented a 'species' anyway, biologically speaking, every time it held sequence data. It could be a subspecies, strain, plasmid, unknown, or an unclassified rank ('no rank') or environmental sample. It really held a fancier representation of a node, as based on the TaxID. My final point is, saying "a species is a taxonomy" to the rest of the biological world doesn't make sense. Maybe it makes sense to you and I and Sendu, in our little Bioperl world. But to the thousands of users out there who don't completely grok the Bioperl class structure, it's just confusing. If I were to get an object back that was labeled Bio::Species, as a biologist I would expect it to be part of a taxonomy, not the actual Taxonomy itself. So, why not cut to the chase: if we are to fundamentally change the concept of what Bio::Species is by making it a Taxonomy/TaxonomyI or whatever, why not just use a Taxonomy object altogether and not bother with Bio::Species at all? Deprecate it. BTW, I'll be in Connecticut for five days at UConn. So I hope to escape the heat for a bit. Thanks for listening to my side of things. Chris > -----Original Message----- > From: Hilmar Lapp [mailto:hlapp at gmx.net] > Sent: Tuesday, July 25, 2006 8:54 AM > To: Chris Fields > Cc: Sendu Bala; bioperl-l at bioperl.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > We intend on having everyone who wants correct taxonomy parsing > results for the entire kingdom of life to define his/her > authoritative taxonomy database, be it local or not, be it HTTP or > SQL queried. > > If you don't care about the correctness of the taxonomy parse, or if > the taxonomy information in the flat file is trivially parseable > because it conforms to standard binomial convention, then whatever is > to be put in place needs to work fine regardless of whether a > taxonomy database is defined or not. > > -hilmar > > On Jul 25, 2006, at 1:53 AM, Chris Fields wrote: > > > So do we intend on having everyone who installs bioperl have a local > > copy of the taxonomy dumpfile? Or perform a remote lookup via > > Entrez? Seems a bit extreme. > > > > I would like the option of not having the lookup run; as I mentioned > > to Sendu, one of the biggest complaints about bioperl is speed. > > Additional lookups won't help on that end. > > > > Chris > > > > On Jul 24, 2006, at 10:31 PM, Hilmar Lapp wrote: > > > >> > >> On Jul 24, 2006, at 10:29 PM, Chris Fields wrote: > >> > >>> [...] > >>> We could go back and forth on what Jason really intended. [...] The > >>> reality is he's not here and you're willing to do the job. > >> > >> Right. And, knowing Jason, I think he'd be perfectly fine with seeing > >> his original idea develop in a possibly different direction, provided > >> it will all work nicely in the end. I'm willing to take the beating > >> on me if that doesn't turn out to be true ... > >> > >>> > >>> There is one thing I will make perfectly clear here: there should > >>> never, ever be enforced lookups for SeqIO (even using caches), > >> > >> You certainly don't want taxonomy lookups during the parsing stage, > >> and also not for the client requesting properties of the species that > >> have been parsed with high confidence, i.e., genus and species for a > >> straightforward binomial like 'Homo sapiens'. > >> > >> Writing sequences, IMHO, doesn't have to be as fast. It may be better > >> to emit strict format a bit slower rather than sloppy format a bit > >> faster. > >> > >> Upon parsing, one idea could be for the flat file parser to set a > >> dirty bit in the parsed out species if the parsed text didn't follow > >> strict binomial conventions, hence the parser may have made a mistake > >> and if a client requests the information it is better to lookup the > >> correct values from a taxonomy database. I.e., you could try with a > >> strict regex first that would imply a high-confidence result. If that > >> fails you don't give up but mark the result as untrustworthy. > >> > >> > >>> [...] > >>> This would have been MUCH easier if all three of us could have gone > >>> to the local bar for a beer and discussed it. We should just take > >>> the time out to videoconference next time. > >> > >> You're not honestly suggesting that a videoconference is better than > >> having beer together? > >> > >> Enjoy your trip, and thanks for hanging in there in the discussion, I > >> appreciate it. > >> > >> -hilmar > >> -- > >> =========================================================== > >> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > >> =========================================================== > >> > >> > >> > >> > >> > >> _______________________________________________ > >> Bioperl-l mailing list > >> Bioperl-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > Christopher Fields > > Postdoctoral Researcher > > Lab of Dr. Robert Switzer > > Dept of Biochemistry > > University of Illinois Urbana-Champaign > > > > > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > From cjfields at uiuc.edu Tue Jul 25 11:36:40 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 25 Jul 2006 10:36:40 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: Message-ID: <003301c6b000$203cc560$15327e82@pyrimidine> > On Jul 25, 2006, at 3:05 AM, Sendu Bala wrote: > > > [...] > > ## the fully-manual way > > my $species = new Bio::Species; > > my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces > > cerevisiae', > > -rank => 'species', -object_id > > => 1, > > -parent_id => 2); > > If this is meant as an example for the use cases I enumerated, then > you wouldn't have the parent_id from a Genbank file. However, you > didn't have that before either, so no problem. > > > my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces', > > -object_id => 2, -parent_id => 3); > > # (no assumption that 'Saccharomyces' is the genus, so rank() > > undefined) > > I think in a confident parse you want to assign 'genus' if there's > little doubt, for example 'Saccharomyces cerevisiae'. Not sure > whether there are weird viri whose names look innocuous but in > reality the name doesn't follow binomial convention. > > > my $n3 = [etc] > > $species->add_node($node); > > $species->add_node($n2); > > I know why you are doing this, but seeing this people will hit a > mental snag. You should listen to Chris' refusal to see the sense in > this as an indication that many people down the road won't see the > sense either. Thanks for pointing that out. I think there is only a small, fundamental difference in our views here. I'm trying to view this as an outsider would, a biologist not familiar with the Bioperl class structure. I understand what Sendu's trying to accomplish but it's really confusing to someone not familiar with what Bio::Species is. Hilmar, you had pointed out several times that Bio::Species and Bio::Taxonomy shouldn't directly intermingle. My original thought for genbank.pm _read_GenBank_Species() was this, copied and pasted from my local genbank.pm. It's sort of extreme, but it passes tests just fine. sub _read_GenBank_Species { my( $self,$buffer) = @_; $_ = $$buffer; my @organelles = qw(plastid chloroplast mitochondrion); my( $source_data, $common_name, @class, $ns_name, $organelle, $source_flag, $sci_name, $abbr ); while (defined($_) || defined($_ = $self->_readline())) { # de-HTMLify (links that may be encountered here don't contain # escaped '>', so a simple-minded approach suffices) s/<[^>]+>//g; if ( /^SOURCE\s+(.*)/o ) { $source_data = $1; $source_data =~ s/\.$//; # remove trailing dot # does it have a GenBank common name in parentheses? $common_name = $source_data =~ m{\((.*)\)}xms; # organelle? If we find additional odd ones, # add to @organelle $organelle = grep { $_ =~ $source_data } @organelles; $source_flag = 1; } elsif ( /^\s{2}ORGANISM\s+(.*)/o ) { $sci_name = $1; $source_flag = 0; } elsif ($source_flag) { # no ORGANISM $common_name .= $source_data; $common_name =~ s/\n//g; $common_name =~ s/\s+/ /g; $source_flag = 0; } elsif ( /^\s+(.+)/o ) { # lineage information my $line = $1; # only split on ';' or '.' so that classification # that is 2 words will still get matched, use # map() to remove trailing/leading spaces push(@class, map { s/^\s+//; s/\s+$//; $_; } split /[;\.]+/, $line) if ( $line =~ /(;|\.)/ ); } else { # reach end of GenBank tax info last; } $_ = undef; # Empty $_ to trigger read of next line } $$buffer = $_; @class = reverse @class; my $make = Bio::Taxonomy::Node->new(); $make->common_name( $common_name ) if $common_name; $make->scientific_name($sci_name) if $sci_name; # could use SimpleValue objs here instead $make->classification( @class ) if @class; $make->organelle($organelle) if $organelle; return $make; } # back in next_seq...grab the TaxID from 'source' # seqfeature # could check organelle() here as well # add taxon_id from source if available if($species && ($feat->primary_tag eq 'source') && $feat->has_tag('db_xref') && (! $species->ncbi_taxid())) { foreach my $tagval ($feat->get_tag_values('db_xref')) { if(index($tagval,"taxon:") == 0) { $species->ncbi_taxid(substr($tagval,6)); last; } } } In other words, remove the extra parsing of genus() species() subspecies etc. All GenBank sequences have a node represented in NCBI's tax database (I checked it out). Even plasmids, unknowns, environmental samples. Chris > So instead, make the logical model in your design more obvious, which > I think ultimately will help maintainability as well. For example: > > my $taxonomy = Bio::Taxonomy->new(); > my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces cerevisiae', > -rank => 'species', -object_id > => 1, > -parent_id => 2); > my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces', > -object_id => 2, -parent_id => 3); > $taxonomy->add_node($node); > $taxonomy->add_node($n2); > > my $species = Bio::Species->new(-lineage => $taxonomy); > print $species->binomial(); > print $species->genus(); > # this may trigger a lookup if a taxonomy db handle has been set, e.g.: > # $taxonomy->db_handle(Bio::DB::Taxonomy->new(-source => 'entrez')); > print $species->classification(); > > > > [etc] > > > > ## Using a factory without db access > > # assume that Bio::Taxonomy::GenbankFactory implements > > # some modified Bio::Taxonomy::FactoryI > > my $factory = Bio::Taxonomy::GenbankFactory->new(); > > my $species = $factory->generate(-classification => ['Saccharomyces > > cerevisiae', 'Saccharomyces', > > 'Saccharomycetaceae' ...]); > > # the generate() method above just does the fully-manual way for you > > Except the method name would be create_object(), the parameter would > be a hash ref, and the return value would be a Bio::TaxonomyI > compliant object: > > my $taxonomy = $factory->create_object({-classification => > ['Saccharomyces > cerevisiae', 'Saccharomyces', > 'Saccharomycetaceae' ...]}); > my $species = Bio::Species->new(-lineage => $taxonomy); > > > > > > ## Using a factory with db access > > # assume that Bio::Taxonomy::EntrezFactory implements some > > # modified Bio::Taxonomy::FactoryI and uses Bio::DB::Taxonomy::entrez > > # to get the nodes > > my $factory = Bio::Taxonomy::EntrezFactory->new(); > > The logic where to do a lookup on should not be duplicated here. It > only belongs under Bio::DB::Taxonomy::*. > > > my $species = $factory->fetch(-scientifc_name => 'Saccharomyces > > cerevisiae'); > > Likewise, use the methods defined in Bio::DB::Taxonomy, and again, > the return type is Bio::Taxonomy, which you would pass to > Bio::Species->new(). > > -hilmar > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From bix at sendu.me.uk Tue Jul 25 13:49:04 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Tue, 25 Jul 2006 18:49:04 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <003301c6b000$203cc560$15327e82@pyrimidine> References: <003301c6b000$203cc560$15327e82@pyrimidine> Message-ID: <44C65990.4080500@sendu.me.uk> Chris Fields wrote: > If I were to get an object back that was labeled Bio::Species, as a > biologist I would expect it to be part of a taxonomy, not the actual > Taxonomy itself. I think this is the most important sentence in the discussion. Ok, so it's clear to me that a better solution is needed than my Bio::Taxonomy-related proposal. Sorry for being so slow on the uptake. I also needed to start trying to code my Taxonomy proposal to see some issues with it. [... in another email...] > I'm trying to view this as an outsider would, > a biologist not familiar with the Bioperl class structure. Ok, let's come up with a proposal that makes sense to the biologist and better matches Jason's original idea. ---- long post follows; there's a summary at the end As a biologist when I consider a species I have the following primary questions. Let's see how we would answer them using a) Bio::Species and genbank.pm as they are now, b) Bio::Species if it was a 'pure' Bio::Taxonomy::Node with no cruft (or if we just dropped Bio::Species and used Node directly), and Chris' updated genbank.pm. Let's say we got our species information from a genbank file where the scientific name and tax id are available to be parsed out. # What is the species' name? a) Not guaranteed to be correct. b) Correct thanks to recent changes to Node, just use scientific_name() # What is the lineage of this species? a) I can get a classification array with classification(). It's a bit rubbish though, I can't tell what any of the array elements are supposed to be. b) A pure Node wouldn't store the lineage on itself. There are two obvious solutions: 1) add cruft to Node by giving it a classification() method - works as well/bad as a). 2) call get_Lineage_Nodes(), which has the benefit of telling me what rank each ancestor was, if that information had been in the file (more likely, if Node was generated from database). Problem: get_Lineage_Nodes() only works if it can $self->db_handle->get_Taxonomy_Node(-taxonid => $self->parent_id); which obviously doesn't work if the nodes in our lineage didn't come from a database, but from the parsing of a genbank flat file. As we parse the genbank file we can certainly make nodes for each word in the list: inside genbank.pm... @class = reverse @class; my @nodes; my $fake_id = 1; foreach my $sci_name (@class) { push(@nodes, new Bio::Taxonomy::Node(-name => $sci_name, object_id => $fake_id++, parent_id => $fake_id); } But how do we keep these nodes and make them returnable later by get_Lineage_Nodes? Perhaps: my $taxonomy = new Bio::Taxonomy; foreach my $node (@nodes) { $taxonomy->add_node($node); } ... my $make = Bio::Taxonomy::Node->new(); ... $make->db_handle($taxonomy); Bio::Taxonomy would have to implement get_Taxonomy_Node (it has get_node which only accepts a rank). Of course this is ugly, storing a Taxonomy in our database handle. We could have a new Bio::DB::Taxonomy:: class instead, that treated a classification array like a database? It could have the added bonus of building up an entire database internally as more input arrays are given to it, able to therefore give each node a unique but consistent id. It would break if one time you gave it qw(Homo Primates) and another time qw(Homo Hominidae Primates), however. Ideas? # What if I don't want the whole lineage, just to know what a specific rank like genus is for my species? a) use genus(), but not guaranteed to be correct. b) two solutions: 1) add cruft to Node by adding a genus() method: as good/bad as a). 2) use get_Lineage_Nodes() or get_Parent_Node() until you find a node with your rank() of interest. Same problems as for lineage question, but also it would be nicer to have a get_node('rank_name') style method. But such a method belongs in something like Bio::Taxonomy, not Node. At the very least a method like genus() would be implemented using pure Node methods like get_Parent_Node(), returning undefined if no parent had a rank() of 'genus', never guessing it. # Is this species the same as another species? a) Not guaranteed to be correct. (no unique id so forced to compare names) b) Correct answer by using object_id() method, along with Chris' change to genbank.pm. # What is the most recent common ancestor of this species and another? a) Can't be answered. b) Use get_LCA_Node(), but same issues as the lineage question, since get_LCA_Node requires a working get_Lineage_Nodes(). It also requires correct (unique) ids for all nodes in all lineages to give the guaranteed correct answer. But at least you /might/ get the correct answer even using only the data in genbank files and no db lookup. ---- summary: It seems like the main problem with Node right now is that it has classification() and things like genus(). I propose pure Node method solutions to answer the questions classification() and genus() were implemented to answer, but in a better, cruft-free way. Bio::DB::Taxonomy::genbank anyone? Then if you started with a Species/Node generated by a genbank parse, and wanted certain questions answered correctly, you only have to set a different db_handle(). The Node only stores the static and hopefully correct information about itself, whilst all other questions go via db_handle, so you can dynamically swap back and forth between databases depending on if you need speed or accuracy. From cjfields at uiuc.edu Tue Jul 25 14:24:12 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 25 Jul 2006 13:24:12 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C65990.4080500@sendu.me.uk> Message-ID: <000001c6b017$873176a0$15327e82@pyrimidine> Sendu, you'll have to make the changes how you see fit. You see my point now, which is great. >From my perspective, all the object type (used to contain taxonomy file information) needs to contain is the scientific name and common names like the SOURCE line abbreviated name and the actual GenBank common name, if present. All the other cruft (i.e. genus/species/subspecies) can be excised, and the proper taxonomic information, if wanted, could be accessed via the object and it's TaxID. Organelle and lineage information needs to be retained (for the non-taxonomists) and could be stored in that object, bumped to SimpleValue objects, or just set (alternative, since the data is small) using a get/set value within the sequence object itself. This would be the bare-bones approach, which Node can fulfill. I also like Hilmar's proposal about including optional lookups, which greatly increases the flexibility when screening sequences. This will likely require a more complicated object structure (i.e. taxonomy with nodes). You suggested a Taxonomy-like object which would work; but don't force Bio::Species into the mix. Why not just use a simple Bio::Taxonomy object for that (Hilmar's point). When one asks for $species->species, they'll get a Node or Taxonomy, whichever is used (that's up to you). The Node represents a more-barebones variation, while the Taxonomy object scheme would be more fully-realized. Either way will work for me. Just don't call it 'species'. ; > Once this is all done, will we really have a need for Bio::Species? That's my other point. The only real use for it was as a container object for sequence data. That job is now done via a Taxonomy/Node object. The only real use it would have is as a container for taxonomic information for species ranks or below. I think Node/Taxonomy can handle evan that though, so now it's also redundant. If a class is not useful and is redundant, maybe it should be deprecated. Anyway, I can't get involved anymore at this point; I'm too busy with getting ready for the Kadner Institute next week. Good luck! Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Sendu Bala > Sent: Tuesday, July 25, 2006 12:49 PM > To: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > Chris Fields wrote: > > If I were to get an object back that was labeled Bio::Species, as a > > biologist I would expect it to be part of a taxonomy, not the actual > > Taxonomy itself. > > I think this is the most important sentence in the discussion. Ok, so > it's clear to me that a better solution is needed than my > Bio::Taxonomy-related proposal. Sorry for being so slow on the uptake. I > also needed to start trying to code my Taxonomy proposal to see some > issues with it. > > > [... in another email...] > > I'm trying to view this as an outsider would, > > a biologist not familiar with the Bioperl class structure. > > Ok, let's come up with a proposal that makes sense to the biologist and > better matches Jason's original idea. > > ---- long post follows; there's a summary at the end > > As a biologist when I consider a species I have the following primary > questions. Let's see how we would answer them using a) Bio::Species and > genbank.pm as they are now, b) Bio::Species if it was a 'pure' > Bio::Taxonomy::Node with no cruft (or if we just dropped Bio::Species > and used Node directly), and Chris' updated genbank.pm. Let's say we got > our species information from a genbank file where the scientific name > and tax id are available to be parsed out. > > # What is the species' name? > a) Not guaranteed to be correct. > b) Correct thanks to recent changes to Node, just use scientific_name() > > > # What is the lineage of this species? > a) I can get a classification array with classification(). It's a bit > rubbish though, I can't tell what any of the array elements are supposed > to be. > b) A pure Node wouldn't store the lineage on itself. There are two > obvious solutions: 1) add cruft to Node by giving it a classification() > method - works as well/bad as a). 2) call get_Lineage_Nodes(), which has > the benefit of telling me what rank each ancestor was, if that > information had been in the file (more likely, if Node was generated > from database). Problem: get_Lineage_Nodes() only works if it can > $self->db_handle->get_Taxonomy_Node(-taxonid => $self->parent_id); > which obviously doesn't work if the nodes in our lineage didn't come > from a database, but from the parsing of a genbank flat file. As we > parse the genbank file we can certainly make nodes for each word in the > list: > inside genbank.pm... @class = reverse @class; > my @nodes; my $fake_id = 1; > foreach my $sci_name (@class) { > push(@nodes, new Bio::Taxonomy::Node(-name => $sci_name, object_id => > $fake_id++, parent_id => $fake_id); > } > But how do we keep these nodes and make them returnable later by > get_Lineage_Nodes? Perhaps: > my $taxonomy = new Bio::Taxonomy; > foreach my $node (@nodes) { > $taxonomy->add_node($node); > } > ... > my $make = Bio::Taxonomy::Node->new(); > ... > $make->db_handle($taxonomy); > Bio::Taxonomy would have to implement get_Taxonomy_Node (it has get_node > which only accepts a rank). Of course this is ugly, storing a Taxonomy > in our database handle. We could have a new Bio::DB::Taxonomy:: class > instead, that treated a classification array like a database? It could > have the added bonus of building up an entire database internally as > more input arrays are given to it, able to therefore give each node a > unique but consistent id. It would break if one time you gave it qw(Homo > Primates) and another time qw(Homo Hominidae Primates), however. Ideas? > > > # What if I don't want the whole lineage, just to know what a specific > rank like genus is for my species? > a) use genus(), but not guaranteed to be correct. > b) two solutions: 1) add cruft to Node by adding a genus() method: as > good/bad as a). 2) use get_Lineage_Nodes() or get_Parent_Node() until > you find a node with your rank() of interest. Same problems as for > lineage question, but also it would be nicer to have a > get_node('rank_name') style method. But such a method belongs in > something like Bio::Taxonomy, not Node. At the very least a method like > genus() would be implemented using pure Node methods like > get_Parent_Node(), returning undefined if no parent had a rank() of > 'genus', never guessing it. > > > # Is this species the same as another species? > a) Not guaranteed to be correct. (no unique id so forced to compare names) > b) Correct answer by using object_id() method, along with Chris' change > to genbank.pm. > > > # What is the most recent common ancestor of this species and another? > a) Can't be answered. > b) Use get_LCA_Node(), but same issues as the lineage question, since > get_LCA_Node requires a working get_Lineage_Nodes(). It also requires > correct (unique) ids for all nodes in all lineages to give the > guaranteed correct answer. But at least you /might/ get the correct > answer even using only the data in genbank files and no db lookup. > > > ---- summary: > > It seems like the main problem with Node right now is that it has > classification() and things like genus(). I propose pure Node method > solutions to answer the questions classification() and genus() were > implemented to answer, but in a better, cruft-free way. > > Bio::DB::Taxonomy::genbank anyone? > > Then if you started with a Species/Node generated by a genbank parse, > and wanted certain questions answered correctly, you only have to set a > different db_handle(). The Node only stores the static and hopefully > correct information about itself, whilst all other questions go via > db_handle, so you can dynamically swap back and forth between databases > depending on if you need speed or accuracy. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From hlapp at gmx.net Tue Jul 25 15:18:00 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 25 Jul 2006 15:18:00 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <000001c6b017$873176a0$15327e82@pyrimidine> References: <000001c6b017$873176a0$15327e82@pyrimidine> Message-ID: <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> On Jul 25, 2006, at 2:24 PM, Chris Fields wrote: > Once this is all done, will we really have a need for Bio::Species? No, except for backwards compatibility. Phasing it out will go over a couple of releases. E.g., v1.6.x could have deprecation warning in the documentation. v1.7+ would have deprecation warnings in the code written to stderr. Just as an aside, we can't just drastically change the return type of a method. Instead, if at all possible, there should be a new method so that the old can be phased out over time but otherwise not changed. I.e., don't change $seq->species() to now all of a sudden return a node or taxonomic lineage, even if initially Bio::Species is returned with some magic under the hood. Instead, create something like # return a Bio::Taxonomy::Node: my $taxon = $seq->taxon(); # alternative approach: return a lineage (taxonomy) # this would be Bio::TaxonomyI compliant my $lineage = $seq->lineage(); The former would require the lineage (and organelle for completeness) information to be either easily (though not necessarily directly) accessible through the node, or added as annotation. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Tue Jul 25 15:30:40 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 25 Jul 2006 14:30:40 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> Message-ID: <000101c6b020$d09bc7b0$15327e82@pyrimidine> Sounds good to me. I'm fine with any way that it's worked out, either Taxonomy or Node-based, as long as there no Bio::Species-based confusion re: Taxonomy, and that this eventually leads to getting rid of Bio::Species altogether. Have fun, guys! (hey, probably the shortest response I have written)... Chris > -----Original Message----- > From: Hilmar Lapp [mailto:hlapp at gmx.net] > Sent: Tuesday, July 25, 2006 2:18 PM > To: Chris Fields > Cc: 'Sendu Bala'; bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > > On Jul 25, 2006, at 2:24 PM, Chris Fields wrote: > > > Once this is all done, will we really have a need for Bio::Species? > > No, except for backwards compatibility. Phasing it out will go over a > couple of releases. E.g., v1.6.x could have deprecation warning in > the documentation. v1.7+ would have deprecation warnings in the code > written to stderr. > > Just as an aside, we can't just drastically change the return type of > a method. Instead, if at all possible, there should be a new method > so that the old can be phased out over time but otherwise not > changed. I.e., don't change $seq->species() to now all of a sudden > return a node or taxonomic lineage, even if initially Bio::Species is > returned with some magic under the hood. Instead, create something like > > # return a Bio::Taxonomy::Node: > my $taxon = $seq->taxon(); > > # alternative approach: return a lineage (taxonomy) > # this would be Bio::TaxonomyI compliant > my $lineage = $seq->lineage(); > > The former would require the lineage (and organelle for completeness) > information to be either easily (though not necessarily directly) > accessible through the node, or added as annotation. > > -hilmar > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > From cjfields at uiuc.edu Tue Jul 25 22:16:36 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 25 Jul 2006 21:16:36 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C65990.4080500@sendu.me.uk> References: <003301c6b000$203cc560$15327e82@pyrimidine> <44C65990.4080500@sendu.me.uk> Message-ID: <1748BD60-EFBE-4872-94A4-7CA3D58BE458@uiuc.edu> One last thing before I shut off bioperl for a week and concentrate on Connecticut; On Jul 25, 2006, at 12:49 PM, Sendu Bala wrote: > Chris Fields wrote: >> If I were to get an object back that was labeled Bio::Species, as a >> biologist I would expect it to be part of a taxonomy, not the actual >> Taxonomy itself. > > I think this is the most important sentence in the discussion. Ok, so > it's clear to me that a better solution is needed than my > Bio::Taxonomy-related proposal. Sorry for being so slow on the > uptake. I > also needed to start trying to code my Taxonomy proposal to see some > issues with it. ... Again, thanks for noticing that. > ---- summary: > > It seems like the main problem with Node right now is that it has > classification() and things like genus(). I propose pure Node method > solutions to answer the questions classification() and genus() were > implemented to answer, but in a better, cruft-free way. > > Bio::DB::Taxonomy::genbank anyone? Ach... You're compromising here; that's not like you. I think you're making this too complicated by trying too many things at once. Don't think sudden dramatic changes in the API. Sneak changes in in a way that doesn't scare users away, then let them get used to the new way of grabbing Tax data. Make your point that it's more accurate to do it this way (you'll have defenders in Hilmar and I, BTW). Do this (start with genbank.pm): 1) Switch out Bio::Species with Node or Taxonomy; relocate other information temporarily (Bio::Species, get/sets in Seq object, SimpleValue). Leave Bio::Species in for the time being, but don't bother making any additional changes to it. 2) Make sure next_seq() and write_seq() work and pass tests. Add additional tests for the Tax/Node object (you could even use the tax dump data you recently added for more complicated tests). 3) Add in additional stuff bit by bit until it is where you would like it. 4) Make sure parsing is kosher with the latest release notes. Probably should make sure write_seq follows what the release note state to some degree. And, really, you won't break anything with genbank.pm organelle() parsing. If you look at the module the organelle isn't even touched in next_seq() or _read_GenBank_Species(), so it was broken to begin with! My proposal, though extreme, was to remove genus() etc (which you wanted as well with Node). You could leave this cruft for the time being in Bio::Species, which could still act as a sequence tax info holder object. It just won't be the >default< Seq tax information object, which would be Bio::Taxonomy or Node. Hence Hilmar's suggestion to use a $seq->taxon() method to return a Node/Taxonomy, and a $seq->species() would still return a Bio::Species object. It's redundant, but only for the time being, and the redundant information wouldn't have a major memory footprint anyway (not like the feature table or the full sequence might). Any information that isn't stored in whatever Tax object you use (i.e. lineage or organelle) could be stored temporarily in another fashion, such as a get/set in Seq or SimpleValue object, to make next_seq/ write_seq work (such as $seq->organelle() or $seq->classification(), instead of $seq->species->organelle and so on). Hilmar then suggests, around 1.6-ish release, note the changes made to SeqIO towards Bio::Taxonomy-based objects, and indicate that Bio::Species via species() and it's associated methods will be deprecated around 1.7 (gives everybody notice on API issues). Then add warnings to Bio::Species in 1.7 noting the deprecation, then remove from core completely in 1.8 - 2.0. One last thing, which is minor really: I remember seeing something about having Nodes with 'no rank' ignored unless a flag is used. That may be bad news for some organisms in sequence files where the TaxID is for a 'no rank' rank, such as environmental samples. May want to think about that here. I'm hoping the releases will start popping out a bit more periodically than they have been. There have been volunteers to release periodic updates for bug fixes etc. If I get a chance I'll try keeping up. Don't count on it though. The conference is 7am-9pm most days, for five days straight! Chris > > Then if you started with a Species/Node generated by a genbank parse, > and wanted certain questions answered correctly, you only have to > set a > different db_handle(). The Node only stores the static and hopefully > correct information about itself, whilst all other questions go via > db_handle, so you can dynamically swap back and forth between > databases > depending on if you need speed or accuracy. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From vrramnar at student.cs.uwaterloo.ca Tue Jul 25 22:44:17 2006 From: vrramnar at student.cs.uwaterloo.ca (vrramnar at student.cs.uwaterloo.ca) Date: Tue, 25 Jul 2006 22:44:17 -0400 Subject: [Bioperl-l] SNP reference file download In-Reply-To: <775C8983-D81E-474C-B4F5-C765ABEF4A2C@uiuc.edu> References: <000001c6b01f$bfd54e20$15327e82@pyrimidine> <1153868024.44c6a0f83fce6@www.nexusmail.uwaterloo.ca> <775C8983-D81E-474C-B4F5-C765ABEF4A2C@uiuc.edu> Message-ID: <1153881857.44c6d7016775f@www.nexusmail.uwaterloo.ca> Hey Chris, I believe I updated all those modules already as I downloaded the entire DB.tar from Bioperl live. Here is my code: #!/usr/bin/perl -w use Bio::Perl; use Bio::DB::EUtilities; my @ids = qw(rs4986950); # With the "rs" before the number the warning says: "no returned links" # Without the "rs" before the number the warning says: "No databases returned; empty linkset" my $elink = Bio::DB::EUtilities->new( -eutil => 'elink', -id => \@ids, -db => 'omim', -dbfrom => 'snp'); $elink->get_response; print "IDs: ", join q(,), $elink->get_ids; Which gives the following error: -------------------- WARNING --------------------- MSG: No databases returned; empty linkset --------------------------------------------------- ------------- EXCEPTION ------------- MSG: Must use database to access IDs STACK Bio::DB::EUtilities::ElinkData::get_LinkIds_by_db /Library/Perl/5.8.6/Bio/ DB/EUtilities/ElinkData.pm:201 STACK Bio::DB::EUtilities::get_ids /Library/Perl/5.8.6/Bio/DB/EUtilities.pm:482 STACK toplevel getOmimNum:13 -------------------------------------- All I really want is the OMIM id number under the section: NCBI Resource Links from the page: http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=1800562 Any idea why this still isn't working?? Rohan Quoting Chris Fields : > Odd, I thought XML::Simple was part of the 5.8 core. Guess I was > wrong. I plan on changing this to a more robust parser soon (likely > XML::SAX or XML::Twig, which will also require a download). > > That warning occurs when if you don't have a link to OMIM present (No > databases returned; empty linkset). The way Elink works is it stores > internal data in a separate object (ELinkData) contained in an > internal cache. The method get_ids() works for all EUtilities to > retrieve IDs, even from ELink objects. The unique problem with ELink > is, since you can search multiple databases. you can retrieve > multiple sets of IDs. > > If you haven't done it, update your EUtilities; the problem is > similar to one I fixed today (I stated something about updating in my > last post). Also, update the main Bio::DB::EUtilities and > Bio::GenericWebDBI as well (the last is the base class from which > EUtilities is based). The 'Count:1' was a debugging statement I > forgot to remove a while ago which I changed in CVS yesterday. It's > possible that commit had other changes which I forgot about. > > Sorry about that, but it is still experimental (emphasis on the > 'mental'). > > Chris > > On Jul 25, 2006, at 5:53 PM, vrramnar at student.cs.uwaterloo.ca wrote: > > > > > Hey Chris, > > > > Ignore the last email, I fixed that problem and downloaded/ > > installed the > > required XML modules. > > > > However, I am now getting this error message: > > > > -------------------- WARNING --------------------- > > MSG: No databases returned; empty linkset > > --------------------------------------------------- > > Count: 1 > > > > ------------- EXCEPTION ------------- > > MSG: Must use database to access IDs > > STACK Bio::DB::EUtilities::ElinkData::get_LinkIds_by_db /Library/ > > Perl/5.8.6/Bio/ > > DB/EUtilities/ElinkData.pm:201 > > STACK Bio::DB::EUtilities::get_ids /Library/Perl/5.8.6/Bio/DB/ > > EUtilities.pm:483 > > STACK toplevel getOmimNum:15 > > > > -------------------------------------- > > > > What does this mean?? > > > > Rohan > > > > Quoting Chris Fields : > > > >> Okay, had to fix an odd bug from ELink due to the way NCBI returns > >> data. > >> > >> You'll need to update the EUtilities modules in bioperl from CVS > >> to make > >> sure this works. > >> > >> This is how it's done: > ---------------------------------------- This mail sent through www.mywaterloo.ca From cjfields at uiuc.edu Wed Jul 26 01:01:41 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 26 Jul 2006 00:01:41 -0500 Subject: [Bioperl-l] SNP reference file download In-Reply-To: <1153881857.44c6d7016775f@www.nexusmail.uwaterloo.ca> References: <000001c6b01f$bfd54e20$15327e82@pyrimidine> <1153868024.44c6a0f83fce6@www.nexusmail.uwaterloo.ca> <775C8983-D81E-474C-B4F5-C765ABEF4A2C@uiuc.edu> <1153881857.44c6d7016775f@www.nexusmail.uwaterloo.ca> Message-ID: The below ID doesn't have any OMIM linked data, hence the warning. The problem is that NCBI, when it doesn't find a link, doesn't send something constructive to tell you that. It sends the original ID encoded in XML, but no actual DB's or ID data links. That's what the warning means. I'll make the original warning a bit more direct: No databases returned; no IDs found. The thrown error is from a logic problem; I have fixed it and committed to CVS. Here's the web page: no OMIM data there either... http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=4986950 Try changing your ID list to this: my @ids = qw(4986950 1800562); You should get back only one ID (only one has an OMIM number). By the way, the SNP data ID is only the digits (don't include the 'rs' designation). Chris On Jul 25, 2006, at 9:44 PM, vrramnar at student.cs.uwaterloo.ca wrote: > > Hey Chris, > > I believe I updated all those modules already as I downloaded the > entire DB.tar > from Bioperl live. Here is my code: > > #!/usr/bin/perl -w > > use Bio::Perl; > use Bio::DB::EUtilities; > > my @ids = qw(rs4986950); > # With the "rs" before the number the warning says: "no returned > links" > # Without the "rs" before the number the warning says: "No > databases returned; > empty linkset" > > > my $elink = Bio::DB::EUtilities->new( -eutil => 'elink', > -id => \@ids, > -db => 'omim', > -dbfrom => 'snp'); > $elink->get_response; > print "IDs: ", join q(,), $elink->get_ids; > > Which gives the following error: > > -------------------- WARNING --------------------- > MSG: No databases returned; empty linkset > --------------------------------------------------- > > ------------- EXCEPTION ------------- > MSG: Must use database to access IDs > STACK Bio::DB::EUtilities::ElinkData::get_LinkIds_by_db /Library/ > Perl/5.8.6/Bio/ > DB/EUtilities/ElinkData.pm:201 > STACK Bio::DB::EUtilities::get_ids /Library/Perl/5.8.6/Bio/DB/ > EUtilities.pm:482 > STACK toplevel getOmimNum:13 > > -------------------------------------- > > All I really want is the OMIM id number under the section: NCBI > Resource Links > from the page: > http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=1800562 > > Any idea why this still isn't working?? > > Rohan > > > Quoting Chris Fields : > >> Odd, I thought XML::Simple was part of the 5.8 core. Guess I was >> wrong. I plan on changing this to a more robust parser soon (likely >> XML::SAX or XML::Twig, which will also require a download). >> >> That warning occurs when if you don't have a link to OMIM present (No >> databases returned; empty linkset). The way Elink works is it stores >> internal data in a separate object (ELinkData) contained in an >> internal cache. The method get_ids() works for all EUtilities to >> retrieve IDs, even from ELink objects. The unique problem with ELink >> is, since you can search multiple databases. you can retrieve >> multiple sets of IDs. >> >> If you haven't done it, update your EUtilities; the problem is >> similar to one I fixed today (I stated something about updating in my >> last post). Also, update the main Bio::DB::EUtilities and >> Bio::GenericWebDBI as well (the last is the base class from which >> EUtilities is based). The 'Count:1' was a debugging statement I >> forgot to remove a while ago which I changed in CVS yesterday. It's >> possible that commit had other changes which I forgot about. >> >> Sorry about that, but it is still experimental (emphasis on the >> 'mental'). >> >> Chris >> >> On Jul 25, 2006, at 5:53 PM, vrramnar at student.cs.uwaterloo.ca wrote: >> >>> >>> Hey Chris, >>> >>> Ignore the last email, I fixed that problem and downloaded/ >>> installed the >>> required XML modules. >>> >>> However, I am now getting this error message: >>> >>> -------------------- WARNING --------------------- >>> MSG: No databases returned; empty linkset >>> --------------------------------------------------- >>> Count: 1 >>> >>> ------------- EXCEPTION ------------- >>> MSG: Must use database to access IDs >>> STACK Bio::DB::EUtilities::ElinkData::get_LinkIds_by_db /Library/ >>> Perl/5.8.6/Bio/ >>> DB/EUtilities/ElinkData.pm:201 >>> STACK Bio::DB::EUtilities::get_ids /Library/Perl/5.8.6/Bio/DB/ >>> EUtilities.pm:483 >>> STACK toplevel getOmimNum:15 >>> >>> -------------------------------------- >>> >>> What does this mean?? >>> >>> Rohan >>> >>> Quoting Chris Fields : >>> >>>> Okay, had to fix an odd bug from ELink due to the way NCBI returns >>>> data. >>>> >>>> You'll need to update the EUtilities modules in bioperl from CVS >>>> to make >>>> sure this works. >>>> >>>> This is how it's done: >> > > > > > ---------------------------------------- > This mail sent through www.mywaterloo.ca Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Wed Jul 26 05:19:29 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Wed, 26 Jul 2006 10:19:29 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <1748BD60-EFBE-4872-94A4-7CA3D58BE458@uiuc.edu> References: <003301c6b000$203cc560$15327e82@pyrimidine> <44C65990.4080500@sendu.me.uk> <1748BD60-EFBE-4872-94A4-7CA3D58BE458@uiuc.edu> Message-ID: <44C733A1.9070201@sendu.me.uk> Chris Fields wrote: > >> It seems like the main problem with Node right now is that it has >> classification() and things like genus(). I propose pure Node method >> solutions to answer the questions classification() and genus() were >> implemented to answer, but in a better, cruft-free way. >> >> Bio::DB::Taxonomy::genbank anyone? > > Ach... You're compromising here; No, I don't think so. Let me explain... (another very long email, but with the same conclusion as above) > 1) Switch out Bio::Species with Node or Taxonomy; relocate other > information temporarily (Bio::Species, get/sets in Seq object, > SimpleValue). Leave Bio::Species in for the time being, but don't > bother making any additional changes to it. [...] > Hence Hilmar's suggestion to use a $seq->taxon() method to return a > Node/Taxonomy, and a $seq->species() would still return a > Bio::Species object. It's redundant, As I see it, the problem to be solved is this: a) A node should just be a node, holding only information about itself (but this can include information on who its parent is, and methods relating to getting its parents/children as new objects - but the data of its parents/children must never be stored on itself). b) Bio::Species isn't very good at its job; you can't ask reasonable taxonomic questions of it and get correct answers. c) We need to transition Bio::Species to something better - something that lets us do the same job as Bio::Species, but do it better. An important aspect of 'better' is that we can switch from the taxonomic information in a genbank file or similar to the information in a taxonomic database if we want certain taxonomic questions answered correctly. But also, we should be able to answer all questions with a good chance of a correct answer even without database access/installation. There are a variety of possible solutions. How can we decide which is best? What would a good solution be? The 'something better' we transition Bio::Species to will become the preferred (or at least de facto standard) way of dealing with taxonomic information in bioperl. This taxonomic module (or set of modules) must be able to model taxonomic information anywhere it is found - databases or genbank files or anything else. If it can't, it would be fundamentally flawed. d) We can immediately discount any solution that involves storing some taxonomic information outside of the tax module. If we find ourselves putting lineage data in a genbank file in SimpleValue objects or similar, we can be pretty sure we've used a poor solution to the problem. That would be a compromise. e) If the thing we transition Bio::Species to can't do everything Bio::Species did (doing it in a different and better way is fine of course), it's not suitable for transitioning to (this is why Node needed all the cruft added to it before it was a suitable candidate). If it /can/ do everything Bio::Species did, there would be no harm immediately making Bio::Species inherit from the new tax module, reimplementing Bio::Species as necessary but making no API change. So any solution that would /require/ $seq->taxon() and $seq->species() wouldn't be a good one, and would be a compromise. But we do want to get rid of Bio::Species eventually, so I'm not saying we shouldn't have a $seq->taxon() or similar, only that either method would give you the same type of object with the same methods available ($seq->taxon->isa('tax module') && ($seq->species->isa('Bio::Species') && $seq->species->isa('tax module')). I see 2 possible solutions to the problem. What should 'tax module' be?: 1) Bio::Taxonomy or other similar class that is a container of multiple nodes. Naively this makes logical sense since one of the jobs Bio::Species has is to store a lineage, and a lineage is best represented as a set of Nodes. So let's have a single object with all our Nodes in it. Problems: Bio::Taxonomy itself, as currently written, is fundamentally flawed. It requires that you know the ranks and order of ranks of all your input nodes before you input them. It requires that all ranks have unique names. It doesn't handle ranks of 'no rank'. You can't have more than one lineage in an instance because you can't have two nodes with the same rank. If you don't know the ranks of your nodes (ie. genbank) there is no way to maintain the order of your lineage because there is no modelling of parent/child. I had planned to re-write it such that the rank-centric implementation was removed and we had parent/child implementation instead. But then there is nothing to stop you adding nodes that are disconnected from the others, creating a broken mess. Bio::Taxonomy::Tree might have been a little more suitable because it implements Bio::Tree::TreeI, but sadly it is also rank-centric and actually requires input of both Bio::Species and Bio::Taxonomy objects to its most useful methods. More important than issues with current implementations of node-container classes, such classes are unable to let us solve problem c) in a good way, and also leave us potentially storing in memory Node objects representing the same taxonomic node multiple times in different instances of the node-container. For problem c) if we were to switch from genbank nodes to database the solution is to delete all the nodes in the container and then get them all again from the database. What if you didn't even have a lineage-related question? You've just retrieved 10s of nodes from the database for no reason (and then store them), when all you wanted was accurate information on the node you were interested in. All in all, it's pretty horrible. Unsuitable implementations plus excess database retrieval plus massive waste of memory with duplicated nodes does not equal a good solution. 2) Bio::Taxonomy::Node. First, solve problem a) by getting rid of methods binomial(), species(), genus(), sub_species(), variant(), organelle(), classification() and show_all(). Except for organelle() which doesn't belong in taxonomy, all of these Bio::Species 'questions' can still be answered by Node - just not in a single method call. I outlined how to answer them in the previous post. For backward compatibility make Bio::Species a Node and implement the suggested way of answering the questions the proper 'Node' way under those methods. Problems: Well, those questions can't actually be answered by Node if the starting point was genbank data or manually created Nodes. The solution is clean and simple: Bio::DB::Taxonomy::genbank or perhaps better named Bio::DB::Taxonomy::list (because it makes a taxonomy database from an ordered list of names - I don't see anything inherently wrong or ugly with that). Then everything magically just works. We get all the power to ask all our questions that Node has already when working with the ncbi database, but we get it when working with genbank data. We suffer none of the problems of a node-container class. We can easily switch databases on the fly. What's not to like? From bix at sendu.me.uk Wed Jul 26 06:00:01 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Wed, 26 Jul 2006 11:00:01 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> Message-ID: <44C73D21.3010301@sendu.me.uk> Hilmar Lapp wrote: > Instead, create something like > > # return a Bio::Taxonomy::Node: > my $taxon = $seq->taxon(); Yes, but $seq->species() would also > # alternative approach: return a lineage (taxonomy) > # this would be Bio::TaxonomyI compliant > my $lineage = $seq->lineage(); I've since come to the conclusion that anything Taxonomy-ish would be inappropriate - see recent post. > The former would require the lineage (and organelle for completeness) > information to be either easily (though not necessarily directly) > accessible through the node, or added as annotation. That specifically is the main problem with Node as it is now. You shouldn't store information about the lineage (essentially information about other nodes) on the node object itself. Storing it as annotation on the Node or elsewhere is terrible: you lose all the power of Node and can no longer ask any lineage-related questions. There is no need for this split in functionality - when you don't have database access and just some genbank files, you can't answer any taxonomic questions involving lineage, vs. when you do have database access suddenly you can start doing useful things. My proposed solution is that bioperl's taxonomy model always lets you answer the same questions regardless of your source for taxonomic information - see recent post. From cjfields at uiuc.edu Wed Jul 26 08:16:29 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 26 Jul 2006 07:16:29 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C733A1.9070201@sendu.me.uk> References: <003301c6b000$203cc560$15327e82@pyrimidine> <44C65990.4080500@sendu.me.uk> <1748BD60-EFBE-4872-94A4-7CA3D58BE458@uiuc.edu> <44C733A1.9070201@sendu.me.uk> Message-ID: <0155C616-8E07-49C8-B4A1-2B77B4DE09F3@uiuc.edu> > ... > > I see 2 possible solutions to the problem. What should 'tax module' > be?: > > 1) Bio::Taxonomy or other similar class that is a container of > multiple > nodes. Naively this makes logical sense since one of the jobs > Bio::Species has is to store a lineage, and a lineage is best > represented as a set of Nodes. So let's have a single object with all > our Nodes in it. Problems: > > Bio::Taxonomy itself, as currently written, is fundamentally > flawed. It > requires that you know the ranks and order of ranks of all your input > nodes before you input them. It requires that all ranks have unique > names. It doesn't handle ranks of 'no rank'. You can't have more than > one lineage in an instance because you can't have two nodes with the > same rank. If you don't know the ranks of your nodes (ie. genbank) > there > is no way to maintain the order of your lineage because there is no > modelling of parent/child. > I had planned to re-write it such that the rank-centric implementation > was removed and we had parent/child implementation instead. But then > there is nothing to stop you adding nodes that are disconnected > from the > others, creating a broken mess. > > Bio::Taxonomy::Tree might have been a little more suitable because it > implements Bio::Tree::TreeI, but sadly it is also rank-centric and > actually requires input of both Bio::Species and Bio::Taxonomy objects > to its most useful methods. > > More important than issues with current implementations of > node-container classes, such classes are unable to let us solve > problem > c) in a good way, and also leave us potentially storing in memory Node > objects representing the same taxonomic node multiple times in > different > instances of the node-container. For problem c) if we were to switch > from genbank nodes to database the solution is to delete all the nodes > in the container and then get them all again from the database. > What if > you didn't even have a lineage-related question? You've just retrieved > 10s of nodes from the database for no reason (and then store them), > when > all you wanted was accurate information on the node you were > interested in. > > All in all, it's pretty horrible. Unsuitable implementations plus > excess > database retrieval plus massive waste of memory with duplicated nodes > does not equal a good solution. > > > 2) Bio::Taxonomy::Node. First, solve problem a) by getting rid of > methods binomial(), species(), genus(), sub_species(), > variant(), organelle(), classification() and show_all(). Except for > organelle() which doesn't belong in taxonomy, all of these > Bio::Species > 'questions' can still be answered by Node - just not in a single > method > call. I outlined how to answer them in the previous post. For backward > compatibility make Bio::Species a Node and implement the suggested way > of answering the questions the proper 'Node' way under those methods. > Problems: > > Well, those questions can't actually be answered by Node if the > starting > point was genbank data or manually created Nodes. The solution is > clean > and simple: Bio::DB::Taxonomy::genbank or perhaps better named > Bio::DB::Taxonomy::list (because it makes a taxonomy database from an > ordered list of names - I don't see anything inherently wrong or ugly > with that). Then everything magically just works. We get all the power > to ask all our questions that Node has already when working with the > ncbi database, but we get it when working with genbank data. We suffer > none of the problems of a node-container class. We can easily switch > databases on the fly. That 'broken mess' (referring to Bio::Taxonomy) is up to the user. You could make it more stringent (i.e. only allow connected nodes, starting with a single initiating node then build from there), though I don't think that's necessary as most people would probably use some sort of factory to generate a taxonomy (a warning might be appropriate). You would have to watch out for potential circular structures. Have it do what you want. I believe the original intent of Taxonomy was to allow building a full-fledged taxonomic structure, so it should stay that way. Sendu, you have to realize this is up to how you want to implement it. We're giving you the freedom to do what you want to Bio::Taxonomy. Of course, if we think you're off we'll reel you back in, but you seem to be on the right track. Realize that the only contentious issue here is that horrible lineage line in the GenBank file. We should have a way to rebuild it as it was from the original file (i.e. not rebuild it from scratch with DB lookups by default). However, you should also have the option to rebuild it from lookups (i.e. correctly), which you could do with a Taxonomy. Note this Bio::Taxonomy method: classify Title : classify Usage : @obj[][0-1] = taxonomy->classify($species); Function: return a ranked classification Returns : @obj of taxa and ranks as word pairs separated by "@" Args : Bio::Species object As Bio::Species will be deprecated, you can use that method in a dual, sneaky way: 1) directly store the lineage information, 2) return the real one (DB lookups) if needed (i,e, if some flag is set, for instance). And, if a Bio::Species argument is used, do what the docs state (catch it early on with an if block and return within it). Bio::Species, as used within genbank.pm, doesn't use Bio::Taxonomy in any way. I don't know if you even need to retain its original purpose here; you might be able to get away with changing the fundamental way this method works altogether. That's up to you. my 2c Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Wed Jul 26 08:49:05 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Wed, 26 Jul 2006 13:49:05 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <0155C616-8E07-49C8-B4A1-2B77B4DE09F3@uiuc.edu> References: <003301c6b000$203cc560$15327e82@pyrimidine> <44C65990.4080500@sendu.me.uk> <1748BD60-EFBE-4872-94A4-7CA3D58BE458@uiuc.edu> <44C733A1.9070201@sendu.me.uk> <0155C616-8E07-49C8-B4A1-2B77B4DE09F3@uiuc.edu> Message-ID: <44C764C1.9010804@sendu.me.uk> Chris Fields wrote: > We're giving you the freedom to do what you want to Bio::Taxonomy. I don't want to do anything with Bio::Taxonomy any more. I've already shown that it isn't suitable for the job. Regardless of how it is implemented, the entire idea of a class that contains Nodes isn't appropriate, for reasons already stated. > Realize that the only contentious issue here is > that horrible lineage line in the GenBank file. We should have a way to > rebuild it as it was from the original file (i.e. not rebuild it from > scratch with DB lookups by default). However, you should also have the > option to rebuild it from lookups (i.e. correctly), which you could do > with a Taxonomy. And I've already shown how rebuilding with a Taxonomy is very far from ideal, while switching db_handle on a Node would be perfect. Why are you now advocating Taxonomy when there is no reason to? > Note this Bio::Taxonomy method: > > classify > > Title : classify > Usage : @obj[][0-1] = taxonomy->classify($species); > Function: return a ranked classification > Returns : @obj of taxa and ranks as word pairs separated by "@" > Args : Bio::Species object Note that all this method does is let you combine a list of rank names with the classification array in a Bio::Species, spitting out some weird data structure. It is only of interest to Bio::Taxonomy::Tree. We're in the situation where we don't know the rank names corresponding to the classification array in a Bio::Species generated by genbank et al. So classify() is of zero value. > As Bio::Species will be deprecated, you can use that method in a dual, > sneaky way: 1) directly store the lineage information, No. Lineage information must be in the form of Nodes or you can't answer lineage-related taxonomic questions. > 2) return the real one (DB lookups) if needed Messy. Doing it with Node would be far superior. Again, Node works all the time, while Taxonomy would work badly or not at all some of the time. Rather than suggest ways of using Taxonomy, tell me what is wrong with my current Node plan. From cjfields at uiuc.edu Wed Jul 26 11:15:28 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 26 Jul 2006 10:15:28 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C764C1.9010804@sendu.me.uk> Message-ID: <002801c6b0c6$59279fa0$15327e82@pyrimidine> I advocate anything but Bio::Species that allows you the option to use lookups for correct taxonomic information and not guesswork (current Bio::Species). So, you could pretty much replace Species immediately with a DB-aware container object with simple get/sets. As of now, that would be that Node or Taxonomy. I have done this already, just haven't committed it yet. And, when I mentioned having freedom to do what you want with Bio::Taxonomy, that includes all of it (including Node, Tree, etc). We just want it to be reasonable and not 'duct tape' for the various Bio::Species mistakes of the past. I don't think the problem here is really that complicated (still, the only thing is the lineage stuff in a sequence file, right?). > > As Bio::Species will be deprecated, you can use that method in a dual, > > sneaky way: 1) directly store the lineage information, > > No. Lineage information must be in the form of Nodes or you can't answer > lineage-related taxonomic questions. You must have a way to store the 'horrible lineage information' data, as is, for those users who do not care about taxonomy and just want to convert seq streams. You shouldn't burden the everyday user with something that is pretty specialized, this being finding correct taxonomic information based on DB lookups for a particular reason (screening sequences, as Hilmar pointed out, was one possibility). I don't care how, but store lineage information as it appears in the file (scalar string) or in a simple data structure (array, maybe?) capable of retaining the information in some way. There are many many ways of doing this which I have previously pointed out; take your pick. Hilmar, in a previous post, told me to take a step back and contemplate a world w/o Bio::Species, where you would design a system capable of dealing with sequence file taxonomic data in a way that allows you to get correct tax information when needed via NCBI Taxonomy data, yet not sacrifice speed if you're just interested in converting sequences via SeqIO. Would you design a Bio::Species class, then? Would you attempt to spend time parsing out species and genus information, when the correct data is sitting on the NCBI server or in a local flatfile? No. You would retain the minimal data necessary in an object for reading and writing data, but have the >option< available to run a lookup. Therefore, Bio::Taxonomy::Node was born. A little prematurely, yes. Probably needed to bake a bit more... Anyway, we must eventually sever our reliance on Bio::Species in order to deprecate it, so the lineage information must be contained, as it appears in the file, somewhere else. And my point with the classify() Bio::Taxonomy method is not to use it as is; you could sneak in your own data if needed. It was an example of a possible way of containing the lineage data, but not meant to be an absolute way. It's up to you how you want to implement it. I think the classes that are currently in place are more than capable of handling the job. Hence my statement before that you are trying to get too many things going right out the starting gate. Start simply by replacing Bio::Species, then worry about other issues. If you think that a specialized class would work, fine, but IMHO I don't think it's absolutely necessary. I had proposed such a class before (more like a Bio::Species-like Tax object) but was shut down, and rightly so; it's unnecessarily complicated and 'contaminates' Bio::Taxonomy with extra unnecessary methods (classification(), genus(), and so on). My last proposal was to eventually strip out the unreliable taxonomic parsing in the various SeqIO modules and replace it with something simple, which seemed to be a consensus among us all. This has to do with Hilmar's post-apocalyptic vision of a Bio::Species-free world. That will eventually happen, and Bioperl will eventually switch over completely to Bio::Taxonomy::Whatever. And Bio::Species can join BPLite and other deprecated modules in the BioPerl Boot Hill. But, for now that can't happen. We all strive for the best information possible. However, you can't sacrifice the needs of other users, a majority whom probably care squat about taxonomy, with your (our) own needs. As I have repeatedly stated, simple is good. We can't just usurp the API for our own wishes w/o warning, so the change has to be gradual and Bio::Species must stick around for the time being. And we must make it optional to have DB lookups or the villagers will be storming the castle. Listen, Sendu. If you can wait a couple of weeks for further discussion then we can slog on with this. But right now I just don't have any more time for this, sorry. You can have the last word and I'll respond when I get back. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Sendu Bala > Sent: Wednesday, July 26, 2006 7:49 AM > To: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > Chris Fields wrote: > > We're giving you the freedom to do what you want to Bio::Taxonomy. > > I don't want to do anything with Bio::Taxonomy any more. I've already > shown that it isn't suitable for the job. Regardless of how it is > implemented, the entire idea of a class that contains Nodes isn't > appropriate, for reasons already stated. > > > > Realize that the only contentious issue here is > > that horrible lineage line in the GenBank file. We should have a way to > > rebuild it as it was from the original file (i.e. not rebuild it from > > scratch with DB lookups by default). However, you should also have the > > option to rebuild it from lookups (i.e. correctly), which you could do > > with a Taxonomy. > > And I've already shown how rebuilding with a Taxonomy is very far from > ideal, while switching db_handle on a Node would be perfect. Why are you > now advocating Taxonomy when there is no reason to? > > > > Note this Bio::Taxonomy method: > > > > classify > > > > Title : classify > > Usage : @obj[][0-1] = taxonomy->classify($species); > > Function: return a ranked classification > > Returns : @obj of taxa and ranks as word pairs separated by "@" > > Args : Bio::Species object > > Note that all this method does is let you combine a list of rank names > with the classification array in a Bio::Species, spitting out some weird > data structure. It is only of interest to Bio::Taxonomy::Tree. > We're in the situation where we don't know the rank names corresponding > to the classification array in a Bio::Species generated by genbank et > al. So classify() is of zero value. > > > > As Bio::Species will be deprecated, you can use that method in a dual, > > sneaky way: 1) directly store the lineage information, > > No. Lineage information must be in the form of Nodes or you can't answer > lineage-related taxonomic questions. > > > > 2) return the real one (DB lookups) if needed > > Messy. Doing it with Node would be far superior. > > > Again, Node works all the time, while Taxonomy would work badly or not > at all some of the time. Rather than suggest ways of using Taxonomy, > tell me what is wrong with my current Node plan. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From morissardj at gmail.com Wed Jul 26 10:59:54 2006 From: morissardj at gmail.com (Morissard =?utf-8?b?asOpcm9tZQ==?=) Date: Wed, 26 Jul 2006 14:59:54 +0000 (UTC) Subject: [Bioperl-l] Accessing TRANSFAC matrices References: <20060717145000.0aig7ymcmurk4wsk@webmail.embl.de> <44BEA9FB.1070009@utk.edu> Message-ID: Hi that may help you ? http://morissardjerome.free.fr/Data/files/matrices.zip From hlapp at gmx.net Wed Jul 26 11:36:32 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 26 Jul 2006 11:36:32 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C73D21.3010301@sendu.me.uk> References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> Message-ID: On Jul 26, 2006, at 6:00 AM, Sendu Bala wrote: > Hilmar Lapp wrote: >> Instead, create something like >> >> # return a Bio::Taxonomy::Node: >> my $taxon = $seq->taxon(); > > Yes, but $seq->species() would also $seq->species() would return a Bio::Species object which may not be more than a thin shell anymore around an implementation that delegates almost everything to a lineage object (Bio::Taxonomy). $seq->taxon() in contrast need not return such a backwards-compatible construct. > >> # alternative approach: return a lineage (taxonomy) >> # this would be Bio::TaxonomyI compliant >> my $lineage = $seq->lineage(); > > I've since come to the conclusion that anything Taxonomy-ish would be > inappropriate - see recent post. Not sure which one you mean, and please don't reference really long emails, you're asking a lot of other people to organize your thoughts for them. At any rate, my point is that if you only name it appropriately you can avoid misconceptions about what is being returned. The fact that it's confusing to return a taxonomy from a method called species() doesn't mean it's equally bad to return a lineage (which is a limited taxonomy) from a method called lineage(). > [...] > > My proposed solution is that bioperl's taxonomy model always lets you > answer the same questions regardless of your source for taxonomic > information - see recent post. See above ... And I'd rather see some code or API examples than extensive elaborations. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Wed Jul 26 11:38:50 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 26 Jul 2006 11:38:50 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C733A1.9070201@sendu.me.uk> References: <003301c6b000$203cc560$15327e82@pyrimidine> <44C65990.4080500@sendu.me.uk> <1748BD60-EFBE-4872-94A4-7CA3D58BE458@uiuc.edu> <44C733A1.9070201@sendu.me.uk> Message-ID: <3E7DE5A0-C094-4894-BAB9-03F62EA98A3D@gmx.net> On Jul 26, 2006, at 5:19 AM, Sendu Bala wrote: > Chris Fields wrote: >> >>> It seems like the main problem with Node right now is that it has >>> classification() and things like genus(). I propose pure Node method >>> solutions to answer the questions classification() and genus() were >>> implemented to answer, but in a better, cruft-free way. >>> >>> Bio::DB::Taxonomy::genbank anyone? >> >> Ach... You're compromising here; > > No, I don't think so. Let me explain... > (another very long email, but with the same conclusion as above) Sorry, can you summarize this in a few sentences? If you do want feedback from me you really need to be more concise. -hilmar > > >> 1) Switch out Bio::Species with Node or Taxonomy; relocate other >> information temporarily (Bio::Species, get/sets in Seq object, >> SimpleValue). Leave Bio::Species in for the time being, but don't >> bother making any additional changes to it. > [...] >> Hence Hilmar's suggestion to use a $seq->taxon() method to return a >> Node/Taxonomy, and a $seq->species() would still return a >> Bio::Species object. It's redundant, > > As I see it, the problem to be solved is this: > > a) A node should just be a node, holding only information about itself > (but this can include information on who its parent is, and methods > relating to getting its parents/children as new objects - but the data > of its parents/children must never be stored on itself). > > b) Bio::Species isn't very good at its job; you can't ask reasonable > taxonomic questions of it and get correct answers. > > c) We need to transition Bio::Species to something better - something > that lets us do the same job as Bio::Species, but do it better. An > important aspect of 'better' is that we can switch from the taxonomic > information in a genbank file or similar to the information in a > taxonomic database if we want certain taxonomic questions answered > correctly. But also, we should be able to answer all questions with a > good chance of a correct answer even without database access/ > installation. > > There are a variety of possible solutions. How can we decide which is > best? What would a good solution be? > > The 'something better' we transition Bio::Species to will become the > preferred (or at least de facto standard) way of dealing with > taxonomic > information in bioperl. This taxonomic module (or set of modules) must > be able to model taxonomic information anywhere it is found - > databases > or genbank files or anything else. If it can't, it would be > fundamentally flawed. > > d) We can immediately discount any solution that involves storing some > taxonomic information outside of the tax module. If we find ourselves > putting lineage data in a genbank file in SimpleValue objects or > similar, we can be pretty sure we've used a poor solution to the > problem. That would be a compromise. > > e) If the thing we transition Bio::Species to can't do everything > Bio::Species did (doing it in a different and better way is fine of > course), it's not suitable for transitioning to (this is why Node > needed > all the cruft added to it before it was a suitable candidate). If it > /can/ do everything Bio::Species did, there would be no harm > immediately > making Bio::Species inherit from the new tax module, reimplementing > Bio::Species as necessary but making no API change. So any solution > that > would /require/ $seq->taxon() and $seq->species() wouldn't be a good > one, and would be a compromise. But we do want to get rid of > Bio::Species eventually, so I'm not saying we shouldn't have a > $seq->taxon() or similar, only that either method would give you the > same type of object with the same methods available > ($seq->taxon->isa('tax module') && ($seq->species->isa('Bio::Species') > && $seq->species->isa('tax module')). > > > I see 2 possible solutions to the problem. What should 'tax module' > be?: > > 1) Bio::Taxonomy or other similar class that is a container of > multiple > nodes. Naively this makes logical sense since one of the jobs > Bio::Species has is to store a lineage, and a lineage is best > represented as a set of Nodes. So let's have a single object with all > our Nodes in it. Problems: > > Bio::Taxonomy itself, as currently written, is fundamentally > flawed. It > requires that you know the ranks and order of ranks of all your input > nodes before you input them. It requires that all ranks have unique > names. It doesn't handle ranks of 'no rank'. You can't have more than > one lineage in an instance because you can't have two nodes with the > same rank. If you don't know the ranks of your nodes (ie. genbank) > there > is no way to maintain the order of your lineage because there is no > modelling of parent/child. > I had planned to re-write it such that the rank-centric implementation > was removed and we had parent/child implementation instead. But then > there is nothing to stop you adding nodes that are disconnected > from the > others, creating a broken mess. > > Bio::Taxonomy::Tree might have been a little more suitable because it > implements Bio::Tree::TreeI, but sadly it is also rank-centric and > actually requires input of both Bio::Species and Bio::Taxonomy objects > to its most useful methods. > > More important than issues with current implementations of > node-container classes, such classes are unable to let us solve > problem > c) in a good way, and also leave us potentially storing in memory Node > objects representing the same taxonomic node multiple times in > different > instances of the node-container. For problem c) if we were to switch > from genbank nodes to database the solution is to delete all the nodes > in the container and then get them all again from the database. > What if > you didn't even have a lineage-related question? You've just retrieved > 10s of nodes from the database for no reason (and then store them), > when > all you wanted was accurate information on the node you were > interested in. > > All in all, it's pretty horrible. Unsuitable implementations plus > excess > database retrieval plus massive waste of memory with duplicated nodes > does not equal a good solution. > > > 2) Bio::Taxonomy::Node. First, solve problem a) by getting rid of > methods binomial(), species(), genus(), sub_species(), > variant(), organelle(), classification() and show_all(). Except for > organelle() which doesn't belong in taxonomy, all of these > Bio::Species > 'questions' can still be answered by Node - just not in a single > method > call. I outlined how to answer them in the previous post. For backward > compatibility make Bio::Species a Node and implement the suggested way > of answering the questions the proper 'Node' way under those methods. > Problems: > > Well, those questions can't actually be answered by Node if the > starting > point was genbank data or manually created Nodes. The solution is > clean > and simple: Bio::DB::Taxonomy::genbank or perhaps better named > Bio::DB::Taxonomy::list (because it makes a taxonomy database from an > ordered list of names - I don't see anything inherently wrong or ugly > with that). Then everything magically just works. We get all the power > to ask all our questions that Node has already when working with the > ncbi database, but we get it when working with genbank data. We suffer > none of the problems of a node-container class. We can easily switch > databases on the fly. > > What's not to like? > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From jay at jays.net Wed Jul 26 11:32:53 2006 From: jay at jays.net (Jay Hannah) Date: Wed, 26 Jul 2006 08:32:53 -0700 Subject: [Bioperl-l] Anyone else at OSCON right now? Message-ID: <44C78B25.80503@jays.net> Any other BioPerl'ers here in Portland for OSCON? I'd love to chat about your life w/ BioPerl. I'm here until Saturday morning. j http://oscon.kwiki.org/index.cgi?JayHannah From adamnkraut at gmail.com Wed Jul 26 10:32:42 2006 From: adamnkraut at gmail.com (Adam Kraut) Date: Wed, 26 Jul 2006 10:32:42 -0400 Subject: [Bioperl-l] Structure::IO In-Reply-To: <716af09c0607250347w24e0d8dbj674758049bc0a4e2@mail.gmail.com> References: <716af09c0607250347w24e0d8dbj674758049bc0a4e2@mail.gmail.com> Message-ID: <134ede0b0607260732u79f0dea2if8f4ea98a5e03524@mail.gmail.com> Hi bernd, Can you better explain what it is you want to do with pdb files? From your example it looks like you want to do something with each chain, but it is unclear what you want to do here: my @chains = $struc->chain($chain); With that said, I was never able to use Bio::Structure in the way that I wanted. I now use the MMTSB Perl libraries instead: http://mmtsb.scripps.edu/cgi-bin/tooldoc?perlpackages Specifically the Molecule module may be useful here. Regards, Adam On 7/25/06, Bernd Web wrote: > > Hi, > > Does someone have experience with Bio::Structure::IO? > The example III.9.1 from the bptutorial.pl covers most, but what is e.g. > the > chain() method of Bio::Structure::Entry doing? The POD states: > > Title : chain > Usage : @chains = $structure->chain($chain); > Function: Connects a (or a list of) Chain objects to a > Bio::Structure::Entry. > Returns : list of Bio::Structure::Residue objects > Args : One Residue or a reference to an array of Residue objects > > But in e.g > my $stream = Bio::Structure::IO->new(-file => $filename, > -format => 'pdb'); > while ( my $struc = $stream->next_structure() ) { > for my $chain ($struc->get_chains) { > my $chainid = $chain->id; > my @chains = $struc->chain($chain); > } > } > > I get Bio::Structure::Chain=HASH(0x9f1ab50). > > What is the function of the chain method and how to use it? > > Best regards, > bernd > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- Adam N. Kraut National Resource for Biomedical Supercomputing http://www.nrbsc.org/sb/ From bix at sendu.me.uk Wed Jul 26 12:11:25 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Wed, 26 Jul 2006 17:11:25 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <002801c6b0c6$59279fa0$15327e82@pyrimidine> References: <002801c6b0c6$59279fa0$15327e82@pyrimidine> Message-ID: <44C7942D.6050603@sendu.me.uk> Chris Fields wrote: >> No. Lineage information must be in the form of Nodes or you can't answer >> lineage-related taxonomic questions. > > You must have a way to store the 'horrible lineage information' data, as is, > for those users who do not care about taxonomy and just want to convert seq > streams. You shouldn't burden the everyday user with something that is > pretty specialized, this being finding correct taxonomic information based > on DB lookups for a particular reason (screening sequences, as Hilmar > pointed out, was one possibility). I am certainly not requiring that anyone find 'correct taxonomic information'. The whole reason I am backing my current proposal is that it works equally well with or without access to NCBI's taxonomy database. Your proposals work poorly without access to such. > I don't care how, but store lineage information as it appears in the file > (scalar string) or in a simple data structure (array, maybe?) capable of > retaining the information in some way. There are many many ways of doing > this which I have previously pointed out; take your pick. I've taken my pick. To set: my $db = new Bio::DB::Taxonomy(-source => 'list', -lineage => @lineage); $node->db_handle($db); To get: @lineage = map { $_->scientific_name } $node->get_Lineage_Nodes; That is as simple as it is going to get in a world where we have 'pure' Nodes or any other kind of pure taxonomic class. If you want to hide the taxonomic complexity from end-users who want to make and store their own lineage of their species without having to know the details of how bioperl's taxonomy modules are supposed to work, tell them to use Bio::Species: To set: $species->classification(@lineage); To get: @lineage = $species->classification; Of course in this example I propose that behind the scenes Bio::Species is a Bio:Taxonomy::Node and just implements classification() the pure Node way, given above. Let me make my requirement very clear: the solution must allow you to find the most recent common ancestor of two solution-objects without access to the NCBI taxonomy database, using exactly the same method call you would use if you /did/ have access to the NCBI taxonomy database. The method in question shouldn't need any special-case code depending on the presence or absence of NCBI taxonomy database. That's the litmus test. I'll tend to reject any solution that fails. From bix at sendu.me.uk Wed Jul 26 12:25:41 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Wed, 26 Jul 2006 17:25:41 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <3E7DE5A0-C094-4894-BAB9-03F62EA98A3D@gmx.net> References: <003301c6b000$203cc560$15327e82@pyrimidine> <44C65990.4080500@sendu.me.uk> <1748BD60-EFBE-4872-94A4-7CA3D58BE458@uiuc.edu> <44C733A1.9070201@sendu.me.uk> <3E7DE5A0-C094-4894-BAB9-03F62EA98A3D@gmx.net> Message-ID: <44C79785.6050705@sendu.me.uk> Hilmar Lapp wrote: > > On Jul 26, 2006, at 5:19 AM, Sendu Bala wrote: > >>>> It seems like the main problem with Node right now is that it has >>>> classification() and things like genus(). I propose pure Node method >>>> solutions to answer the questions classification() and genus() were >>>> implemented to answer, but in a better, cruft-free way. >>>> >>>> Bio::DB::Taxonomy::genbank anyone? > > Sorry, can you summarize this in a few sentences? If you do want > feedback from me you really need to be more concise. A bad solution-module stores any kind of taxonomic information outside of the solution-module or in an inconsistent form. By 'inconsistent' I mean, sometimes you store the name of a taxonomic rank with $node->node_name, other times you store it in an array or scalar held directly on the solution-module or elsewhere. Bio::Taxonomy specifically is not usable. Generally speaking, classes that are containers of multiple nodes are also inappropriate, because they result in excess database retrieval and excess storage of duplicated information amongst instances of such classes. Bio::Taxonomy::Node combined with Bio::DB::Taxonomy::list would probably be ideal. From cjfields at uiuc.edu Wed Jul 26 12:49:40 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 26 Jul 2006 11:49:40 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: Message-ID: <000001c6b0d3$7d936ec0$15327e82@pyrimidine> Hilmar, apologies ahead of time for not being too concise! It's my last hurrah on this thread. No, really! ... > > Yes, but $seq->species() would also > > $seq->species() would return a Bio::Species object which may not be > more than a thin shell anymore around an implementation that > delegates almost everything to a lineage object (Bio::Taxonomy). > > $seq->taxon() in contrast need not return such a backwards-compatible > construct. In genbank.pm _read_GenBank_Species (initial implementation, to switch out Bio::Species with Taxonomy/Node object): 1) Assign data to both Bio::Species (as currently implemented) and Bio::Taxonomy::Node (new way). 2) Assign organelle to Bio::Species and the Seq object get/set organelle(). 3) Assign lineage information to Bio::Species and as an array to the Seq object get/set lineage(). Replace the get/set above with your method of choice, just no Bio::Species. In genbank.pm write_seq() 1) if DB_lookup flag is defined, use $seq->taxon() to build lineage 2) If not, use $seq->lineage(). The fine details (how do you build the lineage?!?) can be worked out along the way. The wonders of CVS! The Taxonomy class used here could be returned using Hilmar's $seq->taxon() and Bio::Species can be returned via $seq->species(). Makes perfect sense! Separated! Nothing complicated about it. Nice and clean. And Bio::Species can eventually be shown the exit door. Elvis has left the building... Organelle-specific sequence TaxIDs, as they refer to the organism and not the organelle, could be placed elsewhere, preferably somewhere more accessible such as $seq->organelle(). And lineage, similarly, could be placed in $seq->lineage(), which would store it as a raw string or as an array. There are many other ways I had pointed out (SimpleValue, Node, etc); I don't care, as long as we eventually sever the Bio::Species tumor from SeqIO. ... > ...And I'd rather see some code or API examples than > extensive elaborations. > > -hilmar Hilmar's right; working code does speaks louder than words. The energy spent in writing up full expositions is better spent elsewhere, hence: I need to get back to work! Wish I could contribute more. Chris From bix at sendu.me.uk Wed Jul 26 13:13:43 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Wed, 26 Jul 2006 18:13:43 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> Message-ID: <44C7A2C7.2070100@sendu.me.uk> Hilmar Lapp wrote: > On Jul 26, 2006, at 6:00 AM, Sendu Bala wrote: > >> Hilmar Lapp wrote: >>> Instead, create something like >>> >>> # return a Bio::Taxonomy::Node: >>> my $taxon = $seq->taxon(); >> Yes, but $seq->species() would also > > $seq->species() would return a Bio::Species object which may not be > more than a thin shell anymore around an implementation that > delegates almost everything to a lineage object (Bio::Taxonomy). I actually forgot to finish that sentence. I was going to suggest Bio::Species isa Bio::Taxonomy::Node and would indeed delegate most of its implementation to Node. >>> # alternative approach: return a lineage (taxonomy) >>> # this would be Bio::TaxonomyI compliant >>> my $lineage = $seq->lineage(); >> I've since come to the conclusion that anything Taxonomy-ish would be >> inappropriate - see recent post. > > The fact that it's confusing to return a taxonomy from a method called species() > doesn't mean it's equally bad to return a lineage (which is a limited > taxonomy) from a method called lineage(). You wouldn't need to though. If you want a lineage you could ask your node for its lineage. There's no point in having a whole other class that contains a node and all its ancestor nodes, when to get the ancestors of a node all you have to do is $node->get_Lineage_Nodes(). >> My proposed solution is that bioperl's taxonomy model always lets you >> answer the same questions regardless of your source for taxonomic >> information - see recent post. > > See above ... And I'd rather see some code or API examples The fine details of the following may be slightly off, but it's just to provide an example. I'll use Test.pm syntax. my @human = qw('Homo sapiens' Homo Mammalia Eukaryota); my @mouse = qw('Mus musculus' Mus Mammalia Eukaryota); Old way with Node ----------------- my $h_node = new Bio::Taxonomy::Node(-classification => @human); my $m_node = new Bio::Taxonomy::Node(-classification => @mouse); @human = map { $_->scientific_name } $h_node->get_Lineage_Nodes; ok @human, 0; # failure to work as expected @human = $h_node->classification; ok join(", ", @human), "Homo sapiens, Homo, Mammalia, Eukaryota"; my $lca = $h_node->get_LCA_Node($m_node); ok $lca, undef; # failure to do anything useful because our lineage data # is in an array, not in nodes # try again with entrez - must make brand new objects my $db = new Bio::DB::Taxonomy(-source => 'entrez'); $h_node = $db->get_Taxonomy_Node(-name => 'Homo sapiens'); $m_node = $db->get_Taxonomy_Node(-name => 'Mus musculus'); @human = map { $_->scientific_name } $h_node->get_Lineage_Nodes; ok join(", ", @human) eq "Homo sapiens, Homo, Homo/Pan/Gorilla group, Hominidae, ..."; # now it works! $lca = $h_node->get_LCA_Node($m_node); ok $lca->scientific_name, 'Mammalia'; # and now this works! Old way with Bio::Species ------------------------- # forget about it, Species has nothing like a get_LCA_Node() Proposed way with Node ---------------------- my $db = new Bio::DB::Taxonomy(-source => 'list', -lineage => @human); my $h_node = $db->get_Taxonomy_Node(-name => 'Homo sapiens'); $db->add_lineage(@mouse); # or make a new db my $m_node = $db->get_Taxonomy_Node(-name => 'Mus musculus'); @human = map { $_->scientific_name } $h_node->get_Lineage_Nodes; ok join(", ", @human), "Homo sapiens, Homo, Mammalia, Eukaryota"; # works as expected my $lca = $h_node->get_LCA_Node($m_node); ok $lca->scientific_name, 'Mammalia'; # works first time # try again with entrez - just change the db_handle $h_node->db_handle(new Bio:DB::Taxonomy(-source => 'entrez'); @human = map { $_->scientific_name } $h_node->get_Lineage_Nodes; ok join(", ", @human) eq "Homo sapiens, Homo, Homo/Pan/Gorilla group, Hominidae, ..."; $lca = $h_node->get_LCA_Node($m_node); ok $lca->scientific_name, 'Mammalia'; Proposed way with Bio::Species ------------------------------ # (Bio::Species isa Bio::Taxonomy::Node, implements its methods like # above) my $h_species = new Bio::Species(-classification => @human); my $m_species = new Bio::Species(-classification => @mouse); @human = map { $_->scientific_name } $h_node->get_Lineage_Nodes; ok join(", ", @human), "Homo sapiens, Homo, Mammalia, Eukaryota"; @human = $h_species->classification; ok join(", ", @human), "Homo sapiens, Homo, Mammalia, Eukaryota"; my $lca = $h_species->get_LCA_Node($m_species); ok $lca->scientific_name, 'Mammalia'; # trying again with entrez behaves as per proposed Node, above From angshu96 at gmail.com Wed Jul 26 13:15:35 2006 From: angshu96 at gmail.com (Angshu Kar) Date: Wed, 26 Jul 2006 12:15:35 -0500 Subject: [Bioperl-l] WUBLASTP parsing problem Message-ID: Hi, Does WU-BLASTP has got something to do with the length of the sequence names (or the sequence names)? What is happening here is I use fasta format proteins to build the blast (I do a distributed blastp) report. But when I parse the report (using bioperl), the query column remains empty for some results as : * 328857 6.6e-135 325331 6.3e-114 325329 1.0e-113 325332 1.7e-113 325330 2.7e-113 . . *. while for some its perfect as: *267750 280003 7.5e-301 267750 348279 7.5e-301 267750 345867 2.0e-300 267750 251915 2.0e-300 267750 346539 6.7e-300 . *. . Some of my sequences are as: *IMGA|AC159872_38.1 hypothetical protein AC159872.12 35121-35051 H EGN_Mt050401 20060209 TIGR 1671.m00013 mrsciilhnmivederdtyaqrwtefeqpggngsstpqpystelrdpdvhhklqtdlvkh iwikfgmyrd* * And part of the blastp (the one where I'm facing the issue) result is as: *Smallest * * Sum High Probability Sequences producing High-scoring Segment Pairs: Score P(N) N gi|33333045|gb|AAQ11687.1| MADS box protein [Triticum aes... 1318 6.6e-135 1 gi|47681327|gb|AAT37484.1| MADS5 protein [Dendrocalamus l... 1120 6.3e-114 1 gi|47681331|gb|AAT37486.1| MADS7 protein [Dendrocalamus l... 1118 1.0e-113 1 gi|47681325|gb|AAT37483.1| MADS4 protein [Dendrocalamus l... 1116 1.7e-113 1 gi|47681329|gb|AAT37485.1| MADS6 protein [Dendrocalamus l... 1114 2.7e-113 1 gi|47681323|gb|AAT37482.1| MADS3 protein [Dendrocalamus l... 1114 2.7e-113 1 11674.m04224|LOC_Os08g41950|protein K-box region, putative 976 1.1e-98 1 gi|28630961|gb|AAO45877.1| MADS5 [Lolium perenne] 967 1.0e-97 1 gi|44888605|gb|AAS48129.1| AGAMOUS LIKE9-like protein [Ho... 964 2.1e-97 1 11674.m04223|LOC_Os08g41950|protein K-box region, putative 899 1.6e-90 1 gi|34979580|gb|AAQ83834.1| MADS box protein [Asparagus of... 875 5.8e-88 1* Could you please let me know if I'm missing something? Has the gi got to do anything with this? Thanking you, Angshu From cain.cshl at gmail.com Wed Jul 26 12:19:26 2006 From: cain.cshl at gmail.com (Scott Cain) Date: Wed, 26 Jul 2006 12:19:26 -0400 Subject: [Bioperl-l] Installing staden io_lib on windows? Message-ID: <1153930767.2632.5.camel@localhost.localdomain> Hi all, I'm wondering if anyone has tried to install Staden's io_lib on Windows, and if so, how did it go? I am not much of a Windows person, but I've tried to make it under cygwin only to get this message: make all-recursive make[1]: Entering directory `/home/scott/io_lib-1.9.2' Making all in read make[2]: Entering directory `/home/scott/io_lib-1.9.2/read' if gcc -DHAVE_CONFIG_H -I. -I. -I.. -I.. -I../include -I../read -I../alf -I../abi -I../ctf -I../ztr -I../plain -I../scf -I../sff -I../exp_file -I../utils -I/usr/local/include -g -O2 -MT Read.o -MD -MP -MF ".deps/Read.Tpo" -c -o Read.o Read.c; \ then mv -f ".deps/Read.Tpo" ".deps/Read.Po"; else rm -f ".deps/Read.Tpo"; exit 1; fi In file included from Read.h:43, from Read.c:40: ../utils/os.h:346:2: #error Must define SP_BIG_ENDIAN or SP_LITTLE_ENDIAN in Makefile make[2]: *** [Read.o] Error 1 make[2]: Leaving directory `/home/scott/io_lib-1.9.2/read' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/home/scott/io_lib-1.9.2' make: *** [all] Error 2 I'm guessing there is a flag I can pass to the configure script to get the endian-ness right, but I don't know (and I don't know if this is just the beginning of a long, fruitless road :-) I would like to use Bio::SCF (from CPAN) in conjuction with the trace glyph in BioGraphics to view traces in GBrowse. Thanks for any advice, Scott -- ------------------------------------------------------------------------ Scott Cain, Ph. D. cain.cshl at gmail.com GMOD Coordinator (http://www.gmod.org/) 216-392-3087 Cold Spring Harbor Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.open-bio.org/pipermail/bioperl-l/attachments/20060726/ae4b06a0/attachment.bin From morissardj at gmail.com Wed Jul 26 16:49:58 2006 From: morissardj at gmail.com (leverdeterre) Date: Wed, 26 Jul 2006 13:49:58 -0700 (PDT) Subject: [Bioperl-l] Accessing TRANSFAC matrices In-Reply-To: References: <44BEA9FB.1070009@utk.edu> Message-ID: <5510746.post@talk.nabble.com> i'm happy for helping you i'have done a page whitch can interrest you http://morissardjerome.free.fr/Data/index.html there is more information about the 397 matrix file ( in the 3 first line) and i'm done all the logo file . ++ -- View this message in context: http://www.nabble.com/Re%3A-Accessing-TRANSFAC-matrices-tf1969467.html#a5510746 Sent from the Perl - Bioperl-L forum at Nabble.com. From morissardj at gmail.com Wed Jul 26 17:15:19 2006 From: morissardj at gmail.com (leverdeterre) Date: Wed, 26 Jul 2006 14:15:19 -0700 (PDT) Subject: [Bioperl-l] Blast Output Parsing In-Reply-To: References: Message-ID: <5511136.post@talk.nabble.com> and without Bioperl i think that may help you http://morissardjerome.free.fr/perl/blastparser.html -- View this message in context: http://www.nabble.com/Blast-Output-Parsing-tf1974691.html#a5511136 Sent from the Perl - Bioperl-L forum at Nabble.com. From osborne1 at optonline.net Wed Jul 26 17:00:50 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Wed, 26 Jul 2006 17:00:50 -0400 Subject: [Bioperl-l] SeqUtils In-Reply-To: <716af09c0607250444y3e005fb1t4e20094fd8db993d@mail.gmail.com> Message-ID: Bernd, That's easily done, changed both POD and code. Brian O. On 7/25/06 7:44 AM, "Bernd Web" wrote: > Hi, > > With Bio::SeqUtils it may be nice to support 3 letter codes with > capitals only, too. > Now > > my $string = Bio::SeqUtils->seq3in($seqobj, 'METGLYTER'); > > will give in $string->seq: XXX. > > Possibly the capitals in MetGlyTer are used to find the amino acids codes? > If not maybe it's easy to implement case-insensitive, or all-capitals > for AA codes in SeqUtils? > > In addition about the POD: maybe it's better not use use $string since > Bio::SeqUtils->seq3in does not return a string but a Bio::PrimarySeq > object. > > Regards, > Bernd > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From osborne1 at optonline.net Wed Jul 26 17:24:34 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Wed, 26 Jul 2006 17:24:34 -0400 Subject: [Bioperl-l] Structure::IO In-Reply-To: <716af09c0607250347w24e0d8dbj674758049bc0a4e2@mail.gmail.com> Message-ID: Bernd, I'm not following your question. The POD in the latest Bio::Structure::Entry shows: =head2 chain() Title : chain Usage : @chains = $structure->chain($chain); Function: Connects a Chain or a list of Chain objects to a Bio::Structure::Entry. Returns : List of Bio::Structure::Chain objects Args : A Chain or a reference to an array of Chain objects =cut Which is not what you've copied and pasted. What version of Bioperl do you use? Brian O. On 7/25/06 6:47 AM, "Bernd Web" wrote: > Hi, > > Does someone have experience with Bio::Structure::IO? > The example III.9.1 from the bptutorial.pl covers most, but what is e.g. the > chain() method of Bio::Structure::Entry doing? The POD states: > > Title : chain > Usage : @chains = $structure->chain($chain); > Function: Connects a (or a list of) Chain objects to a Bio::Structure::Entry. > Returns : list of Bio::Structure::Residue objects > Args : One Residue or a reference to an array of Residue objects > > But in e.g > my $stream = Bio::Structure::IO->new(-file => $filename, > -format => 'pdb'); > while ( my $struc = $stream->next_structure() ) { > for my $chain ($struc->get_chains) { > my $chainid = $chain->id; > my @chains = $struc->chain($chain); > } > } > > I get Bio::Structure::Chain=HASH(0x9f1ab50). > > What is the function of the chain method and how to use it? > > Best regards, > bernd > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From hlapp at gmx.net Thu Jul 27 01:06:52 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 27 Jul 2006 01:06:52 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C7A2C7.2070100@sendu.me.uk> References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> <44C7A2C7.2070100@sendu.me.uk> Message-ID: <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net> I think this looks like a great solution. You could also name Bio::DB::Taxonomy::list as Bio::DB::Taxonomy::inmemory because it really isn't much else than an in-memory database (of limited content if you populate it from flat-file sequence annotation). The only reservation I have is that you'd have methods on Node that don't really operate on the node instance but rather operate on the taxonomy (database) behind the scenes. That's what I would have used Bio::Taxonomy for, not so much as a container than as a class with (conceptually) 'static' methods corresponding to those that are now in Node, like get_Lineage_Nodes(). They would optionally accept a db_handle too, or use a default one set as an attribute. However, leaving/having these methods on Node really isn't such a big deal and I'm sure would even be preferred by many people for the sake of simplicity. So overall I think you should just go ahead. -hilmar On Jul 26, 2006, at 1:13 PM, Sendu Bala wrote: > > The fine details of the following may be slightly off, but it's > just to > provide an example. I'll use Test.pm syntax. > > my @human = qw('Homo sapiens' Homo Mammalia Eukaryota); > my @mouse = qw('Mus musculus' Mus Mammalia Eukaryota); > > > [...] > Proposed way with Node > ---------------------- > > my $db = new Bio::DB::Taxonomy(-source => 'list', -lineage => @human); > my $h_node = $db->get_Taxonomy_Node(-name => 'Homo sapiens'); > $db->add_lineage(@mouse); # or make a new db > my $m_node = $db->get_Taxonomy_Node(-name => 'Mus musculus'); > > @human = map { $_->scientific_name } $h_node->get_Lineage_Nodes; > ok join(", ", @human), "Homo sapiens, Homo, Mammalia, Eukaryota"; > # works as expected > > my $lca = $h_node->get_LCA_Node($m_node); > ok $lca->scientific_name, 'Mammalia'; # works first time > > # try again with entrez - just change the db_handle > $h_node->db_handle(new Bio:DB::Taxonomy(-source => 'entrez'); > > @human = map { $_->scientific_name } $h_node->get_Lineage_Nodes; > ok join(", ", @human) eq "Homo sapiens, Homo, Homo/Pan/Gorilla group, > Hominidae, ..."; > > $lca = $h_node->get_LCA_Node($m_node); > ok $lca->scientific_name, 'Mammalia'; > > [...] -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From bix at sendu.me.uk Thu Jul 27 03:07:22 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 27 Jul 2006 08:07:22 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net> References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> <44C7A2C7.2070100@sendu.me.uk> <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net> Message-ID: <44C8662A.3080904@sendu.me.uk> Hilmar Lapp wrote: > The only reservation I have is that you'd have methods on Node that > don't really operate on the node instance but rather operate on the > taxonomy (database) behind the scenes. That's what I would have used > Bio::Taxonomy for, not so much as a container than as a class with > (conceptually) 'static' methods corresponding to those that are now > in Node, like get_Lineage_Nodes(). Yes, I had the same reservation. But it somehow seemed reasonable for me to ask a node for its lineage, though I draw the line at having a method like get_node('rank_name'). That's the only thing Bio::Taxonomy would have been good for, so it's a trade off between some nice methods and the problems inherent in a node-container class. Though, perhaps we almost have the best of both worlds, since the database is effectively a container without the problems: $node->db_handle->get_Taxonomy_Node(-rank 'rank_name', -lineage_of => $node); ? > So overall I think you should just go ahead. Great, will do. From maximilianh at gmail.com Thu Jul 27 04:56:44 2006 From: maximilianh at gmail.com (Maximilian Haeussler) Date: Thu, 27 Jul 2006 10:56:44 +0200 Subject: [Bioperl-l] TRANSFAC matrices, open acces Message-ID: <76f031ae0607270156lde64ff9m590257d98899fe64@mail.gmail.com> Actually, the fact that the transfac matrices are belonging to a company is quite inconvenient for biologists and bioinformatics analyses working in this field. There are some projects to annotate cis-sequences in regular intervals by volunteers and put the data into the public domain, one of them is the oreganno database http://www.oreganno.org/. Its first annotation jamboree will be held in Gent at the end of this year. If you're interested in cis-sequences, want to meet others that are and are willing to contribute some annotation efforts, don't hestitate to come to gent, it's conveniently placed in the middle of europe and registration costs almost nothing. http://www.dmbr.ugent.be/bioit/contents/regcreative/ One day, hopefully, journals will oblige authors to put their sequences in a common format into genbank but as long as regulation is not seen as an important part of genome annotation, a lot manual annotation will have to be done. cheers max > On 26/07/06, leverdeterre wrote: > > > > i'm happy for helping you > > i'have done a page whitch can interrest you > > http://morissardjerome.free.fr/Data/index.html > > > > there is more information about the 397 matrix file ( in the 3 first line) > > and i'm done all the logo file . > > > > ++ > > -- > > View this message in context: http://www.nabble.com/Re%3A-Accessing-TRANSFAC-matrices-tf1969467.html#a5510746 > > Sent from the Perl - Bioperl-L forum at Nabble.com. > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > -- Maximilian Haeussler, CNRS/INRA Gif-sur-Yvette, France tel: +33 6 12 82 76 16 skype: maximilianhaeussler From morissardj at gmail.com Thu Jul 27 05:10:19 2006 From: morissardj at gmail.com (leverdeterre) Date: Thu, 27 Jul 2006 02:10:19 -0700 (PDT) Subject: [Bioperl-l] Accessing TRANSFAC matrices In-Reply-To: <5510746.post@talk.nabble.com> References: <44BEA9FB.1070009@utk.edu> <5510746.post@talk.nabble.com> Message-ID: <5517747.post@talk.nabble.com> Sorry i remove all this data because they are the proprity of TRANSFAC .. http://www.gene-regulation.com/pub/databases/transfac/doc/misc.html The TRANSFAC? database is free for users from non-profit organizations only. Users from commercial enterprises have to license the TRANSFAC? database and accompanying programs. -- View this message in context: http://www.nabble.com/Re%3A-Accessing-TRANSFAC-matrices-tf1969467.html#a5517747 Sent from the Perl - Bioperl-L forum at Nabble.com. From maximilianh at gmail.com Thu Jul 27 04:44:47 2006 From: maximilianh at gmail.com (Maximilian Haeussler) Date: Thu, 27 Jul 2006 10:44:47 +0200 Subject: [Bioperl-l] Accessing TRANSFAC matrices In-Reply-To: <5510746.post@talk.nabble.com> References: <44BEA9FB.1070009@utk.edu> <5510746.post@talk.nabble.com> Message-ID: <76f031ae0607270144of6ff9cbtbd9f3045bbc4e6e1@mail.gmail.com> I'm pretty sure that you are not allowed to distribute these matrices: http://www.gene-regulation.com/pub/databases/transfac/doc/misc.html [well...but if you don't care and biobase doesn't complain... actually anyone can scrape the matrices from the website with wget.] max On 26/07/06, leverdeterre wrote: > > i'm happy for helping you > i'have done a page whitch can interrest you > http://morissardjerome.free.fr/Data/index.html > > there is more information about the 397 matrix file ( in the 3 first line) > and i'm done all the logo file . > > ++ > -- > View this message in context: http://www.nabble.com/Re%3A-Accessing-TRANSFAC-matrices-tf1969467.html#a5510746 > Sent from the Perl - Bioperl-L forum at Nabble.com. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From bix at sendu.me.uk Thu Jul 27 05:55:01 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 27 Jul 2006 10:55:01 +0100 Subject: [Bioperl-l] TRANSFAC matrices, open acces In-Reply-To: <76f031ae0607270156lde64ff9m590257d98899fe64@mail.gmail.com> References: <76f031ae0607270156lde64ff9m590257d98899fe64@mail.gmail.com> Message-ID: <44C88D75.7040102@sendu.me.uk> Maximilian Haeussler wrote: > Actually, the fact that the transfac matrices are belonging to a > company is quite inconvenient for biologists and bioinformatics > analyses working in this field. The public version is adequate though. It would certainly be useful to have Bioperl access to transfac and other regulation databases. I'll probably write some suitable modules if no one beats me to it. From sdavis2 at mail.nih.gov Thu Jul 27 07:43:09 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 27 Jul 2006 07:43:09 -0400 Subject: [Bioperl-l] TRANSFAC matrices, open acces In-Reply-To: <44C88D75.7040102@sendu.me.uk> Message-ID: On 7/27/06 5:55 AM, "Sendu Bala" wrote: > Maximilian Haeussler wrote: >> Actually, the fact that the transfac matrices are belonging to a >> company is quite inconvenient for biologists and bioinformatics >> analyses working in this field. > > The public version is adequate though. It would certainly be useful to > have Bioperl access to transfac and other regulation databases. I'll > probably write some suitable modules if no one beats me to it. I haven't used it in a while, but the TFBS family of modules are, if I recall correctly, bioperl-compatible, though not part of bioperl. In any case, for those who aren't aware, it might be worth a quick look: http://forkhead.cgb.ki.se/TFBS/ Sean From bix at sendu.me.uk Thu Jul 27 08:01:03 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 27 Jul 2006 13:01:03 +0100 Subject: [Bioperl-l] TRANSFAC matrices, open acces In-Reply-To: References: Message-ID: <44C8AAFF.6060100@sendu.me.uk> Sean Davis wrote: > > On 7/27/06 5:55 AM, "Sendu Bala" wrote: > >> Maximilian Haeussler wrote: >>> Actually, the fact that the transfac matrices are belonging to a >>> company is quite inconvenient for biologists and bioinformatics >>> analyses working in this field. > >> The public version is adequate though. It would certainly be useful to >> have Bioperl access to transfac and other regulation databases. I'll >> probably write some suitable modules if no one beats me to it. > > I haven't used it in a while, but the TFBS family of modules are, if I > recall correctly, bioperl-compatible, though not part of bioperl. In any > case, for those who aren't aware, it might be worth a quick look: Yes. It only has online access to Transfac though, and the inheritance and returned objects are TFBS specific so you miss out on whatever goodness there may be in the rest of bioperl. Still, recommended to use if you want programmatic access to Transfac matrices right now. From bernd.web at gmail.com Thu Jul 27 06:14:13 2006 From: bernd.web at gmail.com (Bernd Web) Date: Thu, 27 Jul 2006 12:14:13 +0200 Subject: [Bioperl-l] Structure::IO In-Reply-To: References: <716af09c0607250347w24e0d8dbj674758049bc0a4e2@mail.gmail.com> Message-ID: <716af09c0607270314u4e2b1eb8y6c1b87f5b3abd8e1@mail.gmail.com> Hi Thanks for your notes. The text I pasted comes from http://doc.bioperl.org/releases/bioperl-1.5.1/ but indeed Entry.pm (v1.25 2006/07/04) shows a different POD. I am trying to get annotation out of PDB. ID is not a problem, but I would like to have the HEADER and possibly comment fields to a (FastA) description line, but how? Bio::Structure::Entry v.1.25 does not list the annotation method in the POD anymore (due to a missing empty line before =head). $struc->annotation still exists; I can get the keys but not the values with $struc->annotation($struc->seqres) (Can't locate object method "get_Annotations" via package "Bio::PrimarySeq"). (Example script attached). The POD states: annotation: $obj->annotation($seq_obj). So I thought of a PrimarySeq object to pass to annotation. The PrimarySeq object ($struc->seqres) does not contain a description: $struc->seqres->desc is uninitialized. Is it possible to get annotation from header/comments fields with Bio::Structure? Best regards, Bernd On 7/26/06, Brian Osborne wrote: > Bernd, > > I'm not following your question. The POD in the latest Bio::Structure::Entry > shows: > > =head2 chain() > > Title : chain > Usage : @chains = $structure->chain($chain); > Function: Connects a Chain or a list of Chain objects to a > Bio::Structure::Entry. > Returns : List of Bio::Structure::Chain objects > Args : A Chain or a reference to an array of Chain objects > > =cut > > Which is not what you've copied and pasted. What version of Bioperl do you > use? > > Brian O. > > > > On 7/25/06 6:47 AM, "Bernd Web" wrote: > > > Hi, > > > > Does someone have experience with Bio::Structure::IO? > > The example III.9.1 from the bptutorial.pl covers most, but what is e.g. the > > chain() method of Bio::Structure::Entry doing? The POD states: > > > > Title : chain > > Usage : @chains = $structure->chain($chain); > > Function: Connects a (or a list of) Chain objects to a Bio::Structure::Entry. > > Returns : list of Bio::Structure::Residue objects > > Args : One Residue or a reference to an array of Residue objects > > > > But in e.g > > my $stream = Bio::Structure::IO->new(-file => $filename, > > -format => 'pdb'); > > while ( my $struc = $stream->next_structure() ) { > > for my $chain ($struc->get_chains) { > > my $chainid = $chain->id; > > my @chains = $struc->chain($chain); > > } > > } > > > > I get Bio::Structure::Chain=HASH(0x9f1ab50). > > > > What is the function of the chain method and how to use it? > > > > Best regards, > > bernd > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > -------------- next part -------------- #!/usr/bin/perl -w use warnings; use strict; use Bio::Structure::IO; my $filename = $ARGV[0]; my $stream = Bio::Structure::IO->new( -file => $filename, -format => 'pdb'); while ( my $struc = $stream->next_structure() ) { print "SEQRES DESC: ", $struc->seqres->desc, "\n"; print join(" ", keys %{$struc->annotation($struc->seqres)}), "\n"; print join(" ", keys %{$struc->annotation()}), "\n"; print join(" ", values %{$struc->annotation()}), "\n"; #(partly) works print join(" ", values %{$struc->annotation($struc->seqres)}), "\n"; #does not work } From bix at sendu.me.uk Thu Jul 27 09:31:54 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 27 Jul 2006 14:31:54 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net> References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> <44C7A2C7.2070100@sendu.me.uk> <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net> Message-ID: <44C8C04A.8070504@sendu.me.uk> Hilmar Lapp wrote: > > So overall I think you should just go ahead. One last suggestion for discussion: It may be appropriate is to rename Bio::Taxonomy::Node to clarify that Node has no particular reliance on or association with Bio::Taxonomy or the other modules in Bio/Taxonomy/. How about calling it Bio::Taxon? It is more obvious what to expect from something called 'Bio::Taxon' when you know that it is the new 'Bio::Species': like Bio::Species but for any taxon. It also makes the class 'top-level' which I think most people are happier using; seems like things in sub-directories are more for advanced users. From hlapp at gmx.net Thu Jul 27 09:44:25 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 27 Jul 2006 09:44:25 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C8C04A.8070504@sendu.me.uk> References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> <44C7A2C7.2070100@sendu.me.uk> <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net> <44C8C04A.8070504@sendu.me.uk> Message-ID: <077A86AB-174D-426E-B65F-0954A72A6019@gmx.net> I don't think the top-level or sub-directory matters at all and I don't want anybody to get used to the notion that that may imply anything (except possibly better thought-out structure for the sub- directory level). For instance RichSeq is what all rich annotation sequence format parsers return, yet it is in a sub-directory. I don't any real objection to Bio::Taxon though if that's what you'd like to name it - although, what will happen to the Bio::Taxonomy hierarchy then? Phased out? -hilmar On Jul 27, 2006, at 9:31 AM, Sendu Bala wrote: > Hilmar Lapp wrote: >> >> So overall I think you should just go ahead. > > One last suggestion for discussion: > > It may be appropriate is to rename Bio::Taxonomy::Node to clarify that > Node has no particular reliance on or association with > Bio::Taxonomy or > the other modules in Bio/Taxonomy/. > > How about calling it Bio::Taxon? > > It is more obvious what to expect from something called 'Bio::Taxon' > when you know that it is the new 'Bio::Species': like Bio::Species but > for any taxon. It also makes the class 'top-level' which I think most > people are happier using; seems like things in sub-directories are > more > for advanced users. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Thu Jul 27 09:48:32 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 27 Jul 2006 08:48:32 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C8662A.3080904@sendu.me.uk> Message-ID: <002a01c6b183$59779880$15327e82@pyrimidine> Sounds good to me; agree with Hilmar's suggestion of 'in_memory' as well, but it's your choice. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Sendu Bala > Sent: Thursday, July 27, 2006 2:07 AM > To: bioperl-l at bioperl.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > Hilmar Lapp wrote: > > The only reservation I have is that you'd have methods on Node that > > don't really operate on the node instance but rather operate on the > > taxonomy (database) behind the scenes. That's what I would have used > > Bio::Taxonomy for, not so much as a container than as a class with > > (conceptually) 'static' methods corresponding to those that are now > > in Node, like get_Lineage_Nodes(). > > Yes, I had the same reservation. But it somehow seemed reasonable for me > to ask a node for its lineage, though I draw the line at having a method > like get_node('rank_name'). That's the only thing Bio::Taxonomy would > have been good for, so it's a trade off between some nice methods and > the problems inherent in a node-container class. > > Though, perhaps we almost have the best of both worlds, since the > database is effectively a container without the problems: > $node->db_handle->get_Taxonomy_Node(-rank 'rank_name', > -lineage_of => $node); ? > > > > So overall I think you should just go ahead. > > Great, will do. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From osborne1 at optonline.net Thu Jul 27 09:44:33 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Thu, 27 Jul 2006 09:44:33 -0400 Subject: [Bioperl-l] Structure::IO In-Reply-To: <716af09c0607270314u4e2b1eb8y6c1b87f5b3abd8e1@mail.gmail.com> Message-ID: Bernd, I'll need to take a look a closer look at the POD but from your description it seems it's wrong, or certainly incomplete. To get the HEADER line you'll do something like: my $stream = Bio::Structure::IO->new(-file => $filename, -format => 'pdb'); my $struc = $stream->next_structure(); my $anncoll = $struc->annotation; my @headers = $anncoll->get_Annotations('header'); This implies that all these top-level annotations are associated with the entry, not with the chains. I don't use Bio::Structure so don't assume this is true for all annotations. There are 2 ways to explore this further. One is to look at t/StructIO.t or other tests, useful examples are frequently found in the tests. The other is to run your script in the debugger: >perl -d pdb.pl 1CAM.pdb By examining the variables your script creates using the "x" command you get to see exactly where strings are stored and what the names of the keys are, this is how I found the HEADER line. Type "h" for the debugger's Help. Brian O. On 7/27/06 6:14 AM, "Bernd Web" wrote: > Hi > > Thanks for your notes. The text I pasted comes from > http://doc.bioperl.org/releases/bioperl-1.5.1/ but indeed Entry.pm > (v1.25 2006/07/04) shows a different POD. > > I am trying to get annotation out of PDB. ID is not a problem, but I > would like to have the HEADER and possibly comment fields to a (FastA) > description line, but how? > > Bio::Structure::Entry v.1.25 does not list the annotation method in > the POD anymore (due to a missing empty line before =head). > $struc->annotation still exists; I can get the keys but not the values > with $struc->annotation($struc->seqres) (Can't locate object method > "get_Annotations" via package "Bio::PrimarySeq"). > (Example script attached). > > The POD states: annotation: $obj->annotation($seq_obj). So I thought > of a PrimarySeq object to pass to annotation. > > The PrimarySeq object ($struc->seqres) does not contain a description: > $struc->seqres->desc is uninitialized. > > Is it possible to get annotation from header/comments fields with > Bio::Structure? > > Best regards, > Bernd > > > On 7/26/06, Brian Osborne wrote: >> Bernd, >> >> I'm not following your question. The POD in the latest Bio::Structure::Entry >> shows: >> >> =head2 chain() >> >> Title : chain >> Usage : @chains = $structure->chain($chain); >> Function: Connects a Chain or a list of Chain objects to a >> Bio::Structure::Entry. >> Returns : List of Bio::Structure::Chain objects >> Args : A Chain or a reference to an array of Chain objects >> >> =cut >> >> Which is not what you've copied and pasted. What version of Bioperl do you >> use? >> >> Brian O. >> >> >> >> On 7/25/06 6:47 AM, "Bernd Web" wrote: >> >>> Hi, >>> >>> Does someone have experience with Bio::Structure::IO? >>> The example III.9.1 from the bptutorial.pl covers most, but what is e.g. the >>> chain() method of Bio::Structure::Entry doing? The POD states: >>> >>> Title : chain >>> Usage : @chains = $structure->chain($chain); >>> Function: Connects a (or a list of) Chain objects to a >>> Bio::Structure::Entry. >>> Returns : list of Bio::Structure::Residue objects >>> Args : One Residue or a reference to an array of Residue objects >>> >>> But in e.g >>> my $stream = Bio::Structure::IO->new(-file => $filename, >>> -format => 'pdb'); >>> while ( my $struc = $stream->next_structure() ) { >>> for my $chain ($struc->get_chains) { >>> my $chainid = $chain->id; >>> my @chains = $struc->chain($chain); >>> } >>> } >>> >>> I get Bio::Structure::Chain=HASH(0x9f1ab50). >>> >>> What is the function of the chain method and how to use it? >>> >>> Best regards, >>> bernd >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> >> From aaron.j.mackey at gsk.com Thu Jul 27 08:54:05 2006 From: aaron.j.mackey at gsk.com (aaron.j.mackey at gsk.com) Date: Thu, 27 Jul 2006 08:54:05 -0400 Subject: [Bioperl-l] Installing staden io_lib on windows? In-Reply-To: <1153930767.2632.5.camel@localhost.localdomain> Message-ID: Hi Scott, > In file included from Read.h:43, > from Read.c:40: > ../utils/os.h:346:2: #error Must define SP_BIG_ENDIAN or > SP_LITTLE_ENDIAN in Makefile os.h has a bunch of #ifdef statements that check for platforms, and there isn't one for cygwin (but there is for MinGW) Try running configure with "--CFLAGS=-DSP_LITTLE_ENDIAN" or somesuch Also take a look at the MinGW section of os.h to see if there are others you will likely need (e.g. NOPIPE, NOLOCKF, etc) Alternatively, you may want to just edit the original os.h to duplicate the MinGW section with the appropriate compiler constant for CYGWIN (__CYGWIN__ I'm guessing, but don't really know for sure). Good luck, -Aaron From bix at sendu.me.uk Thu Jul 27 10:06:23 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 27 Jul 2006 15:06:23 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <077A86AB-174D-426E-B65F-0954A72A6019@gmx.net> References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> <44C7A2C7.2070100@sendu.me.uk> <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net> <44C8C04A.8070504@sendu.me.uk> <077A86AB-174D-426E-B65F-0954A72A6019@gmx.net> Message-ID: <44C8C85F.2010104@sendu.me.uk> Hilmar Lapp wrote: > I don't think the top-level or sub-directory matters at all and I don't > want anybody to get used to the notion that that may imply anything > (except possibly better thought-out structure for the sub-directory > level). For instance RichSeq is what all rich annotation sequence format > parsers return, yet it is in a sub-directory. Well, I'm not aware that I've ever used a RichSeq ;). But your point is taken. > I don't any real objection to Bio::Taxon though if that's what you'd > like to name it - although, what will happen to the Bio::Taxonomy > hierarchy then? Phased out? At the moment it seems to me that the Bio::Taxonomy modules (excluding Node) aren't really usable. Jason wrote a comment in t/TaxonTree.t which tests Taxon and Tree: ## I am pretty sure this module is going the way of the dodo bird so ## I am not sure how much work to put into fixing the tests/module FactoryI is strange (it isn't intended to work like any other Bioperl factory) and there are no implementers of it, while Taxonomy.pm itself would be redundant after my Node changes and has implementation issues, though it may make more sense to some people. My vote is phase out. What is the actual process involved in renaming a module in Bioperl? From hlapp at gmx.net Thu Jul 27 10:29:09 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 27 Jul 2006 10:29:09 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C8C85F.2010104@sendu.me.uk> References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> <44C7A2C7.2070100@sendu.me.uk> <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net> <44C8C04A.8070504@sendu.me.uk> <077A86AB-174D-426E-B65F-0954A72A6019@gmx.net> <44C8C85F.2010104@sendu.me.uk> Message-ID: How do you mean 'process'? You create a new module, and then you deprecate the ones you're phasing out. If possible you rewrite the implementation to use the new module. Not sure this answers your question? -hilmar On Jul 27, 2006, at 10:06 AM, Sendu Bala wrote: > Hilmar Lapp wrote: >> I don't think the top-level or sub-directory matters at all and I >> don't >> want anybody to get used to the notion that that may imply anything >> (except possibly better thought-out structure for the sub-directory >> level). For instance RichSeq is what all rich annotation sequence >> format >> parsers return, yet it is in a sub-directory. > > Well, I'm not aware that I've ever used a RichSeq ;). But your > point is > taken. > > >> I don't any real objection to Bio::Taxon though if that's what you'd >> like to name it - although, what will happen to the Bio::Taxonomy >> hierarchy then? Phased out? > > At the moment it seems to me that the Bio::Taxonomy modules (excluding > Node) aren't really usable. Jason wrote a comment in t/TaxonTree.t > which > tests Taxon and Tree: > > ## I am pretty sure this module is going the way of the dodo bird so > ## I am not sure how much work to put into fixing the tests/module > > FactoryI is strange (it isn't intended to work like any other Bioperl > factory) and there are no implementers of it, while Taxonomy.pm itself > would be redundant after my Node changes and has implementation > issues, > though it may make more sense to some people. > > My vote is phase out. > > > What is the actual process involved in renaming a module in Bioperl? > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Thu Jul 27 10:29:39 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 27 Jul 2006 09:29:39 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <077A86AB-174D-426E-B65F-0954A72A6019@gmx.net> Message-ID: <003101c6b189$17f5d2e0$15327e82@pyrimidine> I'll respond to both here: > Sendu Bala wrote: > > One last suggestion for discussion: > > It may be appropriate is to rename Bio::Taxonomy::Node to clarify that > Node has no particular reliance on or association with Bio::Taxonomy or > the other modules in Bio/Taxonomy/. > > How about calling it Bio::Taxon? > > It is more obvious what to expect from something called 'Bio::Taxon' > when you know that it is the new 'Bio::Species': like Bio::Species but > for any taxon. It also makes the class 'top-level' which I think most > people are happier using; seems like things in sub-directories are more > for advanced users. Hilmar explains the namespace issue with Bioperl more concisely below. You should still be able to use a Node in a Taxonomy, but then again you should also be able to use a Taxon in a Taxonomy as well (by definition, a Taxon is part of a Taxonomy as it is a taxonomic unit). The whole "looking at this from a biologist's perspective" thing again... http://en.wikipedia.org/wiki/Taxon BTW, what exactly is Bio::Taxonomy::Taxon used for? Looks like it is used more for building taxonomic trees that anything, so shouldn't it be moved to Bio::Tree:Taxon (that name isn't used)? Then you could use Bio::Taxonomy::Taxon for your purposes. See, the only concern I have with using the name Bio::Taxon is people confusing it with Bio::Taxonomy itself or with Bio::Taxonomy::Taxon. Though I agree that the name makes sense for what you want. > Hilmar Lapp wrote: > > I don't think the top-level or sub-directory matters at all and I > don't want anybody to get used to the notion that that may imply > anything (except possibly better thought-out structure for the sub- > directory level). For instance RichSeq is what all rich annotation > sequence format parsers return, yet it is in a sub-directory. > > I don't any real objection to Bio::Taxon though if that's what you'd > like to name it - although, what will happen to the Bio::Taxonomy > hierarchy then? Phased out? > > -hilmar I'm not sure how many people out there use Bio::Taxonomy. I think they use the tree-building modules in Bio::Tree more than anything. And there haven't been any panicked users protesting at the gates yet about the many posts for Bio::Taxonomy changes (well, except me, and 'I got better'). Chris From cjfields at uiuc.edu Thu Jul 27 10:54:06 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 27 Jul 2006 09:54:06 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C8C85F.2010104@sendu.me.uk> Message-ID: <003201c6b18c$829330e0$15327e82@pyrimidine> > > I don't any real objection to Bio::Taxon though if that's what you'd > > like to name it - although, what will happen to the Bio::Taxonomy > > hierarchy then? Phased out? > > At the moment it seems to me that the Bio::Taxonomy modules (excluding > Node) aren't really usable. Jason wrote a comment in t/TaxonTree.t which > tests Taxon and Tree: > > ## I am pretty sure this module is going the way of the dodo bird so > ## I am not sure how much work to put into fixing the tests/module > > FactoryI is strange (it isn't intended to work like any other Bioperl > factory) and there are no implementers of it, while Taxonomy.pm itself > would be redundant after my Node changes and has implementation issues, > though it may make more sense to some people. > > My vote is phase out. > > > What is the actual process involved in renaming a module in Bioperl? This is how many times the phrase "Bio::Taxonomy" is used in Bioperl in directory Bio (which should catch any namespace usage like Node, etc.): Instances: 2 BP Module : Bio::DB::Taxonomy Instances: 4 BP Module : Bio::DB::Taxonomy::entrez Instances: 7 BP Module : Bio::DB::Taxonomy::flatfile Instances: 1 BP Module : Bio::Expression::Platform Instances: 1 BP Module : Bio::SeqIO::genbank Instances: 22 BP Module : Bio::Taxonomy Instances: 8 BP Module : Bio::Taxonomy::FactoryI Instances: 17 BP Module : Bio::Taxonomy::Node Instances: 20 BP Module : Bio::Taxonomy::Taxon Instances: 39 BP Module : Bio::Taxonomy::Tree Hmm, not much. Almost all hits are within Bio::DB::taxonomy or Bio::Taxonomy. The SeqIO::genbank was my change BTW; just haven't tossed the code yet. Therefore, the only one left that would be affected (outside of Bio::Taxonomy and Bio::DB::Taxonomy) is Allen Day's Bio::Expression::Platform class, which uses Bio::DB::Taxonomy::entrez to grab Nodes; that would just be changed over to whatever class you plan on using. And that class hasn't been documented at all outside the methods. Furthermore, judging by the mail list archives the Bio::Taxonomy modules had very little usage outside of Node. Jason mentioned on an old post that he could never get Bio::Taxonomy::Taxon/Tree to work and that Dan Kortschak had moved (Dan's last post was in 2003). Hence the test file comments. And you make a good point with Bio::Taxonomy::FactoryI. I agree, if the modules haven't served a useful purpose they should be phased out. Chris From cjfields at uiuc.edu Thu Jul 27 11:15:25 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 27 Jul 2006 10:15:25 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: Message-ID: <003301c6b18f$7d114000$15327e82@pyrimidine> Wow, we're doing a little bioperl spring cleaning here! I agree with Hilmar: create a new module (Bio::Taxon), which claims the namespace, and deprecate the old ones. How 'broken', exactly, is Bio::Taxonomy? The idea behind it seems just (container for Nodes) but maybe it should just be reconfigured, and all the classes in directory Bio/Taxonomy deprecated. Or should we start from scratch completely? Don't know if it has been attempted but it would be nice to have a way for building taxonomic trees from Node/Taxon information using a Taxonomy-like container object. I like the way NCBI does something along these lines with BLAST output now. BTW, thanks guys for a rousing discussion! Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Hilmar Lapp > Sent: Thursday, July 27, 2006 9:29 AM > To: Sendu Bala > Cc: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > How do you mean 'process'? You create a new module, and then you > deprecate the ones you're phasing out. If possible you rewrite the > implementation to use the new module. > > Not sure this answers your question? > > -hilmar > > On Jul 27, 2006, at 10:06 AM, Sendu Bala wrote: > > > Hilmar Lapp wrote: > >> I don't think the top-level or sub-directory matters at all and I > >> don't > >> want anybody to get used to the notion that that may imply anything > >> (except possibly better thought-out structure for the sub-directory > >> level). For instance RichSeq is what all rich annotation sequence > >> format > >> parsers return, yet it is in a sub-directory. > > > > Well, I'm not aware that I've ever used a RichSeq ;). But your > > point is > > taken. > > > > > >> I don't any real objection to Bio::Taxon though if that's what you'd > >> like to name it - although, what will happen to the Bio::Taxonomy > >> hierarchy then? Phased out? > > > > At the moment it seems to me that the Bio::Taxonomy modules (excluding > > Node) aren't really usable. Jason wrote a comment in t/TaxonTree.t > > which > > tests Taxon and Tree: > > > > ## I am pretty sure this module is going the way of the dodo bird so > > ## I am not sure how much work to put into fixing the tests/module > > > > FactoryI is strange (it isn't intended to work like any other Bioperl > > factory) and there are no implementers of it, while Taxonomy.pm itself > > would be redundant after my Node changes and has implementation > > issues, > > though it may make more sense to some people. > > > > My vote is phase out. > > > > > > What is the actual process involved in renaming a module in Bioperl? > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From hlapp at gmx.net Thu Jul 27 11:29:04 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 27 Jul 2006 11:29:04 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <003101c6b189$17f5d2e0$15327e82@pyrimidine> References: <003101c6b189$17f5d2e0$15327e82@pyrimidine> Message-ID: On Jul 27, 2006, at 10:29 AM, Chris Fields wrote: > See, the only concern I have with using the name Bio::Taxon is people > confusing it with Bio::Taxonomy itself or with > Bio::Taxonomy::Taxon. Though > I agree that the name makes sense for what you want. I don't think Bio::Taxonomy is used a lot in earnest if at all, so it you even test the waters by deprecating them right away by putting warning statements there and see whether anybody complains about the cluttered terminal screens. If this goes into snapshot releases and release candidates leading up to 1.6 then they may be phased out right away. Unless anybody on the list has strong objections? Anybody using Bio::Taxonomy? -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From skirov at utk.edu Thu Jul 27 09:57:19 2006 From: skirov at utk.edu (skirov) Date: Thu, 27 Jul 2006 09:57:19 -0400 Subject: [Bioperl-l] TRANSFAC matrices, open acces Message-ID: <44E2E794@webmail.utk.edu> Just a quick note- Bio::Matrix::PSM::IO can parse matrix.dat if you can get it- and as far as I can tell this is not easy- you have to contact the company to get access and it is not clear what their conditions are. This is the reason I have decided not to maintain the transfac parser. Stefan >===== Original Message From Sendu Bala ===== >Sean Davis wrote: >> >> On 7/27/06 5:55 AM, "Sendu Bala" wrote: >> >>> Maximilian Haeussler wrote: >>> Actually, the fact that the transfac matrices are belonging to a >>> company is quite inconvenient for biologists and bioinformatics >>> analyses working in this field. > > >>> The public version is adequate though. It would certainly be useful to >>> have Bioperl access to transfac and other regulation databases. I'll >>> probably write some suitable modules if no one beats me to it. >> >> I haven't used it in a while, but the TFBS family of modules are, if I >> recall correctly, bioperl-compatible, though not part of bioperl. In any >> case, for those who aren't aware, it might be worth a quick look: > >Yes. It only has online access to Transfac though, and the inheritance >and returned objects are TFBS specific so you miss out on whatever >goodness there may be in the rest of bioperl. > >Still, recommended to use if you want programmatic access to Transfac >matrices right now. >_______________________________________________ >Bioperl-l mailing list >Bioperl-l at lists.open-bio.org >http://lists.open-bio.org/mailman/listinfo/bioperl-l From bix at sendu.me.uk Thu Jul 27 12:30:38 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 27 Jul 2006 17:30:38 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> <44C7A2C7.2070100@sendu.me.uk> <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net> <44C8C04A.8070504@sendu.me.uk> <077A86AB-174D-426E-B65F-0954A72A6019@gmx.net> <44C8C85F.2010104@sendu.me.uk> Message-ID: <44C8EA2E.8030000@sendu.me.uk> Hilmar Lapp wrote: > How do you mean 'process'? You create a new module, and then you > deprecate the ones you're phasing out. If possible you rewrite the > implementation to use the new module. > > Not sure this answers your question? I guess. I was thinking of just making Bio::Taxonomy::Node isa Bio::Taxon and then simply removing all the code from Node, leaving just some perldoc that said it had been renamed? Or should there be some methods that issue a warning and then call SUPER? From hlapp at gmx.net Thu Jul 27 12:38:30 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 27 Jul 2006 12:38:30 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C8EA2E.8030000@sendu.me.uk> References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> <44C7A2C7.2070100@sendu.me.uk> <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net> <44C8C04A.8070504@sendu.me.uk> <077A86AB-174D-426E-B65F-0954A72A6019@gmx.net> <44C8C85F.2010104@sendu.me.uk> <44C8EA2E.8030000@sendu.me.uk> Message-ID: <881BB312-DF1B-43F6-A38D-8B543738244F@gmx.net> That's what I said could be possible here on much shorter notice that we'd do usually due to the low usage. Eventually deprecated modules should also be physically removed, so you want to prepare for that. (removing a module breaks scripts that used it; issuing a warning alerts to this being forthcoming.) -hilmar On Jul 27, 2006, at 12:30 PM, Sendu Bala wrote: > Hilmar Lapp wrote: >> How do you mean 'process'? You create a new module, and then you >> deprecate the ones you're phasing out. If possible you rewrite the >> implementation to use the new module. >> >> Not sure this answers your question? > > I guess. I was thinking of just making Bio::Taxonomy::Node isa > Bio::Taxon and then simply removing all the code from Node, leaving > just > some perldoc that said it had been renamed? > > Or should there be some methods that issue a warning and then call > SUPER? > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From sanges at biogem.it Thu Jul 27 12:37:05 2006 From: sanges at biogem.it (Remo Sanges) Date: Thu, 27 Jul 2006 18:37:05 +0200 Subject: [Bioperl-l] TRANSFAC matrices, open acces In-Reply-To: <44E2E794@webmail.utk.edu> References: <44E2E794@webmail.utk.edu> Message-ID: <44C8EBB1.5070709@biogem.it> Here is also my 2c on TFBS: skirov wrote: >Just a quick note- Bio::Matrix::PSM::IO can parse matrix.dat if you can get >it- and as far as I can tell this is not easy- you have to contact the company >to get access and it is not clear what their conditions are. This is the >reason I have decided not to maintain the transfac parser. >Stefan > > >>===== Original Message From Sendu Bala ===== >>Sean Davis wrote: >> >> >>>On 7/27/06 5:55 AM, "Sendu Bala" wrote: >>> >>> >>> >>>>Maximilian Haeussler wrote: >>>>Actually, the fact that the transfac matrices are belonging to a >>>>company is quite inconvenient for biologists and bioinformatics >>>>analyses working in this field. >>>> >>>> >>>>The public version is adequate though. It would certainly be useful to >>>>have Bioperl access to transfac and other regulation databases. I'll >>>>probably write some suitable modules if no one beats me to it. >>>> >>>> >>>I haven't used it in a while, but the TFBS family of modules are, if I >>>recall correctly, bioperl-compatible, though not part of bioperl. In any >>>case, for those who aren't aware, it might be worth a quick look: >>> >>> >>Yes. It only has online access to Transfac though >> TFBS::DB::LocalTRANSFAC - can parse local transfac matrices (matrix.dat) >>, and the inheritance >>and returned objects are TFBS specific so you miss out on whatever >>goodness there may be in the rest of bioperl. >> >> >> In TFBS there are modules which inherithed from Bio::SeqFeature::Generic and Bio::Root::Root. See for example TFBS::Site. So probably it is not so bad.... Here is the link cutted from the Sean's e-mail: http://forkhead.cgb.ki.se/TFBS/ HTH Remo From osborne1 at optonline.net Thu Jul 27 12:49:26 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Thu, 27 Jul 2006 12:49:26 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <881BB312-DF1B-43F6-A38D-8B543738244F@gmx.net> Message-ID: Sendu, And add the module or modules names to the DEPRECATED file. Brian O. On 7/27/06 12:38 PM, "Hilmar Lapp" wrote: > Eventually deprecated modules should also be physically removed From MEC at stowers-institute.org Thu Jul 27 13:28:03 2006 From: MEC at stowers-institute.org (Cook, Malcolm) Date: Thu, 27 Jul 2006 12:28:03 -0500 Subject: [Bioperl-l] TRANSFAC matrices, open acces Message-ID: re: >Yes. It only has online access to Transfac though, not quite true. It does support access to local transfac data files if you have them. --Malcolm From cjfields at uiuc.edu Thu Jul 27 13:45:28 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 27 Jul 2006 12:45:28 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <881BB312-DF1B-43F6-A38D-8B543738244F@gmx.net> Message-ID: <000301c6b1a4$73ef3fd0$15327e82@pyrimidine> Makes sense to me. From my previous post the only bioperl class that used it was Bio::Expression::Platform, and that only for grabbing Node objects from Bio::DB::Taxonomy::entrez (so, change it to use whatever object Bio::DB::Taxonomy returns). I couldn't find anything else in the core outside of the Bio::DB::Taxonomy and Bio::Taxonomy classes and tests that use them. There aren't even any scripts or examples. If you implement Bio::Root::RootI (and pretty much everything does), you could use warn() or deprecated() for these easily: ... Title : warn Usage : $object->warn("Warning message"); Function: Places a warning. What happens now is down to the verbosity of the object (value of $obj->verbose) verbosity 0 or not set => small warning verbosity -1 => no warning verbosity 1 => warning with stack trace verbosity 2 => converts warnings into throw ... Title : deprecated Usage : $obj->deprecated("Method X is deprecated"); Function: Prints a message about deprecation unless verbose is < 0 (which means be quiet) Returns : none Args : Message string to print to STDERR ... Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Hilmar Lapp > Sent: Thursday, July 27, 2006 11:39 AM > To: Sendu Bala > Cc: bioperl-l at bioperl.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > That's what I said could be possible here on much shorter notice that > we'd do usually due to the low usage. > > Eventually deprecated modules should also be physically removed, so > you want to prepare for that. (removing a module breaks scripts that > used it; issuing a warning alerts to this being forthcoming.) > > -hilmar > > On Jul 27, 2006, at 12:30 PM, Sendu Bala wrote: > > > Hilmar Lapp wrote: > >> How do you mean 'process'? You create a new module, and then you > >> deprecate the ones you're phasing out. If possible you rewrite the > >> implementation to use the new module. > >> > >> Not sure this answers your question? > > > > I guess. I was thinking of just making Bio::Taxonomy::Node isa > > Bio::Taxon and then simply removing all the code from Node, leaving > > just > > some perldoc that said it had been renamed? > > > > Or should there be some methods that issue a warning and then call > > SUPER? > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From bix at sendu.me.uk Thu Jul 27 15:30:47 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 27 Jul 2006 20:30:47 +0100 Subject: [Bioperl-l] TRANSFAC matrices, open acces In-Reply-To: References: Message-ID: <44C91467.5050001@sendu.me.uk> Cook, Malcolm wrote: > re: > >> Yes. It only has online access to Transfac though, > > not quite true. It does support access to local transfac data files if > you have them. And to local Jaspar files. I wasn't clear, but I meant for the 'only' to modify 'online'. Ie. it doesn't give you access to any other online databases. From bix at sendu.me.uk Thu Jul 27 15:55:32 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 27 Jul 2006 20:55:32 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <003101c6b189$17f5d2e0$15327e82@pyrimidine> References: <003101c6b189$17f5d2e0$15327e82@pyrimidine> Message-ID: <44C91A34.1040406@sendu.me.uk> Chris Fields wrote: > BTW, what exactly is Bio::Taxonomy::Taxon used for? Looks like it is used > more for building taxonomic trees that anything, so shouldn't it be moved to > Bio::Tree:Taxon (that name isn't used)? Then you could use > Bio::Taxonomy::Taxon for your purposes. It actually seemed more like a possible replacement for Bio::Taxonomy::Node. Thanks to its Tree::NodeI implementation it has the big advantage over Bio::Taxonomy::Node that you access the lineage without a database. From the programmer's point of view it seemed more natural, being able to create nodes and add descendants. I decided against it because I felt the added complexity wasn't really worth it, and Bio::Taxonomy::Node had some of its own advantages. If this turns out to be the wrong choice, my Bio::Taxon can always be reimplemented to also implement Tree::NodeI in the future. > See, the only concern I have with using the name Bio::Taxon is people > confusing it with Bio::Taxonomy itself or with Bio::Taxonomy::Taxon. Though > I agree that the name makes sense for what you want. I don't think you'd confuse it directly with Bio::Taxonomy, but you could certainly waste some time thinking it was appropriate to stick Bio::Taxon objects in Bio::Taxonomy objects - theoretically it might work but ultimately you'd just be wasting your time. I'll make sure the docs in the Taxonomy modules steer people in the right direction. From bix at sendu.me.uk Thu Jul 27 16:18:06 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 27 Jul 2006 21:18:06 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <003301c6b18f$7d114000$15327e82@pyrimidine> References: <003301c6b18f$7d114000$15327e82@pyrimidine> Message-ID: <44C91F7E.2040000@sendu.me.uk> Chris Fields wrote: > How 'broken', exactly, is Bio::Taxonomy? Its certainly usable as-is, but there are some gotchas. # It has an acknowledged weakness of not coping with multiple ranks of the same name (notably 'no rank'). # You can't have 2 nodes with the same rank (so can only build a single lineage, not a whole menagerie). # You must supply a list of all your rank names correctly ordered before you can add any nodes (or trust that the default list is satisfactory - it won't be if you have just a single 'no rank'). # You simply don't need it if you have Bio::Taxonomy::Nodes with db_handle set, or Bio::Taxonomy::Taxons. In my opinion, the burden is just too great for this ever to have been a 'fun' module to use. It was only required so that people could manually create their own Bio::Taxonomy::Nodes and form a lineage without a database. > Don't know if it has been attempted but it would be nice to have a way for > building taxonomic trees from Node/Taxon information using a Taxonomy-like > container object. I like the way NCBI does something along these lines with > BLAST output now. Not really sure what you mean. I don't think you'd require a container object to do any particular task. Can you clarify? From clarsen at vecna.com Thu Jul 27 15:59:50 2006 From: clarsen at vecna.com (Chris Larsen) Date: Thu, 27 Jul 2006 15:59:50 -0400 (EDT) Subject: [Bioperl-l] Working code Message-ID: <7263.70.106.6.26.1154030390.squirrel@mail.vecna.com> Hey gang, You said you wanted to see working code: ------------------------------------------- > ...And I'd rather see some code or API examples than > extensive elaborations. > > -hilmar Hilmar's right; working code does speaks louder than words. -Chris -------------------------------------------- So here's some: http://www.biohealthbase.org/GSearch/ We've just released the v2 of Bioinformatic Resource Center's website "Biohealthbase". Earlier I pointed out BHB v1 to the list; then we had implemented GBrowse on top of GUS 3. There was some data processing using BioPerl packages to generate well-formatted data for the Oracle instance. But new micro-organisms are added now, so we have Francisella, Mycobacterium, Microsporidia, Giardia, and Influenza. They are under GUS 3.5. We also now have some web-capable BLASTing under there (Please no spam!) And multiple sequence alignments and dendrograms are to come, using MUSCLE instead of ClustalW. Currently, a Bioperl I/O module accepts the output from BLAST and writes up some HTML, then our web app on another server displays the URL content. But we will improve on this model in v3 for MSA et al. since the requirements are different for multiple vs single alignments. Thanks again for the open source! Chris ---------------------------- Christopher Larsen, Ph.D. Senior Scientist Vecna Technologies, Inc. 5004 Lehigh Rd College Park, MD 20740-3821 e: clarsen at vecna.com ph: (240) 737-1625 f: (301) 699-3180 From skirov at utk.edu Thu Jul 27 09:56:45 2006 From: skirov at utk.edu (skirov) Date: Thu, 27 Jul 2006 09:56:45 -0400 Subject: [Bioperl-l] TRANSFAC matrices, open acces Message-ID: <44E2E5B9@webmail.utk.edu> Just a quick note- Bio::Matrix::PSM::IO can parse matrix.dat if you can get it- and as far as I can tell this is not easy- you have to contact the company to get access and it is not clear what their conditions are. Stefan >===== Original Message From Sendu Bala ===== >Sean Davis wrote: >> >> On 7/27/06 5:55 AM, "Sendu Bala" wrote: >> >>> Maximilian Haeussler wrote: >>> Actually, the fact that the transfac matrices are belonging to a >>> company is quite inconvenient for biologists and bioinformatics >>> analyses working in this field. > > >>> The public version is adequate though. It would certainly be useful to >>> have Bioperl access to transfac and other regulation databases. I'll >>> probably write some suitable modules if no one beats me to it. >> >> I haven't used it in a while, but the TFBS family of modules are, if I >> recall correctly, bioperl-compatible, though not part of bioperl. In any >> case, for those who aren't aware, it might be worth a quick look: > >Yes. It only has online access to Transfac though, and the inheritance >and returned objects are TFBS specific so you miss out on whatever >goodness there may be in the rest of bioperl. > >Still, recommended to use if you want programmatic access to Transfac >matrices right now. >_______________________________________________ >Bioperl-l mailing list >Bioperl-l at lists.open-bio.org >http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Thu Jul 27 21:19:51 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 27 Jul 2006 20:19:51 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C91F7E.2040000@sendu.me.uk> References: <003301c6b18f$7d114000$15327e82@pyrimidine> <44C91F7E.2040000@sendu.me.uk> Message-ID: <3DAB9065-3633-4D50-B97E-41F2BB58C6EB@uiuc.edu> ... >> Don't know if it has been attempted but it would be nice to have a >> way for >> building taxonomic trees from Node/Taxon information using a >> Taxonomy-like >> container object. I like the way NCBI does something along these >> lines with >> BLAST output now. > > Not really sure what you mean. I don't think you'd require a container > object to do any particular task. Can you clarify? Let's say you start with a list of sequence IDs from a BLAST report and wanted to find the taxonomic relationship between the BLAST hits. NCBI does something similar to this in their last few BLAST output revisions from the CGI interface; they have a link which contains the organisms ranked taxonomically in various ways. There is probably a Bioperl-specific way of doing this but I haven't spent the effort yet working out how. No big deal, really. I have PLENTY else to work on. Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From R.Birnie at leeds.ac.uk Fri Jul 28 05:39:34 2006 From: R.Birnie at leeds.ac.uk (Richard Birnie) Date: Fri, 28 Jul 2006 10:39:34 +0100 Subject: [Bioperl-l] whole genome annotation Message-ID: Hello all, I'm just trying to familiarise myself with BioPerl and I'm a little overwhelmed by the sheer volume of information available on the wiki. I'm hoping someone can point in the right direction through the labyrinth. This may become a little longwinded but I'll try and get all the annoying newbie questions out of the way in one go. Let me try and explain what I'm aiming for. I have some CGH data downloaded from the Progenetix database (http://www.progenetix.de/~pgscripts/progenetix/Aboutprogenetix.html), this data is simplified to record simply gain/loss/amplification of whole chromosome bands at 862 band resolution to facilitate the combination of data from multiple different studies. What I'd like to be able to do is download a copy of the human genome sequence with annotation describing the locations of chromosome bands and preferably of known genes. I then want to be able to manipulate the genome data based on the CGH data to mimic deletions. The ultimate goal of this is to be able to feed the manipulated genome data into a program (metashark) that predicts the structure of metabolic networks based on genome annotation compared to a reference genome, in this case a complete 'normal' human genome and see what effect that has on the metabolic pathways. I appreciate that is a bit vague but thats sort of my problem, I'm not a bioinformatician really so I'm not sue of the details of what I want. I just happen to have an question to answer and bioperl seems the way to go (for this project and more generally). I've started looking at the HOWTOs and read the main bioperl tutorial. I also looked at the CGL comparative genomics library but I haven't penetrated far into that yet. I'm ok with basic perl although not much object oriented stuff. I don't really have much experience with handling sequence data on a whole genome scale either, a few genbank files for my favourite genes is fine but I need some guidance to work on this scale. What I'm looking for is someone to give me a start. I'd greatly appreciate it if someone could spell out the general steps for downloading a complete copy of the human genome and its annotations (if this is even a feasible approach) and how to put it all together. Not actual code just the general concept for each step and which tools from the bioperl set would be most appropriate for each step so that I can focus what I need to read about, even a little pseudo-code if I'm lucky. If I can get the genome data downloaded and setup properly I'll work out how to apply the CGH data to it myself. If example code for what I'm trying to describe is included somewhere, great could someone point to where. Thanks for your patience. best regards, Richard Dr Richard Birnie Scientific Officer Section of Pathology and Tumour Biology Welcome Brenner Building, LIMM St James University Hospital Beckett St, Leeds, LS9 7TF Tel:0113 3438624 e-mail: r.birnie at leeds.ac.uk From sdavis2 at mail.nih.gov Fri Jul 28 07:59:17 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Fri, 28 Jul 2006 07:59:17 -0400 Subject: [Bioperl-l] whole genome annotation In-Reply-To: References: Message-ID: <44C9FC15.3040503@mail.nih.gov> Richard Birnie wrote: > Hello all, > > I'm just trying to familiarise myself with BioPerl and I'm a little overwhelmed by the sheer volume of information available on the wiki. I'm hoping someone can point in the right direction through the labyrinth. This may become a little longwinded but I'll try and get all the annoying newbie questions out of the way in one go. > > Let me try and explain what I'm aiming for. I have some CGH data downloaded from the Progenetix database (http://www.progenetix.de/~pgscripts/progenetix/Aboutprogenetix.html), this data is simplified to record simply gain/loss/amplification of whole chromosome bands at 862 band resolution to facilitate the combination of data from multiple different studies. > > What I'd like to be able to do is download a copy of the human genome sequence with annotation describing the locations of chromosome bands and preferably of known genes. I then want to be able to manipulate the genome data based on the CGH data to mimic deletions. The ultimate goal of this is to be able to feed the manipulated genome data into a program (metashark) that predicts the structure of metabolic networks based on genome annotation compared to a reference genome, in this case a complete 'normal' human genome and see what effect that has on the metabolic pathways. > > I appreciate that is a bit vague but thats sort of my problem, I'm not a bioinformatician really so I'm not sue of the details of what I want. I just happen to have an question to answer and bioperl seems the way to go (for this project and more generally). I've started looking at the HOWTOs and read the main bioperl tutorial. I also looked at the CGL comparative genomics library but I haven't penetrated far into that yet. I'm ok with basic perl although not much object oriented stuff. I don't really have much experience with handling sequence data on a whole genome scale either, a few genbank files for my favourite genes is fine but I need some guidance to work on this scale. > > What I'm looking for is someone to give me a start. I'd greatly appreciate it if someone could spell out the general steps for downloading a complete copy of the human genome and its annotations (if this is even a feasible approach) and how to put it all together. Not actual code just the general concept for each step and which tools from the bioperl set would be most appropriate for each step so that I can focus what I need to read about, even a little pseudo-code if I'm lucky. If I can get the genome data downloaded and setup properly I'll work out how to apply the CGH data to it myself. > > If example code for what I'm trying to describe is included somewhere, great could someone point to where. Hi, Richard. Bioperl is good for many things, but for simply grabbing all the locations of human genes in the genome and chromosome band locations, I wouldn't use bioperl. It sounds to me like you are interested in getting the genes associated with each chromosomal band? If so, just download the cytoband.txt and refFlat.txt files from the UCSC genome browser site. cytoband.txt contains the base pair locations for each of the cytobands. refFlat.txt contains the base pair locations of "refseq" genes. It is then simply a matter of finding overlapping regions (genes with cytobands) to determine which genes are in which cytobands. Since the files are tab-delimited text, they are very easy to work with (in perl, excel, python, ...). Don't get me wrong--I really appreciate the power of bioperl, but in this case, your task lends itself to a simpler (and MUCH) faster approach. Sean From R.Birnie at leeds.ac.uk Fri Jul 28 08:21:46 2006 From: R.Birnie at leeds.ac.uk (Richard Birnie) Date: Fri, 28 Jul 2006 13:21:46 +0100 Subject: [Bioperl-l] whole genome annotation References: <44C9FC15.3040503@mail.nih.gov> Message-ID: -----Original Message----- From: Sean Davis [mailto:sdavis2 at mail.nih.gov] Sent: Fri 7/28/2006 12:59 To: Richard Birnie Cc: bioperl-l at lists.open-bio.org Subject: Re: [Bioperl-l] whole genome annotation Richard Birnie wrote: > Hello all, > > I'm just trying to familiarise myself with BioPerl and I'm a little overwhelmed by the sheer volume of information available on the wiki. I'm hoping someone can point in the right direction through the labyrinth. This may become a little longwinded but I'll try and get all the annoying newbie questions out of the way in one go. > > Let me try and explain what I'm aiming for. I have some CGH data downloaded from the Progenetix database (http://www.progenetix.de/~pgscripts/progenetix/Aboutprogenetix.html), this data is simplified to record simply gain/loss/amplification of whole chromosome bands at 862 band resolution to facilitate the combination of data from multiple different studies. > > What I'd like to be able to do is download a copy of the human genome sequence with annotation describing the locations of chromosome bands and preferably of known genes. I then want to be able to manipulate the genome data based on the CGH data to mimic deletions. The ultimate goal of this is to be able to feed the manipulated genome data into a program (metashark) that predicts the structure of metabolic networks based on genome annotation compared to a reference genome, in this case a complete 'normal' human genome and see what effect that has on the metabolic pathways. > > I appreciate that is a bit vague but thats sort of my problem, I'm not a bioinformatician really so I'm not sue of the details of what I want. I just happen to have an question to answer and bioperl seems the way to go (for this project and more generally). I've started looking at the HOWTOs and read the main bioperl tutorial. I also looked at the CGL comparative genomics library but I haven't penetrated far into that yet. I'm ok with basic perl although not much object oriented stuff. I don't really have much experience with handling sequence data on a whole genome scale either, a few genbank files for my favourite genes is fine but I need some guidance to work on this scale. > > What I'm looking for is someone to give me a start. I'd greatly appreciate it if someone could spell out the general steps for downloading a complete copy of the human genome and its annotations (if this is even a feasible approach) and how to put it all together. Not actual code just the general concept for each step and which tools from the bioperl set would be most appropriate for each step so that I can focus what I need to read about, even a little pseudo-code if I'm lucky. If I can get the genome data downloaded and setup properly I'll work out how to apply the CGH data to it myself. > > If example code for what I'm trying to describe is included somewhere, great could someone point to where. Hi, Richard. Bioperl is good for many things, but for simply grabbing all the locations of human genes in the genome and chromosome band locations, I wouldn't use bioperl. It sounds to me like you are interested in getting the genes associated with each chromosomal band? If so, just download the cytoband.txt and refFlat.txt files from the UCSC genome browser site. cytoband.txt contains the base pair locations for each of the cytobands. refFlat.txt contains the base pair locations of "refseq" genes. It is then simply a matter of finding overlapping regions (genes with cytobands) to determine which genes are in which cytobands. Since the files are tab-delimited text, they are very easy to work with (in perl, excel, python, ...). Don't get me wrong--I really appreciate the power of bioperl, but in this case, your task lends itself to a simpler (and MUCH) faster approach. Sean Thanks for the response Sean, getting the genes associated with each band is certainly part of what I need and your suggestion will help with that. I did look at the UCSC site but as you know there is such a volume of info on there I didn't really know which files I needed. However my main goal requires slightly more. What I want to be able to do is take the chromosomal band annotation info and compare that against the CGH data I have. From this I'd like to then be able say "OK band 8q13.1 (or whatever) is deleted, so make a copy of the genome with the actual sequence associated with that band removed." Then I could feed both sequences into metashark which predicts the structure of metabolic pathways based on genome annotation and see what effect deleting that region of DNA has on the structure of the metabolic network. Knowing which genes are involved is necessary for identifying what are the important components within the region. Are there tools in Bioperl for making this comparison? It can probably be reduced to a straight comparison of data structures so I may just use regular perl for this part unless there is anything designed for purpose. The thing I was struggling with was how to store and manipulate genomic sequence data in such quantities. Since this morning I've had a better look at the CGL library and associated datastore module, I think I can do it using these tools but I'm having a few dependency issues getting it installed right now. So I'll go back to wrestling with that. regards, Richard From valiente at lsi.upc.edu Fri Jul 28 08:10:19 2006 From: valiente at lsi.upc.edu (Gabriel Valiente) Date: Fri, 28 Jul 2006 15:10:19 +0300 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: References: Message-ID: >>> At the moment it seems to me that the Bio::Taxonomy modules >>> (excluding >>> Node) aren't really usable. I've been using Bio::Taxonomy and Bio::DB::Taxonomy a lot, they are very useful modules. Whatever Bio::Taxon or Bio::Taxonomy::Taxon turns out to be, please do keep the Bio::DB::Taxonomy functionality. BTW, does anybody know how to include branch lengths in Bio::DB::Taxonomy? Thanks a lot, Gabriel From y.itan at ucl.ac.uk Fri Jul 28 08:07:32 2006 From: y.itan at ucl.ac.uk (Yuval Itan) Date: Fri, 28 Jul 2006 13:07:32 +0100 Subject: [Bioperl-l] Getting sequences by base pair locations Message-ID: <05a8f85bdc36dc7e370d0935b6394f9d@ucl.ac.uk> Hello all, I was BLATing a few hundred human genes against the chimp genome, and kept the best chimp hits for every human gene. I have the base pair start and end location for every chimp hit, and I need to get the sequence for each of these chimp hits. Here is an example for a few chimp hits bp locations: Start End 142854 144504 154479 155198 153066 167370 163146 163559 I have one chimp genome file (about 3GB) including all chromosomes, but I could also get one file per chromosome if that would make things easier. Does anyone have a script or a link for an interface that can do the job? Thank you very much. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/enriched Size: 835 bytes Desc: not available Url : http://lists.open-bio.org/pipermail/bioperl-l/attachments/20060728/7627dccd/attachment.bin From hlapp at gmx.net Fri Jul 28 08:59:43 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Fri, 28 Jul 2006 08:59:43 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: References: Message-ID: <233D3060-5CF7-4DF7-8EF6-6762CF45B94D@gmx.net> If I understand Sendu's proposal correctly then the existing methods in Bio::DB::Taxonomy will remain largely unchanged (methods may be added though). Can you describe briefly what you use Bio::Taxonomy for, e.g., which methods you use primarily and the context? -hilmar On Jul 28, 2006, at 8:10 AM, Gabriel Valiente wrote: >>>> At the moment it seems to me that the Bio::Taxonomy modules >>>> (excluding >>>> Node) aren't really usable. > > I've been using Bio::Taxonomy and Bio::DB::Taxonomy a lot, they are > very useful modules. Whatever Bio::Taxon or Bio::Taxonomy::Taxon > turns out to be, please do keep the Bio::DB::Taxonomy functionality. > > BTW, does anybody know how to include branch lengths in > Bio::DB::Taxonomy? > > Thanks a lot, > > Gabriel > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From bix at sendu.me.uk Fri Jul 28 09:01:44 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Fri, 28 Jul 2006 14:01:44 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: References: Message-ID: <44CA0AB8.7040205@sendu.me.uk> Gabriel Valiente wrote: >>>> At the moment it seems to me that the Bio::Taxonomy modules >>>> (excluding >>>> Node) aren't really usable. > > I've been using Bio::Taxonomy Can I ask how you've been using it? > and Bio::DB::Taxonomy a lot, they are > very useful modules. Whatever Bio::Taxon or Bio::Taxonomy::Taxon > turns out to be, please do keep the Bio::DB::Taxonomy functionality. Bio::DB::Taxonomy is staying virtually unaltered. > BTW, does anybody know how to include branch lengths in > Bio::DB::Taxonomy? At the moment, you don't 'include' anything at all in the DB modules yourself, since they are read-only. They give you Nodes which you can alter afterwards. I plan to add something like a 'distance to parent' in Node (Bio::Taxon) so you can work out branch lengths; you can't do that yet. From bix at sendu.me.uk Fri Jul 28 09:13:44 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Fri, 28 Jul 2006 14:13:44 +0100 Subject: [Bioperl-l] Getting sequences by base pair locations In-Reply-To: <05a8f85bdc36dc7e370d0935b6394f9d@ucl.ac.uk> References: <05a8f85bdc36dc7e370d0935b6394f9d@ucl.ac.uk> Message-ID: <44CA0D88.3000404@sendu.me.uk> Yuval Itan wrote: > Hello all, > > I was BLATing a few hundred human genes against the chimp genome, and > kept the best chimp hits for every human gene. > I have the base pair start and end location for every chimp hit, and I > need to get the sequence for each of these chimp hits. Here is an > example for a few chimp hits bp locations: > > Start End* > *142854 144504 > 154479 155198 > 153066 167370 > 163146 163559 > > I have one chimp genome file (about 3GB) including all chromosomes, but > I could also get one file per chromosome if that would make things > easier. Does anyone have a script or a link for an interface that can do > the job? If your genome file is in some standard format, use SeqIO. http://www.bioperl.org/wiki/HOWTO:SeqIO And then get the sequence corresponding to the correct chromosome and get the desired chunk with subseq(); http://www.bioperl.org/wiki/HOWTO:Beginners#The_Sequence_Object You'd also have to make sure that the data used during the blat is exactly the same data you have in your big file. From sdavis2 at mail.nih.gov Fri Jul 28 09:28:02 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Fri, 28 Jul 2006 09:28:02 -0400 Subject: [Bioperl-l] whole genome annotation In-Reply-To: References: <44C9FC15.3040503@mail.nih.gov> Message-ID: <44CA10E2.8010205@mail.nih.gov> Richard Birnie wrote: > > -----Original Message----- > From: Sean Davis [mailto:sdavis2 at mail.nih.gov] > Sent: Fri 7/28/2006 12:59 > To: Richard Birnie > Cc: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] whole genome annotation > > Richard Birnie wrote: > >>Hello all, >> >>I'm just trying to familiarise myself with BioPerl and I'm a little overwhelmed by the sheer volume of information available on the wiki. I'm hoping someone can point in the right direction through the labyrinth. This may become a little longwinded but I'll try and get all the annoying newbie questions out of the way in one go. >> >>Let me try and explain what I'm aiming for. I have some CGH data downloaded from the Progenetix database (http://www.progenetix.de/~pgscripts/progenetix/Aboutprogenetix.html), this data is simplified to record simply gain/loss/amplification of whole chromosome bands at 862 band resolution to facilitate the combination of data from multiple different studies. >> >>What I'd like to be able to do is download a copy of the human genome sequence with annotation describing the locations of chromosome bands and preferably of known genes. I then want to be able to manipulate the genome data based on the CGH data to mimic deletions. The ultimate goal of this is to be able to feed the manipulated genome data into a program (metashark) that predicts the structure of metabolic networks based on genome annotation compared to a reference genome, in this case a complete 'normal' human genome and see what effect that has on the metabolic pathways. >> >>I appreciate that is a bit vague but thats sort of my problem, I'm not a bioinformatician really so I'm not sue of the details of what I want. I just happen to have an question to answer and bioperl seems the way to go (for this project and more generally). I've started looking at the HOWTOs and read the main bioperl tutorial. I also looked at the CGL comparative genomics library but I haven't penetrated far into that yet. I'm ok with basic perl although not much object oriented stuff. I don't really have much experience with handling sequence data on a whole genome scale either, a few genbank files for my favourite genes is fine but I need some guidance to work on this scale. >> >>What I'm looking for is someone to give me a start. I'd greatly appreciate it if someone could spell out the general steps for downloading a complete copy of the human genome and its annotations (if this is even a feasible approach) and how to put it all together. Not actual code just the general concept for each step and which tools from the bioperl set would be most appropriate for each step so that I can focus what I need to read about, even a little pseudo-code if I'm lucky. If I can get the genome data downloaded and setup properly I'll work out how to apply the CGH data to it myself. >> >>If example code for what I'm trying to describe is included somewhere, great could someone point to where. > > > Hi, Richard. > > Bioperl is good for many things, but for simply grabbing all the > locations of human genes in the genome and chromosome band locations, I > wouldn't use bioperl. It sounds to me like you are interested in > getting the genes associated with each chromosomal band? If so, just > download the cytoband.txt and refFlat.txt files from the UCSC genome > browser site. cytoband.txt contains the base pair locations for each of > the cytobands. refFlat.txt contains the base pair locations of "refseq" > genes. It is then simply a matter of finding overlapping regions (genes > with cytobands) to determine which genes are in which cytobands. Since > the files are tab-delimited text, they are very easy to work with (in > perl, excel, python, ...). Don't get me wrong--I really appreciate the > power of bioperl, but in this case, your task lends itself to a simpler > (and MUCH) faster approach. > > Sean > > Thanks for the response Sean, > > getting the genes associated with each band is certainly part of what I need and your suggestion will help with that. I did look at the UCSC site but as you know there is such a volume of info on there I didn't really know which files I needed. > > However my main goal requires slightly more. What I want to be able to do is take the chromosomal band annotation info and compare that against the CGH data I have. From this I'd like to then be able say "OK band 8q13.1 (or whatever) is deleted, so make a copy of the genome with the actual sequence associated with that band removed." Then I could feed both sequences into metashark which predicts the structure of metabolic pathways based on genome annotation and see what effect deleting that region of DNA has on the structure of the metabolic network. Knowing which genes are involved is necessary for identifying what are the important components within the region. Are there tools in Bioperl for making this comparison? It can probably be reduced to a straight comparison of data structures so I may just use regular perl for this part unless there is anything designed for purpose. > > The thing I was struggling with was how to store and manipulate genomic sequence data in such quantities. Since this morning I've had a better look at the CGL library and associated datastore module, I think I can do it using these tools but I'm having a few dependency issues getting it installed right now. So I'll go back to wrestling with that. Ahh. I see. Metashark actually searches the remaining sequence in the human genome? If that is the case, then you need the start and end positions of the chromosomal bands, which you can download from the ucsc genome browser. Follow the links to download and then to the genome of your choice and finally get the chromband.txt file. The other piece of the puzzle is the bio::DB::Fasta module. It allows extremely fast access to a set of fasta files, which it first indexes. Here is the documentation for it: http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/DB/Fasta.html You could imagine making a hash indexed by chromosome band of a hash of starts and ends for each band. For each CGH experiment, find those regions that are deleted. Exclude those when looping through all the chromosome bands, pulling the sequence using Bio::DB::Fasta for each band and writing that to a file for metashark. Sean From sdavis2 at mail.nih.gov Fri Jul 28 09:30:52 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Fri, 28 Jul 2006 09:30:52 -0400 Subject: [Bioperl-l] Getting sequences by base pair locations In-Reply-To: <05a8f85bdc36dc7e370d0935b6394f9d@ucl.ac.uk> References: <05a8f85bdc36dc7e370d0935b6394f9d@ucl.ac.uk> Message-ID: <44CA118C.7010401@mail.nih.gov> Yuval Itan wrote: > Hello all, > > I was BLATing a few hundred human genes against the chimp genome, and > kept the best chimp hits for every human gene. > I have the base pair start and end location for every chimp hit, and I > need to get the sequence for each of these chimp hits. Here is an > example for a few chimp hits bp locations: > > Start End* > *142854 144504 > 154479 155198 > 153066 167370 > 163146 163559 > > I have one chimp genome file (about 3GB) including all chromosomes, but > I could also get one file per chromosome if that would make things > easier. Does anyone have a script or a link for an interface that can do > the job? See this module: http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/DB/Fasta.html Sean From osborne1 at optonline.net Fri Jul 28 09:35:02 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Fri, 28 Jul 2006 09:35:02 -0400 Subject: [Bioperl-l] whole genome annotation In-Reply-To: Message-ID: Richard, A good starting point is a FAQ page we created that describes various ways of extracting genomic sequence: http://www.bioperl.org/wiki/Getting_Genomic_Sequences Check that out, and Sean's suggestion, and write back to bioperl-l if you have questions. One thing that this page doesn't really address is the special challenge that comes with working with very large sequences, this is something you might have to consider as well. You also asked about downloading the human genome and its annotations. There's also more than one way to do this as well. You'd have access to this data if you used the ENSEMBL API but you can get the Genbank files at ftp://ftp.ncbi.nih.gov/genomes/. Having said that I should add that one of the advantages of the ENSEMBL API approach is that you don't have to download the entire genome. Don't know what machine you're working on but, again, trying to manipulate very large sequences may tax your computer as well as your patience. Brian O. On 7/28/06 5:39 AM, "Richard Birnie" wrote: > Hello all, > > I'm just trying to familiarise myself with BioPerl and I'm a little > overwhelmed by the sheer volume of information available on the wiki. I'm > hoping someone can point in the right direction through the labyrinth. This > may become a little longwinded but I'll try and get all the annoying newbie > questions out of the way in one go. > > Let me try and explain what I'm aiming for. I have some CGH data downloaded > from the Progenetix database > (http://www.progenetix.de/~pgscripts/progenetix/Aboutprogenetix.html), this > data is simplified to record simply gain/loss/amplification of whole > chromosome bands at 862 band resolution to facilitate the combination of data > from multiple different studies. > > What I'd like to be able to do is download a copy of the human genome sequence > with annotation describing the locations of chromosome bands and preferably of > known genes. I then want to be able to manipulate the genome data based on the > CGH data to mimic deletions. The ultimate goal of this is to be able to feed > the manipulated genome data into a program (metashark) that predicts the > structure of metabolic networks based on genome annotation compared to a > reference genome, in this case a complete 'normal' human genome and see what > effect that has on the metabolic pathways. > > I appreciate that is a bit vague but thats sort of my problem, I'm not a > bioinformatician really so I'm not sue of the details of what I want. I just > happen to have an question to answer and bioperl seems the way to go (for this > project and more generally). I've started looking at the HOWTOs and read the > main bioperl tutorial. I also looked at the CGL comparative genomics library > but I haven't penetrated far into that yet. I'm ok with basic perl although > not much object oriented stuff. I don't really have much experience with > handling sequence data on a whole genome scale either, a few genbank files for > my favourite genes is fine but I need some guidance to work on this scale. > > What I'm looking for is someone to give me a start. I'd greatly appreciate it > if someone could spell out the general steps for downloading a complete copy > of the human genome and its annotations (if this is even a feasible approach) > and how to put it all together. Not actual code just the general concept for > each step and which tools from the bioperl set would be most appropriate for > each step so that I can focus what I need to read about, even a little > pseudo-code if I'm lucky. If I can get the genome data downloaded and setup > properly I'll work out how to apply the CGH data to it myself. > > If example code for what I'm trying to describe is included somewhere, great > could someone point to where. > > Thanks for your patience. > best regards, > Richard > > > > Dr Richard Birnie > Scientific Officer > Section of Pathology and Tumour Biology > Welcome Brenner Building, LIMM > St James University Hospital > Beckett St, Leeds, LS9 7TF > Tel:0113 3438624 > e-mail: r.birnie at leeds.ac.uk > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From sdavis2 at mail.nih.gov Fri Jul 28 09:41:45 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Fri, 28 Jul 2006 09:41:45 -0400 Subject: [Bioperl-l] Getting sequences by base pair locations In-Reply-To: <44CA0D88.3000404@sendu.me.uk> References: <05a8f85bdc36dc7e370d0935b6394f9d@ucl.ac.uk> <44CA0D88.3000404@sendu.me.uk> Message-ID: <44CA1419.3030100@mail.nih.gov> Sendu Bala wrote: > Yuval Itan wrote: > >>Hello all, >> >>I was BLATing a few hundred human genes against the chimp genome, and >>kept the best chimp hits for every human gene. >>I have the base pair start and end location for every chimp hit, and I >>need to get the sequence for each of these chimp hits. Here is an >>example for a few chimp hits bp locations: >> >>Start End* >>*142854 144504 >>154479 155198 >>153066 167370 >>163146 163559 >> >>I have one chimp genome file (about 3GB) including all chromosomes, but >>I could also get one file per chromosome if that would make things >>easier. Does anyone have a script or a link for an interface that can do >>the job? > > > If your genome file is in some standard format, use SeqIO. > http://www.bioperl.org/wiki/HOWTO:SeqIO > > And then get the sequence corresponding to the correct chromosome and > get the desired chunk with subseq(); > http://www.bioperl.org/wiki/HOWTO:Beginners#The_Sequence_Object My guess is that Yuval will need random access to the sequences. With seqIO, this is possible with a relatively large amount of memory, but Bio::DB::Fasta might be the better bet. Alternatively, make a custom track (see the documentation for doing so at the UCSC genome browser site), upload it, and then getting the DNA is trivial with just a couple of mouseclicks. This method also has the advantage of being able to do things like viewing the data in genome coordinates and allows the possibility of doing interections with known chimp genes so you could find hits that don't overlap known chimp genes, for example. Sean From valiente at lsi.upc.edu Fri Jul 28 09:53:10 2006 From: valiente at lsi.upc.edu (Gabriel Valiente) Date: Fri, 28 Jul 2006 16:53:10 +0300 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: <001301c6b24b$da38ba80$15327e82@pyrimidine> References: <001301c6b24b$da38ba80$15327e82@pyrimidine> Message-ID: <1D2EF357-E8DE-4F15-824A-9C9359389685@lsi.upc.edu> > Would be nice to know how you use Bio::Taxonomy. You are the first > here who > seems to have a use for it. I'm using it to obtain a reference taxonomy for a set of organisms, against which to assess a phylogeny obtained by the usual protocol (fetch rRNA sequences, align them, obtain a distance matrix, cluster). Roughly: use Bio::DB::Taxonomy; my $nodesfile = "nodes.dmp"; my $namesfile = "names.dmp"; my $db = new Bio::DB::Taxonomy(-source => 'flatfile', -directory => "./db/", -nodesfile => $nodesfile, -namesfile => $namesfile); my @species = (...); for my $ncbi_name (@species) { my $ncbi_id = $db->get_taxonid($ncbi_name); my $node = $db->get_Taxonomy_Node(-taxonid => $ncbi_id); my @lineage = get_lineage_nodes($node); # ... } Here, get_lineage_nodes could be added as a method to Bio::Taxonomy::Node or equivalent: sub get_lineage_nodes{ my $node = shift; my @lineage; while ($node->node_name ne "root") { $node = $node->get_Parent_Node; unshift @lineage, $node; } return @lineage; } I've also written a method to merge the full lineages of a set of Bio::Taxonomy::Node object into a Bio::Tree::Tree object. I'd be glad to contribute it as well, but I'm not sure where it would fit. > As for branch lengths, I think you're confusing > 'taxonomy' (classification > of organisms based on just about anything) with > 'phylogeny' (evolutionary > relatedness). Note in the Wikipedia article below the use of the term > 'phylogenetic taxonomy', which is the classification of organisms > based on > evolutionary relationships. > > http://en.wikipedia.org/wiki/Taxonomy > > http://en.wikipedia.org/wiki/Phylogeny > > NCBI has a disclaimer about the Taxonomy database that is related > to this: > > http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi? > chapter=how > cite > > There are HOWTOs on tree manipulation, population genetics, and > PAML on the > wiki which might be a good start for Bioperl phylogenetic methods: > > http://www.bioperl.org/wiki/HOWTO:Trees > > http://www.bioperl.org/wiki/HOWTO:PAML > > http://www.bioperl.org/wiki/HOWTO:PopGen Thanks a lot. Let me check it and get back to the discussion later on. Gabriel > Chris > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> bounces at lists.open-bio.org] On Behalf Of Gabriel Valiente >> Sent: Friday, July 28, 2006 7:10 AM >> To: bioperl-l at lists.open-bio.org >> Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) >> >>>>> At the moment it seems to me that the Bio::Taxonomy modules >>>>> (excluding >>>>> Node) aren't really usable. >> >> I've been using Bio::Taxonomy and Bio::DB::Taxonomy a lot, they are >> very useful modules. Whatever Bio::Taxon or Bio::Taxonomy::Taxon >> turns out to be, please do keep the Bio::DB::Taxonomy functionality. >> >> BTW, does anybody know how to include branch lengths in >> Bio::DB::Taxonomy? >> >> Thanks a lot, >> >> Gabriel >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l From R.Birnie at leeds.ac.uk Fri Jul 28 09:56:15 2006 From: R.Birnie at leeds.ac.uk (Richard Birnie) Date: Fri, 28 Jul 2006 14:56:15 +0100 Subject: [Bioperl-l] whole genome annotation References: Message-ID: Thanks folks, That should be enough to get me going. At least I can see the wood for the trees now. Richard Dr Richard Birnie Scientific Officer Section of Pathology and Tumour Biology Welcome Brenner Building, LIMM St James University Hospital Beckett St, Leeds, LS9 7TF Tel:0113 3438624 e-mail: r.birnie at leeds.ac.uk -----Original Message----- From: Brian Osborne [mailto:osborne1 at optonline.net] Sent: Fri 7/28/2006 14:35 To: Richard Birnie; bioperl-l at lists.open-bio.org Subject: Re: [Bioperl-l] whole genome annotation Richard, A good starting point is a FAQ page we created that describes various ways of extracting genomic sequence: http://www.bioperl.org/wiki/Getting_Genomic_Sequences Check that out, and Sean's suggestion, and write back to bioperl-l if you have questions. One thing that this page doesn't really address is the special challenge that comes with working with very large sequences, this is something you might have to consider as well. You also asked about downloading the human genome and its annotations. There's also more than one way to do this as well. You'd have access to this data if you used the ENSEMBL API but you can get the Genbank files at ftp://ftp.ncbi.nih.gov/genomes/. Having said that I should add that one of the advantages of the ENSEMBL API approach is that you don't have to download the entire genome. Don't know what machine you're working on but, again, trying to manipulate very large sequences may tax your computer as well as your patience. Brian O. On 7/28/06 5:39 AM, "Richard Birnie" wrote: > Hello all, > > I'm just trying to familiarise myself with BioPerl and I'm a little > overwhelmed by the sheer volume of information available on the wiki. I'm > hoping someone can point in the right direction through the labyrinth. This > may become a little longwinded but I'll try and get all the annoying newbie > questions out of the way in one go. > > Let me try and explain what I'm aiming for. I have some CGH data downloaded > from the Progenetix database > (http://www.progenetix.de/~pgscripts/progenetix/Aboutprogenetix.html), this > data is simplified to record simply gain/loss/amplification of whole > chromosome bands at 862 band resolution to facilitate the combination of data > from multiple different studies. > > What I'd like to be able to do is download a copy of the human genome sequence > with annotation describing the locations of chromosome bands and preferably of > known genes. I then want to be able to manipulate the genome data based on the > CGH data to mimic deletions. The ultimate goal of this is to be able to feed > the manipulated genome data into a program (metashark) that predicts the > structure of metabolic networks based on genome annotation compared to a > reference genome, in this case a complete 'normal' human genome and see what > effect that has on the metabolic pathways. > > I appreciate that is a bit vague but thats sort of my problem, I'm not a > bioinformatician really so I'm not sue of the details of what I want. I just > happen to have an question to answer and bioperl seems the way to go (for this > project and more generally). I've started looking at the HOWTOs and read the > main bioperl tutorial. I also looked at the CGL comparative genomics library > but I haven't penetrated far into that yet. I'm ok with basic perl although > not much object oriented stuff. I don't really have much experience with > handling sequence data on a whole genome scale either, a few genbank files for > my favourite genes is fine but I need some guidance to work on this scale. > > What I'm looking for is someone to give me a start. I'd greatly appreciate it > if someone could spell out the general steps for downloading a complete copy > of the human genome and its annotations (if this is even a feasible approach) > and how to put it all together. Not actual code just the general concept for > each step and which tools from the bioperl set would be most appropriate for > each step so that I can focus what I need to read about, even a little > pseudo-code if I'm lucky. If I can get the genome data downloaded and setup > properly I'll work out how to apply the CGH data to it myself. > > If example code for what I'm trying to describe is included somewhere, great > could someone point to where. > > Thanks for your patience. > best regards, > Richard > > > > Dr Richard Birnie > Scientific Officer > Section of Pathology and Tumour Biology > Welcome Brenner Building, LIMM > St James University Hospital > Beckett St, Leeds, LS9 7TF > Tel:0113 3438624 > e-mail: r.birnie at leeds.ac.uk > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Fri Jul 28 09:43:47 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 28 Jul 2006 08:43:47 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: Message-ID: <001301c6b24b$da38ba80$15327e82@pyrimidine> Now I get personal email? Yikes! Sendu has indicated that Bio::DB::Taxonomy will stay essentially unchanged. If anything changes, it >may< be the class used to hold the Node information. Would be nice to know how you use Bio::Taxonomy. You are the first here who seems to have a use for it. As for branch lengths, I think you're confusing 'taxonomy' (classification of organisms based on just about anything) with 'phylogeny' (evolutionary relatedness). Note in the Wikipedia article below the use of the term 'phylogenetic taxonomy', which is the classification of organisms based on evolutionary relationships. http://en.wikipedia.org/wiki/Taxonomy http://en.wikipedia.org/wiki/Phylogeny NCBI has a disclaimer about the Taxonomy database that is related to this: http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=how cite There are HOWTOs on tree manipulation, population genetics, and PAML on the wiki which might be a good start for Bioperl phylogenetic methods: http://www.bioperl.org/wiki/HOWTO:Trees http://www.bioperl.org/wiki/HOWTO:PAML http://www.bioperl.org/wiki/HOWTO:PopGen Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Gabriel Valiente > Sent: Friday, July 28, 2006 7:10 AM > To: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) > > >>> At the moment it seems to me that the Bio::Taxonomy modules > >>> (excluding > >>> Node) aren't really usable. > > I've been using Bio::Taxonomy and Bio::DB::Taxonomy a lot, they are > very useful modules. Whatever Bio::Taxon or Bio::Taxonomy::Taxon > turns out to be, please do keep the Bio::DB::Taxonomy functionality. > > BTW, does anybody know how to include branch lengths in > Bio::DB::Taxonomy? > > Thanks a lot, > > Gabriel > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Fri Jul 28 10:15:38 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 28 Jul 2006 09:15:38 -0500 Subject: [Bioperl-l] Getting sequences by base pair locations In-Reply-To: <44CA118C.7010401@mail.nih.gov> Message-ID: <001401c6b250$4e3c2490$15327e82@pyrimidine> Yutal, You can also do this remotely if the file you want is in GenBank (and you don't want to store the data locally). The nice thing about using this is any seqfeatures in the GenBank file within the region requested is also returned. Note that if data is stored in a RefSeq file you'll need to add the parameter '-no_redirect => 1,' to the Bio::DB::GenBank object. I would NOT recommend this for huge numbers of sequences (>2000) as you would be spamming NCBI with thousands of repeated requests; if you did have a relatively large number you could run this overnight, which is what I do. Bio::DB::Fasta would be better if you have tons of hits. Use this in a loop to grab the sequences one at a time based on the start, stop positions, (and strand, if you need it): # Below is from Bio::DB::GenBank POD, with some modifications my $factory = Bio::DB::GenBank->new( -seq_start => $start, -seq_stop => $end, -strand => $strand # 1=plus, 2=minus ); my $seq_obj; eval { $seq_obj = $factory->get_Seq_by_acc($sf->seq_id); }; if( $@ ) { print STDERR "Unable to retrieve from $start to $end.\n"; print STDERR "Error!\n$@"; print STDERR "Attempting to move on...\n"; next; } print STDERR "Got sequence: ",$seq_obj->description,"\n"; print STDERR "\tLength: ",$seq_obj->length,"\n"; my $sf_len = $sf->length; The eval{} block is needed to make sure retrieval worked via network connections and to not end based on a network error (the object throws an error which eval catches, logs it to STDERR, thus allowing you to continue on). Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Sean Davis > Sent: Friday, July 28, 2006 8:31 AM > To: Yuval Itan > Cc: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Getting sequences by base pair locations > > Yuval Itan wrote: > > Hello all, > > > > I was BLATing a few hundred human genes against the chimp genome, and > > kept the best chimp hits for every human gene. > > I have the base pair start and end location for every chimp hit, and I > > need to get the sequence for each of these chimp hits. Here is an > > example for a few chimp hits bp locations: > > > > Start End* > > *142854 144504 > > 154479 155198 > > 153066 167370 > > 163146 163559 > > > > I have one chimp genome file (about 3GB) including all chromosomes, but > > I could also get one file per chromosome if that would make things > > easier. Does anyone have a script or a link for an interface that can do > > the job? > > See this module: > > http://doc.bioperl.org/releases/bioperl-current/bioperl- > live/Bio/DB/Fasta.html > > Sean > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Fri Jul 28 10:35:21 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 28 Jul 2006 09:35:21 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: <1D2EF357-E8DE-4F15-824A-9C9359389685@lsi.upc.edu> Message-ID: <001501c6b253$0fed08a0$15327e82@pyrimidine> > use Bio::DB::Taxonomy; > I've also written a method to merge the full lineages of a set of > Bio::Taxonomy::Node object into a Bio::Tree::Tree object. I'd be glad > to contribute it as well, but I'm not sure where it would fit. Ah, that would be great (I had mentioned something along these lines to do with BLAST reports). But does this actually use Bio::Taxonomy directly? Taxonomy::Node does not inherit methods from Bio::Taxonomy AFAIK. So, anything that Sendu does may not dramatically impact your code. Sendu? You might need to address some of this to Sendu. Big changes are afoot for Bio::Taxonomy and Bio::Taxonomy::Node. He's heading that up. Chris > ... > Thanks a lot. Let me check it and get back to the discussion later on. > > Gabriel > > > Chris > > ... From cjfields at uiuc.edu Fri Jul 28 10:37:09 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 28 Jul 2006 09:37:09 -0500 Subject: [Bioperl-l] Getting sequences by base pair locations In-Reply-To: <44CA1419.3030100@mail.nih.gov> Message-ID: <001601c6b253$4ec57170$15327e82@pyrimidine> ... > > If your genome file is in some standard format, use SeqIO. > > http://www.bioperl.org/wiki/HOWTO:SeqIO > > > > And then get the sequence corresponding to the correct chromosome and > > get the desired chunk with subseq(); > > http://www.bioperl.org/wiki/HOWTO:Beginners#The_Sequence_Object > > My guess is that Yuval will need random access to the sequences. With > seqIO, this is possible with a relatively large amount of memory, but > Bio::DB::Fasta might be the better bet. Agreed. This is one of the bioperl 'speed' issue areas: http://www.bioperl.org/wiki/Project_priority_list Bio::DB::Fasta returns a specialized PrimarySeq object which gets around the current speed issues with SeqIO. > Alternatively, make a custom track (see the documentation for doing so > at the UCSC genome browser site), upload it, and then getting the DNA is > trivial with just a couple of mouseclicks. This method also has the > advantage of being able to do things like viewing the data in genome > coordinates and allows the possibility of doing interections with known > chimp genes so you could find hits that don't overlap known chimp genes, > for example. > > Sean Would be nice to have a more automated and direct way of doing something along these lines within bioperl (with the obvious caveat of not spamming the server). You can currently retrieve chunks of sequence based on start, stop, strand from GenBank. Ah, one can dream... Chris From bix at sendu.me.uk Fri Jul 28 10:38:20 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Fri, 28 Jul 2006 15:38:20 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: <1D2EF357-E8DE-4F15-824A-9C9359389685@lsi.upc.edu> References: <001301c6b24b$da38ba80$15327e82@pyrimidine> <1D2EF357-E8DE-4F15-824A-9C9359389685@lsi.upc.edu> Message-ID: <44CA215C.2070607@sendu.me.uk> Gabriel Valiente wrote: >> Would be nice to know how you use Bio::Taxonomy. You are the first >> here who >> seems to have a use for it. > > I'm using it to obtain a reference taxonomy for a set of organisms, > against which to assess a phylogeny obtained by the usual protocol > (fetch rRNA sequences, align them, obtain a distance matrix, > cluster). Roughly: > > use Bio::DB::Taxonomy; Ah, we were specifically wondering if you had used Bio/Taxonomy.pm, not Taxonomy modules in general. Again, DB::Taxonomy usage will be unaffected. > Here, get_lineage_nodes could be added as a method to > Bio::Taxonomy::Node or equivalent: > > sub get_lineage_nodes{ > my $node = shift; > my @lineage; > while ($node->node_name ne "root") { > $node = $node->get_Parent_Node; > unshift @lineage, $node; > } > return @lineage; > } I think you must have an older version of bioperl. Bio::Taxonomy::Node has a method get_Lineage_Nodes() which more or less does exactly that. > I've also written a method to merge the full lineages of a set of > Bio::Taxonomy::Node object into a Bio::Tree::Tree object. I'd be glad > to contribute it as well, but I'm not sure where it would fit. Post it and I'll see if it will fit anywhere :) From cuiw at ncbi.nlm.nih.gov Fri Jul 28 09:46:50 2006 From: cuiw at ncbi.nlm.nih.gov (Cui, Wenwu (NIH/NLM/NCBI) [C]) Date: Fri, 28 Jul 2006 09:46:50 -0400 Subject: [Bioperl-l] Getting sequences by base pair locations In-Reply-To: <05a8f85bdc36dc7e370d0935b6394f9d@ucl.ac.uk> Message-ID: <18C407FD4FFB424292D769FBD68C1987C7C254@NIHCESMLBX8.nih.gov> Maybe the easiest way is to use LWP to get the webpage. Here is an example for CHIMP1A:10:12345678:12348888: http://www.ensembl.org/Pan_troglodytes/exportview?format=fasta&l=10%3A12 345678-12348888&action=export&_format=Text&output=txt&submit=Continue+%3 E%3E Wenwu Cui ________________________________ From: Yuval Itan [mailto:y.itan at ucl.ac.uk] Sent: Friday, July 28, 2006 8:08 AM To: bioperl-l at lists.open-bio.org Subject: [Bioperl-l] Getting sequences by base pair locations Hello all, I was BLATing a few hundred human genes against the chimp genome, and kept the best chimp hits for every human gene. I have the base pair start and end location for every chimp hit, and I need to get the sequence for each of these chimp hits. Here is an example for a few chimp hits bp locations: Start End 142854 144504 154479 155198 153066 167370 163146 163559 I have one chimp genome file (about 3GB) including all chromosomes, but I could also get one file per chromosome if that would make things easier. Does anyone have a script or a link for an interface that can do the job? Thank you very much. From valiente at lsi.upc.edu Fri Jul 28 10:49:28 2006 From: valiente at lsi.upc.edu (Gabriel Valiente) Date: Fri, 28 Jul 2006 17:49:28 +0300 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: <001501c6b253$0fed08a0$15327e82@pyrimidine> References: <001501c6b253$0fed08a0$15327e82@pyrimidine> Message-ID: <5563CD94-DC99-46A3-A56A-485D4A4D3031@lsi.upc.edu> >> use Bio::DB::Taxonomy; > > > >> I've also written a method to merge the full lineages of a set of >> Bio::Taxonomy::Node object into a Bio::Tree::Tree object. I'd be glad >> to contribute it as well, but I'm not sure where it would fit. > > Ah, that would be great (I had mentioned something along these > lines to do > with BLAST reports). But does this actually use Bio::Taxonomy > directly? > Taxonomy::Node does not inherit methods from Bio::Taxonomy AFAIK. So, > anything that Sendu does may not dramatically impact your code. > Sendu? It is a general algorithm I devised that takes a set of paths and builds up a tree. The input paths are full lineages coming from Bio::DB::Taxonomy, the output tree is a Bio::Tree::Tree. This is why I said I don't know exactly where it would belong, it looks to me more like a standalone script than a Bio::Taxonomy or Bio::Tree method. Gabriel > You might need to address some of this to Sendu. Big changes are > afoot for > Bio::Taxonomy and Bio::Taxonomy::Node. He's heading that up. > > Chris > >> ... >> Thanks a lot. Let me check it and get back to the discussion later >> on. >> >> Gabriel >> >>> Chris >>> > ... From sdavis2 at mail.nih.gov Fri Jul 28 11:21:09 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Fri, 28 Jul 2006 11:21:09 -0400 Subject: [Bioperl-l] Getting sequences by base pair locations In-Reply-To: <001601c6b253$4ec57170$15327e82@pyrimidine> References: <001601c6b253$4ec57170$15327e82@pyrimidine> Message-ID: <44CA2B65.8070906@mail.nih.gov> Chris Fields wrote: > Would be nice to have a more automated and direct way of doing something > along these lines within bioperl (with the obvious caveat of not spamming > the server). You can currently retrieve chunks of sequence based on start, > stop, strand from GenBank. The ENSembl API has some features that can be useful for these types of things. I, personally, have a mirror of the UCSC mysql database (very easy to do with just rsync and mysql) and try to turn questions like these into SQL queries. That, combined with Bio::DB::Fasta, can make a useful automated pipeline for getting arbitrary sequences associated with genomic locations meeting specific criteria. It is much faster than anything one can do over the web and doesn't have access limitations. Sean From cjfields at uiuc.edu Fri Jul 28 11:27:17 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 28 Jul 2006 10:27:17 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: <5563CD94-DC99-46A3-A56A-485D4A4D3031@lsi.upc.edu> Message-ID: <000001c6b25a$4f9392b0$15327e82@pyrimidine> > It is a general algorithm I devised that takes a set of paths and > builds up a tree. The input paths are full lineages coming from > Bio::DB::Taxonomy, the output tree is a Bio::Tree::Tree. This is why > I said I don't know exactly where it would belong, it looks to me > more like a standalone script than a Bio::Taxonomy or Bio::Tree method. > > Gabriel Agreed. You could submit the script as an example here if it is short, or via Bugzilla as an enhancement request: http://bugzilla.open-bio.org/ It could be added to the scripts\ or examples\ directory in bioperl-core. Chris From valiente at lsi.upc.edu Fri Jul 28 12:35:20 2006 From: valiente at lsi.upc.edu (Gabriel Valiente) Date: Fri, 28 Jul 2006 19:35:20 +0300 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: <000001c6b25a$4f9392b0$15327e82@pyrimidine> References: <000001c6b25a$4f9392b0$15327e82@pyrimidine> Message-ID: <3DB992C6-DF16-42B9-8C36-F3B5C8CCBDE7@lsi.upc.edu> >> It is a general algorithm I devised that takes a set of paths and >> builds up a tree. The input paths are full lineages coming from >> Bio::DB::Taxonomy, the output tree is a Bio::Tree::Tree. This is why >> I said I don't know exactly where it would belong, it looks to me >> more like a standalone script than a Bio::Taxonomy or Bio::Tree >> method. >> >> Gabriel > > Agreed. You could submit the script as an example here if it is > short, or > via Bugzilla as an enhancement request: > > http://bugzilla.open-bio.org/ > > It could be added to the scripts\ or examples\ directory in bioperl- > core. Here it is. Please check it and include for instance as taxonomy2tree.PLS in the scripts/tree or scripts/taxonomy directory. Disclaimer: I'm also publishing part of this code in a conference paper. The script is already fully functional but anyway, I have a couple of improvements in mind. The minor one is provision for cmdline input. How would you like to input an array of names? The other one is to remove internal node labels and contract elementary paths, for instance reducing the tree: (((((((((((((((((((((((((((("Pongo pygmaeus")Pongo,(("Gorilla gorilla")Gorilla,("Pan troglodytes")Pan,("Homo sapiens")Homo)"Homo/ Pan/Gorilla group")Hominidae)Hominoidea)Catarrhini)Simiiformes) Primates)Euarchontoglires)Eutheria)Theria)Mammalia)Amniota)Tetrapoda) Sarcopterygii)Euteleostomi)Teleostomi)"Gnathostomata ") Vertebrata)"Craniata ")Chordata)Deuterostomia)Coelomata) Bilateria)Eumetazoa)Metazoa)"Fungi/Metazoa group")Eukaryota)"cellular organisms")root; to the tree: ("Pongo pygmaeus",("Gorilla gorilla","Pan troglodytes","Homo sapiens")); It is certainly easy to remove all internal node labels. On the other hand, I've been working on contraction of elementary paths for quite a while, but always got stuck with internals of the Bio::Tree methods to remove nodes. Thus, so far the only working code I have consists of removing elementary branches while making a deep copy of the tree, which certainly is not quite elegant... Thanks a lot, Gabriel #!/usr/bin/perl -w # Author: Gabriel Valiente # Purpose: Bio::DB::Taxonomy -> Bio::Tree::Tree use strict; use Bio::DB::Taxonomy; use Bio::TreeIO; my $nodesfile = "nodes.dmp"; my $namesfile = "names.dmp"; my $db = new Bio::DB::Taxonomy(-source => 'flatfile', -directory => "./db/", -nodesfile => $nodesfile, -namesfile => $namesfile); # the input to the script is an array of species names my @species = ('Orangutan', 'Gorilla', 'Chimpanzee', 'Human'); my $root = new Bio::Tree::Node(-id => "root"); my $tree = new Bio::Tree::Tree(-root => $root); # the full lineages of the species are merged into a tree for my $name (@species) { my $ncbi_id = $db->get_taxonid($name); if ($ncbi_id) { my $node = $db->get_Taxonomy_Node(-taxonid => $ncbi_id); my @lineage = get_lineage_nodes($node); shift @lineage; # discard root push @lineage, $node; merge_path($root, \@lineage); } else { warn "no NCBI Taxonomy node for species ",$name,"\n"; } } # the tree is output in Newick format my $output = new Bio::TreeIO(-format => 'newick'); $output->write_tree($tree); # the actual merging of full lineages is performed by a recursive method sub merge_path { my $root = shift; my $path = shift; my @path = @{$path}; if (@path) { my $top = shift @path; my @children = grep { $_->id eq $top->node_name } $root- >each_Descendent; if (@children) { # $root has a $child with id eq $top name my $child = shift @children; merge_path($child,\@path); } else { # add $top and @path below $root my $node = $root; unshift @path, $top; while (@path) { my $top = shift @path; my $name = $top->node_name; my $child = new Bio::Tree::Node(-id => "$name"); $node->add_Descendent($child); $node = $child; } } } } # the full lineage of a species is recovered by traversing the taxonomy sub get_lineage_nodes{ my $node = shift; my @lineage; while ($node->node_name ne "root") { $node = $node->get_Parent_Node; unshift @lineage, $node; } return @lineage; } =head1 NAME taxonomy2tree - builds a taxonomic tree based on the full lineages of a set of species names =head1 DESCRIPTION This script requires that the bioperl-run pkg be also installed. Providing the nodes.dmp and names.dmp files from the NCBI Taxonomy dump (see Bio::DB::Taxonomy::flatfile for more info) is only necessary on the first time running. This will create the local indexes and may take quite a long time. However once created, these indexes will allow fast access for species to taxon id OR taxon id to species name lookups. =cut From MEC at stowers-institute.org Fri Jul 28 12:44:43 2006 From: MEC at stowers-institute.org (Cook, Malcolm) Date: Fri, 28 Jul 2006 11:44:43 -0500 Subject: [Bioperl-l] Getting sequences by base pair locations Message-ID: There are many options. But, it looks like you only have start end coordinates! Where do you know which chromosome/contig the hit was on? Assuming you have this, if you did the blat with a local copy of the blat program and a the genome, then in addition to the blat command, you have the twoBitToFa command which can extract the hits from the blat index (see http://genome.ucsc.edu/goldenPath/help/blatSpec.html ) Or did you do the blat at ucsc? Malcolm Cook Database Applications Manager, Bioinformatics Stowers Institute for Medical Research oh - I replied similarly in the Bio BB forum, but it is held for moderation so am replying here as well ________________________________ From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Yuval Itan Sent: Friday, July 28, 2006 7:08 AM To: bioperl-l at lists.open-bio.org Subject: [Bioperl-l] Getting sequences by base pair locations Hello all, I was BLATing a few hundred human genes against the chimp genome, and kept the best chimp hits for every human gene. I have the base pair start and end location for every chimp hit, and I need to get the sequence for each of these chimp hits. Here is an example for a few chimp hits bp locations: Start End 142854 144504 154479 155198 153066 167370 163146 163559 I have one chimp genome file (about 3GB) including all chromosomes, but I could also get one file per chromosome if that would make things easier. Does anyone have a script or a link for an interface that can do the job? Thank you very much. From osborne1 at optonline.net Fri Jul 28 13:25:12 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Fri, 28 Jul 2006 13:25:12 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: <3DB992C6-DF16-42B9-8C36-F3B5C8CCBDE7@lsi.upc.edu> Message-ID: Gabriel, It looks like most of the Bioperl scripts use Getopt::Long. It's documentation says, in part: Options can take multiple values at once, for example --coordinates 52.2 16.4 --rgbcolor 255 255 149 This can be accomplished by adding a repeat specifier to the option specification. Repeat specifiers are very similar to the {...} repeat specifiers that can be used with regular expression patterns. For example, the above command line would be handled as follows: GetOptions('coordinates=f{2}' => \@coor, 'rgbcolor=i{3}' => \@color); So the arguments are space-delimited on the command line. Is the problem that the names can be binomial? Brian O. On 7/28/06 12:35 PM, "Gabriel Valiente" wrote: > The minor one is provision for cmdline input. > How would you like to input an array of names? From golharam at umdnj.edu Fri Jul 28 14:03:39 2006 From: golharam at umdnj.edu (Ryan Golhar) Date: Fri, 28 Jul 2006 14:03:39 -0400 Subject: [Bioperl-l] Bio::Align::DNAStatistics module has errors? Message-ID: <01a701c6b270$28232130$2f01a8c0@GOLHARMOBILE1> This is from the description: This object contains routines for calculating various statistics and distances for DNA alignments. The routines are not well tested and do contain errors at this point. Work is underway to correct them, but do not expect this code to give you the right answer currently! Use dnadist/distmat in the PHLYIP or EMBOSS packages to calculate the dis- tances. Any idea what the errors are and what is/is not usable? From lzhtom at hotmail.com Fri Jul 28 22:00:23 2006 From: lzhtom at hotmail.com (zhihua li) Date: Sat, 29 Jul 2006 02:00:23 +0000 Subject: [Bioperl-l] how to get annotations (especially ensembl IDs) for a list of genes? Message-ID: Hi all, I have a list of like 300 genes (actually their refseq IDs). Now I wanna get more information (annotations) for each of the genes. Speficially, I want a mapping of the refseq IDs to Ensembl gene IDs. I know how to do it through a web page. But I'm wondering if I can also do it via bioperl, by using some modules or packages. Can anyone help me out here? Thanks a lot! From jason.stajich at duke.edu Sat Jul 29 01:18:50 2006 From: jason.stajich at duke.edu (Jason Stajich) Date: Fri, 28 Jul 2006 22:18:50 -0700 Subject: [Bioperl-l] Bio::Align::DNAStatistics module has errors? Message-ID: I think that msg was CYA by me at some point - I am pretty sure I made tests based on numbers from PHYLIP and EMBOSS but was hoping for someone else to help. At this point I have no reliable time to really work on, but I hope someone who is interested in it will give it a whirl. There may be some boundary cases that don't work where seqs are too short or have a zero number of a particular nt but in general the nums should jive. I am not sure about all the NG Ks and Ka as I didn't write those but I believe Richard vetted them pretty well first. There are a couple of methods not implemented too - am always hopeful other people will see this as a great starting point and roll up their sleeves to join in... -jason -- Jason Stajich Duke University http://www.duke.edu/~jes12 From bix at sendu.me.uk Sat Jul 29 03:25:38 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Sat, 29 Jul 2006 08:25:38 +0100 Subject: [Bioperl-l] how to get annotations (especially ensembl IDs) for a list of genes? In-Reply-To: References: Message-ID: <44CB0D72.20104@sendu.me.uk> zhihua li wrote: > Hi all, > > I have a list of like 300 genes (actually their refseq IDs). Now I > wanna get more information (annotations) for each of the genes. > Speficially, I want a mapping of the refseq IDs to Ensembl gene IDs. > > I know how to do it through a web page. But I'm wondering if I can also > do it via bioperl One possible way is to use the Ensembl perl API: http://www.ensembl.org/info/software/core/core_tutorial.html You'd get a gene or transcript adapator and use fetch_all_by_external_name() iirc. I'm aware that not every entrez id can be mapped that way, but perhaps most if not all refseqs will work. From bix at sendu.me.uk Sat Jul 29 03:54:52 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Sat, 29 Jul 2006 08:54:52 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: <001301c6b24b$da38ba80$15327e82@pyrimidine> References: <001301c6b24b$da38ba80$15327e82@pyrimidine> Message-ID: <44CB144C.6050507@sendu.me.uk> Chris Fields wrote: > > As for branch lengths, I think you're confusing 'taxonomy' (classification > of organisms based on just about anything) with 'phylogeny' (evolutionary > relatedness). Note in the Wikipedia article below the use of the term > 'phylogenetic taxonomy', which is the classification of organisms based on > evolutionary relationships. > > http://en.wikipedia.org/wiki/Taxonomy > > http://en.wikipedia.org/wiki/Phylogeny Indeed. The two can be considered closely intertwined - if you were making a phylogeny you might hang it on a taxonomy. At any rate, you need to know a bunch of evolutionarily related species names before you start work, and Bio::Taxonomy::Node has been as good a place as any to get that. > There are HOWTOs on tree manipulation, population genetics, and PAML on the > wiki which might be a good start for Bioperl phylogenetic methods: > > http://www.bioperl.org/wiki/HOWTO:Trees Which is why the Trees HOWTO talks about taxa, and a number of the Taxonomy modules have phylogenetic methods like get_lca. (And why there is Bio::Taxonomy::Taxon and Tree.) I suppose this is another reason to make Bio::Taxonomy::Node (ne Bio::Taxon) implement Bio::Tree::NodeI. (for these reasons I don't think Gabriel's method isn't best appropriate as a script - it's something you might do all the time, as a matter of course. If Bio::Taxon wasa Bio::Tree::NodeI you would just do my $tree = new Bio::Tree::Tree(-root => $bio_taxon); and blamo, instant phylogenetic taxonomy) From cjfields at uiuc.edu Sat Jul 29 07:49:29 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sat, 29 Jul 2006 06:49:29 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: <44CB144C.6050507@sendu.me.uk> References: <001301c6b24b$da38ba80$15327e82@pyrimidine> <44CB144C.6050507@sendu.me.uk> Message-ID: <5C765CAF-4976-45D1-BDA4-1E432081FF4D@uiuc.edu> On Jul 29, 2006, at 2:54 AM, Sendu Bala wrote: > Chris Fields wrote: >> >> As for branch lengths, I think you're confusing >> 'taxonomy' (classification >> of organisms based on just about anything) with >> 'phylogeny' (evolutionary >> relatedness). Note in the Wikipedia article below the use of the >> term >> 'phylogenetic taxonomy', which is the classification of organisms >> based on >> evolutionary relationships. >> >> http://en.wikipedia.org/wiki/Taxonomy >> >> http://en.wikipedia.org/wiki/Phylogeny > > Indeed. The two can be considered closely intertwined - if you were > making a phylogeny you might hang it on a taxonomy. At any rate, you > need to know a bunch of evolutionarily related species names before > you > start work, and Bio::Taxonomy::Node has been as good a place as any to > get that. Intertwined, yes, but not exactly the same. Hence the NCBI disclaimer I mentioned: How to reference the NCBI taxonomy database The NCBI taxonomy database is not a primary source for taxonomic or phylogenetic information. Furthermore, the database does not follow a single taxonomic treatise but rather attempts to incorporate phylogenetic and taxonomic knowledge from a variety of sources, including the published literature, web-based databases, and the advice of sequence submitters and outside taxonomy experts. Consequently, the NCBI taxonomy database is not a phylogenetic or taxonomic authority and should not be cited as such. >> There are HOWTOs on tree manipulation, population genetics, and >> PAML on the >> wiki which might be a good start for Bioperl phylogenetic methods: >> >> http://www.bioperl.org/wiki/HOWTO:Trees > > Which is why the Trees HOWTO talks about taxa, and a number of the > Taxonomy modules have phylogenetic methods like get_lca. (And why > there > is Bio::Taxonomy::Taxon and Tree.) Are we still thinking about deprecating those? I have seen very little mention of those modules from the mail list archives, and Jason mentioned that Bio::Taxonomy::Taxon hasn't been modified in a long time. > I suppose this is another reason to make Bio::Taxonomy::Node (ne > Bio::Taxon) implement Bio::Tree::NodeI. > > (for these reasons I don't think Gabriel's method isn't best > appropriate > as a script - it's something you might do all the time, as a matter of > course. If Bio::Taxon wasa Bio::Tree::NodeI you would just do my > $tree = > new Bio::Tree::Tree(-root => $bio_taxon); and blamo, instant > phylogenetic taxonomy) Brian already deposited the script (see bioperl-guts). You could use it for the methods, of course noting Gabriel's contribution. Sounds like a good plan to me ; > Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From nabil at broad.mit.edu Sun Jul 30 00:28:00 2006 From: nabil at broad.mit.edu (Nabil Hafez) Date: Sun, 30 Jul 2006 00:28:00 -0400 Subject: [Bioperl-l] Strangeness in parsing blast file Message-ID: <44CC3550.5070105@broad.mit.edu> Hi, I am having a somewhat similar problem to what was posted in http://bioperl.org/pipermail/bioperl-l/2006-May/021416.html however, I have read through all of that thread and I don't believe what I am experiencing is the exact same problem. I also realize that the Bioperl version 1.5.1 fixes a problem with blast parsing. My problem: My blastresults file parses fine and everything works swimmingly if I pass the blast output file by name such as $blast_result = 'test.blastout'; however when I do $blast_result = &do_blast($sample_fasta); even though in both cases $blast_result evaluate to "test.blastout", the parsing doesn't work, more specifically it gets an undefined variable for $result in while( my $result = $report_obj->next_result ) { Sorr y for the long email - any help would be appreciated, Thanks - Nabil The code...non releavant parts trimmed for size constraints....debugging from working and non-working versions after the code. use strict; use Bio::SearchIO; use Getopt::Std; use List::Util qw(shuffle); use Benchmark; my ($inputfile, $samplefile, $blastfile, $blast_result, $blast_report, $blast_verbose); #files generated #------------------# # Subroutine Calls # #------------------# my $test = &create_sample_file($inputfile); #inputfile being a fasta file containing nucleotide sequence $blast_result = &do_blast($test); ##$blast_result = 'test.blastout'; #when this is uncommented and replace the previous two lines with test.blastout being normal blast output - the script works fine. &parse_blast($blast_result); ####################### # create_sample_file # # Input: Original Fasta File # # Output: Fasta file containing randomly sampled reads # # sub create_sample_file { my $in = shift; my $linecount = 0; my @lines; $samplefile = $in . "_sample"; #Determine total # of reads in input fasta $totalreads = `$grep -c '>' $inputfile`; $totalreads =~ s/\s+//; chomp $totalreads; if ($totalreads > 1000) { #sample if more than 1000 reads $sample_reads = sprintf("%.0f", $totalreads * ($per_to_sample/100)); #number of reads to sample } else { #otherwise use all reads $sample_reads = $totalreads; } $/ = '>'; #define fasta record input seperator open (IN, "<$in") or die "Cannot open $in $!\n"; open (OUT, ">$samplefile") or die "Cannot open $samplefile $!\n"; while () { #read lines into an array chomp; push (@lines, $_); } @lines = shuffle(@lines); #shuffle array foreach (@lines) { print OUT ">$_" if $linecount <= $sample_reads; #output to file sampled number of reads $linecount++; } close IN; close OUT; return $samplefile; }#end create_sample_file ####################### # do_blast # # Input: Fasta File containing SCREENED sampled reads # # Output: Blast File # # sub do_blast { my $bf = shift; my $blastoutput = $bf . ".blastout"; print "Blasting against $db ...\n"; `blast/blast-2.2.10/bin/megablast -d /prodinfo/proddata_ntblastdb/nt -e 1e-50 -p 99 -D 2 -i test -o test.blastout`; return $blastoutput; }#end do_blast ####################### # parse_blast # # Input: Blast file # # Output: Creates hash containing best hit for each read # # sub parse_blast { my $blastoutfile = shift; if (! -e $blastoutfile) { die "$blastoutfile does not exist $!\n"; } print "Parsing blast hits ...\n"; my $report_obj = new Bio::SearchIO(-verbose => 1, -format => 'blast', -file => $blastoutfile); die "no valid $report_obj" unless defined $report_obj; while( my $result = $report_obj->next_result ) { die "no valid $result" unless defined $result; while( my $hit = $result->next_hit ) { while( my $hsp = $hit->next_hsp ) { my $name = $result->query_name; my $hitDesc = $hit->description; my $length = $hsp->length('total'); my $per_id = sprintf("%.2f", $hsp->percent_identity); my $eval = $hsp->evalue; next if (defined $blast_results{$name} && $blast_results{$name}->[0] > $length); #only keep best hit for any read $blast_results{$name} = [$length, $per_id, $eval, $hitDesc]; #store in hash of arrays } } } } #end parse_blast Debug of NON-working blast-parse: main::(454/scripts/fasta_blasta_mb.pl:151): 151: my $sample_fasta = &create_sample_file($inputfile); DB<2> n main::(454/scripts/fasta_blasta_mb.pl:152): 152: $blast_result = &do_blast($sample_fasta); DB<2> x $sample_fasta 0 'G782.2005-08-16-16-48.fasta_sample' DB<3> n Blasting against NT ... main::(454/scripts/fasta_blasta_mb.pl:154): 154: &parse_blast($blast_result); DB<3> s main::parse_blast(454/scripts/fasta_blasta_mb.pl:293): 293: my $blastoutfile = shift; DB<3> s main::parse_blast(454/scripts/fasta_blasta_mb.pl:295): 295: if (! -e $blastoutfile) { DB<3> x $blastoutfile 0 'G782.2005-08-16-16-48.fasta_sample.blastout' DB<4> s main::parse_blast(454/scripts/fasta_blasta_mb.pl:299): 299: print "Parsing blast hits ...\n"; DB<4> s Parsing blast hits ... main::parse_blast(454/scripts/fasta_blasta_mb.pl:302): 302: my $report_obj = new Bio::SearchIO(-verbose => 1, 303: -format => 'blast', 304: -file => $blastoutfile);#or die "Could not open blast report $!"; DB<4> s Bio::SearchIO::new(/util/lib/perl5/site_perl/5.8.0/Bio/SearchIO.pm:129): 129: my($caller, at args) = @_; DB<4> r scalar context return from Bio::SearchIO::new: '_file' => 'G782.2005-08-16-16-48.fasta_sample.blastout' '_filehandle' => GLOB(0x8cef40c) -> *Symbol::GEN1 FileHandle({*Symbol::GEN1}) => fileno(3) '_flush_on_write' => 1 '_handler' => Bio::SearchIO::IteratedSearchResultEventBuilder=HASH(0x9505724) '_factories' => HASH(0x95054c0) 'hit' => Bio::Factory::ObjectFactory=HASH(0x95017b8) '_loaded_types' => HASH(0x9506c0c) 'Bio::Search::Hit::BlastHit' => 1 '_root_verbose' => 0 'interface' => 'Bio::Search::Hit::HitI' 'type' => 'Bio::Search::Hit::BlastHit' 'hsp' => Bio::Factory::ObjectFactory=HASH(0x9500e10) '_loaded_types' => HASH(0x9506c18) 'Bio::Search::HSP::GenericHSP' => 1 '_root_verbose' => 0 'interface' => 'Bio::Search::HSP::HSPI' 'type' => 'Bio::Search::HSP::GenericHSP' 'iteration' => Bio::Factory::ObjectFactory=HASH(0x9506c60) '_loaded_types' => HASH(0x9506af8) 'Bio::Search::Iteration::GenericIteration' => 1 '_root_verbose' => 0 'interface' => 'Bio::Search::Iteration::IterationI' 'type' => 'Bio::Search::Iteration::GenericIteration' 'result' => Bio::Factory::ObjectFactory=HASH(0x9504c80) '_loaded_types' => HASH(0x9501f74) 'Bio::Search::Result::BlastResult' => 1 '_root_verbose' => 0 'interface' => 'Bio::Search::Result::ResultI' 'type' => 'Bio::Search::Result::BlastResult' '_inclusion_threshold' => 0.001 '_root_verbose' => 1 '_handler_cache' => Bio::SearchIO::IteratedSearchResultEventBuilder=HASH(0x9505724) -> REUSED_ADDRESS '_notfirsttime' => 0 '_reporttype' => '' '_root_cleanup_methods' => ARRAY(0x8cde434) 0 CODE(0x82a9aec) -> &Bio::Root::IO::_io_cleanup in /util/lib/perl5/site_perl/5.8.0/Bio/Root/IO.pm:544-570 1 CODE(0x82a9aec) -> REUSED_ADDRESS '_root_verbose' => 1 main::parse_blast(454/scripts/fasta_blasta_mb.pl:307): 307: die "no valid $report_obj" unless defined $report_obj; DB<4> s main::parse_blast(454/scripts/fasta_blasta_mb.pl:310): 310: while( my $result = $report_obj->next_result ) { DB<4> s Bio::SearchIO::blast::next_result(/util/lib/perl5/site_perl/5.8.0/Bio/SearchIO/blast.pm:389): 389: my ($self) = @_; DB<4> r scalar context return from Bio::SearchIO::blast::next_result: undef Bio::SearchIO::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/SearchIO.pm:438): 438: my $self = shift; DB<4> r scalar context return from Bio::SearchIO::DESTROY: '' Bio::Root::Root::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/Root/Root.pm:404): 404: my $self = shift; DB<4> r scalar context return from Bio::Root::Root::DESTROY: undef Bio::Root::Root::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/Root/Root.pm:404): 404: my $self = shift; DB<4> r scalar context return from Bio::Root::Root::DESTROY: undef Bio::Root::Root::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/Root/Root.pm:404): 404: my $self = shift; DB<4> r scalar context return from Bio::Root::Root::DESTROY: undef Bio::Root::Root::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/Root/Root.pm:404): 404: my $self = shift; DB<4> r scalar context return from Bio::Root::Root::DESTROY: undef Bio::Root::Root::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/Root/Root.pm:404): 404: my $self = shift; DB<4> r scalar context return from Bio::Root::Root::DESTROY: undef main::(454/scripts/fasta_blasta_mb.pl:155): 155: &output_results(); DB<4> x $result 0 undef Debug of WORKING blast-parse: Parsing blast hits ... main::parse_blast(454/scripts/fasta_blasta_mb.pl:302): 302: my $report_obj = new Bio::SearchIO(-verbose => 1, 303: -format => 'blast', 304: -file => $blastoutfile);#or die "Could not open blast report $!"; DB<3> s Bio::SearchIO::new(/util/lib/perl5/site_perl/5.8.0/Bio/SearchIO.pm:129): 129: my($caller, at args) = @_; DB<3> r scalar context return from Bio::SearchIO::new: '_file' => 'G782.2005-08-16-16-48.fasta_sample.blastout' '_filehandle' => GLOB(0x8763100) -> *Symbol::GEN1 FileHandle({*Symbol::GEN1}) => fileno(3) '_flush_on_write' => 1 '_handler' => Bio::SearchIO::IteratedSearchResultEventBuilder=HASH(0x8ab3be4) '_factories' => HASH(0x8ab1594) 'hit' => Bio::Factory::ObjectFactory=HASH(0x8a7b7c0) '_loaded_types' => HASH(0x8abee10) 'Bio::Search::Hit::BlastHit' => 1 '_root_verbose' => 0 'interface' => 'Bio::Search::Hit::HitI' 'type' => 'Bio::Search::Hit::BlastHit' 'hsp' => Bio::Factory::ObjectFactory=HASH(0x8a87200) '_loaded_types' => HASH(0x8abee1c) 'Bio::Search::HSP::GenericHSP' => 1 '_root_verbose' => 0 'interface' => 'Bio::Search::HSP::HSPI' 'type' => 'Bio::Search::HSP::GenericHSP' 'iteration' => Bio::Factory::ObjectFactory=HASH(0x8abee64) '_loaded_types' => HASH(0x8abecfc) 'Bio::Search::Iteration::GenericIteration' => 1 '_root_verbose' => 0 'interface' => 'Bio::Search::Iteration::IterationI' 'type' => 'Bio::Search::Iteration::GenericIteration' 'result' => Bio::Factory::ObjectFactory=HASH(0x8a81a84) '_loaded_types' => HASH(0x8a96ce8) 'Bio::Search::Result::BlastResult' => 1 '_root_verbose' => 0 'interface' => 'Bio::Search::Result::ResultI' 'type' => 'Bio::Search::Result::BlastResult' '_inclusion_threshold' => 0.001 '_root_verbose' => 1 '_handler_cache' => Bio::SearchIO::IteratedSearchResultEventBuilder=HASH(0x8ab3be4) -> REUSED_ADDRESS '_notfirsttime' => 0 '_reporttype' => '' '_root_cleanup_methods' => ARRAY(0x8762efc) 0 CODE(0x82a9aec) -> &Bio::Root::IO::_io_cleanup in /util/lib/perl5/site_perl/5.8.0/Bio/Root/IO.pm:544-570 1 CODE(0x82a9aec) -> REUSED_ADDRESS '_root_verbose' => 1 main::parse_blast(454/scripts/fasta_blasta_mb.pl:307): 307: die "no valid $report_obj" unless defined $report_obj; DB<3> s main::parse_blast(454/scripts/fasta_blasta_mb.pl:310): 310: while( my $result = $report_obj->next_result ) { DB<3> s Bio::SearchIO::blast::next_result(/util/lib/perl5/site_perl/5.8.0/Bio/SearchIO/blast.pm:389): 389: my ($self) = @_; DB<3> r blast.pm: unrecognized line Reference: Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb Miller (2000), blast.pm: unrecognized line "A greedy algorithm for aligning DNA sequences", blast.pm: unrecognized line J Comput Biol 2000; 7(1-2):203-14. blast.pm: unrecognized line Score E Got NCBI HSP score=354, evalue 0.0 scalar context return from Bio::SearchIO::blast::next_result: '_algorithm' => 'MEGABLAST' '_algorithm_version' => '2.2.10 [Oct-19-2004]' '_dbentries' => 4249067 '_dbletters' => 17735149364 '_dbname' => 'All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS,GSS,environmental samples or phase 0, 1 or 2 HTGS sequences) ' '_hitindex' => 0 '_hits' => ARRAY(0x8b2acd0) empty array '_inclusion_threshold' => 0.001 '_iteration_count' => 1 '_iteration_index' => 0 '_iterations' => ARRAY(0x8b2ac4c) 0 Bio::Search::Iteration::GenericIteration=HASH(0x8b1cacc) '_newhits_below_threshold' => ARRAY(0x8b1ca84) 0 Bio::Search::Hit::BlastHit=HASH(0x8b1cf64) '_accession' => 'AE004091' '_algorithm' => 'MEGABLAST' '_description' => 'Pseudomonas aeruginosa PAO1, complete genome' '_hsps' => ARRAY(0x8b1ceb0) 0 Bio::Search::HSP::GenericHSP=HASH(0x8b2098c) '_algorithm' => 'MEGABLAST' '_frac_conserved' => HASH(0x8b266a0) 'hit' => 0.991803278688525 'query' => 0.991803278688525 'total' => 0.991803278688525 '_frac_identical' => HASH(0x8b2658c) 'hit' => 0.991803278688525 'query' => 0.991803278688525 'total' => 0.991803278688525 '_gaps' => HASH(0x8b24d94) 'hit' => 0 'query' => 0 'total' => 0 '_gsf_tag_hash' => HASH(0x8b20998) empty hash '_hit_string' => 'cctgacctccgctcaactgcgcaaatacgccagcgccggtcggccgttccccgaagggcgcctgctggccgcctcctgccacgacgcggaggaactggccctggctgcctcgatgggagtggagttcgtcaccctttcgccggtacagccgaccgagagccatcccggcgagccggcgctgggttgggacaaggccgccgaactgatcgccggcttcaaccagccggtctacctgctgggtggcctcggtccgcagcaagccgagcaggcttgggagcatggagcccagggcgtggcgggtatccgtgcgttctggccgggcggcctttgacggtggaatgaagaaaaaaggaggcttcggcctcc' '_homology_string' => '|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||' etc...... From torsten.seemann at infotech.monash.edu.au Sun Jul 30 01:41:30 2006 From: torsten.seemann at infotech.monash.edu.au (Torsten Seemann) Date: Sun, 30 Jul 2006 15:41:30 +1000 Subject: [Bioperl-l] Strangeness in parsing blast file In-Reply-To: <44CC3550.5070105@broad.mit.edu> References: <44CC3550.5070105@broad.mit.edu> Message-ID: <44CC468A.40700@infotech.monash.edu.au> > sub do_blast { > my $bf = shift; > my $blastoutput = $bf . ".blastout"; > print "Blasting against $db ...\n"; > `blast/blast-2.2.10/bin/megablast -d /prodinfo/proddata_ntblastdb/nt > -e 1e-50 -p 99 -D 2 -i test -o test.blastout`; > return $blastoutput; > }#end do_blast Should "-o test.blastoutput" be "-o $blastoutput" ? Otherwise you are returning the name of a non-existent file, which naturally Bio::SearchIO will not be able to find a blast result in. Alternatively use Bio::Tools::Run::StandaloneBlast to invoke megablast rather than back-ticks - that way you avoid any intermediate file and get a Bio::SearchIO object back directly. --Torsten From nabil at broad.mit.edu Sun Jul 30 10:11:03 2006 From: nabil at broad.mit.edu (Nabil Hafez) Date: Sun, 30 Jul 2006 10:11:03 -0400 Subject: [Bioperl-l] Strangeness in parsing blast file In-Reply-To: <44CC468A.40700@infotech.monash.edu.au> References: <44CC3550.5070105@broad.mit.edu> <44CC468A.40700@infotech.monash.edu.au> Message-ID: <44CCBDF7.2010601@broad.mit.edu> I had modified the variables a bit to try and make them more readable than what is in my code, in my code -o $blastoutput is what it is, like I said, the blast portion works absolutely fine - i.e. the do_blast sub routine is fully functional. here's a cut and paste from my actual code my $MBLAST = "/prodinfo/prod3pty/blast/blast-2.2.10/bin/megablast"; my $blastdb = "/prodinfo/proddata_ntblastdb/nt"; my $e_val = "1e-50"; #Default e-value Getopt_long my $percent_id = "99"; #Default percentage identity my $per_to_sample ="10"; #Default for percentage of reads to sample sub do_blast { my $bf = shift; my $blastoutput = $bf . ".blastout"; print "Blasting against $db ...\n"; `$MBLAST -d $blastdb -e $e_val -p $percent_id -D 2 -i $bf -o $blastoutput`; return $blastoutput; } I will try your suggestion to use the Bio::Tools::Run::StandaloneBlast, is megablast supported by this module? Thanks Nabil Torsten Seemann wrote: > >> sub do_blast { >> my $bf = shift; >> my $blastoutput = $bf . ".blastout"; >> print "Blasting against $db ...\n"; >> `blast/blast-2.2.10/bin/megablast -d >> /prodinfo/proddata_ntblastdb/nt -e 1e-50 -p 99 -D 2 -i test -o >> test.blastout`; > > > return $blastoutput; > > }#end do_blast > > Should "-o test.blastoutput" be "-o $blastoutput" ? > > Otherwise you are returning the name of a non-existent file, which > naturally Bio::SearchIO will not be able to find a blast result in. > > Alternatively use Bio::Tools::Run::StandaloneBlast to invoke megablast > rather than back-ticks - that way you avoid any intermediate file and > get a Bio::SearchIO object back directly. > > --Torsten > From bix at sendu.me.uk Sun Jul 30 12:20:54 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Sun, 30 Jul 2006 17:20:54 +0100 Subject: [Bioperl-l] Strangeness in parsing blast file In-Reply-To: <44CCBDF7.2010601@broad.mit.edu> References: <44CC3550.5070105@broad.mit.edu> <44CC468A.40700@infotech.monash.edu.au> <44CCBDF7.2010601@broad.mit.edu> Message-ID: <44CCDC66.2030604@sendu.me.uk> Nabil Hafez wrote: > I had modified the variables a bit to try and make them more readable > than what is in my code, in my code -o $blastoutput is > what it is, like I said, the blast portion works absolutely fine - i.e. > the do_blast sub routine is fully functional. How do you know? > `$MBLAST -d $blastdb -e $e_val -p $percent_id -D 2 -i $bf -o > $blastoutput`; Does this command definitely produce exactly the same file as the one you use to show that parse_blast() does sometimes work (when you avoid using do_blast())? Btw, http://perldoc.perl.org/perlfaq8.html#What's-wrong-with-using-backticks-in-a-void-context%3f > I will try your suggestion to use the Bio::Tools::Run::StandaloneBlast, > is megablast supported by this module? No, it doesn't. You could cheat and call _runblast() directly (give it an executable string and a string of args to megablast), and provide -outfile to new(). From nabil at broad.mit.edu Sun Jul 30 20:13:16 2006 From: nabil at broad.mit.edu (Nabil Hafez) Date: Sun, 30 Jul 2006 20:13:16 -0400 Subject: [Bioperl-l] Strangeness in parsing blast file In-Reply-To: <44CCDC66.2030604@sendu.me.uk> References: <44CC3550.5070105@broad.mit.edu> <44CC468A.40700@infotech.monash.edu.au> <44CCBDF7.2010601@broad.mit.edu> <44CCDC66.2030604@sendu.me.uk> Message-ID: <44CD4B1C.5070907@broad.mit.edu> Sendu Bala wrote: >Nabil Hafez wrote: > > >>I had modified the variables a bit to try and make them more readable >>than what is in my code, in my code -o $blastoutput is >>what it is, like I said, the blast portion works absolutely fine - i.e. >>the do_blast sub routine is fully functional. >> >> > >How do you know? > > > Because it creates a file containing all of the blastoutput, this works every time - a file is created with the blastoutput. >> `$MBLAST -d $blastdb -e $e_val -p $percent_id -D 2 -i $bf -o >>$blastoutput`; >> >> > >Does this command definitely produce exactly the same file as the one >you use to show that parse_blast() does sometimes work (when you avoid >using do_blast())? > > > Yes - the exact same file because I produce the file with do_blast() and then when it fails to parse it ends but there is a blastoutput file created in my directory. If i re-run the script again just feeding in the name of the file that was created, it parses it just fine. So basically the parsing works whenever I feed it a blastoupt file but it can't seem to parse the same file that was created and then passed to the parse_blast() subroutine >Btw, >http://perldoc.perl.org/perlfaq8.html#What's-wrong-with-using-backticks-in-a-void-context%3f > >Good to know. Thanks. > > >>I will try your suggestion to use the Bio::Tools::Run::StandaloneBlast, >>is megablast supported by this module? >> >> > >No, it doesn't. You could cheat and call _runblast() directly (give it >an executable string and a string of args to megablast), and provide >-outfile to new(). > > > I still don't think the blast is a problem since I get perfect blastoutput everytime. >_______________________________________________ >Bioperl-l mailing list >Bioperl-l at lists.open-bio.org >http://lists.open-bio.org/mailman/listinfo/bioperl-l > > From cjfields at uiuc.edu Sun Jul 30 22:52:16 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sun, 30 Jul 2006 21:52:16 -0500 Subject: [Bioperl-l] Strangeness in parsing blast file In-Reply-To: <44CD4B1C.5070907@broad.mit.edu> References: <44CC3550.5070105@broad.mit.edu> <44CC468A.40700@infotech.monash.edu.au> <44CCBDF7.2010601@broad.mit.edu> <44CCDC66.2030604@sendu.me.uk> <44CD4B1C.5070907@broad.mit.edu> Message-ID: <81C49D1F-0468-4B63-8D7A-09E1C48573F0@uiuc.edu> As an aside, BLAST 2.2.13 or later cannot be parsed using Bioperl 1.5.1. You have to update to the latest bioperl-live (from CVS). Chris On Jul 30, 2006, at 7:13 PM, Nabil Hafez wrote: > > > Sendu Bala wrote: > >> Nabil Hafez wrote: >> >> >>> I had modified the variables a bit to try and make them more >>> readable >>> than what is in my code, in my code -o $blastoutput is >>> what it is, like I said, the blast portion works absolutely fine >>> - i.e. >>> the do_blast sub routine is fully functional. >>> >>> >> >> How do you know? >> >> >> > Because it creates a file containing all of the blastoutput, this > works > every time - a file is created with the > blastoutput. > >>> `$MBLAST -d $blastdb -e $e_val -p $percent_id -D 2 -i $bf -o >>> $blastoutput`; >>> >>> >> >> Does this command definitely produce exactly the same file as the one >> you use to show that parse_blast() does sometimes work (when you >> avoid >> using do_blast())? >> >> >> > Yes - the exact same file because I produce the file with do_blast() > and then when it fails to parse it ends but > there is a blastoutput file created in my directory. If i re-run the > script again just feeding in the name of the file that was > created, it parses it just fine. So basically the parsing works > whenever I feed it a blastoupt file but it can't seem to parse > the same file that was created and then passed to the parse_blast() > subroutine > >> Btw, >> http://perldoc.perl.org/perlfaq8.html#What's-wrong-with-using- >> backticks-in-a-void-context%3f >> >> Good to know. Thanks. >> >> >>> I will try your suggestion to use the >>> Bio::Tools::Run::StandaloneBlast, >>> is megablast supported by this module? >>> >>> >> >> No, it doesn't. You could cheat and call _runblast() directly >> (give it >> an executable string and a string of args to megablast), and provide >> -outfile to new(). >> >> >> > I still don't think the blast is a problem since I get perfect > blastoutput everytime. > >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Mon Jul 31 04:29:28 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 31 Jul 2006 09:29:28 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: <5C765CAF-4976-45D1-BDA4-1E432081FF4D@uiuc.edu> References: <001301c6b24b$da38ba80$15327e82@pyrimidine> <44CB144C.6050507@sendu.me.uk> <5C765CAF-4976-45D1-BDA4-1E432081FF4D@uiuc.edu> Message-ID: <44CDBF68.2040803@sendu.me.uk> Chris Fields wrote: > On Jul 29, 2006, at 2:54 AM, Sendu Bala wrote: > >>> http://www.bioperl.org/wiki/HOWTO:Trees >> Which is why the Trees HOWTO talks about taxa, and a number of the >> Taxonomy modules have phylogenetic methods like get_lca. (And why >> there >> is Bio::Taxonomy::Taxon and Tree.) > > Are we still thinking about deprecating those? I have seen very > little mention of those modules from the mail list archives, and > Jason mentioned that Bio::Taxonomy::Taxon hasn't been modified in a > long time. Yes, they would both be redundant and nonsensical with the planned changes to Bio::Species. From Xianjun.Dong at bccs.uib.no Mon Jul 31 07:55:59 2006 From: Xianjun.Dong at bccs.uib.no (Xianjun Dong) Date: Mon, 31 Jul 2006 13:55:59 +0200 Subject: [Bioperl-l] PAML + Codeml problem.. In-Reply-To: 4A98ACB8EC146149872BAC9A132A582C277AC4@icex5.ic.ac.uk Message-ID: <1154346960.6517.19.camel@lauvtre.ii.uib.no> Hi, I have a problem during running the Codeml Wiki-HOWTO code: Here is the error message: //////////////////////////////////////////////////////////////// [xianjund at lauvtre kaks]$ perl paml.pl test.fa -------------------- WARNING --------------------- MSG: There was an error - see error_string for the program output STACK Bio::Tools::Run::Phylo::PAML::Codeml::run /Home/extern/xianjund/src/bioperl/bioperl-run/Bio/Tools/Run/Phylo/PAML/Codeml.pm:581 STACK toplevel paml.pl:61 ------------- EXCEPTION: Bio::Root::NotImplemented ------------- MSG: Unknown format of PAML output STACK: Error::throw STACK: Bio::Root::Root::throw /usr/lib/perl5/site_perl/5.8.5/Bio/Root/Root.pm:328 STACK: Bio::Tools::Phylo::PAML::_parse_summary /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:359 STACK: Bio::Tools::Phylo::PAML::next_result /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:224 STACK: paml.pl:62 ---------------------------------------------------------------- //////////////////////////////////////////////////////////////// My test sequence is: >ENST00000361390 ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCTAATCGCAATGGCATTCCTAATGCTTACCGAACGAAAAATTCTAGGCTATATACAACTACGCAAAGGCCCCAACGTTGTAGGCCCCTACGGGCTACTACAACCCTTCGCTGACGCCATAAAACTCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATCACCGCCCCGACCTTAGCTCTCACCATCGCTCTTCTACTATGAACCCCCCTCCCCATACCCAACCCCCTGGTCAACCTCAACCTAGGCCTCCTATTTATTCTAGCCACCTCTAGCCTAGCCGTTTACTCAATCCTCTGATCAGGGTGAGCATCAAACTCAAACTACGCCCTGATCGGCGCACTGCGAGCAGTAGCCCAAACAATCTCATATGAAGTCACCCTAGCCATCATTCTACTATCAACATTACTAATAAGTGGCTCCTTTAACCTCTCCACCCTTATCACAACACAAGAACACCTCTGATTACTCCTGCCATCATGACCCTTGGCCATAATATGATTTATCTCCACACTAGCAGAGACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACTAGTCTCAGGCTTCAACATCGAATACGCCGCAGGCCCCTTCGCCCTATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTACAATCTTCCTAGGAACAACATATGACGCACTCTCCCCTGAACTCTACACAACATATTTTGTCACCAAGACCCTACTTCTAACCTCCCTGTTCTTATGAATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACACCTCCTATGAAAAAACTTCCTACCACTCACCCTAGCATTACTTATATGATATGTCTCCATACCCATTACAATCTCCAGCATTCCCCCTCAAACCTAA >ENSMUST00000082392 GTGTTCTTTATTAATATCCTAACACTCCTCGTCCCCATTCTAATCGCCATAGCCTTCCTAACATTAGTAGAACGCAAAATCTTAGGGTACATACAACTACGAAAAGGCCCTAACATTGTTGGTCCATACGGCATTTTACAACCATTTGCAGACGCCATAAAATTATTTATAAAAGAACCAATACGCCCTTTAACAACCTCTATATCCTTATTTATTATTGCACCTACCCTATCACTCACACTAGCATTAAGTCTATGAGTTCCCCTACCAATACCACACCCATTAATTAATTTAAACCTAGGGATTTTATTTATTTTAGCAACATCTAGCCTATCAGTTTACTCCATTCTATGATCAGGATGAGCCTCAAACTCCAAATACTCACTATTCGGAGCTTTACGAGCCGTAGCCCAAACAATTTCATATGAAGTAACCATAGCTATTATCCTTTTATCAGTTCTATTAATAAATGGATCCTACTCTCTACAAACACTTATTACAACCCAAGAACACATATGATTACTTCTGCCAGCCTGACCCATAGCCATAATATGATTTATCTCAACCCTAGCAGAAACAAACCGGGCCCCCTTCGACCTGACAGAAGGAGAATCAGAATTAGTATCAGGGTTTAACGTAGAATACGCAGCCGGCCCATTCGCGTTATTCTTTATAGCAGAGTACACTAACATTATTCTAATAAACGCCCTAACAACTATTATCTTCCTAGGACCCCTATACTATATCAATTTACCAGAACTCTACTCAACTAACTTCATAATAGAAGCTCTACTACTATCATCAACATTCCTATGGATCCGAGCATCTTATCCACGCTTCCGTTACGATCAACTTATACATCTTCTATGAAAAAACTTTCTACCCCTAACACTAGCATTATGTATGTGACATATTTCTTTACCAATTTTTACAGCGGGAGTACCACCATACATATAG Sure, I checked it. There is some stop codon in it. If I replace it with non-stop codon, it works. For example, >ENST00000361390 ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCcaaTCGCAATGGCATTCCcaaTGCTTACCGAACGAAAAATTCcaaGCTATATACAACTACGCAAAGGCCCCAACGTTGcaaGCCCCTACGGGCTACTACAACCCTTCGCcaaCGCCAcaaAACTCTTCACCAAAGAGCCCCcaaAACCCGCCACATCTACCATCACCCTCTACATCACCGCCCCGACCTcaaCTCTCACCATCGCTCTTCTACTAcaaACCCCCCTCCCCATACCCAACCCCCTGGTCAACCTCAACCcaaGCCTCCTATTTATTCcaaCCACCTCcaaCCcaaCCGTTTACTCAATCCTCcaaTCAGGGcaaGCATCAAACTCAAACTACGCCCcaaTCGGCGCACTGCGAGCAGcaaCCCAAACAATCTCATAcaaAGTCACCCcaaCCATCATTCTACTATCAACATTACcaacaaGTGGCTCCTTcaaCCTCTCCACCCTTATCACAACACAAGAACACCTCcaaTTACTCCTGCCATCAcaaCCCTTGGCCAcaaTAcaaTTTATCTCCACACcaaCAGAGACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACcaaTCTCAGGCTTCAACATCGAATACGCCGCAGGCCCCTTCGCCCTATTCTTCAcaaCCGAATACACAAACATTATTAcaacaaACACCCTCACCACTACAATCTTCCcaaGAACAACATAcaaCGCACTCTCCCCcaaACTCTACACAACATATTTTGTCACCAAGACCCTACTTCcaaCCTCCCTGTTCTTAcaaATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACACCTCCTAcaaAAAAACTTCCTACCACTCACCCcaaCATTACTTATAcaaTATGTCTCCATACCCATTACAATCTCCAGCATTCCCCCTCAAACCcaa >ENSMUST00000082392 GTGTTCTTTATcaaTATCCcaaCACTCCTCGTCCCCATTCcaaTCGCCAcaaCCTTCCcaaCATcaacaaAACGCAAAATCTcaaGGTACATACAACTACGAAAAGGCCCcaaCATTGTTGGTCCATACGGCATTTTACAACCATTTGCAGACGCCAcaaAATTATTTAcaaAAGAACCAATACGCCCTTcaaCAACCTCTATATCCTTATTTATTATTGCACCTACCCTATCACTCACACcaaCATcaaGTCTAcaaGTTCCCCTACCAATACCACACCCATcaaTcaaTTcaaACCcaaGGATTTTATTTATTTcaaCAACATCcaaCCTATCAGTTTACTCCATTCTAcaaTCAGGAcaaGCCTCAAACTCCAAATACTCACTATTCGGAGCTTTACGAGCCGcaaCCCAAACAATTTCATAcaaAGcaaCCAcaaCTATTATCCTTTTATCAGTTCTATcaacaaATGGATCCTACTCTCTACAAACACTTATTACAACCCAAGAACACATAcaaTTACTTCTGCCAGCCcaaCCCAcaaCCAcaaTAcaaTTTATCTCAACCCcaaCAGAAACAAACCGGGCCCCCTTCGACCcaaCAGAAGGAGAATCAGAATcaaTATCAGGGTTcaaCGcaaAATACGCAGCCGGCCCATTCGCGTTATTCTTTAcaaCAGAGTACACcaaCATTATTCcaacaaACGCCCcaaCAACTATTATCTTCCcaaGACCCCTATACTATATCAATTTACCAGAACTCTACTCAACcaaCTTCAcaacaaAAGCTCTACTACTATCATCAACATTCCTATGGATCCGAGCATCTTATCCACGCTTCCGTTACGATCAACTTATACATCTTCTAcaaAAAAACTTTCTACCCCcaaCACcaaCATTATGTATGcaaCATATTTCTTTACCAATTTTTACAGCGGGAGTACCACCATACATAcaa But my question is: it does not occur in the codon position (say, the third codon's position is not a times of 3). Why it effect the result? And also there is code to filter out the stop codon in the sample code (as the following shown) /////////////////////////////// if( $pseq =~ /\*/ && $pseq !~ /\*$/ ) { warn("provided a CDS sequence with a stop codon, PAML will choke!"); exit(0); } # Tcoffee can't handle '*' even if it is trailing $pseq =~ s/\*//g; ///////////////////////////// So, when translate back from aa_aln to dna_aln, there should be no stop codon included. SO, why it can not pass? Thanks for answer! P.S: attach my code here: ///////////////////////////////////////////////////////// #!/usr/bin/perl -w use strict; use Bio::Tools::Run::Phylo::PAML::Codeml; use Bio::Tools::Run::Alignment::Clustalw; # for projecting alignments from protein to R/DNA space use Bio::Align::Utilities qw(aa_to_dna_aln); # for input of the sequence data use Bio::SeqIO; use Bio::AlignIO; my $aln_factory = Bio::Tools::Run::Alignment::Clustalw->new('quiet'=>1); my $seqdata = shift || 'test.fa'; my $seqio = new Bio::SeqIO(-file => $seqdata, -format => 'fasta'); my %seqs; my @prots; # process each sequence while ( my $seq = $seqio->next_seq ) { $seqs{$seq->display_id} = $seq; # translate them into protein my $protein = $seq->translate(); my $pseq = $protein->seq(); if( $pseq =~ /\*/ && $pseq !~ /\*$/ ) { warn("provided a CDS sequence with a stop codon, PAML will choke!"); exit(0); } # Tcoffee can't handle '*' even if it is trailing $pseq =~ s/\*//g; $protein->seq($pseq); push @prots, $protein; } if( @prots < 2 ) { warn("Need at least 2 CDS sequences to proceed"); exit(0); } # open(OUT, ">align_output.txt") || die("cannot open output align_output for writing"); # Align the sequences with clustalw my $aa_aln = $aln_factory->align(\@prots); # project the protein alignment back to CDS coordinates my $dna_aln = aa_to_dna_aln($aa_aln, \%seqs); my @each = $dna_aln->each_seq(); my $kaks_factory = Bio::Tools::Run::Phylo::PAML::Codeml->new ( -params => { 'runmode' => -2, 'seqtype' => 1, }, -save_tempfiles => 1, -verbose => 1); # set the alignment object $kaks_factory->alignment($dna_aln); # run the KaKs analysis my ($rc,$parser) = $kaks_factory->run(); my $result = $parser->next_result; my $MLmatrix = $result->get_MLmatrix(); my @otus = $result->get_seqs(); # this gives us a mapping from the PAML order of sequences back to # the input order (since names get truncated) my @pos = map { my $c= 1; foreach my $s ( @each ) { last if( $s->display_id eq $_->display_id ); $c++; } $c; } @otus; print join("\t", qw(SEQ1 SEQ2 Ka Ks Ka/Ks PROT_PERCENTID CDNA_PERCENTID)),"\n"; for( my $i = 0; $i < (scalar @otus -1) ; $i++) { for( my $j = $i+1; $j < (scalar @otus); $j++ ) { my $sub_aa_aln = $aa_aln->select_noncont($pos[$i],$pos[$j]); my $sub_dna_aln = $dna_aln->select_noncont($pos[$i],$pos[$j]); print join("\t", $otus[$i]->display_id, $otus[$j]->display_id,$MLmatrix->[$i]->[$j]- >{'dN'}, $MLmatrix->[$i]->[$j]->{'dS'}, $MLmatrix->[$i]->[$j]->{'omega'}, sprintf("%.2f",$sub_aa_aln- >percentage_identity), sprintf("%.2f",$sub_dna_aln- >percentage_identity), ), "\n"; } } -- Xianjun Dong PhD Student Computational Biology Unit Bergen Center for Computational Science University of Bergen H?yteknologisenteret, Thorm?hlensgate 55 N-5008 Bergen,Norway. Webpage: http://www.ii.uib.no/~xianjund/ MSN: sterding at hotmail.com Phone No: +47 - 55584354 (office) +47 - 47361688 (mobile) Fax No: +47 - 55584295 From golharam at umdnj.edu Mon Jul 31 11:20:33 2006 From: golharam at umdnj.edu (Ryan Golhar) Date: Mon, 31 Jul 2006 11:20:33 -0400 Subject: [Bioperl-l] PAML + Codeml problem.. In-Reply-To: <1154346960.6517.19.camel@lauvtre.ii.uib.no> Message-ID: <027201c6b4b4$ddc201f0$2f01a8c0@GOLHARMOBILE1> Hi Xianjun, I just did some work on this module including the example. >> it does not occur in the codon position >>(say, the third codon's position is not a times of 3). >>Why it effect the result? If I'm interpreting your question correctly, the stop codons in your sequence occur in-frame. This is why it is choking. >>So, when translate back from aa_aln to dna_aln, there should be no stop codon included. SO, why it can not pass? The Ka and Ks statistics are not calculated based on the protein sequence, they are calculated based on the DNA sequence. The protein sequence is used to provide a alignment for the codons of the DNA sequence. Checking the protein sequence for * is easier to identify in-frame stop codons than scanning the DNA sequence. The two checks for stop codons you mentioned are to check for stop codons within the sequence without worry for the last amino acid. The second part remove the * at the end of the sequence (not the middle). If you want to remove the in-frame stop codons, you can, but do so before translating it to protein sequences. Ryan -----Original Message----- From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Xianjun Dong Sent: Monday, July 31, 2006 7:56 AM To: bioperl-l at lists.open-bio.org Subject: [Bioperl-l] PAML + Codeml problem.. Hi, I have a problem during running the Codeml Wiki-HOWTO code: Here is the error message: //////////////////////////////////////////////////////////////// [xianjund at lauvtre kaks]$ perl paml.pl test.fa -------------------- WARNING --------------------- MSG: There was an error - see error_string for the program output STACK Bio::Tools::Run::Phylo::PAML::Codeml::run /Home/extern/xianjund/src/bioperl/bioperl-run/Bio/Tools/Run/Phylo/PAML/C odeml.pm:581 STACK toplevel paml.pl:61 ------------- EXCEPTION: Bio::Root::NotImplemented ------------- MSG: Unknown format of PAML output STACK: Error::throw STACK: Bio::Root::Root::throw /usr/lib/perl5/site_perl/5.8.5/Bio/Root/Root.pm:328 STACK: Bio::Tools::Phylo::PAML::_parse_summary /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:359 STACK: Bio::Tools::Phylo::PAML::next_result /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:224 STACK: paml.pl:62 ---------------------------------------------------------------- //////////////////////////////////////////////////////////////// My test sequence is: >ENST00000361390 ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCTAATCGCAATGGCATTCCTAATGCTTACCGAA CGAAAAATTCTAGGCTATATACAACTACGCAAAGGCCCCAACGTTGTAGGCCCCTACGGGCTACTACAACCC TTCGCTGACGCCATAAAACTCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATC ACCGCCCCGACCTTAGCTCTCACCATCGCTCTTCTACTATGAACCCCCCTCCCCATACCCAACCCCCTGGTC AACCTCAACCTAGGCCTCCTATTTATTCTAGCCACCTCTAGCCTAGCCGTTTACTCAATCCTCTGATCAGGG TGAGCATCAAACTCAAACTACGCCCTGATCGGCGCACTGCGAGCAGTAGCCCAAACAATCTCATATGAAGTC ACCCTAGCCATCATTCTACTATCAACATTACTAATAAGTGGCTCCTTTAACCTCTCCACCCTTATCACAACA CAAGAACACCTCTGATTACTCCTGCCATCATGACCCTTGGCCATAATATGATTTATCTCCACACTAGCAGAG ACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACTAGTCTCAGGCTTCAACATCGAATACGCC GCAGGCCCCTTCGCCCTATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTACA ATCTTCCTAGGAACAACATATGACGCACTCTCCCCTGAACTCTACACAACATATTTTGTCACCAAGACCCTA CTTCTAACCTCCCTGTTCTTATGAATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACACCTC CTATGAAAAAACTTCCTACCACTCACCCTAGCATTACTTATATGATATGTCTCCATACCCATTACAATCTCC AGCATTCCCCCTCAAACCTAA >ENSMUST00000082392 GTGTTCTTTATTAATATCCTAACACTCCTCGTCCCCATTCTAATCGCCATAGCCTTCCTAACATTAGTAGAA CGCAAAATCTTAGGGTACATACAACTACGAAAAGGCCCTAACATTGTTGGTCCATACGGCATTTTACAACCA TTTGCAGACGCCATAAAATTATTTATAAAAGAACCAATACGCCCTTTAACAACCTCTATATCCTTATTTATT ATTGCACCTACCCTATCACTCACACTAGCATTAAGTCTATGAGTTCCCCTACCAATACCACACCCATTAATT AATTTAAACCTAGGGATTTTATTTATTTTAGCAACATCTAGCCTATCAGTTTACTCCATTCTATGATCAGGA TGAGCCTCAAACTCCAAATACTCACTATTCGGAGCTTTACGAGCCGTAGCCCAAACAATTTCATATGAAGTA ACCATAGCTATTATCCTTTTATCAGTTCTATTAATAAATGGATCCTACTCTCTACAAACACTTATTACAACC CAAGAACACATATGATTACTTCTGCCAGCCTGACCCATAGCCATAATATGATTTATCTCAACCCTAGCAGAA ACAAACCGGGCCCCCTTCGACCTGACAGAAGGAGAATCAGAATTAGTATCAGGGTTTAACGTAGAATACGCA GCCGGCCCATTCGCGTTATTCTTTATAGCAGAGTACACTAACATTATTCTAATAAACGCCCTAACAACTATT ATCTTCCTAGGACCCCTATACTATATCAATTTACCAGAACTCTACTCAACTAACTTCATAATAGAAGCTCTA CTACTATCATCAACATTCCTATGGATCCGAGCATCTTATCCACGCTTCCGTTACGATCAACTTATACATCTT CTATGAAAAAACTTTCTACCCCTAACACTAGCATTATGTATGTGACATATTTCTTTACCAATTTTTACAGCG GGAGTACCACCATACATATAG Sure, I checked it. There is some stop codon in it. If I replace it with non-stop codon, it works. For example, >ENST00000361390 ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCcaaTCGCAATGGCATTCCcaaTGCTTACCGAA CGAAAAATTCcaaGCTATATACAACTACGCAAAGGCCCCAACGTTGcaaGCCCCTACGGGCTACTACAACCC TTCGCcaaCGCCAcaaAACTCTTCACCAAAGAGCCCCcaaAACCCGCCACATCTACCATCACCCTCTACATC ACCGCCCCGACCTcaaCTCTCACCATCGCTCTTCTACTAcaaACCCCCCTCCCCATACCCAACCCCCTGGTC AACCTCAACCcaaGCCTCCTATTTATTCcaaCCACCTCcaaCCcaaCCGTTTACTCAATCCTCcaaTCAGGG caaGCATCAAACTCAAACTACGCCCcaaTCGGCGCACTGCGAGCAGcaaCCCAAACAATCTCATAcaaAGTC ACCCcaaCCATCATTCTACTATCAACATTACcaacaaGTGGCTCCTTcaaCCTCTCCACCCTTATCACAACA CAAGAACACCTCcaaTTACTCCTGCCATCAcaaCCCTTGGCCAcaaTAcaaTTTATCTCCACACcaaCAGAG ACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACcaaTCTCAGGCTTCAACATCGAATACGCC GCAGGCCCCTTCGCCCTATTCTTCAcaaCCGAATACACAAACATTATTAcaacaaACACCCTCACCACTACA ATCTTCCcaaGAACAACATAcaaCGCACTCTCCCCcaaACTCTACACAACATATTTTGTCACCAAGACCCTA CTTCcaaCCTCCCTGTTCTTAcaaATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACACCTC CTAcaaAAAAACTTCCTACCACTCACCCcaaCATTACTTATAcaaTATGTCTCCATACCCATTACAATCTCC AGCATTCCCCCTCAAACCcaa >ENSMUST00000082392 GTGTTCTTTATcaaTATCCcaaCACTCCTCGTCCCCATTCcaaTCGCCAcaaCCTTCCcaaCATcaacaaAA CGCAAAATCTcaaGGTACATACAACTACGAAAAGGCCCcaaCATTGTTGGTCCATACGGCATTTTACAACCA TTTGCAGACGCCAcaaAATTATTTAcaaAAGAACCAATACGCCCTTcaaCAACCTCTATATCCTTATTTATT ATTGCACCTACCCTATCACTCACACcaaCATcaaGTCTAcaaGTTCCCCTACCAATACCACACCCATcaaTc aaTTcaaACCcaaGGATTTTATTTATTTcaaCAACATCcaaCCTATCAGTTTACTCCATTCTAcaaTCAGGA caaGCCTCAAACTCCAAATACTCACTATTCGGAGCTTTACGAGCCGcaaCCCAAACAATTTCATAcaaAGca aCCAcaaCTATTATCCTTTTATCAGTTCTATcaacaaATGGATCCTACTCTCTACAAACACTTATTACAACC CAAGAACACATAcaaTTACTTCTGCCAGCCcaaCCCAcaaCCAcaaTAcaaTTTATCTCAACCCcaaCAGAA ACAAACCGGGCCCCCTTCGACCcaaCAGAAGGAGAATCAGAATcaaTATCAGGGTTcaaCGcaaAATACGCA GCCGGCCCATTCGCGTTATTCTTTAcaaCAGAGTACACcaaCATTATTCcaacaaACGCCCcaaCAACTATT ATCTTCCcaaGACCCCTATACTATATCAATTTACCAGAACTCTACTCAACcaaCTTCAcaacaaAAGCTCTA CTACTATCATCAACATTCCTATGGATCCGAGCATCTTATCCACGCTTCCGTTACGATCAACTTATACATCTT CTAcaaAAAAACTTTCTACCCCcaaCACcaaCATTATGTATGcaaCATATTTCTTTACCAATTTTTACAGCG GGAGTACCACCATACATAcaa But my question is: it does not occur in the codon position (say, the third codon's position is not a times of 3). Why it effect the result? And also there is code to filter out the stop codon in the sample code (as the following shown) /////////////////////////////// if( $pseq =~ /\*/ && $pseq !~ /\*$/ ) { warn("provided a CDS sequence with a stop codon, PAML will choke!"); exit(0); } # Tcoffee can't handle '*' even if it is trailing $pseq =~ s/\*//g; ///////////////////////////// So, when translate back from aa_aln to dna_aln, there should be no stop codon included. SO, why it can not pass? Thanks for answer! P.S: attach my code here: ///////////////////////////////////////////////////////// #!/usr/bin/perl -w use strict; use Bio::Tools::Run::Phylo::PAML::Codeml; use Bio::Tools::Run::Alignment::Clustalw; # for projecting alignments from protein to R/DNA space use Bio::Align::Utilities qw(aa_to_dna_aln); # for input of the sequence data use Bio::SeqIO; use Bio::AlignIO; my $aln_factory = Bio::Tools::Run::Alignment::Clustalw->new('quiet'=>1); my $seqdata = shift || 'test.fa'; my $seqio = new Bio::SeqIO(-file => $seqdata, -format => 'fasta'); my %seqs; my @prots; # process each sequence while ( my $seq = $seqio->next_seq ) { $seqs{$seq->display_id} = $seq; # translate them into protein my $protein = $seq->translate(); my $pseq = $protein->seq(); if( $pseq =~ /\*/ && $pseq !~ /\*$/ ) { warn("provided a CDS sequence with a stop codon, PAML will choke!"); exit(0); } # Tcoffee can't handle '*' even if it is trailing $pseq =~ s/\*//g; $protein->seq($pseq); push @prots, $protein; } if( @prots < 2 ) { warn("Need at least 2 CDS sequences to proceed"); exit(0); } # open(OUT, ">align_output.txt") || die("cannot open output align_output for writing"); # Align the sequences with clustalw my $aa_aln = $aln_factory->align(\@prots); # project the protein alignment back to CDS coordinates my $dna_aln = aa_to_dna_aln($aa_aln, \%seqs); my @each = $dna_aln->each_seq(); my $kaks_factory = Bio::Tools::Run::Phylo::PAML::Codeml->new ( -params => { 'runmode' => -2, 'seqtype' => 1, }, -save_tempfiles => 1, -verbose => 1); # set the alignment object $kaks_factory->alignment($dna_aln); # run the KaKs analysis my ($rc,$parser) = $kaks_factory->run(); my $result = $parser->next_result; my $MLmatrix = $result->get_MLmatrix(); my @otus = $result->get_seqs(); # this gives us a mapping from the PAML order of sequences back to # the input order (since names get truncated) my @pos = map { my $c= 1; foreach my $s ( @each ) { last if( $s->display_id eq $_->display_id ); $c++; } $c; } @otus; print join("\t", qw(SEQ1 SEQ2 Ka Ks Ka/Ks PROT_PERCENTID CDNA_PERCENTID)),"\n"; for( my $i = 0; $i < (scalar @otus -1) ; $i++) { for( my $j = $i+1; $j < (scalar @otus); $j++ ) { my $sub_aa_aln = $aa_aln->select_noncont($pos[$i],$pos[$j]); my $sub_dna_aln = $dna_aln->select_noncont($pos[$i],$pos[$j]); print join("\t", $otus[$i]->display_id, $otus[$j]->display_id,$MLmatrix->[$i]->[$j]- >{'dN'}, $MLmatrix->[$i]->[$j]->{'dS'}, $MLmatrix->[$i]->[$j]->{'omega'}, sprintf("%.2f",$sub_aa_aln- >percentage_identity), sprintf("%.2f",$sub_dna_aln- >percentage_identity), ), "\n"; } } -- Xianjun Dong PhD Student Computational Biology Unit Bergen Center for Computational Science University of Bergen H?yteknologisenteret, Thorm?hlensgate 55 N-5008 Bergen,Norway. Webpage: http://www.ii.uib.no/~xianjund/ MSN: sterding at hotmail.com Phone No: +47 - 55584354 (office) +47 - 47361688 (mobile) Fax No: +47 - 55584295 _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l From nabil at broad.mit.edu Mon Jul 31 14:57:48 2006 From: nabil at broad.mit.edu (Nabil Hafez) Date: Mon, 31 Jul 2006 14:57:48 -0400 Subject: [Bioperl-l] Strangeness in parsing blast file In-Reply-To: <44CC3550.5070105@broad.mit.edu> References: <44CC3550.5070105@broad.mit.edu> Message-ID: <44CE52AC.4080108@broad.mit.edu> I have figured out the problem - not a problem with Bioperl. In my create_sample_file() subroutine I defined $/ = '>'; #define fasta record input seperator when it should have been this local $/ = "\n>"; the use of local made a big difference. Thanks to all for your help. Nabil Hafez Nabil Hafez wrote: > Hi, > I am having a somewhat similar problem to what was posted in > http://bioperl.org/pipermail/bioperl-l/2006-May/021416.html > however, I have read through all of that thread and I don't believe what > I am > experiencing is the exact same problem. I also realize that the Bioperl > version 1.5.1 > fixes a problem with blast parsing. > > My problem: > My blastresults file parses fine and everything works swimmingly if > I pass > the blast output file by name such as > $blast_result = 'test.blastout'; > > however when I do > $blast_result = &do_blast($sample_fasta); > > even though in both cases $blast_result evaluate to "test.blastout", the > parsing doesn't work, more specifically > it gets an undefined variable for $result in while( my $result = > $report_obj->next_result ) { > > Sorr y for the long email - any help would be appreciated, > Thanks - Nabil > > > The code...non releavant parts trimmed for size constraints....debugging > from working and non-working > versions after the code. > > use strict; > use Bio::SearchIO; > use Getopt::Std; > use List::Util qw(shuffle); > use Benchmark; > > my ($inputfile, $samplefile, $blastfile, $blast_result, $blast_report, > $blast_verbose); #files generated > > > #------------------# > # Subroutine Calls # > #------------------# > > my $test = &create_sample_file($inputfile); #inputfile being a fasta > file containing nucleotide sequence > $blast_result = &do_blast($test); > ##$blast_result = 'test.blastout'; #when this is uncommented and > replace the previous two lines with test.blastout being normal blast > output - the script works fine. > &parse_blast($blast_result); > > > ####################### > # create_sample_file > # > # Input: Original Fasta File > # > # Output: Fasta file containing randomly sampled reads > # > # > sub create_sample_file { > my $in = shift; > my $linecount = 0; > my @lines; > > $samplefile = $in . "_sample"; > > #Determine total # of reads in input fasta > $totalreads = `$grep -c '>' $inputfile`; > $totalreads =~ s/\s+//; > chomp $totalreads; > > if ($totalreads > 1000) { #sample if more than 1000 reads > $sample_reads = sprintf("%.0f", $totalreads * > ($per_to_sample/100)); #number of reads to sample > } > else { #otherwise use all reads > $sample_reads = $totalreads; > } > > $/ = '>'; #define fasta record input seperator > > open (IN, "<$in") or die "Cannot open $in $!\n"; > open (OUT, ">$samplefile") or die "Cannot open $samplefile $!\n"; > > > while () { #read lines into an array > chomp; > push (@lines, $_); > } > > @lines = shuffle(@lines); #shuffle array > foreach (@lines) { > print OUT ">$_" if $linecount <= $sample_reads; #output to > file sampled number of reads > $linecount++; > } > > close IN; > close OUT; > > return $samplefile; > > }#end create_sample_file > > > ####################### > # do_blast > # > # Input: Fasta File containing SCREENED sampled reads > # > # Output: Blast File > # > # > > sub do_blast { > my $bf = shift; > my $blastoutput = $bf . ".blastout"; > > print "Blasting against $db ...\n"; > > `blast/blast-2.2.10/bin/megablast -d /prodinfo/proddata_ntblastdb/nt > -e 1e-50 -p 99 -D 2 -i test -o test.blastout`; > > return $blastoutput; > > }#end do_blast > > > > ####################### > # parse_blast > # > # Input: Blast file > # > # Output: Creates hash containing best hit for each read > # > # > > sub parse_blast { > my $blastoutfile = shift; > > if (! -e $blastoutfile) { > die "$blastoutfile does not exist $!\n"; > } > > print "Parsing blast hits ...\n"; > > > my $report_obj = new Bio::SearchIO(-verbose => 1, > -format => 'blast', > -file => $blastoutfile); > > > die "no valid $report_obj" unless defined $report_obj; > > > while( my $result = $report_obj->next_result ) { > die "no valid $result" unless defined $result; > while( my $hit = $result->next_hit ) { > while( my $hsp = $hit->next_hsp ) { > my $name = $result->query_name; > my $hitDesc = $hit->description; > my $length = $hsp->length('total'); > my $per_id = sprintf("%.2f", $hsp->percent_identity); > my $eval = $hsp->evalue; > next if (defined $blast_results{$name} && > $blast_results{$name}->[0] > $length); #only keep best hit for any read > $blast_results{$name} = [$length, $per_id, $eval, $hitDesc]; > #store in hash of arrays > } > } > } > > } #end parse_blast > > > > > > Debug of NON-working blast-parse: > > main::(454/scripts/fasta_blasta_mb.pl:151): > 151: my $sample_fasta = &create_sample_file($inputfile); > DB<2> n > main::(454/scripts/fasta_blasta_mb.pl:152): > 152: $blast_result = &do_blast($sample_fasta); > DB<2> x $sample_fasta > 0 'G782.2005-08-16-16-48.fasta_sample' > DB<3> n > Blasting against NT ... > main::(454/scripts/fasta_blasta_mb.pl:154): > 154: &parse_blast($blast_result); > DB<3> s > main::parse_blast(454/scripts/fasta_blasta_mb.pl:293): > 293: my $blastoutfile = shift; > DB<3> s > main::parse_blast(454/scripts/fasta_blasta_mb.pl:295): > 295: if (! -e $blastoutfile) { > DB<3> x $blastoutfile > 0 'G782.2005-08-16-16-48.fasta_sample.blastout' > DB<4> s > main::parse_blast(454/scripts/fasta_blasta_mb.pl:299): > 299: print "Parsing blast hits ...\n"; > DB<4> s > Parsing blast hits ... > main::parse_blast(454/scripts/fasta_blasta_mb.pl:302): > 302: my $report_obj = new Bio::SearchIO(-verbose => 1, > 303: -format => 'blast', > 304: -file => > $blastoutfile);#or die "Could not open blast report $!"; > DB<4> s > Bio::SearchIO::new(/util/lib/perl5/site_perl/5.8.0/Bio/SearchIO.pm:129): > 129: my($caller, at args) = @_; > DB<4> r > scalar context return from Bio::SearchIO::new: '_file' => > 'G782.2005-08-16-16-48.fasta_sample.blastout' > '_filehandle' => GLOB(0x8cef40c) > -> *Symbol::GEN1 > FileHandle({*Symbol::GEN1}) => fileno(3) > '_flush_on_write' => 1 > '_handler' => > Bio::SearchIO::IteratedSearchResultEventBuilder=HASH(0x9505724) > '_factories' => HASH(0x95054c0) > 'hit' => Bio::Factory::ObjectFactory=HASH(0x95017b8) > '_loaded_types' => HASH(0x9506c0c) > 'Bio::Search::Hit::BlastHit' => 1 > '_root_verbose' => 0 > 'interface' => 'Bio::Search::Hit::HitI' > 'type' => 'Bio::Search::Hit::BlastHit' > 'hsp' => Bio::Factory::ObjectFactory=HASH(0x9500e10) > '_loaded_types' => HASH(0x9506c18) > 'Bio::Search::HSP::GenericHSP' => 1 > '_root_verbose' => 0 > 'interface' => 'Bio::Search::HSP::HSPI' > 'type' => 'Bio::Search::HSP::GenericHSP' > 'iteration' => Bio::Factory::ObjectFactory=HASH(0x9506c60) > '_loaded_types' => HASH(0x9506af8) > 'Bio::Search::Iteration::GenericIteration' => 1 > '_root_verbose' => 0 > 'interface' => 'Bio::Search::Iteration::IterationI' > 'type' => 'Bio::Search::Iteration::GenericIteration' > 'result' => Bio::Factory::ObjectFactory=HASH(0x9504c80) > '_loaded_types' => HASH(0x9501f74) > 'Bio::Search::Result::BlastResult' => 1 > '_root_verbose' => 0 > 'interface' => 'Bio::Search::Result::ResultI' > 'type' => 'Bio::Search::Result::BlastResult' > '_inclusion_threshold' => 0.001 > '_root_verbose' => 1 > '_handler_cache' => > Bio::SearchIO::IteratedSearchResultEventBuilder=HASH(0x9505724) > -> REUSED_ADDRESS > '_notfirsttime' => 0 > '_reporttype' => '' > '_root_cleanup_methods' => ARRAY(0x8cde434) > 0 CODE(0x82a9aec) > -> &Bio::Root::IO::_io_cleanup in > /util/lib/perl5/site_perl/5.8.0/Bio/Root/IO.pm:544-570 > 1 CODE(0x82a9aec) > -> REUSED_ADDRESS > '_root_verbose' => 1 > main::parse_blast(454/scripts/fasta_blasta_mb.pl:307): > 307: die "no valid $report_obj" unless defined $report_obj; > DB<4> s > main::parse_blast(454/scripts/fasta_blasta_mb.pl:310): > 310: while( my $result = $report_obj->next_result ) { > DB<4> s > Bio::SearchIO::blast::next_result(/util/lib/perl5/site_perl/5.8.0/Bio/SearchIO/blast.pm:389): > 389: my ($self) = @_; > DB<4> r > scalar context return from Bio::SearchIO::blast::next_result: undef > Bio::SearchIO::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/SearchIO.pm:438): > 438: my $self = shift; > DB<4> r > scalar context return from Bio::SearchIO::DESTROY: '' > Bio::Root::Root::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/Root/Root.pm:404): > 404: my $self = shift; > DB<4> r > scalar context return from Bio::Root::Root::DESTROY: undef > Bio::Root::Root::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/Root/Root.pm:404): > 404: my $self = shift; > DB<4> r > scalar context return from Bio::Root::Root::DESTROY: undef > Bio::Root::Root::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/Root/Root.pm:404): > 404: my $self = shift; > DB<4> r > scalar context return from Bio::Root::Root::DESTROY: undef > Bio::Root::Root::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/Root/Root.pm:404): > 404: my $self = shift; > DB<4> r > scalar context return from Bio::Root::Root::DESTROY: undef > Bio::Root::Root::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/Root/Root.pm:404): > 404: my $self = shift; > DB<4> r > scalar context return from Bio::Root::Root::DESTROY: undef > main::(454/scripts/fasta_blasta_mb.pl:155): > 155: &output_results(); > DB<4> x $result > 0 undef > > > > Debug of WORKING blast-parse: > Parsing blast hits ... > main::parse_blast(454/scripts/fasta_blasta_mb.pl:302): > 302: my $report_obj = new Bio::SearchIO(-verbose => 1, > 303: -format => 'blast', > 304: -file => > $blastoutfile);#or die "Could not open blast report $!"; > DB<3> s > Bio::SearchIO::new(/util/lib/perl5/site_perl/5.8.0/Bio/SearchIO.pm:129): > 129: my($caller, at args) = @_; > DB<3> r > scalar context return from Bio::SearchIO::new: '_file' => > 'G782.2005-08-16-16-48.fasta_sample.blastout' > '_filehandle' => GLOB(0x8763100) > -> *Symbol::GEN1 > FileHandle({*Symbol::GEN1}) => fileno(3) > '_flush_on_write' => 1 > '_handler' => > Bio::SearchIO::IteratedSearchResultEventBuilder=HASH(0x8ab3be4) > '_factories' => HASH(0x8ab1594) > 'hit' => Bio::Factory::ObjectFactory=HASH(0x8a7b7c0) > '_loaded_types' => HASH(0x8abee10) > 'Bio::Search::Hit::BlastHit' => 1 > '_root_verbose' => 0 > 'interface' => 'Bio::Search::Hit::HitI' > 'type' => 'Bio::Search::Hit::BlastHit' > 'hsp' => Bio::Factory::ObjectFactory=HASH(0x8a87200) > '_loaded_types' => HASH(0x8abee1c) > 'Bio::Search::HSP::GenericHSP' => 1 > '_root_verbose' => 0 > 'interface' => 'Bio::Search::HSP::HSPI' > 'type' => 'Bio::Search::HSP::GenericHSP' > 'iteration' => Bio::Factory::ObjectFactory=HASH(0x8abee64) > '_loaded_types' => HASH(0x8abecfc) > 'Bio::Search::Iteration::GenericIteration' => 1 > '_root_verbose' => 0 > 'interface' => 'Bio::Search::Iteration::IterationI' > 'type' => 'Bio::Search::Iteration::GenericIteration' > 'result' => Bio::Factory::ObjectFactory=HASH(0x8a81a84) > '_loaded_types' => HASH(0x8a96ce8) > 'Bio::Search::Result::BlastResult' => 1 > '_root_verbose' => 0 > 'interface' => 'Bio::Search::Result::ResultI' > 'type' => 'Bio::Search::Result::BlastResult' > '_inclusion_threshold' => 0.001 > '_root_verbose' => 1 > '_handler_cache' => > Bio::SearchIO::IteratedSearchResultEventBuilder=HASH(0x8ab3be4) > -> REUSED_ADDRESS > '_notfirsttime' => 0 > '_reporttype' => '' > '_root_cleanup_methods' => ARRAY(0x8762efc) > 0 CODE(0x82a9aec) > -> &Bio::Root::IO::_io_cleanup in > /util/lib/perl5/site_perl/5.8.0/Bio/Root/IO.pm:544-570 > 1 CODE(0x82a9aec) > -> REUSED_ADDRESS > '_root_verbose' => 1 > main::parse_blast(454/scripts/fasta_blasta_mb.pl:307): > 307: die "no valid $report_obj" unless defined $report_obj; > DB<3> s > main::parse_blast(454/scripts/fasta_blasta_mb.pl:310): > 310: while( my $result = $report_obj->next_result ) { > DB<3> s > Bio::SearchIO::blast::next_result(/util/lib/perl5/site_perl/5.8.0/Bio/SearchIO/blast.pm:389): > 389: my ($self) = @_; > DB<3> r > blast.pm: unrecognized line Reference: Zheng Zhang, Scott Schwartz, > Lukas Wagner, and Webb Miller (2000), > blast.pm: unrecognized line "A greedy algorithm for aligning DNA > sequences", > blast.pm: unrecognized line J Comput Biol 2000; 7(1-2):203-14. > blast.pm: unrecognized > line > Score E > Got NCBI HSP score=354, evalue 0.0 > scalar context return from Bio::SearchIO::blast::next_result: > '_algorithm' => 'MEGABLAST' > '_algorithm_version' => '2.2.10 [Oct-19-2004]' > '_dbentries' => 4249067 > '_dbletters' => 17735149364 > '_dbname' => 'All GenBank+EMBL+DDBJ+PDB sequences (but no EST, > STS,GSS,environmental samples or phase 0, 1 or 2 HTGS sequences) ' > '_hitindex' => 0 > '_hits' => ARRAY(0x8b2acd0) > empty array > '_inclusion_threshold' => 0.001 > '_iteration_count' => 1 > '_iteration_index' => 0 > '_iterations' => ARRAY(0x8b2ac4c) > 0 Bio::Search::Iteration::GenericIteration=HASH(0x8b1cacc) > '_newhits_below_threshold' => ARRAY(0x8b1ca84) > 0 Bio::Search::Hit::BlastHit=HASH(0x8b1cf64) > '_accession' => 'AE004091' > '_algorithm' => 'MEGABLAST' > '_description' => 'Pseudomonas aeruginosa PAO1, complete genome' > '_hsps' => ARRAY(0x8b1ceb0) > 0 Bio::Search::HSP::GenericHSP=HASH(0x8b2098c) > '_algorithm' => 'MEGABLAST' > '_frac_conserved' => HASH(0x8b266a0) > 'hit' => 0.991803278688525 > 'query' => 0.991803278688525 > 'total' => 0.991803278688525 > '_frac_identical' => HASH(0x8b2658c) > 'hit' => 0.991803278688525 > 'query' => 0.991803278688525 > 'total' => 0.991803278688525 > '_gaps' => HASH(0x8b24d94) > 'hit' => 0 > 'query' => 0 > 'total' => 0 > '_gsf_tag_hash' => HASH(0x8b20998) > empty hash > '_hit_string' => > 'cctgacctccgctcaactgcgcaaatacgccagcgccggtcggccgttccccgaagggcgcctgctggccgcctcctgccacgacgcggaggaactggccctggctgcctcgatgggagtggagttcgtcaccctttcgccggtacagccgaccgagagccatcccggcgagccggcgctgggttgggacaaggccgccgaactgatcgccggcttcaaccagccggtctacctgctgggtggcctcggtccgcagcaagccgagcaggcttgggagcatggagcccagggcgtggcgggtatccgtgcgttctggccgggcggcctttgacggtggaatgaagaaaaaaggaggcttcggcctcc' > '_homology_string' => > '|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| > ||||||||||||||||||||||||||||||||||||||||| > ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| > |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||' > etc...... > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From andreo_beck at yahoo.com Mon Jul 31 22:59:30 2006 From: andreo_beck at yahoo.com (Andreo Beck) Date: Mon, 31 Jul 2006 19:59:30 -0700 (PDT) Subject: [Bioperl-l] $hit_object->frac_aligned_hit/$hit_object->frac_aligned_query Message-ID: <20060801025930.96806.qmail@web55705.mail.re3.yahoo.com> Hi, Can $hit_object->frac_aligned_hit or $hit_object->frac_aligned_query give outputs > 1 ? I get some > 1 values. Does using the parentheses (e.g. $hit_object->frac_aligned_hit()) make any difference? Thanks, Andy --------------------------------- Talk is cheap. Use Yahoo! Messenger to make PC-to-Phone calls. Great rates starting at 1?/min. From andreo_beck at yahoo.com Mon Jul 31 22:56:45 2006 From: andreo_beck at yahoo.com (Andreo Beck) Date: Mon, 31 Jul 2006 19:56:45 -0700 (PDT) Subject: [Bioperl-l] $hit_object->frac_aligned_hit/$hit_object->frac_aligned_query Message-ID: <20060801025645.12106.qmail@web55703.mail.re3.yahoo.com> Hi, Can $hit_object->frac_aligned_hit or $hit_object->frac_aligned_query give outputs > 1 ? I get them. Does using the parentheses (e.g. $hit_object->frac_aligned_hit()) make any difference? Thanks, Andy __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From darin.london at duke.edu Mon Jul 3 08:41:33 2006 From: darin.london at duke.edu (Darin London) Date: Mon, 03 Jul 2006 08:41:33 -0400 Subject: [Bioperl-l] Call For Birds of a Feather Suggestions Message-ID: <44A9107D.2050304@duke.edu> The BOSC organizing comittee is currently seeking suggestions for Birds of a Feather meeting ideas. Birds of a Feather meetings are one of the more popular activities at BOSC, occurring at the end of each days session. These are free-form meetings organized by the attendees themselves to discuss one or a few topics of interest in greater detail. BOF?s have been formed to allow developers and users of individual OBF software to meet each other face-to-face to discuss the project, or to discuss completely new ideas, and even start new software development projects. These meetings offer a unique opportunity for individuals to explore more about the activities of the various Open Source Projects, and, in some cases, even take an active role influencing the future of Open Source Software development. If you would like to create a BOF, just sign up for a wiki account, login, and edit the BOSC 2006 Birds of a Feather page. From bix at sendu.me.uk Wed Jul 5 08:37:34 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Wed, 05 Jul 2006 13:37:34 +0100 Subject: [Bioperl-l] checkout_all fails on biodata Message-ID: <44ABB28E.2000203@sendu.me.uk> I'm doing: cvs -d:ext:sendu at dev.open-bio.org:/home/repository/bioperl co bioperl_all to check out all the bioperl packages at once. However it only checks out core, db, pedigree, pipeline and run before failing on biodata: cvs checkout: Updating biodata cvs checkout: failed to create lock directory for `/home/repository/bioperl/biodata' (/home/repository/bioperl/biodata/#cvs.lock): Permission denied cvs checkout: failed to obtain dir lock in repository `/home/repository/bioperl/biodata' cvs [checkout aborted]: read lock failed - giving up This failure is consistent for me (had it multiple times, different days, never worked). Biodata isn't even mentioned as a possible package at http://bioperl.org/wiki/Using_CVS. What is it? Could it be moved to the end of the alias list so it is checked out last, letting all the other packages be checked out before failure? PS. neither biodata nor pipeline are mentioned as a package on that wiki page or at http://bioperl.org/wiki/Category:BioPerl_Packages. Are there yet more packages? Cheers, Sendu. From hlapp at gmx.net Wed Jul 5 08:55:42 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 5 Jul 2006 08:55:42 -0400 Subject: [Bioperl-l] checkout_all fails on biodata In-Reply-To: <44ABB28E.2000203@sendu.me.uk> References: <44ABB28E.2000203@sendu.me.uk> Message-ID: <9C3FE120-6B3D-4C81-99E1-921E81926CCA@gmx.net> Should have been fixed - I can cvs update. did you try again? On Jul 5, 2006, at 8:37 AM, Sendu Bala wrote: > I'm doing: > > cvs -d:ext:sendu at dev.open-bio.org:/home/repository/bioperl co > bioperl_all > > to check out all the bioperl packages at once. However it only checks > out core, db, pedigree, pipeline and run before failing on biodata: > > cvs checkout: Updating biodata > cvs checkout: failed to create lock directory for > `/home/repository/bioperl/biodata' > (/home/repository/bioperl/biodata/#cvs.lock): Permission denied > cvs checkout: failed to obtain dir lock in repository > `/home/repository/bioperl/biodata' > cvs [checkout aborted]: read lock failed - giving up > > This failure is consistent for me (had it multiple times, different > days, never worked). > > Biodata isn't even mentioned as a possible package at > http://bioperl.org/wiki/Using_CVS. What is it? Could it be moved to > the > end of the alias list so it is checked out last, letting all the other > packages be checked out before failure? > > PS. neither biodata nor pipeline are mentioned as a package on that > wiki > page or at http://bioperl.org/wiki/Category:BioPerl_Packages. Are > there > yet more packages? > > Cheers, > Sendu. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From bix at sendu.me.uk Wed Jul 5 09:03:50 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Wed, 05 Jul 2006 14:03:50 +0100 Subject: [Bioperl-l] checkout_all fails on biodata In-Reply-To: <9C3FE120-6B3D-4C81-99E1-921E81926CCA@gmx.net> References: <44ABB28E.2000203@sendu.me.uk> <9C3FE120-6B3D-4C81-99E1-921E81926CCA@gmx.net> Message-ID: <44ABB8B6.5040707@sendu.me.uk> Hilmar Lapp wrote: > Should have been fixed - I can cvs update. did you try again? Still doesn't work, no change. I can manually check out the other packages, I just can't do it with bioperl_all alias. co bioperl-biodata fails because: cvs server: cannot find module `bioperl-biodata' - ignored cvs [checkout aborted]: cannot expand modules (not that I want it - if its no longer a bioperl package can it be removed from the alias?) From hlapp at gmx.net Wed Jul 5 09:41:27 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 5 Jul 2006 09:41:27 -0400 Subject: [Bioperl-l] checkout_all fails on biodata In-Reply-To: <44ABB8B6.5040707@sendu.me.uk> References: <44ABB28E.2000203@sendu.me.uk> <9C3FE120-6B3D-4C81-99E1-921E81926CCA@gmx.net> <44ABB8B6.5040707@sendu.me.uk> Message-ID: The idea was once that Bioperl, Biojava, etc had all those unit tests that use specific sample data which take up quite a bit of space. Unifying the largely redundant test data into a single shared repository would save quite a bit of space and therefore download/ update time for people who work on/use more than one Bio* project. However, this was never fully implemented AFAIK. I.e., you don't need biodata. I guess it could be removed from the alias since it's not integrated anyway. Any other opinions? I also forwarded your report to root-l as I couldn't find the offending (stale) lock file. -hilmar On Jul 5, 2006, at 9:03 AM, Sendu Bala wrote: > Hilmar Lapp wrote: >> Should have been fixed - I can cvs update. did you try again? > > Still doesn't work, no change. I can manually check out the other > packages, I just can't do it with bioperl_all alias. > > co bioperl-biodata fails because: > cvs server: cannot find module `bioperl-biodata' - ignored > cvs [checkout aborted]: cannot expand modules > > (not that I want it - if its no longer a bioperl package can it be > removed from the alias?) > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Wed Jul 5 09:48:03 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 5 Jul 2006 08:48:03 -0500 Subject: [Bioperl-l] checkout_all fails on biodata In-Reply-To: <44ABB8B6.5040707@sendu.me.uk> Message-ID: <000f01c6a039$a7a24f10$15327e82@pyrimidine> Bioperl-data was a directory started up a few years ago to hold various data files for testing and as examples (BLAST file examples, GenBank files, etc), somewhat like the t/data directory but cleaned up a bit more. It hasn't been updated in a while. Regardless, you should be able to check it out. As for the problem, looks like Hilmar's checking up on a possible lock file issue. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Sendu Bala > Sent: Wednesday, July 05, 2006 8:04 AM > To: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] checkout_all fails on biodata > > Hilmar Lapp wrote: > > Should have been fixed - I can cvs update. did you try again? > > Still doesn't work, no change. I can manually check out the other > packages, I just can't do it with bioperl_all alias. > > co bioperl-biodata fails because: > cvs server: cannot find module `bioperl-biodata' - ignored > cvs [checkout aborted]: cannot expand modules > > (not that I want it - if its no longer a bioperl package can it be > removed from the alias?) > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Wed Jul 5 11:06:30 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 5 Jul 2006 10:06:30 -0500 Subject: [Bioperl-l] checkout_all fails on biodata In-Reply-To: Message-ID: <001901c6a044$999a14b0$15327e82@pyrimidine> I use TortoiseCVS via WinXP and I'm getting the same issue that Sendu has: --------------------------- In C:\Perl\src: "C:\Program Files\TortoiseCVS\cvs.exe" "-q" "--lf" "checkout" "-P" "bioperl_all" CVSROOT=:ext:cjfields at dev.open-bio.org:/home/repository/bioperl ... cvs checkout: failed to create lock directory for `/home/repository/bioperl/biodata' (/home/repository/bioperl/biodata/#cvs.lock): Permission denied cvs checkout: failed to obtain dir lock in repository `/home/repository/bioperl/biodata' cvs [checkout aborted]: read lock failed - giving up cvs.exe checkout: in directory bioperl: cvs.exe checkout: cannot open CVS/Entries for reading: No such file or directory --------------------------- I had the same problem with schema (BioSQL) a while back. I tried again, and... --------------------------- cvs checkout: failed to create lock directory for `/home/repository/bioperl/biosql-schema' (/home/repository/bioperl/biosql-schema/#cvs.lock): Permission denied cvs checkout: failed to obtain dir lock in repository `/home/repository/bioperl/biosql-schema' cvs [checkout aborted]: read lock failed - giving up cvs.exe checkout: in directory .: cvs.exe checkout: cannot open CVS/Entries for reading: No such file or directory --------------------------- I believe it had something to do with CVS commit privileges (i.e. I had none for schema, which was fine). So maybe this is a permissions issue via the lock file? Looking at the alias: bioperl_all -d bioperl &core &db &run &pipeline &pedigree &biodata &schema &network µarray This may mean if anyone w/o commit privs for any of the above (specifically schema and biodata) tries checkout/update using bioperl-all, they may run into this problem. Since it's not integrated I don't see the problem with removing it from the alias, but if we follow the same line of logic (and privileges are the issue) then schema must be removed as well. To me it doesn't make much sense to not include schema though since we can checkout/update bioperl-db. BTW, I like the idea of biodata as you've outlined it. Would be nice to gear the test suite towards a more general set of data for all the Bio* projects versus having each one come with their own, and the data could be updated a bit more frequently that t/data is. Seems like it would definitely save a large chunk of real estate for the distributions. If one wanted to run the full test suite then they would have to download biodata separately, though, but not a bad compromise. Though, if this is/was its intent, why would it need a lock file? Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Hilmar Lapp > Sent: Wednesday, July 05, 2006 8:41 AM > To: Sendu Bala > Cc: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] checkout_all fails on biodata > > The idea was once that Bioperl, Biojava, etc had all those unit tests > that use specific sample data which take up quite a bit of space. > Unifying the largely redundant test data into a single shared > repository would save quite a bit of space and therefore download/ > update time for people who work on/use more than one Bio* project. > > However, this was never fully implemented AFAIK. I.e., you don't need > biodata. I guess it could be removed from the alias since it's not > integrated anyway. > > Any other opinions? > > I also forwarded your report to root-l as I couldn't find the > offending (stale) lock file. > > -hilmar > > On Jul 5, 2006, at 9:03 AM, Sendu Bala wrote: > > > Hilmar Lapp wrote: > >> Should have been fixed - I can cvs update. did you try again? > > > > Still doesn't work, no change. I can manually check out the other > > packages, I just can't do it with bioperl_all alias. > > > > co bioperl-biodata fails because: > > cvs server: cannot find module `bioperl-biodata' - ignored > > cvs [checkout aborted]: cannot expand modules > > > > (not that I want it - if its no longer a bioperl package can it be > > removed from the alias?) > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Wed Jul 5 11:36:33 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 5 Jul 2006 10:36:33 -0500 Subject: [Bioperl-l] Bio::SearchIO::hmmer hsp behaviour In-Reply-To: Message-ID: <001a01c6a048$cb802420$15327e82@pyrimidine> Okay, I managed to figure out what the problem was. I committed a fix in CVS for the initial bug (Selvi's missing hits). Still has one HSP per hit for now; I think it will take a bit more effort to get a BLAST-like multi HSP/hit up and running. Selvi, update from CVS to see if that works. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Chris Fields > Sent: Friday, June 30, 2006 12:44 PM > To: Sendu Bala; Jason Stajich > Cc: bioperl-l at lists.open-bio.org list > Subject: Re: [Bioperl-l] Bio::SearchIO::hmmer hsp behaviour > > I'll try looking at it this weekend. A suggested workaround is to > either try setting -A for no alignments or setting it to a high > number to retrieve all of them. It's pretty serious as the error > silently dumps those domains, so for those using automated annotation > pipelines would miss it unless they are also checking the raw output. > > You could design a SearchIO::hmmpfam parser then expand it to take in > hmmsearch output at a later point, or keep them separate. I like the > idea of having modules that are more specific about what they parse; > seems at some point you reach serious code bloat and maintenance > becomes an issue. Look at SearchIO::blast; it parses various text > BLAST output very well but with some serious obfuscation. Just don't > know how productive it would be to separate out the PSI-BLAST and > bl2seq stuff since they are pretty close to a standard BLAST > report... oh well. > > To Jason : good luck on your move. Drop us a line here to let us > know everything went well. > > Chris > > On Jun 30, 2006, at 11:14 AM, Sendu Bala wrote: > > > Chris Fields wrote: > >> It may have been just simpler to have it be one HSP (domain) per Hit > >> (model) as that's how the reports are generated. My reasoning was > >> that > >> using the one domain per model made sense based on what you are > >> actually > >> trying to do, which is annotate the sequence based on the order the > >> domain appears. Most others may not view it that way, which is fine. > >> One can always gather the relevant HSP's, convert to seqfeatures, > >> then > >> sort them if order is important, I suppose. > >> > >> I would say, if the overall consensus is to modify it to have > >> multiple > >> domain hits per model (similar to BLAST) then Sendu should go > >> ahead and > >> make those changes then announce it on the list so no one can gripe > >> about it later. My main concern was not changing things so > >> dramatically > >> that it'll break for someone > > > > Going on your earlier suggestion, I was thinking about making > > SearchIO::hmmpfam instead, which would get used if you set the > > format to > > 'hmmpfam' instead of the generic 'hmmer' when making a SearchIO. I > > suppose I would make a SearchIO::hmmsearch as well, if necessary. > > > > > > [...] > >> that the reported bug about missing hits (Bug 2036) is fixed as well. > > > > However, having never made a SearchIO plugin before, it will be some > > time before I get my head around it. I'll want to make one the current > > HOWTO:SearchIO way before I can think about doing it a better way > > (hashes) as well. So I can say I'll make a move on this at some > > point in > > the future, but if someone wants to fix Bug 2036 in the mean time, > > they > > are welcome to. Again as suggested, my priority is Bio::Map right now. > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From arareko at campus.iztacala.unam.mx Wed Jul 5 11:38:14 2006 From: arareko at campus.iztacala.unam.mx (Mauricio Herrera Cuadra) Date: Wed, 05 Jul 2006 10:38:14 -0500 Subject: [Bioperl-l] checkout_all fails on biodata In-Reply-To: <001901c6a044$999a14b0$15327e82@pyrimidine> References: <001901c6a044$999a14b0$15327e82@pyrimidine> Message-ID: <44ABDCE6.7090906@campus.iztacala.unam.mx> Same problem here. I've never used the bioperl_all alias before (I always check-out dirs individually), but to me it seems like a privileges issue as Chris suggests. Also browsed through all the repository in dev.open-bio.org and didn't found such lock file. I guess Chris D. or Jason will know better what's happening here. Mauricio. Chris Fields wrote: > I use TortoiseCVS via WinXP and I'm getting the same issue that Sendu has: > --------------------------- > In C:\Perl\src: "C:\Program Files\TortoiseCVS\cvs.exe" "-q" "--lf" > "checkout" "-P" "bioperl_all" > CVSROOT=:ext:cjfields at dev.open-bio.org:/home/repository/bioperl > > ... > > cvs checkout: failed to create lock directory for > `/home/repository/bioperl/biodata' > (/home/repository/bioperl/biodata/#cvs.lock): Permission denied > cvs checkout: failed to obtain dir lock in repository > `/home/repository/bioperl/biodata' > cvs [checkout aborted]: read lock failed - giving up > cvs.exe checkout: in directory bioperl: > cvs.exe checkout: cannot open CVS/Entries for reading: No such file or > directory > --------------------------- > > I had the same problem with schema (BioSQL) a while back. I tried again, > and... > > --------------------------- > cvs checkout: failed to create lock directory for > `/home/repository/bioperl/biosql-schema' > (/home/repository/bioperl/biosql-schema/#cvs.lock): Permission denied > cvs checkout: failed to obtain dir lock in repository > `/home/repository/bioperl/biosql-schema' > cvs [checkout aborted]: read lock failed - giving up > cvs.exe checkout: in directory .: > cvs.exe checkout: cannot open CVS/Entries for reading: No such file or > directory > --------------------------- > > I believe it had something to do with CVS commit privileges (i.e. I had none > for schema, which was fine). So maybe this is a permissions issue via the > lock file? Looking at the alias: > > bioperl_all -d bioperl &core &db &run &pipeline &pedigree &biodata &schema > &network µarray > > This may mean if anyone w/o commit privs for any of the above (specifically > schema and biodata) tries checkout/update using bioperl-all, they may run > into this problem. > > Since it's not integrated I don't see the problem with removing it from the > alias, but if we follow the same line of logic (and privileges are the > issue) then schema must be removed as well. To me it doesn't make much > sense to not include schema though since we can checkout/update bioperl-db. > > > BTW, I like the idea of biodata as you've outlined it. Would be nice to > gear the test suite towards a more general set of data for all the Bio* > projects versus having each one come with their own, and the data could be > updated a bit more frequently that t/data is. Seems like it would > definitely save a large chunk of real estate for the distributions. If one > wanted to run the full test suite then they would have to download biodata > separately, though, but not a bad compromise. Though, if this is/was its > intent, why would it need a lock file? > > Chris > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> bounces at lists.open-bio.org] On Behalf Of Hilmar Lapp >> Sent: Wednesday, July 05, 2006 8:41 AM >> To: Sendu Bala >> Cc: bioperl-l at lists.open-bio.org >> Subject: Re: [Bioperl-l] checkout_all fails on biodata >> >> The idea was once that Bioperl, Biojava, etc had all those unit tests >> that use specific sample data which take up quite a bit of space. >> Unifying the largely redundant test data into a single shared >> repository would save quite a bit of space and therefore download/ >> update time for people who work on/use more than one Bio* project. >> >> However, this was never fully implemented AFAIK. I.e., you don't need >> biodata. I guess it could be removed from the alias since it's not >> integrated anyway. >> >> Any other opinions? >> >> I also forwarded your report to root-l as I couldn't find the >> offending (stale) lock file. >> >> -hilmar >> >> On Jul 5, 2006, at 9:03 AM, Sendu Bala wrote: >> >>> Hilmar Lapp wrote: >>>> Should have been fixed - I can cvs update. did you try again? >>> Still doesn't work, no change. I can manually check out the other >>> packages, I just can't do it with bioperl_all alias. >>> >>> co bioperl-biodata fails because: >>> cvs server: cannot find module `bioperl-biodata' - ignored >>> cvs [checkout aborted]: cannot expand modules >>> >>> (not that I want it - if its no longer a bioperl package can it be >>> removed from the alias?) >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >> -- >> =========================================================== >> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : >> =========================================================== >> >> >> >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- MAURICIO HERRERA CUADRA arareko at campus.iztacala.unam.mx Laboratorio de Gen?tica Unidad de Morfofisiolog?a y Funci?n Facultad de Estudios Superiores Iztacala, UNAM From bix at sendu.me.uk Thu Jul 6 04:41:57 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 06 Jul 2006 09:41:57 +0100 Subject: [Bioperl-l] Bio::Map changes In-Reply-To: <449A9AF9.2000305@sendu.me.uk> References: <44985915.8010607@sendu.me.uk> <449A9AF9.2000305@sendu.me.uk> Message-ID: <44ACCCD5.3030309@sendu.me.uk> Sendu Bala wrote: > The next step is to tidy up all of Bio::Map*, which involves a major > reimplementation of the whole system [...] > The reimplementation will make Position central to the model, allowing > for lots of other things to work properly without anything becoming > inconsistent (as is currently the case). This is now done. It uses a new PositionHandler class behind the scenes. The next step is to introduce relative positioning across the board, possibly in a way that makes OrderedPosition redundant or an implementer of the system. Has anyone here ever used Bio::Map* modules for anything? I would appreciate you sending me your code, especially if you've used MapIO, Physical (encompassing Clone, Contig, FPCMarker, OrderedPositionWithDistance) or LinkageMap (encompassing LinkagePosition, OrderedPosition, Microsatellite) since these have insufficient tests at the moment. From nidage at yahoo.com Thu Jul 6 14:13:12 2006 From: nidage at yahoo.com (sss lll) Date: Thu, 6 Jul 2006 11:13:12 -0700 (PDT) Subject: [Bioperl-l] PrimarySeqI object Exception Message-ID: <20060706181312.63581.qmail@web35607.mail.mud.yahoo.com> Hi there, I encountered a problem while calling module PrimarySeqI, with the following code: my $db=Bio::DB::Fasta->new($fafile); my $obj=$db->get_Seq_by_id($array_gene_name[$p]); $seqio->write_seq($obj); The error message was: MSG: Did not provide a valid Bio::PrimarySeqI object STACK Bio::SeqIO::fasta::write_seq /usr/lib/perl5/site_perl/5.8.0/Bio/SeqIO/fasta.pm:178 We think it had to do with the lengh of the gene name. For example the following gene name was a problem: gi|59711891|ref|YP_204667.1| NAD-specific glutamate dehydrogenase [Vibrio fischeri ES114]*VIB*COG2902*E Any ideas on what happened? Thanks __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From rmb32 at cornell.edu Thu Jul 6 19:11:00 2006 From: rmb32 at cornell.edu (Robert Buels) Date: Thu, 06 Jul 2006 16:11:00 -0700 Subject: [Bioperl-l] parser for GeneSeqer In-Reply-To: <2FB066C7-12E6-46D8-8F4A-BD096BE2A0CA@uiuc.edu> References: <44A558F2.2050304@cornell.edu> <2FB066C7-12E6-46D8-8F4A-BD096BE2A0CA@uiuc.edu> Message-ID: <44AD9884.6040507@cornell.edu> The Annotation/Annotatable stuff was going to be talked about at the GMOD meeting that just happened, wasn't it? What's the scoop on that? Rob Chris Fields wrote: > If you plan on generating seqfeatures from this output you could check > out the Bio::Tools core modules for examples. There are a few there > that take program output and convert them to Bio::SeqFeature::Generic > objects, including Bio::Tools:RNAMotif and Bio::Tools::tRNAscanSE. If > alignments are involved you might want something like > Bio::SeqFeature::FeaturePair. Not sure about using the > SeqFeature::Annotation or others; I thought that the some of the > Annotation/Annotatable stuff might be changing soon but I may be wrong. > > Chris > > On Jun 30, 2006, at 12:01 PM, Robert Buels wrote: > >> Hi all, >> >> I find myself needing a parser for GeneSeqer output, so I'm writing one >> (which I will submit for your consideration when it's working). In a >> nutshell, GeneSeqer is a (kind of old) program for aligning a bunch of >> ESTs to genomic sequence, then using those alignments to predict where >> in the genomic sequence the genes are. So really what you get from this >> is a bunch of hierarchical features. >> >> I don't really know where I should put it in the bioperl hierarchy >> though. Probably FeatureIO? >> >> And what's the current fashion for objects it should emit? >> Bio::SeqFeature::Generic? Bio::SeqFeature::Annotated? >> >> Rob >> >> --Robert Buels >> SGN Bioinformatics Analyst >> 252A Emerson Hall, Cornell University >> Ithaca, NY 14853 >> Tel: 503-889-8539 >> rmb32 at cornell.edu >> http://www.sgn.cornell.edu >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign > > > -- Robert Buels SGN Bioinformatics Analyst 252A Emerson Hall, Cornell University Ithaca, NY 14853 Tel: 503-889-8539 rmb32 at cornell.edu http://www.sgn.cornell.edu From hlapp at gmx.net Thu Jul 6 19:27:31 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 6 Jul 2006 19:27:31 -0400 Subject: [Bioperl-l] parser for GeneSeqer In-Reply-To: <44AD9884.6040507@cornell.edu> References: <44A558F2.2050304@cornell.edu> <2FB066C7-12E6-46D8-8F4A-BD096BE2A0CA@uiuc.edu> <44AD9884.6040507@cornell.edu> Message-ID: <6B530ED6-5825-47C4-A677-2C75E0F97E26@gmx.net> No scoop b/c no time. I am busy w/ a grant and Lincoln had to leave early as well on Friday. Sorry. On Jul 6, 2006, at 7:11 PM, Robert Buels wrote: > The Annotation/Annotatable stuff was going to be talked about at the > GMOD meeting that just happened, wasn't it? What's the scoop on that? > > Rob > > > Chris Fields wrote: >> If you plan on generating seqfeatures from this output you could >> check >> out the Bio::Tools core modules for examples. There are a few there >> that take program output and convert them to Bio::SeqFeature::Generic >> objects, including Bio::Tools:RNAMotif and >> Bio::Tools::tRNAscanSE. If >> alignments are involved you might want something like >> Bio::SeqFeature::FeaturePair. Not sure about using the >> SeqFeature::Annotation or others; I thought that the some of the >> Annotation/Annotatable stuff might be changing soon but I may be >> wrong. >> >> Chris >> >> On Jun 30, 2006, at 12:01 PM, Robert Buels wrote: >> >>> Hi all, >>> >>> I find myself needing a parser for GeneSeqer output, so I'm >>> writing one >>> (which I will submit for your consideration when it's working). >>> In a >>> nutshell, GeneSeqer is a (kind of old) program for aligning a >>> bunch of >>> ESTs to genomic sequence, then using those alignments to predict >>> where >>> in the genomic sequence the genes are. So really what you get >>> from this >>> is a bunch of hierarchical features. >>> >>> I don't really know where I should put it in the bioperl hierarchy >>> though. Probably FeatureIO? >>> >>> And what's the current fashion for objects it should emit? >>> Bio::SeqFeature::Generic? Bio::SeqFeature::Annotated? >>> >>> Rob >>> >>> --Robert Buels >>> SGN Bioinformatics Analyst >>> 252A Emerson Hall, Cornell University >>> Ithaca, NY 14853 >>> Tel: 503-889-8539 >>> rmb32 at cornell.edu >>> http://www.sgn.cornell.edu >>> >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> Christopher Fields >> Postdoctoral Researcher >> Lab of Dr. Robert Switzer >> Dept of Biochemistry >> University of Illinois Urbana-Champaign >> >> >> > > -- > Robert Buels > SGN Bioinformatics Analyst > 252A Emerson Hall, Cornell University > Ithaca, NY 14853 > Tel: 503-889-8539 > rmb32 at cornell.edu > http://www.sgn.cornell.edu > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Thu Jul 6 19:28:09 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 6 Jul 2006 18:28:09 -0500 Subject: [Bioperl-l] parser for GeneSeqer In-Reply-To: <44AD9884.6040507@cornell.edu> Message-ID: <000001c6a153$d78b83c0$15327e82@pyrimidine> Not any word yet. Been pretty quiet, likely b/c everybody was there planning a roadmap for v1.6. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Robert Buels > Sent: Thursday, July 06, 2006 6:11 PM > To: bioperl-l at bioperl.org > Subject: Re: [Bioperl-l] parser for GeneSeqer > > The Annotation/Annotatable stuff was going to be talked about at the > GMOD meeting that just happened, wasn't it? What's the scoop on that? > > Rob > > > Chris Fields wrote: > > If you plan on generating seqfeatures from this output you could check > > out the Bio::Tools core modules for examples. There are a few there > > that take program output and convert them to Bio::SeqFeature::Generic > > objects, including Bio::Tools:RNAMotif and Bio::Tools::tRNAscanSE. If > > alignments are involved you might want something like > > Bio::SeqFeature::FeaturePair. Not sure about using the > > SeqFeature::Annotation or others; I thought that the some of the > > Annotation/Annotatable stuff might be changing soon but I may be wrong. > > > > Chris > > > > On Jun 30, 2006, at 12:01 PM, Robert Buels wrote: > > > >> Hi all, > >> > >> I find myself needing a parser for GeneSeqer output, so I'm writing one > >> (which I will submit for your consideration when it's working). In a > >> nutshell, GeneSeqer is a (kind of old) program for aligning a bunch of > >> ESTs to genomic sequence, then using those alignments to predict where > >> in the genomic sequence the genes are. So really what you get from > this > >> is a bunch of hierarchical features. > >> > >> I don't really know where I should put it in the bioperl hierarchy > >> though. Probably FeatureIO? > >> > >> And what's the current fashion for objects it should emit? > >> Bio::SeqFeature::Generic? Bio::SeqFeature::Annotated? > >> > >> Rob > >> > >> --Robert Buels > >> SGN Bioinformatics Analyst > >> 252A Emerson Hall, Cornell University > >> Ithaca, NY 14853 > >> Tel: 503-889-8539 > >> rmb32 at cornell.edu > >> http://www.sgn.cornell.edu > >> > >> > >> _______________________________________________ > >> Bioperl-l mailing list > >> Bioperl-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > Christopher Fields > > Postdoctoral Researcher > > Lab of Dr. Robert Switzer > > Dept of Biochemistry > > University of Illinois Urbana-Champaign > > > > > > > > -- > Robert Buels > SGN Bioinformatics Analyst > 252A Emerson Hall, Cornell University > Ithaca, NY 14853 > Tel: 503-889-8539 > rmb32 at cornell.edu > http://www.sgn.cornell.edu > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From hlapp at gmx.net Thu Jul 6 19:41:44 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 6 Jul 2006 19:41:44 -0400 Subject: [Bioperl-l] parser for GeneSeqer In-Reply-To: <000001c6a153$d78b83c0$15327e82@pyrimidine> References: <000001c6a153$d78b83c0$15327e82@pyrimidine> Message-ID: Uhm - roadmap - I guess yes, but more that of the Golden State, or other states on the way, for Jason. On Jul 6, 2006, at 7:28 PM, Chris Fields wrote: > Not any word yet. Been pretty quiet, likely b/c everybody was there > planning a roadmap for v1.6. > > Chris > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> bounces at lists.open-bio.org] On Behalf Of Robert Buels >> Sent: Thursday, July 06, 2006 6:11 PM >> To: bioperl-l at bioperl.org >> Subject: Re: [Bioperl-l] parser for GeneSeqer >> >> The Annotation/Annotatable stuff was going to be talked about at the >> GMOD meeting that just happened, wasn't it? What's the scoop on >> that? >> >> Rob >> >> >> Chris Fields wrote: >>> If you plan on generating seqfeatures from this output you could >>> check >>> out the Bio::Tools core modules for examples. There are a few there >>> that take program output and convert them to >>> Bio::SeqFeature::Generic >>> objects, including Bio::Tools:RNAMotif and >>> Bio::Tools::tRNAscanSE. If >>> alignments are involved you might want something like >>> Bio::SeqFeature::FeaturePair. Not sure about using the >>> SeqFeature::Annotation or others; I thought that the some of the >>> Annotation/Annotatable stuff might be changing soon but I may be >>> wrong. >>> >>> Chris >>> >>> On Jun 30, 2006, at 12:01 PM, Robert Buels wrote: >>> >>>> Hi all, >>>> >>>> I find myself needing a parser for GeneSeqer output, so I'm >>>> writing one >>>> (which I will submit for your consideration when it's working). >>>> In a >>>> nutshell, GeneSeqer is a (kind of old) program for aligning a >>>> bunch of >>>> ESTs to genomic sequence, then using those alignments to predict >>>> where >>>> in the genomic sequence the genes are. So really what you get from >> this >>>> is a bunch of hierarchical features. >>>> >>>> I don't really know where I should put it in the bioperl hierarchy >>>> though. Probably FeatureIO? >>>> >>>> And what's the current fashion for objects it should emit? >>>> Bio::SeqFeature::Generic? Bio::SeqFeature::Annotated? >>>> >>>> Rob >>>> >>>> --Robert Buels >>>> SGN Bioinformatics Analyst >>>> 252A Emerson Hall, Cornell University >>>> Ithaca, NY 14853 >>>> Tel: 503-889-8539 >>>> rmb32 at cornell.edu >>>> http://www.sgn.cornell.edu >>>> >>>> >>>> _______________________________________________ >>>> Bioperl-l mailing list >>>> Bioperl-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >>> Christopher Fields >>> Postdoctoral Researcher >>> Lab of Dr. Robert Switzer >>> Dept of Biochemistry >>> University of Illinois Urbana-Champaign >>> >>> >>> >> >> -- >> Robert Buels >> SGN Bioinformatics Analyst >> 252A Emerson Hall, Cornell University >> Ithaca, NY 14853 >> Tel: 503-889-8539 >> rmb32 at cornell.edu >> http://www.sgn.cornell.edu >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Thu Jul 6 19:49:23 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 6 Jul 2006 18:49:23 -0500 Subject: [Bioperl-l] parser for GeneSeqer In-Reply-To: Message-ID: <000101c6a156$cee60bc0$15327e82@pyrimidine> Oh well. There's always BOSC... Chris > -----Original Message----- > From: Hilmar Lapp [mailto:hlapp at gmx.net] > Sent: Thursday, July 06, 2006 6:42 PM > To: Chris Fields > Cc: 'Robert Buels'; bioperl-l at bioperl.org > Subject: Re: [Bioperl-l] parser for GeneSeqer > > Uhm - roadmap - I guess yes, but more that of the Golden State, or > other states on the way, for Jason. > > On Jul 6, 2006, at 7:28 PM, Chris Fields wrote: > > > Not any word yet. Been pretty quiet, likely b/c everybody was there > > planning a roadmap for v1.6. > > > > Chris > > > >> -----Original Message----- > >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > >> bounces at lists.open-bio.org] On Behalf Of Robert Buels > >> Sent: Thursday, July 06, 2006 6:11 PM > >> To: bioperl-l at bioperl.org > >> Subject: Re: [Bioperl-l] parser for GeneSeqer > >> > >> The Annotation/Annotatable stuff was going to be talked about at the > >> GMOD meeting that just happened, wasn't it? What's the scoop on > >> that? > >> > >> Rob > >> > >> > >> Chris Fields wrote: > >>> If you plan on generating seqfeatures from this output you could > >>> check > >>> out the Bio::Tools core modules for examples. There are a few there > >>> that take program output and convert them to > >>> Bio::SeqFeature::Generic > >>> objects, including Bio::Tools:RNAMotif and > >>> Bio::Tools::tRNAscanSE. If > >>> alignments are involved you might want something like > >>> Bio::SeqFeature::FeaturePair. Not sure about using the > >>> SeqFeature::Annotation or others; I thought that the some of the > >>> Annotation/Annotatable stuff might be changing soon but I may be > >>> wrong. > >>> > >>> Chris > >>> > >>> On Jun 30, 2006, at 12:01 PM, Robert Buels wrote: > >>> > >>>> Hi all, > >>>> > >>>> I find myself needing a parser for GeneSeqer output, so I'm > >>>> writing one > >>>> (which I will submit for your consideration when it's working). > >>>> In a > >>>> nutshell, GeneSeqer is a (kind of old) program for aligning a > >>>> bunch of > >>>> ESTs to genomic sequence, then using those alignments to predict > >>>> where > >>>> in the genomic sequence the genes are. So really what you get from > >> this > >>>> is a bunch of hierarchical features. > >>>> > >>>> I don't really know where I should put it in the bioperl hierarchy > >>>> though. Probably FeatureIO? > >>>> > >>>> And what's the current fashion for objects it should emit? > >>>> Bio::SeqFeature::Generic? Bio::SeqFeature::Annotated? > >>>> > >>>> Rob > >>>> > >>>> --Robert Buels > >>>> SGN Bioinformatics Analyst > >>>> 252A Emerson Hall, Cornell University > >>>> Ithaca, NY 14853 > >>>> Tel: 503-889-8539 > >>>> rmb32 at cornell.edu > >>>> http://www.sgn.cornell.edu > >>>> > >>>> > >>>> _______________________________________________ > >>>> Bioperl-l mailing list > >>>> Bioperl-l at lists.open-bio.org > >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l > >>> > >>> Christopher Fields > >>> Postdoctoral Researcher > >>> Lab of Dr. Robert Switzer > >>> Dept of Biochemistry > >>> University of Illinois Urbana-Champaign > >>> > >>> > >>> > >> > >> -- > >> Robert Buels > >> SGN Bioinformatics Analyst > >> 252A Emerson Hall, Cornell University > >> Ithaca, NY 14853 > >> Tel: 503-889-8539 > >> rmb32 at cornell.edu > >> http://www.sgn.cornell.edu > >> > >> > >> _______________________________________________ > >> Bioperl-l mailing list > >> Bioperl-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > From osborne1 at optonline.net Thu Jul 6 21:06:32 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Thu, 06 Jul 2006 21:06:32 -0400 Subject: [Bioperl-l] PrimarySeqI object Exception In-Reply-To: <20060706181312.63581.qmail@web35607.mail.mud.yahoo.com> Message-ID: sss lll, What this error means is that $obj is not a valid Sequence object, this is what's passed to the write_seq method. What identifier is $array_gene_name[$p]? Brian O. On 7/6/06 2:13 PM, "sss lll" wrote: > Hi there, > > I encountered a problem while calling module > PrimarySeqI, with the following code: > > my $db=Bio::DB::Fasta->new($fafile); > my $obj=$db->get_Seq_by_id($array_gene_name[$p]); > $seqio->write_seq($obj); > > The error message was: > MSG: Did not provide a valid Bio::PrimarySeqI object > STACK Bio::SeqIO::fasta::write_seq > /usr/lib/perl5/site_perl/5.8.0/Bio/SeqIO/fasta.pm:178 > > We think it had to do with the lengh of the gene name. > For example the following gene name was a problem: > > gi|59711891|ref|YP_204667.1| NAD-specific glutamate > dehydrogenase [Vibrio fischeri ES114]*VIB*COG2902*E > > Any ideas on what happened? > > Thanks > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam protection around > http://mail.yahoo.com > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From rmb32 at cornell.edu Thu Jul 6 21:24:40 2006 From: rmb32 at cornell.edu (Robert Buels) Date: Thu, 06 Jul 2006 18:24:40 -0700 Subject: [Bioperl-l] FeatureIO tests failing on linux debian sarge Message-ID: <44ADB7D8.7080102@cornell.edu> I am stumped. On a fresh checkout from cvs (as of like 10 seconds ago): rob at rubisco:/usr/local/lib/site_perl/bioperl-live$ perl -v This is perl, v5.8.4 built for i386-linux-thread-multi Copyright 1987-2004, Larry Wall Perl may be copied only under the terms of either the Artistic License or the GNU General Public License, which may be found in the Perl 5 source kit. Complete documentation for Perl, including FAQ lists, should be found on this system using `man perl' or `perldoc perl'. If you have access to the Internet, point your browser at http://www.perl.com/, the Perl Home Page. rob at rubisco:/usr/local/lib/site_perl/Bio$ perl t/FeatureIO.t 1..22 ok 1 ok 2 ok 3 ok 4 ok 5 ok 6 Can't locate object method "get_Annotations" via package "Bio::SeqFeature::Annotated" at /usr/local/lib/site_perl/Bio/SeqFeature/Annotated.pm line 292, line 2. ok 7 # Cannot complete FeatureIO tests ok 8 # Cannot complete FeatureIO tests ok 9 # Cannot complete FeatureIO tests ok 10 # Cannot complete FeatureIO tests ok 11 # Cannot complete FeatureIO tests ok 12 # Cannot complete FeatureIO tests ok 13 # Cannot complete FeatureIO tests ok 14 # Cannot complete FeatureIO tests ok 15 # Cannot complete FeatureIO tests ok 16 # Cannot complete FeatureIO tests ok 17 # Cannot complete FeatureIO tests ok 18 # Cannot complete FeatureIO tests ok 19 # Cannot complete FeatureIO tests ok 20 # Cannot complete FeatureIO tests ok 21 # Cannot complete FeatureIO tests ok 22 # Cannot complete FeatureIO tests However, same code runs fine on my debian unstable machine (perl 5.8.8). Perhaps this is a bug in debian stable's perl? I did some poking around through the code, changing @ISA = qw/.../ to use base, switching the order of inclusion in the ISA at the top of Bio::SeqFeature::Annotated, no dice. Anybody able to reproduce this? Anyone have any ideas? Rob -- Robert Buels SGN Bioinformatics Analyst 252A Emerson Hall, Cornell University Ithaca, NY 14853 Tel: 503-889-8539 rmb32 at cornell.edu http://www.sgn.cornell.edu From cjfields at uiuc.edu Thu Jul 6 22:30:25 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 6 Jul 2006 21:30:25 -0500 Subject: [Bioperl-l] FeatureIO tests failing on linux debian sarge In-Reply-To: <44ADB7D8.7080102@cornell.edu> Message-ID: <000001c6a16d$4dd7e6e0$15327e82@pyrimidine> I don't get any issues (all tests pass), except a few warning messages which is normal; some ontology handlind not implemented. Usually when running tests I use 'perl -I. t/test.t' to force it to use the core directory first. You might try that to see if it 'fixes' the problem. If it does, there may be another bioperl installation in @INC being used instead of your current directory. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Robert Buels > Sent: Thursday, July 06, 2006 8:25 PM > To: bioperl-l at bioperl.org > Subject: [Bioperl-l] FeatureIO tests failing on linux debian sarge > > I am stumped. On a fresh checkout from cvs (as of like 10 seconds ago): > > > rob at rubisco:/usr/local/lib/site_perl/bioperl-live$ perl -v > > This is perl, v5.8.4 built for i386-linux-thread-multi > > Copyright 1987-2004, Larry Wall > > Perl may be copied only under the terms of either the Artistic License > or the > GNU General Public License, which may be found in the Perl 5 source kit. > > Complete documentation for Perl, including FAQ lists, should be found on > this system using `man perl' or `perldoc perl'. If you have access to the > Internet, point your browser at http://www.perl.com/, the Perl Home Page. > > rob at rubisco:/usr/local/lib/site_perl/Bio$ perl t/FeatureIO.t > 1..22 > ok 1 > ok 2 > ok 3 > ok 4 > ok 5 > ok 6 > Can't locate object method "get_Annotations" via package > "Bio::SeqFeature::Annotated" at > /usr/local/lib/site_perl/Bio/SeqFeature/Annotated.pm line 292, > line 2. > ok 7 # Cannot complete FeatureIO tests > ok 8 # Cannot complete FeatureIO tests > ok 9 # Cannot complete FeatureIO tests > ok 10 # Cannot complete FeatureIO tests > ok 11 # Cannot complete FeatureIO tests > ok 12 # Cannot complete FeatureIO tests > ok 13 # Cannot complete FeatureIO tests > ok 14 # Cannot complete FeatureIO tests > ok 15 # Cannot complete FeatureIO tests > ok 16 # Cannot complete FeatureIO tests > ok 17 # Cannot complete FeatureIO tests > ok 18 # Cannot complete FeatureIO tests > ok 19 # Cannot complete FeatureIO tests > ok 20 # Cannot complete FeatureIO tests > ok 21 # Cannot complete FeatureIO tests > ok 22 # Cannot complete FeatureIO tests > > However, same code runs fine on my debian unstable machine (perl > 5.8.8). Perhaps this is a bug in debian stable's perl? > > I did some poking around through the code, changing @ISA = qw/.../ to > use base, switching the order of inclusion in the ISA at the top of > Bio::SeqFeature::Annotated, no dice. > > Anybody able to reproduce this? Anyone have any ideas? > > Rob > > -- > Robert Buels > SGN Bioinformatics Analyst > 252A Emerson Hall, Cornell University > Ithaca, NY 14853 > Tel: 503-889-8539 > rmb32 at cornell.edu > http://www.sgn.cornell.edu > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From chandan.kr.singh at gmail.com Fri Jul 7 01:23:40 2006 From: chandan.kr.singh at gmail.com (CHANDAN SINGH) Date: Fri, 7 Jul 2006 10:53:40 +0530 Subject: [Bioperl-l] PrimarySeqI object Exception In-Reply-To: References: <20060706181312.63581.qmail@web35607.mail.mud.yahoo.com> Message-ID: <2d4f320607062223y520a1375lb30cf40c1c883702@mail.gmail.com> Hi By default , id is the first word encountered i.e, the first string after ">" seperated from the rest by a space. The sample id u mentioned in ur first mail contains spaces and as i mentioned in my previous mail, i am sure the ids made by indexing and the ones u r using in the array are different. U can see the ids used in indexing by using @ids = $db->ids() ; print join("\n", at ids) ; Cheers Chandan On 7/7/06, Brian Osborne wrote: > > sss lll, > > What this error means is that $obj is not a valid Sequence object, this is > what's passed to the write_seq method. What identifier is > $array_gene_name[$p]? > > Brian O. > > > On 7/6/06 2:13 PM, "sss lll" wrote: > > > Hi there, > > > > I encountered a problem while calling module > > PrimarySeqI, with the following code: > > > > my $db=Bio::DB::Fasta->new($fafile); > > my $obj=$db->get_Seq_by_id($array_gene_name[$p]); > > $seqio->write_seq($obj); > > > > The error message was: > > MSG: Did not provide a valid Bio::PrimarySeqI object > > STACK Bio::SeqIO::fasta::write_seq > > /usr/lib/perl5/site_perl/5.8.0/Bio/SeqIO/fasta.pm:178 > > > > We think it had to do with the lengh of the gene name. > > For example the following gene name was a problem: > > > > gi|59711891|ref|YP_204667.1| NAD-specific glutamate > > dehydrogenase [Vibrio fischeri ES114]*VIB*COG2902*E > > > > Any ideas on what happened? > > > > Thanks > > > > __________________________________________________ > > Do You Yahoo!? > > Tired of spam? Yahoo! Mail has the best spam protection around > > http://mail.yahoo.com > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From selvik at ufl.edu Fri Jul 7 12:07:03 2006 From: selvik at ufl.edu (Selvi Kadirvel) Date: Fri, 7 Jul 2006 12:07:03 -0400 Subject: [Bioperl-l] Bio::SearchIO::hmmer hsp behaviour In-Reply-To: <001a01c6a048$cb802420$15327e82@pyrimidine> References: <001a01c6a048$cb802420$15327e82@pyrimidine> Message-ID: <1A5235F4-87E6-42D7-9796-7FEB8F7C04E5@ufl.edu> Chris: I just tried it out, and it looks like this solution works fine for me. Thank you for the fix! -Selvi On Jul 5, 2006, at 11:36 AM, Chris Fields wrote: > Okay, I managed to figure out what the problem was. I committed a > fix in > CVS for the initial bug (Selvi's missing hits). Still has one HSP > per hit > for now; I think it will take a bit more effort to get a BLAST-like > multi > HSP/hit up and running. > > Selvi, update from CVS to see if that works. > > Chris > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> bounces at lists.open-bio.org] On Behalf Of Chris Fields >> Sent: Friday, June 30, 2006 12:44 PM >> To: Sendu Bala; Jason Stajich >> Cc: bioperl-l at lists.open-bio.org list >> Subject: Re: [Bioperl-l] Bio::SearchIO::hmmer hsp behaviour >> >> I'll try looking at it this weekend. A suggested workaround is to >> either try setting -A for no alignments or setting it to a high >> number to retrieve all of them. It's pretty serious as the error >> silently dumps those domains, so for those using automated annotation >> pipelines would miss it unless they are also checking the raw output. >> >> You could design a SearchIO::hmmpfam parser then expand it to take in >> hmmsearch output at a later point, or keep them separate. I like the >> idea of having modules that are more specific about what they parse; >> seems at some point you reach serious code bloat and maintenance >> becomes an issue. Look at SearchIO::blast; it parses various text >> BLAST output very well but with some serious obfuscation. Just don't >> know how productive it would be to separate out the PSI-BLAST and >> bl2seq stuff since they are pretty close to a standard BLAST >> report... oh well. >> >> To Jason : good luck on your move. Drop us a line here to let us >> know everything went well. >> >> Chris >> >> On Jun 30, 2006, at 11:14 AM, Sendu Bala wrote: >> >>> Chris Fields wrote: >>>> It may have been just simpler to have it be one HSP (domain) per >>>> Hit >>>> (model) as that's how the reports are generated. My reasoning was >>>> that >>>> using the one domain per model made sense based on what you are >>>> actually >>>> trying to do, which is annotate the sequence based on the order the >>>> domain appears. Most others may not view it that way, which is >>>> fine. >>>> One can always gather the relevant HSP's, convert to seqfeatures, >>>> then >>>> sort them if order is important, I suppose. >>>> >>>> I would say, if the overall consensus is to modify it to have >>>> multiple >>>> domain hits per model (similar to BLAST) then Sendu should go >>>> ahead and >>>> make those changes then announce it on the list so no one can gripe >>>> about it later. My main concern was not changing things so >>>> dramatically >>>> that it'll break for someone >>> >>> Going on your earlier suggestion, I was thinking about making >>> SearchIO::hmmpfam instead, which would get used if you set the >>> format to >>> 'hmmpfam' instead of the generic 'hmmer' when making a SearchIO. I >>> suppose I would make a SearchIO::hmmsearch as well, if necessary. >>> >>> >>> [...] >>>> that the reported bug about missing hits (Bug 2036) is fixed as >>>> well. >>> >>> However, having never made a SearchIO plugin before, it will be some >>> time before I get my head around it. I'll want to make one the >>> current >>> HOWTO:SearchIO way before I can think about doing it a better way >>> (hashes) as well. So I can say I'll make a move on this at some >>> point in >>> the future, but if someone wants to fix Bug 2036 in the mean time, >>> they >>> are welcome to. Again as suggested, my priority is Bio::Map right >>> now. >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> Christopher Fields >> Postdoctoral Researcher >> Lab of Dr. Robert Switzer >> Dept of Biochemistry >> University of Illinois Urbana-Champaign >> >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > From cjfields at uiuc.edu Fri Jul 7 12:16:30 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 7 Jul 2006 11:16:30 -0500 Subject: [Bioperl-l] Bio::SeqFeatureI spliced_seq Message-ID: <002a01c6a1e0$b4e2b360$15327e82@pyrimidine> There is a reported bug (Bug 2039) which I found an easy fix for; the issue is that spliced_seq, as currently implemented, has two optional arguments: my ($self, $db, $nosort) = @_; $db is-a Bio::DB::RandomAccessI; $nosort is a flag so that locations aren't sorted before splicing, which is crux of the bug. So, to set $nosort you must also set $db to either undef or a Bio::DB::RandomAccessI (a point not made in the docs and not immediately clear to the user). Would it make more sense to have something like this (using $self->_rearrange to get the options)? my $seq = $sf->spliced_seq(-nosort => 1); my $seq = $sf->spliced_seq(-db => $db); my $seq = $sf->spliced_seq(-nosort => 1 -db => $db); Christopher Fields Postdoctoral Researcher - Switzer Lab Dept. of Biochemistry University of Illinois Urbana-Champaign From vebaev at gmail.com Sat Jul 8 16:59:40 2006 From: vebaev at gmail.com (Vesselin Baev) Date: Sat, 08 Jul 2006 23:59:40 +0300 Subject: [Bioperl-l] BLAST running options Message-ID: <44B01CBC.9070404@gmail.com> Hi, I'm parsing Blast results, but I need an Blast option to limit limit and decrease the Blast number of results. I'm blasting an oligo about 40nt and I need only results which are with mismatches (not more than 10) or exactly matching but in the length as the query - 40. I do not want all the big amount of results that blast gave me about shorter matching. Do anyone knows what king of BLAST option to use? Thanks -- ------------------------------------------------ University of Plovdiv Faculty of Biology Dept. Molecular Biology and Plant Physiology Tzar Asen 24 Plovdiv 4000, BULGARIA vebaev at gmail.com tel.00359889034044 From cjfields at uiuc.edu Sat Jul 8 19:15:29 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sat, 8 Jul 2006 18:15:29 -0500 Subject: [Bioperl-l] BLAST running options In-Reply-To: <44B01CBC.9070404@gmail.com> References: <44B01CBC.9070404@gmail.com> Message-ID: <95D47990-9B63-444D-B386-04219D21DC39@uiuc.edu> There were some posts about this a few months back. http://bioperl.org/pipermail/bioperl-l/2006-April/021341.html Essentially, most responders suggested not using BLAST, but I believe there were a few who gave pointers. Chris On Jul 8, 2006, at 3:59 PM, Vesselin Baev wrote: > Hi, > I'm parsing Blast results, but I need an Blast option to limit > limit and > decrease the Blast number of results. > I'm blasting an oligo about 40nt and I need only results which are > with > mismatches (not more than 10) or exactly matching but in the length as > the query - 40. > I do not want all the big amount of results that blast gave me about > shorter matching. > > Do anyone knows what king of BLAST option to use? > Thanks > > -- > ------------------------------------------------ > > University of Plovdiv > Faculty of Biology > Dept. Molecular Biology and Plant Physiology > Tzar Asen 24 > Plovdiv 4000, BULGARIA > vebaev at gmail.com > tel.00359889034044 > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Mon Jul 10 17:09:12 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 10 Jul 2006 16:09:12 -0500 Subject: [Bioperl-l] How to use gi2taxonid Message-ID: <000301c6a465$182025d0$15327e82@pyrimidine> Hubert, In case you didn't get this going, there may be another option now. I have started work on a new set of modules called Bio::DB::EUtilities in bioperl-live, intended as a back-end for NCBI database searches. It can be used directly if needed though. You can use EPost/Elink to directly retrieve the taxonIDs using the following script (pass a file containing the protein/nucleotide primary ID on command line). The below retrieves taxonid's using protein GI's: use Bio::DB::EUtilities; my @ids; while (my $id = <>) { chomp $id; push @ids, $id; } my $epost = Bio::DB::EUtilities->new( -eutil => 'epost', -db => 'protein', -id => \@ids, ); $epost->get_response; my $elink = Bio::DB::EUtilities->new( -eutil => 'elink', -cookie => $epost->next_cookie, -db => 'taxonomy', ); $elink->get_response; my @tax_ids = $elink->get_db_ids; Chris > hi, > I have downloaded the gi2taxonid file to get the taxonid for a GI > number > taken from a report as recommended here, but I don't know how to > use the > gi2taxonid file. > Jason wrote in a previous post that you have to make a DB_File out of > it, but I don't know how....and finally tie it to a hash.... > Can anybody give me a hint how to use it..... my final goal is to get > the taxonomy. > > thanks > Hubert Christopher Fields Postdoctoral Researcher - Switzer Lab Dept. of Biochemistry University of Illinois Urbana-Champaign From hubert.prielinger at gmx.at Mon Jul 10 19:53:26 2006 From: hubert.prielinger at gmx.at (Hubert Prielinger) Date: Mon, 10 Jul 2006 17:53:26 -0600 Subject: [Bioperl-l] How to use gi2taxonid In-Reply-To: <000301c6a465$182025d0$15327e82@pyrimidine> References: <000301c6a465$182025d0$15327e82@pyrimidine> Message-ID: <44B2E876.2020200@gmx.at> Hi Chris, thanks for your response, actually I have done it with the EUtils, because I have only accession ids and there is no possibility to retrieve the taxonomy directly for an accession id. Because the xml files you retrieve are very small, I first assign accession id to esearch, parse the Uid from the xml file, assign Uid to esummary, parse tax id from xml and finally, assign tax id to esummary again and retrieve taxonomy and parse it..... I know a little bit intricatley, but it works fine.....thanks regards Hubert Chris Fields wrote: > Hubert, > > In case you didn't get this going, there may be another option now. I have > started work on a new set of modules called Bio::DB::EUtilities in > bioperl-live, intended as a back-end for NCBI database searches. It can be > used directly if needed though. You can use EPost/Elink to directly > retrieve the taxonIDs using the following script (pass a file containing the > protein/nucleotide primary ID on command line). The below retrieves > taxonid's using protein GI's: > > > use Bio::DB::EUtilities; > my @ids; > > while (my $id = <>) { > chomp $id; > push @ids, $id; > } > > my $epost = Bio::DB::EUtilities->new( > -eutil => 'epost', > -db => 'protein', > -id => \@ids, > ); > > $epost->get_response; > > my $elink = Bio::DB::EUtilities->new( > -eutil => 'elink', > -cookie => $epost->next_cookie, > -db => 'taxonomy', > ); > > $elink->get_response; > > my @tax_ids = $elink->get_db_ids; > > > > Chris > > >> hi, >> I have downloaded the gi2taxonid file to get the taxonid for a GI >> number >> taken from a report as recommended here, but I don't know how to >> use the >> gi2taxonid file. >> Jason wrote in a previous post that you have to make a DB_File out of >> it, but I don't know how....and finally tie it to a hash.... >> Can anybody give me a hint how to use it..... my final goal is to get >> the taxonomy. >> >> thanks >> Hubert >> > > Christopher Fields > Postdoctoral Researcher - Switzer Lab > Dept. of Biochemistry > University of Illinois Urbana-Champaign > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > From MEC at stowers-institute.org Mon Jul 10 20:25:11 2006 From: MEC at stowers-institute.org (Cook, Malcolm) Date: Mon, 10 Jul 2006 19:25:11 -0500 Subject: [Bioperl-l] Bio::Tools::Fgenesh bug? and fix? Message-ID: I am finding the Bio::Tools::Fgenesh parser to incorrectly handle the feature coordinates on - strand predictions. In particular, start & end are deliberately reversed if the strand is '-'. I guess this was a holdover from Genscan.pm and wasn't really tested !?!?! Or, perhaps fgenesh v 2.4 which I am running has different output in this respect compared to the version 2.0, against which this module was written. Or, perhaps my understanding is blotto (known to happen). Does anyone know for sure? If I comment out selected lines... # if($predobj->strand() == 1) { $predobj->start($start); $predobj->end($end); # } else { # $predobj->end($start); # $predobj->start($end); # } ... then GFF produced by my naive fgenesh2gff script below is correct (at least w.r.t. strand and coordinates - GFF compatibility purists might wince). Should I commit this change to head? Malcolm Cook Database Applications Manager, Bioinformatics Stowers Institute for Medical Research #!/usr/bin/env perl # fgenesh2gff # PURPOSE: parse fgenesh output into gff # USAGE: fgenesh fish somefish.dna | fgenesh2gff > somefish.dna.fgenesh.gff use strict; use warnings; use Bio::Tools::Fgenesh; use Bio::FeatureIO; # Remaining options should name files to process, but if none, process # standard input: @ARGV = ('-') unless @ARGV; my $fgenesh = Bio::Tools::Fgenesh->new(-fh => \*ARGV); my $featureout = new Bio::Tools::GFF( -gff_version => 2, #whatever ;) ); my $IDNUM = 0; while (my $gene = $fgenesh->next_prediction()) { my $ID = "fgenesh" . ++ $IDNUM; $gene->add_tag_value('ID', $ID); $featureout->write_feature($gene); foreach ($gene->exons()) { $_->add_tag_value('Parent', $ID); $_->seq_id($gene->seq_id); $featureout->write_feature($_); } } $fgenesh->close(); exit 0; From chris at dwan.org Mon Jul 10 22:06:41 2006 From: chris at dwan.org (Christopher Dwan) Date: Mon, 10 Jul 2006 22:06:41 -0400 Subject: [Bioperl-l] Bio::Tools::Fgenesh bug? and fix? In-Reply-To: References: Message-ID: I'm not surprised that there are parts that don't work right, I coped genscan.pm and made the absolute minimal changes required to get what I needed working. Haven't touched it since. Please feel free to do what needs to be done, and sorry about the mess. -Chris Dwan On Jul 10, 2006, at 8:25 PM, Cook, Malcolm wrote: > I am finding the Bio::Tools::Fgenesh parser to incorrectly handle the > feature coordinates on - strand predictions. > > In particular, start & end are deliberately reversed if the strand is > '-'. > > I guess this was a holdover from Genscan.pm and wasn't really tested > !?!?! > > Or, perhaps fgenesh v 2.4 which I am running has different output in > this respect compared to the version 2.0, against which this module > was > written. > > Or, perhaps my understanding is blotto (known to happen). > > Does anyone know for sure? > > If I comment out selected lines... > > # if($predobj->strand() == 1) { > $predobj->start($start); > $predobj->end($end); > # } else { > # $predobj->end($start); > # $predobj->start($end); > # } > > ... then GFF produced by my naive fgenesh2gff script below is correct > (at least w.r.t. strand and coordinates - GFF compatibility purists > might wince). > > Should I commit this change to head? > > > Malcolm Cook > Database Applications Manager, Bioinformatics > Stowers Institute for Medical Research > > > > #!/usr/bin/env perl > > # fgenesh2gff > # PURPOSE: parse fgenesh output into gff > # USAGE: fgenesh fish somefish.dna | fgenesh2gff > > somefish.dna.fgenesh.gff > > use strict; > use warnings; > use Bio::Tools::Fgenesh; > use Bio::FeatureIO; > > # Remaining options should name files to process, but if none, process > # standard input: > @ARGV = ('-') unless @ARGV; > my $fgenesh = Bio::Tools::Fgenesh->new(-fh => \*ARGV); > > my $featureout = new Bio::Tools::GFF( > -gff_version => 2, #whatever ;) > ); > my $IDNUM = 0; > while (my $gene = $fgenesh->next_prediction()) { > my $ID = "fgenesh" . ++ $IDNUM; > $gene->add_tag_value('ID', $ID); > $featureout->write_feature($gene); > foreach ($gene->exons()) { > $_->add_tag_value('Parent', $ID); > $_->seq_id($gene->seq_id); > $featureout->write_feature($_); > } > } > $fgenesh->close(); > > exit 0; > From rvosa at sfu.ca Tue Jul 11 04:58:46 2006 From: rvosa at sfu.ca (Rutger Vos) Date: Tue, 11 Jul 2006 01:58:46 -0700 Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? Message-ID: <44B36846.8070103@sfu.ca> Dear all, would it be possible to overload Bio::Root::RootI's 'throw' method to accept an additional, optional (positional) argument to define the exception class, e.g. using Exception::Class: # ...somewhere ... sub makefh { my ( $self, $filename ) = @_; open my $fh, '<' $filename or $self->throw("Can't open file: $!", 'Bio::Exceptions::FileIO'); # NOTE second argument return $fh; } #.... somewhere else my $fh; eval { $fh = $obj->makefh( 'data.txt'); } if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) { # something's wrong with the file? } -- ++++++++++++++++++++++++++++++++++++++++++++++++++++ Rutger Vos, PhD. candidate Department of Biological Sciences Simon Fraser University 8888 University Drive Burnaby, BC, V5A1S6 Phone: 604-291-5625 Fax: 604-291-3496 Personal site: http://www.sfu.ca/~rvosa FAB* lab: http://www.sfu.ca/~fabstar Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ ++++++++++++++++++++++++++++++++++++++++++++++++++++ From khoiwal_tara at yahoo.co.in Tue Jul 11 08:19:17 2006 From: khoiwal_tara at yahoo.co.in (Khoiwal Tara) Date: Tue, 11 Jul 2006 05:19:17 -0700 (PDT) Subject: [Bioperl-l] Need help in needle parser Message-ID: <20060711121918.64429.qmail@web8513.mail.in.yahoo.com> Hi, I want to parse the output of needle.I tried but didn't able to get expected output. My code is as follows: #!/usr/local/bin/perl use strict; use warnings; use Bio::AlignIO; my $needleReport = $ARGV[0]; my $in = new Bio::AlignIO(-format => 'emboss',-file =>$needleReport); while(my $align = $in->next_aln()){ print "Alignment Length:".$align->length()."\n"; print "Percentage Identity:".$align->percentage_identity()."\n"; print "Consensus string:".$align->consensus_string()."\n"; print "Number of sequences:".$align->no_sequence()."\n"; print "Number of residues:".$align->no_residues()."\n"; } But it doesn't go inside the while loop. Pls help me. How to find the alignment position for the query sequence on the target sequence from the needle output? Where can i find the good documentation on needle parser and its usage? Good document on bioperl for beginners. Regards, Tara Khoiwal. --------------------------------- Sneak preview the all-new Yahoo.com. It's not radically different. Just radically better. From cjfields at uiuc.edu Tue Jul 11 08:59:07 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 11 Jul 2006 07:59:07 -0500 Subject: [Bioperl-l] Need help in needle parser In-Reply-To: <20060711121918.64429.qmail@web8513.mail.in.yahoo.com> References: <20060711121918.64429.qmail@web8513.mail.in.yahoo.com> Message-ID: <250EEE60-48D0-4844-B0C0-13E17E60965C@uiuc.edu> perldoc Bio::AlignIO perldoc Bio::AlignIO::needle http://www.bioperl.org/wiki/FAQ http://www.bioperl.org/wiki/HOWTO:Beginners http://www.bioperl.org/wiki/Bptutorial.pl http://www.catb.org/~esr/faqs/smart-questions.html Google is your friend! If it isn't entering the while loop, there are two possibilities: 1) Something is wrong with the file 2) The parser isn't reading the file correctly In order to know which, we will need to see the alignment itself. Chris On Jul 11, 2006, at 7:19 AM, Khoiwal Tara wrote: > Hi, > I want to parse the output of needle.I tried but didn't able to > get expected output. > > My code is as follows: > > #!/usr/local/bin/perl > > use strict; > use warnings; > use Bio::AlignIO; > my $needleReport = $ARGV[0]; > > my $in = new Bio::AlignIO(-format => 'emboss',-file =>$needleReport); > > while(my $align = $in->next_aln()){ > print "Alignment Length:".$align->length()."\n"; > print "Percentage Identity:".$align->percentage_identity()."\n"; > print "Consensus string:".$align->consensus_string()."\n"; > print "Number of sequences:".$align->no_sequence()."\n"; > print "Number of residues:".$align->no_residues()."\n"; > } > > But it doesn't go inside the while loop. > Pls help me. > How to find the alignment position for the query sequence on the > target sequence from the needle output? > Where can i find the good documentation on needle parser and its > usage? > Good document on bioperl for beginners. > > Regards, > Tara Khoiwal. > > > --------------------------------- > Sneak preview the all-new Yahoo.com. It's not radically different. > Just radically better. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Tue Jul 11 09:13:23 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 11 Jul 2006 08:13:23 -0500 Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? In-Reply-To: <44B36846.8070103@sfu.ca> References: <44B36846.8070103@sfu.ca> Message-ID: <954D76EB-F8D3-4652-957B-6ED4BBAEE209@uiuc.edu> I suppose you could; Bio::Root::Root does that using Error.pm (if it is installed). It almost sounds like what Bio::Root::Root does is what you want, but you want a little more information when exceptions are thrown maybe? from perldoc Bio::Root::Root: ... # Alternatively, using the new typed exception syntax in the throw() call: $obj->throw( -class => 'Bio::Root::BadParameter', -text => "Can not open file $file", -value => $file); ... Typed Exception Syntax The typed exception syntax of throw() has the advantage of plainly indicating the nature of the trouble, since the name of the class is included in the title of the exception output. To take advantage of this capability, you must specify arguments as named parameters in the throw() call. Here are the parameters: -class name of the class of the exception. This should be one of the classes defined in Bio::Root::Exception, or a custom error of yours that extends one of the exceptions defined in Bio::Root::Exception. -text a sensible message for the exception -value the value causing the exception or $!, if appropriate. Note that Bio::Root::Exception does not need to be imported into your module (or script) namespace in order to throw exceptions via Bio::Root::Root::throw(), since Bio::Root::Root imports it. Chris On Jul 11, 2006, at 3:58 AM, Rutger Vos wrote: > Dear all, > > would it be possible to overload Bio::Root::RootI's 'throw' method to > accept an additional, optional (positional) argument to define the > exception class, e.g. using Exception::Class: > > # ...somewhere ... > > sub makefh { > my ( $self, $filename ) = @_; > open my $fh, '<' $filename or $self->throw("Can't open file: $!", > 'Bio::Exceptions::FileIO'); # NOTE second argument > return $fh; > } > > #.... somewhere else > my $fh; > eval { > $fh = $obj->makefh( 'data.txt'); > } > if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) { > # something's wrong with the file? > } > > -- > ++++++++++++++++++++++++++++++++++++++++++++++++++++ > Rutger Vos, PhD. candidate > Department of Biological Sciences > Simon Fraser University > 8888 University Drive > Burnaby, BC, V5A1S6 > Phone: 604-291-5625 > Fax: 604-291-3496 > Personal site: http://www.sfu.ca/~rvosa > FAB* lab: http://www.sfu.ca/~fabstar > Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Tue Jul 11 11:25:32 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 11 Jul 2006 10:25:32 -0500 Subject: [Bioperl-l] Need help in needle parser In-Reply-To: <20060711132601.46368.qmail@web8510.mail.in.yahoo.com> Message-ID: <001601c6a4fe$3ff7ca10$15327e82@pyrimidine> There are a few odd things about the data you sent; the FASTA files aren't FASTA format (they are raw) and the needle output doesn't have sequence names. You could try running these through needle with descriptors to see if that helps, but. it is very likely my option #2 (i.e. the parser doesn't recognize the format). There is a thread on the mail list about this issue: http://thread.gmane.org/gmane.comp.lang.perl.bio.general/8926/focus=8935 Basically, it looks like the needle output has changed dramatically in EMBOSS v3. Jason's suggested options from the above thread (as well as mine): . I think the "emboss" format changed in 3.0.0 solutions: a) fix the AlignIO::emboss parser to handle both flavors (old and new) b) have it output MSF format and use AlignIO::msf. . So, as a workaround, use MSF output. I won't have time to look at this anytime soon as I'm busy at $job and getting ready for a summer institute; I'll submit this as a bug to see if someone else can tackle it before I get back in early August. Chris _____ From: Khoiwal Tara [mailto:khoiwal_tara at yahoo.co.in] Sent: Tuesday, July 11, 2006 8:26 AM To: Chris Fields Subject: Re: [Bioperl-l] Need help in needle parser I am sending my testing data to you. I have two fasta files "GenomicSeq.fasta" and "TranscriptSeq.fasta". I ran needle on these files as follows: $ needle GenomicSeq.fasta TranscriptSeq.fasta outfile.needle So the out put of the needle will get stored in outfile.needle. I am attaching the output file also. Please check it and tell me if it has any problem. Is my output file is correct? Thanks and Regards, Tara. Chris Fields wrote: perldoc Bio::AlignIO perldoc Bio::AlignIO::needle http://www.bioperl.org/wiki/FAQ http://www.bioperl.org/wiki/HOWTO:Beginners http://www.bioperl.org/wiki/Bptutorial.pl http://www.catb.org/~esr/faqs/smart-questions.html Google is your friend! If it isn't entering the while loop, there are two possibilities: 1) Something is wrong with the file 2) The parser isn't reading the file correctly In order to know which, we will need to see the alignment itself. Chris On Jul 11, 2006, at 7:19 AM, Khoiwal Tara wrote: > Hi, > I want to parse the output of needle.I tried but didn't able to > get expected output. > > My code is as follows: > > #!/usr/local/bin/perl > > use strict; > use warnings; > use Bio::AlignIO; > my $needleReport = $ARGV[0]; > > my $in = new Bio::AlignIO(-format => 'emboss',-file =>$needleReport); > > while(my $align = $in->next_aln()){ > print "Alignment Length:".$align->length()."\n"; > print "Percentage Identity:".$align->percentage_identity()."\n"; > print "Consensus string:".$align->consensus_string()."\n"; > print "Number of sequences:".$align->no_sequence()."\n"; > print "Number of residues:".$align->no_residues()."\n"; > } > > But it doesn't go inside the while loop. > Pls help me. > How to find the alignment position for the query sequence on the > target sequence from the needle output? > Where can i find the good documentation on needle parser and its > usage? > Good document on bioperl for beginners. > > Regards, > Tara Khoiwal. > > > --------------------------------- > Sneak preview the all-new Yahoo.com. It's not radically different. > Just radically better. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign _____ Do you Yahoo!? Next-gen email? Have it all with the all-new Yahoo! Mail Beta. From MEC at stowers-institute.org Tue Jul 11 11:56:40 2006 From: MEC at stowers-institute.org (Cook, Malcolm) Date: Tue, 11 Jul 2006 10:56:40 -0500 Subject: [Bioperl-l] Bio::Tools::Fgenesh bug? and fix? Message-ID: Got it. Commits made. Thanks for the history lesson. Cheers, Malcolm Cook >-----Original Message----- >From: Christopher Dwan [mailto:chris at dwan.org] >Sent: Monday, July 10, 2006 9:07 PM >To: Cook, Malcolm >Cc: bioperl-l >Subject: Re: Bio::Tools::Fgenesh bug? and fix? > > >I'm not surprised that there are parts that don't work right, I coped >genscan.pm and made the absolute minimal changes required to get what >I needed working. Haven't touched it since. > >Please feel free to do what needs to be done, and sorry about the mess. > >-Chris Dwan > >On Jul 10, 2006, at 8:25 PM, Cook, Malcolm wrote: > >> I am finding the Bio::Tools::Fgenesh parser to incorrectly handle the >> feature coordinates on - strand predictions. >> >> In particular, start & end are deliberately reversed if the strand is >> '-'. >> >> I guess this was a holdover from Genscan.pm and wasn't really tested >> !?!?! >> >> Or, perhaps fgenesh v 2.4 which I am running has different output in >> this respect compared to the version 2.0, against which this module >> was >> written. >> >> Or, perhaps my understanding is blotto (known to happen). >> >> Does anyone know for sure? >> >> If I comment out selected lines... >> >> # if($predobj->strand() == 1) { >> $predobj->start($start); >> $predobj->end($end); >> # } else { >> # $predobj->end($start); >> # $predobj->start($end); >> # } >> >> ... then GFF produced by my naive fgenesh2gff script below is correct >> (at least w.r.t. strand and coordinates - GFF compatibility purists >> might wince). >> >> Should I commit this change to head? >> >> >> Malcolm Cook >> Database Applications Manager, Bioinformatics >> Stowers Institute for Medical Research >> >> >> >> #!/usr/bin/env perl >> >> # fgenesh2gff >> # PURPOSE: parse fgenesh output into gff >> # USAGE: fgenesh fish somefish.dna | fgenesh2gff > >> somefish.dna.fgenesh.gff >> >> use strict; >> use warnings; >> use Bio::Tools::Fgenesh; >> use Bio::FeatureIO; >> >> # Remaining options should name files to process, but if >none, process >> # standard input: >> @ARGV = ('-') unless @ARGV; >> my $fgenesh = Bio::Tools::Fgenesh->new(-fh => \*ARGV); >> >> my $featureout = new Bio::Tools::GFF( >> -gff_version => 2, #whatever ;) >> ); >> my $IDNUM = 0; >> while (my $gene = $fgenesh->next_prediction()) { >> my $ID = "fgenesh" . ++ $IDNUM; >> $gene->add_tag_value('ID', $ID); >> $featureout->write_feature($gene); >> foreach ($gene->exons()) { >> $_->add_tag_value('Parent', $ID); >> $_->seq_id($gene->seq_id); >> $featureout->write_feature($_); >> } >> } >> $fgenesh->close(); >> >> exit 0; >> > > From cjfields at uiuc.edu Tue Jul 11 12:04:49 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 11 Jul 2006 11:04:49 -0500 Subject: [Bioperl-l] Need help in needle parser In-Reply-To: <20060711132601.46368.qmail@web8510.mail.in.yahoo.com> Message-ID: <000101c6a503$bd982eb0$15327e82@pyrimidine> Okay, I take that back. Bio::AlignIO::emboss does parse EMBOSS v3 needle output! The fact that it doesn't parse your alignment is b/c there are no sequence descriptors in the file for the sequences (your FASTA files didn't have them either). Modifying the file to contain descriptions for both the alignment and the 'Aligned_sequences:' section gets your test alignment to work. I consider this a feature and not a bug; how would others be able to distinguish between numerous sequences in an alignment w/o identifiers of some sort? It shouldn't just toss this out without a warning however; I'll try to add a little exception handling. BTW, one line is incorrect in your script; it should be print "Number of sequences:".$align->no_sequences()."\n"; you have print "Number of sequences:".$align->no_sequence()."\n"; Chris _____ From: Khoiwal Tara [mailto:khoiwal_tara at yahoo.co.in] Sent: Tuesday, July 11, 2006 8:26 AM To: Chris Fields Subject: Re: [Bioperl-l] Need help in needle parser I am sending my testing data to you. I have two fasta files "GenomicSeq.fasta" and "TranscriptSeq.fasta". I ran needle on these files as follows: $ needle GenomicSeq.fasta TranscriptSeq.fasta outfile.needle So the out put of the needle will get stored in outfile.needle. I am attaching the output file also. Please check it and tell me if it has any problem. Is my output file is correct? Thanks and Regards, Tara. Chris Fields wrote: perldoc Bio::AlignIO perldoc Bio::AlignIO::needle http://www.bioperl.org/wiki/FAQ http://www.bioperl.org/wiki/HOWTO:Beginners http://www.bioperl.org/wiki/Bptutorial.pl http://www.catb.org/~esr/faqs/smart-questions.html Google is your friend! If it isn't entering the while loop, there are two possibilities: 1) Something is wrong with the file 2) The parser isn't reading the file correctly In order to know which, we will need to see the alignment itself. Chris On Jul 11, 2006, at 7:19 AM, Khoiwal Tara wrote: > Hi, > I want to parse the output of needle.I tried but didn't able to > get expected output. > > My code is as follows: > > #!/usr/local/bin/perl > > use strict; > use warnings; > use Bio::AlignIO; > my $needleReport = $ARGV[0]; > > my $in = new Bio::AlignIO(-format => 'emboss',-file =>$needleReport); > > while(my $align = $in->next_aln()){ > print "Alignment Length:".$align->length()."\n"; > print "Percentage Identity:".$align->percentage_identity()."\n"; > print "Consensus string:".$align->consensus_string()."\n"; > print "Number of sequences:".$align->no_sequence()."\n"; > print "Number of residues:".$align->no_residues()."\n"; > } > > But it doesn't go inside the while loop. > Pls help me. > How to find the alignment position for the query sequence on the > target sequence from the needle output? > Where can i find the good documentation on needle parser and its > usage? > Good document on bioperl for beginners. > > Regards, > Tara Khoiwal. > > > --------------------------------- > Sneak preview the all-new Yahoo.com. It's not radically different. > Just radically better. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign _____ Do you Yahoo!? Next-gen email? Have it all with the all-new Yahoo! Mail Beta. From wrp at virginia.edu Tue Jul 11 14:05:29 2006 From: wrp at virginia.edu (William R. Pearson) Date: Tue, 11 Jul 2006 14:05:29 -0400 Subject: [Bioperl-l] Course announcement: CSHL Computational Genomics Course In-Reply-To: References: Message-ID: <45D80228-35DE-44B0-9E11-48EC76CE0DE7@virginia.edu> Course announcement - Application deadline, July 15, 2006 ================================================================ Cold Spring Harbor COMPUTATIONAL & COMPARATIVE GENOMICS November 8 - 14, 2006 Application Deadline: July 15, 2006 INSTRUCTORS: Pearson, William, Ph.D., University of Virginia, Charlottesville, VA Smith, Randall, Ph.D., SmithKline Beecham Pharmaceuticals, King of Prussia, PA Beyond BLAST and FASTA - Alignment: from proteins to genomes - This course presents a comprehensive overview of the theory and practice of computational methods for extracting the maximum amount of information from protein and DNA sequence similarity through sequence database searches, statistical analysis, and multiple sequence alignment, and genome scale alignment. Additional topics include gene finding, dentifying signals in unaligned sequences, integration of genetic and sequence information in biological databases. The course combines lectures with hands-on exercises; students are encouraged to pose challenging sequence analysis problems using their own data. The course makes extensive use of local WWW pages to present problem sets and the computing tools to solve them. Students use Windows and Mac workstations attached to a UNIX server; participants should be comfortable using the Unix operating system and a Unix text editor. The course is designed for biologists seeking advanced training in biological sequence analysis, computational biology core resource directors and staff, and for scientists in other disciplines, such as computer science, who wish to survey current research problems in biological sequence analysis and comparative genomics. The primary focus of the Computational and Comparative Genomics Course is the theory and practice of algorithms used in computational biology, with the goal of using current methods more effectively and developing new algorithms. Cold Spring Harbor also offers a "Programming for Biology" course, which focuses more on software development. Over the past few years, the course has been expanded to cover more algorithms and exercises on comparative genomics and genome databases. For additional information and the lecture schedule and problem sets for the 2005 course, see: http://fasta.bioch.virginia.edu/cshl05 ================================================================ To apply to the course, fill out the form at: http://meetings.cshl.edu/courses/courseapplication.asp ================================================================ From rvosa at sfu.ca Tue Jul 11 14:58:25 2006 From: rvosa at sfu.ca (Rutger Vos) Date: Tue, 11 Jul 2006 11:58:25 -0700 Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? In-Reply-To: <954D76EB-F8D3-4652-957B-6ED4BBAEE209@uiuc.edu> References: <44B36846.8070103@sfu.ca> <954D76EB-F8D3-4652-957B-6ED4BBAEE209@uiuc.edu> Message-ID: <44B3F4D1.7090804@sfu.ca> I must have overlooked this. I think it does what I want. So could I do something like: $obj->thow_not_implemented( -class => 'Bio::Root::NotImplemented' ); ...in interfaces? Chris Fields wrote: > I suppose you could; Bio::Root::Root does that using Error.pm (if it > is installed). It almost sounds like what Bio::Root::Root does is > what you want, but you want a little more information when exceptions > are thrown maybe? > > from perldoc Bio::Root::Root: > > ... > # Alternatively, using the new typed exception syntax in > the throw() call: > > $obj->throw( -class => 'Bio::Root::BadParameter', > -text => "Can not open file $file", > -value => $file); > ... > > Typed Exception Syntax > > The typed exception syntax of throw() has the advantage of > plainly > indicating the nature of the trouble, since the name of the > class is > included in the title of the exception output. > > To take advantage of this capability, you must specify > arguments as > named parameters in the throw() call. Here are the parameters: > > -class > name of the class of the exception. This should be one > of the > classes defined in Bio::Root::Exception, or a custom > error of yours > that extends one of the exceptions defined in > Bio::Root::Exception. > > -text > a sensible message for the exception > > -value > the value causing the exception or $!, if appropriate. > > Note that Bio::Root::Exception does not need to be imported > into your > module (or script) namespace in order to throw exceptions via > Bio::Root::Root::throw(), since Bio::Root::Root imports it. > > > Chris > > On Jul 11, 2006, at 3:58 AM, Rutger Vos wrote: > > >> Dear all, >> >> would it be possible to overload Bio::Root::RootI's 'throw' method to >> accept an additional, optional (positional) argument to define the >> exception class, e.g. using Exception::Class: >> >> # ...somewhere ... >> >> sub makefh { >> my ( $self, $filename ) = @_; >> open my $fh, '<' $filename or $self->throw("Can't open file: $!", >> 'Bio::Exceptions::FileIO'); # NOTE second argument >> return $fh; >> } >> >> #.... somewhere else >> my $fh; >> eval { >> $fh = $obj->makefh( 'data.txt'); >> } >> if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) { >> # something's wrong with the file? >> } >> >> -- >> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Rutger Vos, PhD. candidate >> Department of Biological Sciences >> Simon Fraser University >> 8888 University Drive >> Burnaby, BC, V5A1S6 >> Phone: 604-291-5625 >> Fax: 604-291-3496 >> Personal site: http://www.sfu.ca/~rvosa >> FAB* lab: http://www.sfu.ca/~fabstar >> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- ++++++++++++++++++++++++++++++++++++++++++++++++++++ Rutger Vos, PhD. candidate Department of Biological Sciences Simon Fraser University 8888 University Drive Burnaby, BC, V5A1S6 Phone: 604-291-5625 Fax: 604-291-3496 Personal site: http://www.sfu.ca/~rvosa FAB* lab: http://www.sfu.ca/~fabstar Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ ++++++++++++++++++++++++++++++++++++++++++++++++++++ From hlapp at gmx.net Tue Jul 11 15:05:03 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 11 Jul 2006 15:05:03 -0400 Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? In-Reply-To: <44B36846.8070103@sfu.ca> References: <44B36846.8070103@sfu.ca> Message-ID: <18C839F9-B099-4A4A-9957-2BF4EB7CFB85@gmx.net> I think it does this already, except that I believe you need to create the exception object and initialize with the message upfront. Steve, can you comment? Is this at least somewhat right? -hilmar On Jul 11, 2006, at 4:58 AM, Rutger Vos wrote: > Dear all, > > would it be possible to overload Bio::Root::RootI's 'throw' method to > accept an additional, optional (positional) argument to define the > exception class, e.g. using Exception::Class: > > # ...somewhere ... > > sub makefh { > my ( $self, $filename ) = @_; > open my $fh, '<' $filename or $self->throw("Can't open file: $!", > 'Bio::Exceptions::FileIO'); # NOTE second argument > return $fh; > } > > #.... somewhere else > my $fh; > eval { > $fh = $obj->makefh( 'data.txt'); > } > if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) { > # something's wrong with the file? > } > > -- > ++++++++++++++++++++++++++++++++++++++++++++++++++++ > Rutger Vos, PhD. candidate > Department of Biological Sciences > Simon Fraser University > 8888 University Drive > Burnaby, BC, V5A1S6 > Phone: 604-291-5625 > Fax: 604-291-3496 > Personal site: http://www.sfu.ca/~rvosa > FAB* lab: http://www.sfu.ca/~fabstar > Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Tue Jul 11 15:05:54 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 11 Jul 2006 15:05:54 -0400 Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? In-Reply-To: <954D76EB-F8D3-4652-957B-6ED4BBAEE209@uiuc.edu> References: <44B36846.8070103@sfu.ca> <954D76EB-F8D3-4652-957B-6ED4BBAEE209@uiuc.edu> Message-ID: <297D4770-A963-4039-8D90-987CC570BA94@gmx.net> Alright - well spotted Chris. This is what I was looking for. On Jul 11, 2006, at 9:13 AM, Chris Fields wrote: > I suppose you could; Bio::Root::Root does that using Error.pm (if it > is installed). It almost sounds like what Bio::Root::Root does is > what you want, but you want a little more information when exceptions > are thrown maybe? > > from perldoc Bio::Root::Root: > > ... > # Alternatively, using the new typed exception syntax in > the throw() call: > > $obj->throw( -class => 'Bio::Root::BadParameter', > -text => "Can not open file $file", > -value => $file); > ... > > Typed Exception Syntax > > The typed exception syntax of throw() has the advantage of > plainly > indicating the nature of the trouble, since the name of the > class is > included in the title of the exception output. > > To take advantage of this capability, you must specify > arguments as > named parameters in the throw() call. Here are the parameters: > > -class > name of the class of the exception. This should be one > of the > classes defined in Bio::Root::Exception, or a custom > error of yours > that extends one of the exceptions defined in > Bio::Root::Exception. > > -text > a sensible message for the exception > > -value > the value causing the exception or $!, if appropriate. > > Note that Bio::Root::Exception does not need to be imported > into your > module (or script) namespace in order to throw exceptions via > Bio::Root::Root::throw(), since Bio::Root::Root imports it. > > > Chris > > On Jul 11, 2006, at 3:58 AM, Rutger Vos wrote: > >> Dear all, >> >> would it be possible to overload Bio::Root::RootI's 'throw' method to >> accept an additional, optional (positional) argument to define the >> exception class, e.g. using Exception::Class: >> >> # ...somewhere ... >> >> sub makefh { >> my ( $self, $filename ) = @_; >> open my $fh, '<' $filename or $self->throw("Can't open file: $!", >> 'Bio::Exceptions::FileIO'); # NOTE second argument >> return $fh; >> } >> >> #.... somewhere else >> my $fh; >> eval { >> $fh = $obj->makefh( 'data.txt'); >> } >> if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) { >> # something's wrong with the file? >> } >> >> -- >> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Rutger Vos, PhD. candidate >> Department of Biological Sciences >> Simon Fraser University >> 8888 University Drive >> Burnaby, BC, V5A1S6 >> Phone: 604-291-5625 >> Fax: 604-291-3496 >> Personal site: http://www.sfu.ca/~rvosa >> FAB* lab: http://www.sfu.ca/~fabstar >> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Tue Jul 11 16:42:35 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 11 Jul 2006 15:42:35 -0500 Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? In-Reply-To: <44B3F4D1.7090804@sfu.ca> Message-ID: <000b01c6a52a$8c77f8c0$15327e82@pyrimidine> Bio::Root::Root doesn't overload throw_not_implemented from Bio::Root::RootI; from the comments looks like Steve C and Ewan B couldn't work out some of the Error.pm issues. Judging by the POD for Bio::Root::RootI, throw_not_implemented doesn't accept arguments; it throws a Bio::Root::NotImplemented exception automatically. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Rutger Vos > Sent: Tuesday, July 11, 2006 1:58 PM > To: Chris Fields > Cc: 'Bioperl List' > Subject: Re: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? > > I must have overlooked this. I think it does what I want. So could I do > something like: > > $obj->thow_not_implemented( -class => 'Bio::Root::NotImplemented' ); > > ...in interfaces? > > Chris Fields wrote: > > I suppose you could; Bio::Root::Root does that using Error.pm (if it > > is installed). It almost sounds like what Bio::Root::Root does is > > what you want, but you want a little more information when exceptions > > are thrown maybe? > > > > from perldoc Bio::Root::Root: > > > > ... > > # Alternatively, using the new typed exception syntax in > > the throw() call: > > > > $obj->throw( -class => 'Bio::Root::BadParameter', > > -text => "Can not open file $file", > > -value => $file); > > ... > > > > Typed Exception Syntax > > > > The typed exception syntax of throw() has the advantage of > > plainly > > indicating the nature of the trouble, since the name of the > > class is > > included in the title of the exception output. > > > > To take advantage of this capability, you must specify > > arguments as > > named parameters in the throw() call. Here are the parameters: > > > > -class > > name of the class of the exception. This should be one > > of the > > classes defined in Bio::Root::Exception, or a custom > > error of yours > > that extends one of the exceptions defined in > > Bio::Root::Exception. > > > > -text > > a sensible message for the exception > > > > -value > > the value causing the exception or $!, if appropriate. > > > > Note that Bio::Root::Exception does not need to be imported > > into your > > module (or script) namespace in order to throw exceptions via > > Bio::Root::Root::throw(), since Bio::Root::Root imports it. > > > > > > Chris > > > > On Jul 11, 2006, at 3:58 AM, Rutger Vos wrote: > > > > > >> Dear all, > >> > >> would it be possible to overload Bio::Root::RootI's 'throw' method to > >> accept an additional, optional (positional) argument to define the > >> exception class, e.g. using Exception::Class: > >> > >> # ...somewhere ... > >> > >> sub makefh { > >> my ( $self, $filename ) = @_; > >> open my $fh, '<' $filename or $self->throw("Can't open file: $!", > >> 'Bio::Exceptions::FileIO'); # NOTE second argument > >> return $fh; > >> } > >> > >> #.... somewhere else > >> my $fh; > >> eval { > >> $fh = $obj->makefh( 'data.txt'); > >> } > >> if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) { > >> # something's wrong with the file? > >> } > >> > >> -- > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> Rutger Vos, PhD. candidate > >> Department of Biological Sciences > >> Simon Fraser University > >> 8888 University Drive > >> Burnaby, BC, V5A1S6 > >> Phone: 604-291-5625 > >> Fax: 604-291-3496 > >> Personal site: http://www.sfu.ca/~rvosa > >> FAB* lab: http://www.sfu.ca/~fabstar > >> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> > >> > >> _______________________________________________ > >> Bioperl-l mailing list > >> Bioperl-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > >> > > > > Christopher Fields > > Postdoctoral Researcher > > Lab of Dr. Robert Switzer > > Dept of Biochemistry > > University of Illinois Urbana-Champaign > > > > > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > > > > > > > -- > ++++++++++++++++++++++++++++++++++++++++++++++++++++ > Rutger Vos, PhD. candidate > Department of Biological Sciences > Simon Fraser University > 8888 University Drive > Burnaby, BC, V5A1S6 > Phone: 604-291-5625 > Fax: 604-291-3496 > Personal site: http://www.sfu.ca/~rvosa > FAB* lab: http://www.sfu.ca/~fabstar > Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From frederick.partridge at st-johns.oxford.ac.uk Tue Jul 11 17:23:28 2006 From: frederick.partridge at st-johns.oxford.ac.uk (Frederick Partridge) Date: Tue, 11 Jul 2006 22:23:28 +0100 (BST) Subject: [Bioperl-l] Get nucleotide sequence when expecting protein from genpept Message-ID: I am trying to retrieve various protein sequences from genpept using get_Seq_by_acc. All of them work ok, except one T16005: If I try and retrieve it with a reduced program: #!usr/bin/perl -w use strict; use Bio::Perl; use Bio::SeqIO; my $genpept = new Bio::DB::GenPept; my $seq = $genpept->get_Seq_by_acc('T16005'); print ($seq->seq(),'\n'); I get back a nucleotide sequence, which is another sequence at NCBI with the same accession number. (I thought these were meant to be unique? but evidently not.) I am using bioperl 1.5.1, perl 5.8.1, Mac OS 10.3 Could anyone help me to get this protein sequence with my program? Many thanks, Freddie Partridge University of Oxford From qfdong at iastate.edu Tue Jul 11 17:32:56 2006 From: qfdong at iastate.edu (Qunfeng) Date: Tue, 11 Jul 2006 16:32:56 -0500 Subject: [Bioperl-l] Get nucleotide sequence when expecting protein from genpept In-Reply-To: References: Message-ID: <6.1.2.0.2.20060711163128.08086570@qfdong.mail.iastate.edu> This particular protein record (acc#T16005) was imported from PIR. In other words, this is not an original GenBank protein record. When GenBank imports protein records from other DB, it keeps their original acc#. However, gi# should be unique. Q At 04:23 PM 7/11/2006, Frederick Partridge wrote: >I am trying to retrieve various protein sequences from genpept using >get_Seq_by_acc. All of them work ok, except one T16005: > > >If I try and retrieve it with a reduced program: > > >#!usr/bin/perl -w > >use strict; > >use Bio::Perl; >use Bio::SeqIO; > >my $genpept = new Bio::DB::GenPept; > >my $seq = $genpept->get_Seq_by_acc('T16005'); > >print ($seq->seq(),'\n'); > > > >I get back a nucleotide sequence, which is another sequence at NCBI with >the same accession number. (I thought these were meant to be unique? but >evidently not.) > > >I am using bioperl 1.5.1, perl 5.8.1, Mac OS 10.3 > > >Could anyone help me to get this protein sequence with my program? > > >Many thanks, > > > >Freddie Partridge > >University of Oxford > > >_______________________________________________ >Bioperl-l mailing list >Bioperl-l at lists.open-bio.org >http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Tue Jul 11 18:05:09 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 11 Jul 2006 17:05:09 -0500 Subject: [Bioperl-l] Get nucleotide sequence when expecting protein fromgenpept In-Reply-To: Message-ID: <000001c6a536$141befb0$15327e82@pyrimidine> It's an imprted PIR record, so there probably is no accession recorded in the database. I think NCBI uses a fallback to nucleotide if it can't find a particular accession via protein. Using the primary ID (the GI#, 7498730) works. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Frederick Partridge > Sent: Tuesday, July 11, 2006 4:23 PM > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] Get nucleotide sequence when expecting protein > fromgenpept > > > > I am trying to retrieve various protein sequences from genpept using > get_Seq_by_acc. All of them work ok, except one T16005: > > > If I try and retrieve it with a reduced program: > > > #!usr/bin/perl -w > > use strict; > > use Bio::Perl; > use Bio::SeqIO; > > my $genpept = new Bio::DB::GenPept; > > my $seq = $genpept->get_Seq_by_acc('T16005'); > > print ($seq->seq(),'\n'); > > > > I get back a nucleotide sequence, which is another sequence at NCBI with > the same accession number. (I thought these were meant to be unique? but > evidently not.) > > > I am using bioperl 1.5.1, perl 5.8.1, Mac OS 10.3 > > > Could anyone help me to get this protein sequence with my program? > > > Many thanks, > > > > Freddie Partridge > > University of Oxford > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Tue Jul 11 18:47:38 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 11 Jul 2006 17:47:38 -0500 Subject: [Bioperl-l] Get nucleotide sequence when expecting proteinfromgenpept In-Reply-To: <000001c6a536$141befb0$15327e82@pyrimidine> Message-ID: <000201c6a53c$03970ed0$15327e82@pyrimidine> Okay, now try this: use Bio::DB::GenPept; use Bio::SeqIO; my $factory = Bio::DB::GenPept->new(-format => 'fasta'); my $seqin = $factory->get_Stream_by_acc('T16005'); my $seqout = Bio::SeqIO->new(-fh => \*STDOUT, -format => 'fasta'); while (my $seq = $seqin->next_seq) { $seqout->write_seq($seq); } This returns both the nucleotide sequence and the correct protein sequence; the protein was returned second for some reason, so get_Seq_by_acc misses it while get_Stream_by_acc doesn't. I have notified NCBI about this issue, but they will likely just tell me to use the GI number for searches as they are unique. Probably a good warning for anyone using accessions for all their work (I use the GI myself). Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Chris Fields > Sent: Tuesday, July 11, 2006 5:05 PM > To: 'Frederick Partridge'; bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Get nucleotide sequence when expecting > proteinfromgenpept > > It's an imprted PIR record, so there probably is no accession recorded in > the database. I think NCBI uses a fallback to nucleotide if it can't find > a > particular accession via protein. Using the primary ID (the GI#, 7498730) > works. > > Chris > > > -----Original Message----- > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > > bounces at lists.open-bio.org] On Behalf Of Frederick Partridge > > Sent: Tuesday, July 11, 2006 4:23 PM > > To: bioperl-l at lists.open-bio.org > > Subject: [Bioperl-l] Get nucleotide sequence when expecting protein > > fromgenpept > > > > > > > > I am trying to retrieve various protein sequences from genpept using > > get_Seq_by_acc. All of them work ok, except one T16005: > > > > > > If I try and retrieve it with a reduced program: > > > > > > #!usr/bin/perl -w > > > > use strict; > > > > use Bio::Perl; > > use Bio::SeqIO; > > > > my $genpept = new Bio::DB::GenPept; > > > > my $seq = $genpept->get_Seq_by_acc('T16005'); > > > > print ($seq->seq(),'\n'); > > > > > > > > I get back a nucleotide sequence, which is another sequence at NCBI with > > the same accession number. (I thought these were meant to be unique? but > > evidently not.) > > > > > > I am using bioperl 1.5.1, perl 5.8.1, Mac OS 10.3 > > > > > > Could anyone help me to get this protein sequence with my program? > > > > > > Many thanks, > > > > > > > > Freddie Partridge > > > > University of Oxford > > > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From Steve_Chervitz at affymetrix.com Tue Jul 11 20:21:16 2006 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Tue, 11 Jul 2006 17:21:16 -0700 Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? In-Reply-To: <18C839F9-B099-4A4A-9957-2BF4EB7CFB85@gmx.net> Message-ID: The Bio::Root::Root object is rigged to use the Error.pm module if available, so you can throw and catch of exception objects derived from Error. The motivation here was to provide a recommended path for folks that want to use more structured exception handling logic in their bioperl code. There are a number of pre-defined subclasses of exceptions that cover common problems (such as FileOpenException), but you can also define your own. See a list of the predfined exceptions as well as some how to docs in the POD for Bio::Root::Exception: http://search.cpan.org/~birney/bioperl-1.4/Bio/Root/Exception.pm There's a bunch more info about Bioperl exception fun available from the bioperl distribution under the examples/root directory. See the README in that directory to get oriented. There are a number of demo scripts there, too. Bio::Root::Root doesn't know anything about Exception::Class, but I see you can use it with Error.pm as described here: http://search.cpan.org/~drolsky/Exception-Class-1.23/lib/Exception/Class.pm# OTHER_EXCEPTION_MODULES_(try%2Fcatch_syntax) Cheers, Steve > From: Hilmar Lapp > Date: Tue, 11 Jul 2006 15:05:03 -0400 > To: Rutger Vos > Cc: Bioperl , Steve Chervitz > > Subject: Re: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? > > I think it does this already, except that I believe you need to > create the exception object and initialize with the message upfront. > > Steve, can you comment? Is this at least somewhat right? > > -hilmar > > On Jul 11, 2006, at 4:58 AM, Rutger Vos wrote: > >> Dear all, >> >> would it be possible to overload Bio::Root::RootI's 'throw' method to >> accept an additional, optional (positional) argument to define the >> exception class, e.g. using Exception::Class: >> >> # ...somewhere ... >> >> sub makefh { >> my ( $self, $filename ) = @_; >> open my $fh, '<' $filename or $self->throw("Can't open file: $!", >> 'Bio::Exceptions::FileIO'); # NOTE second argument >> return $fh; >> } >> >> #.... somewhere else >> my $fh; >> eval { >> $fh = $obj->makefh( 'data.txt'); >> } >> if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) { >> # something's wrong with the file? >> } >> >> -- >> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Rutger Vos, PhD. candidate >> Department of Biological Sciences >> Simon Fraser University >> 8888 University Drive >> Burnaby, BC, V5A1S6 >> Phone: 604-291-5625 >> Fax: 604-291-3496 >> Personal site: http://www.sfu.ca/~rvosa >> FAB* lab: http://www.sfu.ca/~fabstar >> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > From Steve_Chervitz at affymetrix.com Tue Jul 11 21:07:06 2006 From: Steve_Chervitz at affymetrix.com (Steve_Chervitz) Date: Tue, 11 Jul 2006 18:07:06 -0700 Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? In-Reply-To: <000b01c6a52a$8c77f8c0$15327e82@pyrimidine> References: <000b01c6a52a$8c77f8c0$15327e82@pyrimidine> Message-ID: <337D69E3-4075-426B-A53E-31D37ED6CA2E@affymetrix.com> On Jul 11, 2006, at 1:42 PM, Chris Fields wrote: > Bio::Root::Root doesn't overload throw_not_implemented from > Bio::Root::RootI; from the comments looks like Steve C and Ewan B > couldn't > work out some of the Error.pm issues. The issue (I believe) was that Bio::Root::RootI::throw_not_implemented was doing some checking for the presence of Error.pm and calling Error::throw. I changed it so that this fanciness only happens in Root.pm. > Judging by the POD for Bio::Root::RootI, throw_not_implemented doesn't > accept arguments; it throws a Bio::Root::NotImplemented exception > automatically. Looking at the code now, throw_not_implemented() does not throw a Bio::Root::NotImplemented exception. It just throws a simple, unclassed message. We could allow it to throw an exception of class Bio::Root:NotImplemented by changing this code: if( $self->can('throw') ) { $self->throw($message); }... to this if( $self->can('throw') ) { $self->throw(-text=>$message, -class=>'Bio::Root::NotImplemented'); }... This does not create any dependency on Error.pm, but permits it to be used if available. If Error.pm is not loaded, the only change is that the class string is included in the error message, which is kind of handy. Trouble would occur if the implementing class: * does not derive from Bio::Root::Root, * does not import Bio::Root::Exception, * fails to implement a method which gets called, and * Error.pm is available. I don't know if such implementations exist in bioperl now, but I suspect they would be rare (and discouraged). Steve > Chris > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> bounces at lists.open-bio.org] On Behalf Of Rutger Vos >> Sent: Tuesday, July 11, 2006 1:58 PM >> To: Chris Fields >> Cc: 'Bioperl List' >> Subject: Re: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) >> overloading? >> >> I must have overlooked this. I think it does what I want. So could >> I do >> something like: >> >> $obj->thow_not_implemented( -class => 'Bio::Root::NotImplemented' ); >> >> ...in interfaces? >> >> Chris Fields wrote: >>> I suppose you could; Bio::Root::Root does that using Error.pm (if it >>> is installed). It almost sounds like what Bio::Root::Root does is >>> what you want, but you want a little more information when >>> exceptions >>> are thrown maybe? >>> >>> from perldoc Bio::Root::Root: >>> >>> ... >>> # Alternatively, using the new typed exception syntax in >>> the throw() call: >>> >>> $obj->throw( -class => 'Bio::Root::BadParameter', >>> -text => "Can not open file $file", >>> -value => $file); >>> ... >>> >>> Typed Exception Syntax >>> >>> The typed exception syntax of throw() has the advantage of >>> plainly >>> indicating the nature of the trouble, since the name of the >>> class is >>> included in the title of the exception output. >>> >>> To take advantage of this capability, you must specify >>> arguments as >>> named parameters in the throw() call. Here are the >>> parameters: >>> >>> -class >>> name of the class of the exception. This should be one >>> of the >>> classes defined in Bio::Root::Exception, or a custom >>> error of yours >>> that extends one of the exceptions defined in >>> Bio::Root::Exception. >>> >>> -text >>> a sensible message for the exception >>> >>> -value >>> the value causing the exception or $!, if appropriate. >>> >>> Note that Bio::Root::Exception does not need to be imported >>> into your >>> module (or script) namespace in order to throw exceptions >>> via >>> Bio::Root::Root::throw(), since Bio::Root::Root imports it. >>> >>> >>> Chris >>> >>> On Jul 11, 2006, at 3:58 AM, Rutger Vos wrote: >>> >>> >>>> Dear all, >>>> >>>> would it be possible to overload Bio::Root::RootI's 'throw' >>>> method to >>>> accept an additional, optional (positional) argument to define the >>>> exception class, e.g. using Exception::Class: >>>> >>>> # ...somewhere ... >>>> >>>> sub makefh { >>>> my ( $self, $filename ) = @_; >>>> open my $fh, '<' $filename or $self->throw("Can't open file: >>>> $!", >>>> 'Bio::Exceptions::FileIO'); # NOTE second argument >>>> return $fh; >>>> } >>>> >>>> #.... somewhere else >>>> my $fh; >>>> eval { >>>> $fh = $obj->makefh( 'data.txt'); >>>> } >>>> if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) { >>>> # something's wrong with the file? >>>> } >>>> >>>> -- >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Rutger Vos, PhD. candidate >>>> Department of Biological Sciences >>>> Simon Fraser University >>>> 8888 University Drive >>>> Burnaby, BC, V5A1S6 >>>> Phone: 604-291-5625 >>>> Fax: 604-291-3496 >>>> Personal site: http://www.sfu.ca/~rvosa >>>> FAB* lab: http://www.sfu.ca/~fabstar >>>> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >>>> >>>> _______________________________________________ >>>> Bioperl-l mailing list >>>> Bioperl-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>> >>> >>> Christopher Fields >>> Postdoctoral Researcher >>> Lab of Dr. Robert Switzer >>> Dept of Biochemistry >>> University of Illinois Urbana-Champaign >>> >>> >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >>> >>> >>> >> >> -- >> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Rutger Vos, PhD. candidate >> Department of Biological Sciences >> Simon Fraser University >> 8888 University Drive >> Burnaby, BC, V5A1S6 >> Phone: 604-291-5625 >> Fax: 604-291-3496 >> Personal site: http://www.sfu.ca/~rvosa >> FAB* lab: http://www.sfu.ca/~fabstar >> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Tue Jul 11 23:27:37 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 11 Jul 2006 22:27:37 -0500 Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? In-Reply-To: <337D69E3-4075-426B-A53E-31D37ED6CA2E@affymetrix.com> References: <000b01c6a52a$8c77f8c0$15327e82@pyrimidine> <337D69E3-4075-426B-A53E-31D37ED6CA2E@affymetrix.com> Message-ID: <18B66484-FE46-4F77-BDD1-B97085D04C95@uiuc.edu> Makes sense to keep most of the magic in Root instead of RootI.pm. The POD for RootI does state that the class exception thrown is Bio::Root::NotImplemented, so we should probably either change the POD to reflect what really happens or change throw_not_implemented like you suggest (my vote is the latter). I don't think many (if any) implementing classes fall into your 'trouble' category, though I can't be sure how many actually import Bio::Root::Exception. Chris On Jul 11, 2006, at 8:07 PM, Steve_Chervitz wrote: > On Jul 11, 2006, at 1:42 PM, Chris Fields wrote: > >> Bio::Root::Root doesn't overload throw_not_implemented from >> Bio::Root::RootI; from the comments looks like Steve C and Ewan B >> couldn't >> work out some of the Error.pm issues. > > The issue (I believe) was that > Bio::Root::RootI::throw_not_implemented was doing some checking for > the presence of Error.pm and calling Error::throw. I changed it so > that this fanciness only happens in Root.pm. > >> Judging by the POD for Bio::Root::RootI, throw_not_implemented >> doesn't >> accept arguments; it throws a Bio::Root::NotImplemented exception >> automatically. > > Looking at the code now, throw_not_implemented() does not throw a > Bio::Root::NotImplemented exception. It just throws a simple, > unclassed message. We could allow it to throw an exception of class > Bio::Root:NotImplemented by changing this code: > > if( $self->can('throw') ) { > $self->throw($message); > }... > > to this > > if( $self->can('throw') ) { > $self->throw(-text=>$message, - > class=>'Bio::Root::NotImplemented'); > }... > > This does not create any dependency on Error.pm, but permits it to > be used if available. If Error.pm is not loaded, the only change is > that the class string is included in the error message, which is > kind of handy. > > Trouble would occur if the implementing class: > > * does not derive from Bio::Root::Root, > * does not import Bio::Root::Exception, > * fails to implement a method which gets called, and > * Error.pm is available. > > I don't know if such implementations exist in bioperl now, but I > suspect they would be rare (and discouraged). > > Steve > > >> Chris >> >>> -----Original Message----- >>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >>> bounces at lists.open-bio.org] On Behalf Of Rutger Vos >>> Sent: Tuesday, July 11, 2006 1:58 PM >>> To: Chris Fields >>> Cc: 'Bioperl List' >>> Subject: Re: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) >>> overloading? >>> >>> I must have overlooked this. I think it does what I want. So >>> could I do >>> something like: >>> >>> $obj->thow_not_implemented( -class => 'Bio::Root::NotImplemented' ); >>> >>> ...in interfaces? >>> >>> Chris Fields wrote: >>>> I suppose you could; Bio::Root::Root does that using Error.pm >>>> (if it >>>> is installed). It almost sounds like what Bio::Root::Root does is >>>> what you want, but you want a little more information when >>>> exceptions >>>> are thrown maybe? >>>> >>>> from perldoc Bio::Root::Root: >>>> >>>> ... >>>> # Alternatively, using the new typed exception syntax in >>>> the throw() call: >>>> >>>> $obj->throw( -class => 'Bio::Root::BadParameter', >>>> -text => "Can not open file $file", >>>> -value => $file); >>>> ... >>>> >>>> Typed Exception Syntax >>>> >>>> The typed exception syntax of throw() has the advantage of >>>> plainly >>>> indicating the nature of the trouble, since the name of the >>>> class is >>>> included in the title of the exception output. >>>> >>>> To take advantage of this capability, you must specify >>>> arguments as >>>> named parameters in the throw() call. Here are the >>>> parameters: >>>> >>>> -class >>>> name of the class of the exception. This should be one >>>> of the >>>> classes defined in Bio::Root::Exception, or a custom >>>> error of yours >>>> that extends one of the exceptions defined in >>>> Bio::Root::Exception. >>>> >>>> -text >>>> a sensible message for the exception >>>> >>>> -value >>>> the value causing the exception or $!, if appropriate. >>>> >>>> Note that Bio::Root::Exception does not need to be imported >>>> into your >>>> module (or script) namespace in order to throw >>>> exceptions via >>>> Bio::Root::Root::throw(), since Bio::Root::Root imports it. >>>> >>>> >>>> Chris >>>> >>>> On Jul 11, 2006, at 3:58 AM, Rutger Vos wrote: >>>> >>>> >>>>> Dear all, >>>>> >>>>> would it be possible to overload Bio::Root::RootI's 'throw' >>>>> method to >>>>> accept an additional, optional (positional) argument to define the >>>>> exception class, e.g. using Exception::Class: >>>>> >>>>> # ...somewhere ... >>>>> >>>>> sub makefh { >>>>> my ( $self, $filename ) = @_; >>>>> open my $fh, '<' $filename or $self->throw("Can't open >>>>> file: $!", >>>>> 'Bio::Exceptions::FileIO'); # NOTE second argument >>>>> return $fh; >>>>> } >>>>> >>>>> #.... somewhere else >>>>> my $fh; >>>>> eval { >>>>> $fh = $obj->makefh( 'data.txt'); >>>>> } >>>>> if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) { >>>>> # something's wrong with the file? >>>>> } >>>>> >>>>> -- >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> Rutger Vos, PhD. candidate >>>>> Department of Biological Sciences >>>>> Simon Fraser University >>>>> 8888 University Drive >>>>> Burnaby, BC, V5A1S6 >>>>> Phone: 604-291-5625 >>>>> Fax: 604-291-3496 >>>>> Personal site: http://www.sfu.ca/~rvosa >>>>> FAB* lab: http://www.sfu.ca/~fabstar >>>>> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> >>>>> >>>>> _______________________________________________ >>>>> Bioperl-l mailing list >>>>> Bioperl-l at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>>> >>>> >>>> Christopher Fields >>>> Postdoctoral Researcher >>>> Lab of Dr. Robert Switzer >>>> Dept of Biochemistry >>>> University of Illinois Urbana-Champaign >>>> >>>> >>>> >>>> _______________________________________________ >>>> Bioperl-l mailing list >>>> Bioperl-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>> >>>> >>>> >>>> >>> >>> -- >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Rutger Vos, PhD. candidate >>> Department of Biological Sciences >>> Simon Fraser University >>> 8888 University Drive >>> Burnaby, BC, V5A1S6 >>> Phone: 604-291-5625 >>> Fax: 604-291-3496 >>> Personal site: http://www.sfu.ca/~rvosa >>> FAB* lab: http://www.sfu.ca/~fabstar >>> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From frederick.partridge at st-johns.oxford.ac.uk Wed Jul 12 11:16:33 2006 From: frederick.partridge at st-johns.oxford.ac.uk (Frederick Partridge) Date: Wed, 12 Jul 2006 16:16:33 +0100 (BST) Subject: [Bioperl-l] Get nucleotide sequence when expecting proteinfromgenpept In-Reply-To: <000201c6a53c$03970ed0$15327e82@pyrimidine> References: <000201c6a53c$03970ed0$15327e82@pyrimidine> Message-ID: On Tue, 11 Jul 2006, Chris Fields wrote: > This returns both the nucleotide sequence and the correct protein sequence; > the protein was returned second for some reason, so get_Seq_by_acc misses it > while get_Stream_by_acc doesn't. I have notified NCBI about this issue, but > they will likely just tell me to use the GI number for searches as they are > unique. Probably a good warning for anyone using accessions for all their > work (I use the GI myself). Thank you both for your help, I have converted to GIs and it works much better. As an aside, it might be nice to have a $hit->gi method as well as $hit->accession for parsing blast reports. (I now realise that you can derive the gi from $hit->name, but this might have encouraged me to start off using gi instead of accession numbers). Freddie Partridge University of Oxford From cjfields at uiuc.edu Wed Jul 12 11:39:39 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 12 Jul 2006 10:39:39 -0500 Subject: [Bioperl-l] Get nucleotide sequence when expecting proteinfromgenpept In-Reply-To: Message-ID: <000b01c6a5c9$635a7540$15327e82@pyrimidine> Problem is, you may or may not have GIs for a BLAST hit depending on how you retrieve the BLAST report, what interface you use, etc. NCBI is pretty ambiguous when it comes to GI vs. accession; the sequence database guys want you to use the GI for searches (since that's the unique ID for NCBI's databases) and don't promise getting the correct sequence using the accession. However, the BLAST interface guys have set up the BLAST CGI server to not return the GI by default(accessible through Bio::Tools::Run::RemoteBlast). Even more confusing, if you use the NCBI BLAST web interface, this option is turned on by default. Don't know what blastcl3 or blastall does, haven't checked in a while. Anyway, this could be why there is no $hit->gi method for GenericHit/BlastHit. It could be added; I will need to look at SearchIO::blast/blastxml/blasttable to see how this is parsed out. BTW, what I do as a work-around, when using RemoteBlast, is below (you could use the while loop to grab the GIs using SearchIO::blast if they are present in the BLAST report). This grabs all the GI's from the description line (not just the best hit). # sets retrieval header to include the GI always $Bio::Tools::Run::RemoteBlast::RETRIEVALHEADER{'NCBI_GI'} = 'yes'; ... while ( my $hit = $result->next_hit) { my $description = $hit->description; while ($description =~ /gi\|(.*?)\|/g) { my $gi = $1; push @gis, $gi; } } Chris > -----Original Message----- > From: Frederick Partridge [mailto:frederick.partridge at st- > johns.oxford.ac.uk] > Sent: Wednesday, July 12, 2006 10:17 AM > To: Chris Fields > Cc: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Get nucleotide sequence when expecting > proteinfromgenpept > > > > On Tue, 11 Jul 2006, Chris Fields wrote: > > This returns both the nucleotide sequence and the correct protein > sequence; > > the protein was returned second for some reason, so get_Seq_by_acc > misses it > > while get_Stream_by_acc doesn't. I have notified NCBI about this issue, > but > > they will likely just tell me to use the GI number for searches as they > are > > unique. Probably a good warning for anyone using accessions for all > their > > work (I use the GI myself). > > > Thank you both for your help, I have converted to GIs and it works much > better. > > As an aside, it might be nice to have a $hit->gi method as well as > $hit->accession for parsing blast reports. (I now realise that you can > derive the gi from $hit->name, but this might have encouraged me to start > off using gi instead of accession numbers). > > > Freddie Partridge > > University of Oxford > From Steve_Chervitz at affymetrix.com Wed Jul 12 14:53:22 2006 From: Steve_Chervitz at affymetrix.com (Steve_Chervitz) Date: Wed, 12 Jul 2006 11:53:22 -0700 Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? In-Reply-To: <18B66484-FE46-4F77-BDD1-B97085D04C95@uiuc.edu> References: <000b01c6a52a$8c77f8c0$15327e82@pyrimidine> <337D69E3-4075-426B-A53E-31D37ED6CA2E@affymetrix.com> <18B66484-FE46-4F77-BDD1-B97085D04C95@uiuc.edu> Message-ID: <3E119694-68C5-47A6-971B-8E035CBB6429@affymetrix.com> For modules that derive from Bio::Root::Root, there's no need to import Bio::Root::Exception since the Root object does it. I also favor adding the -class parameter to throw_not_implemented in RootI. I just committed this change in in bioperl-live. I also added a test for it in t/RootI.t I haven't run the complete suite of tests after making this change, but I don't suspect there'll be any trouble (famous last words). Really, if any test leads to the calling of throw_not_implemented (besides the test I just added), that in itself is trouble. Steve On Jul 11, 2006, at 8:27 PM, Chris Fields wrote: > Makes sense to keep most of the magic in Root instead of RootI.pm. > The POD for RootI does state that the class exception thrown is > Bio::Root::NotImplemented, so we should probably either change the > POD to reflect what really happens or change throw_not_implemented > like you suggest (my vote is the latter). I don't think many (if > any) implementing classes fall into your 'trouble' category, though I > can't be sure how many actually import Bio::Root::Exception. > > Chris > > On Jul 11, 2006, at 8:07 PM, Steve_Chervitz wrote: > >> On Jul 11, 2006, at 1:42 PM, Chris Fields wrote: >> >>> Bio::Root::Root doesn't overload throw_not_implemented from >>> Bio::Root::RootI; from the comments looks like Steve C and Ewan B >>> couldn't >>> work out some of the Error.pm issues. >> >> The issue (I believe) was that >> Bio::Root::RootI::throw_not_implemented was doing some checking for >> the presence of Error.pm and calling Error::throw. I changed it so >> that this fanciness only happens in Root.pm. >> >>> Judging by the POD for Bio::Root::RootI, throw_not_implemented >>> doesn't >>> accept arguments; it throws a Bio::Root::NotImplemented exception >>> automatically. >> >> Looking at the code now, throw_not_implemented() does not throw a >> Bio::Root::NotImplemented exception. It just throws a simple, >> unclassed message. We could allow it to throw an exception of class >> Bio::Root:NotImplemented by changing this code: >> >> if( $self->can('throw') ) { >> $self->throw($message); >> }... >> >> to this >> >> if( $self->can('throw') ) { >> $self->throw(-text=>$message, - >> class=>'Bio::Root::NotImplemented'); >> }... >> >> This does not create any dependency on Error.pm, but permits it to >> be used if available. If Error.pm is not loaded, the only change is >> that the class string is included in the error message, which is >> kind of handy. >> >> Trouble would occur if the implementing class: >> >> * does not derive from Bio::Root::Root, >> * does not import Bio::Root::Exception, >> * fails to implement a method which gets called, and >> * Error.pm is available. >> >> I don't know if such implementations exist in bioperl now, but I >> suspect they would be rare (and discouraged). >> >> Steve >> >> >>> Chris >>> >>>> -----Original Message----- >>>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >>>> bounces at lists.open-bio.org] On Behalf Of Rutger Vos >>>> Sent: Tuesday, July 11, 2006 1:58 PM >>>> To: Chris Fields >>>> Cc: 'Bioperl List' >>>> Subject: Re: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) >>>> overloading? >>>> >>>> I must have overlooked this. I think it does what I want. So >>>> could I do >>>> something like: >>>> >>>> $obj->thow_not_implemented( -class => >>>> 'Bio::Root::NotImplemented' ); >>>> >>>> ...in interfaces? >>>> >>>> Chris Fields wrote: >>>>> I suppose you could; Bio::Root::Root does that using Error.pm >>>>> (if it >>>>> is installed). It almost sounds like what Bio::Root::Root does is >>>>> what you want, but you want a little more information when >>>>> exceptions >>>>> are thrown maybe? >>>>> >>>>> from perldoc Bio::Root::Root: >>>>> >>>>> ... >>>>> # Alternatively, using the new typed exception syntax in >>>>> the throw() call: >>>>> >>>>> $obj->throw( -class => 'Bio::Root::BadParameter', >>>>> -text => "Can not open file $file", >>>>> -value => $file); >>>>> ... >>>>> >>>>> Typed Exception Syntax >>>>> >>>>> The typed exception syntax of throw() has the advantage of >>>>> plainly >>>>> indicating the nature of the trouble, since the name of >>>>> the >>>>> class is >>>>> included in the title of the exception output. >>>>> >>>>> To take advantage of this capability, you must specify >>>>> arguments as >>>>> named parameters in the throw() call. Here are the >>>>> parameters: >>>>> >>>>> -class >>>>> name of the class of the exception. This should be >>>>> one >>>>> of the >>>>> classes defined in Bio::Root::Exception, or a custom >>>>> error of yours >>>>> that extends one of the exceptions defined in >>>>> Bio::Root::Exception. >>>>> >>>>> -text >>>>> a sensible message for the exception >>>>> >>>>> -value >>>>> the value causing the exception or $!, if appropriate. >>>>> >>>>> Note that Bio::Root::Exception does not need to be >>>>> imported >>>>> into your >>>>> module (or script) namespace in order to throw >>>>> exceptions via >>>>> Bio::Root::Root::throw(), since Bio::Root::Root imports >>>>> it. >>>>> >>>>> >>>>> Chris >>>>> >>>>> On Jul 11, 2006, at 3:58 AM, Rutger Vos wrote: >>>>> >>>>> >>>>>> Dear all, >>>>>> >>>>>> would it be possible to overload Bio::Root::RootI's 'throw' >>>>>> method to >>>>>> accept an additional, optional (positional) argument to define >>>>>> the >>>>>> exception class, e.g. using Exception::Class: >>>>>> >>>>>> # ...somewhere ... >>>>>> >>>>>> sub makefh { >>>>>> my ( $self, $filename ) = @_; >>>>>> open my $fh, '<' $filename or $self->throw("Can't open >>>>>> file: $!", >>>>>> 'Bio::Exceptions::FileIO'); # NOTE second argument >>>>>> return $fh; >>>>>> } >>>>>> >>>>>> #.... somewhere else >>>>>> my $fh; >>>>>> eval { >>>>>> $fh = $obj->makefh( 'data.txt'); >>>>>> } >>>>>> if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) { >>>>>> # something's wrong with the file? >>>>>> } >>>>>> >>>>>> -- >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> Rutger Vos, PhD. candidate >>>>>> Department of Biological Sciences >>>>>> Simon Fraser University >>>>>> 8888 University Drive >>>>>> Burnaby, BC, V5A1S6 >>>>>> Phone: 604-291-5625 >>>>>> Fax: 604-291-3496 >>>>>> Personal site: http://www.sfu.ca/~rvosa >>>>>> FAB* lab: http://www.sfu.ca/~fabstar >>>>>> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Bioperl-l mailing list >>>>>> Bioperl-l at lists.open-bio.org >>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>>>> >>>>> >>>>> Christopher Fields >>>>> Postdoctoral Researcher >>>>> Lab of Dr. Robert Switzer >>>>> Dept of Biochemistry >>>>> University of Illinois Urbana-Champaign >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Bioperl-l mailing list >>>>> Bioperl-l at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>>> >>>>> >>>>> >>>>> >>>> >>>> -- >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Rutger Vos, PhD. candidate >>>> Department of Biological Sciences >>>> Simon Fraser University >>>> 8888 University Drive >>>> Burnaby, BC, V5A1S6 >>>> Phone: 604-291-5625 >>>> Fax: 604-291-3496 >>>> Personal site: http://www.sfu.ca/~rvosa >>>> FAB* lab: http://www.sfu.ca/~fabstar >>>> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >>>> >>>> _______________________________________________ >>>> Bioperl-l mailing list >>>> Bioperl-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Wed Jul 12 15:23:33 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 12 Jul 2006 14:23:33 -0500 Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? In-Reply-To: <3E119694-68C5-47A6-971B-8E035CBB6429@affymetrix.com> Message-ID: <000901c6a5e8$aaca53e0$15327e82@pyrimidine> Thanks Steve! Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Steve_Chervitz > Sent: Wednesday, July 12, 2006 1:53 PM > To: Chris Fields > Cc: Rutger Vos; Bioperl List > Subject: Re: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? > > For modules that derive from Bio::Root::Root, there's no need to > import Bio::Root::Exception since the Root object does it. > > I also favor adding the -class parameter to throw_not_implemented in > RootI. I just committed this change in in bioperl-live. I also added > a test for it in t/RootI.t > > I haven't run the complete suite of tests after making this change, > but I don't suspect there'll be any trouble (famous last words). > Really, if any test leads to the calling of throw_not_implemented > (besides the test I just added), that in itself is trouble. > > Steve > > On Jul 11, 2006, at 8:27 PM, Chris Fields wrote: > > > Makes sense to keep most of the magic in Root instead of RootI.pm. > > The POD for RootI does state that the class exception thrown is > > Bio::Root::NotImplemented, so we should probably either change the > > POD to reflect what really happens or change throw_not_implemented > > like you suggest (my vote is the latter). I don't think many (if > > any) implementing classes fall into your 'trouble' category, though I > > can't be sure how many actually import Bio::Root::Exception. > > > > Chris > > > > On Jul 11, 2006, at 8:07 PM, Steve_Chervitz wrote: > > > >> On Jul 11, 2006, at 1:42 PM, Chris Fields wrote: > >> > >>> Bio::Root::Root doesn't overload throw_not_implemented from > >>> Bio::Root::RootI; from the comments looks like Steve C and Ewan B > >>> couldn't > >>> work out some of the Error.pm issues. > >> > >> The issue (I believe) was that > >> Bio::Root::RootI::throw_not_implemented was doing some checking for > >> the presence of Error.pm and calling Error::throw. I changed it so > >> that this fanciness only happens in Root.pm. > >> > >>> Judging by the POD for Bio::Root::RootI, throw_not_implemented > >>> doesn't > >>> accept arguments; it throws a Bio::Root::NotImplemented exception > >>> automatically. > >> > >> Looking at the code now, throw_not_implemented() does not throw a > >> Bio::Root::NotImplemented exception. It just throws a simple, > >> unclassed message. We could allow it to throw an exception of class > >> Bio::Root:NotImplemented by changing this code: > >> > >> if( $self->can('throw') ) { > >> $self->throw($message); > >> }... > >> > >> to this > >> > >> if( $self->can('throw') ) { > >> $self->throw(-text=>$message, - > >> class=>'Bio::Root::NotImplemented'); > >> }... > >> > >> This does not create any dependency on Error.pm, but permits it to > >> be used if available. If Error.pm is not loaded, the only change is > >> that the class string is included in the error message, which is > >> kind of handy. > >> > >> Trouble would occur if the implementing class: > >> > >> * does not derive from Bio::Root::Root, > >> * does not import Bio::Root::Exception, > >> * fails to implement a method which gets called, and > >> * Error.pm is available. > >> > >> I don't know if such implementations exist in bioperl now, but I > >> suspect they would be rare (and discouraged). > >> > >> Steve > >> > >> > >>> Chris > >>> > >>>> -----Original Message----- > >>>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > >>>> bounces at lists.open-bio.org] On Behalf Of Rutger Vos > >>>> Sent: Tuesday, July 11, 2006 1:58 PM > >>>> To: Chris Fields > >>>> Cc: 'Bioperl List' > >>>> Subject: Re: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) > >>>> overloading? > >>>> > >>>> I must have overlooked this. I think it does what I want. So > >>>> could I do > >>>> something like: > >>>> > >>>> $obj->thow_not_implemented( -class => > >>>> 'Bio::Root::NotImplemented' ); > >>>> > >>>> ...in interfaces? > >>>> > >>>> Chris Fields wrote: > >>>>> I suppose you could; Bio::Root::Root does that using Error.pm > >>>>> (if it > >>>>> is installed). It almost sounds like what Bio::Root::Root does is > >>>>> what you want, but you want a little more information when > >>>>> exceptions > >>>>> are thrown maybe? > >>>>> > >>>>> from perldoc Bio::Root::Root: > >>>>> > >>>>> ... > >>>>> # Alternatively, using the new typed exception syntax in > >>>>> the throw() call: > >>>>> > >>>>> $obj->throw( -class => 'Bio::Root::BadParameter', > >>>>> -text => "Can not open file $file", > >>>>> -value => $file); > >>>>> ... > >>>>> > >>>>> Typed Exception Syntax > >>>>> > >>>>> The typed exception syntax of throw() has the advantage of > >>>>> plainly > >>>>> indicating the nature of the trouble, since the name of > >>>>> the > >>>>> class is > >>>>> included in the title of the exception output. > >>>>> > >>>>> To take advantage of this capability, you must specify > >>>>> arguments as > >>>>> named parameters in the throw() call. Here are the > >>>>> parameters: > >>>>> > >>>>> -class > >>>>> name of the class of the exception. This should be > >>>>> one > >>>>> of the > >>>>> classes defined in Bio::Root::Exception, or a custom > >>>>> error of yours > >>>>> that extends one of the exceptions defined in > >>>>> Bio::Root::Exception. > >>>>> > >>>>> -text > >>>>> a sensible message for the exception > >>>>> > >>>>> -value > >>>>> the value causing the exception or $!, if appropriate. > >>>>> > >>>>> Note that Bio::Root::Exception does not need to be > >>>>> imported > >>>>> into your > >>>>> module (or script) namespace in order to throw > >>>>> exceptions via > >>>>> Bio::Root::Root::throw(), since Bio::Root::Root imports > >>>>> it. > >>>>> > >>>>> > >>>>> Chris > >>>>> > >>>>> On Jul 11, 2006, at 3:58 AM, Rutger Vos wrote: > >>>>> > >>>>> > >>>>>> Dear all, > >>>>>> > >>>>>> would it be possible to overload Bio::Root::RootI's 'throw' > >>>>>> method to > >>>>>> accept an additional, optional (positional) argument to define > >>>>>> the > >>>>>> exception class, e.g. using Exception::Class: > >>>>>> > >>>>>> # ...somewhere ... > >>>>>> > >>>>>> sub makefh { > >>>>>> my ( $self, $filename ) = @_; > >>>>>> open my $fh, '<' $filename or $self->throw("Can't open > >>>>>> file: $!", > >>>>>> 'Bio::Exceptions::FileIO'); # NOTE second argument > >>>>>> return $fh; > >>>>>> } > >>>>>> > >>>>>> #.... somewhere else > >>>>>> my $fh; > >>>>>> eval { > >>>>>> $fh = $obj->makefh( 'data.txt'); > >>>>>> } > >>>>>> if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) { > >>>>>> # something's wrong with the file? > >>>>>> } > >>>>>> > >>>>>> -- > >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>>>> Rutger Vos, PhD. candidate > >>>>>> Department of Biological Sciences > >>>>>> Simon Fraser University > >>>>>> 8888 University Drive > >>>>>> Burnaby, BC, V5A1S6 > >>>>>> Phone: 604-291-5625 > >>>>>> Fax: 604-291-3496 > >>>>>> Personal site: http://www.sfu.ca/~rvosa > >>>>>> FAB* lab: http://www.sfu.ca/~fabstar > >>>>>> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ > >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>>>> > >>>>>> > >>>>>> _______________________________________________ > >>>>>> Bioperl-l mailing list > >>>>>> Bioperl-l at lists.open-bio.org > >>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l > >>>>>> > >>>>> > >>>>> Christopher Fields > >>>>> Postdoctoral Researcher > >>>>> Lab of Dr. Robert Switzer > >>>>> Dept of Biochemistry > >>>>> University of Illinois Urbana-Champaign > >>>>> > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> Bioperl-l mailing list > >>>>> Bioperl-l at lists.open-bio.org > >>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l > >>>>> > >>>>> > >>>>> > >>>>> > >>>> > >>>> -- > >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>> Rutger Vos, PhD. candidate > >>>> Department of Biological Sciences > >>>> Simon Fraser University > >>>> 8888 University Drive > >>>> Burnaby, BC, V5A1S6 > >>>> Phone: 604-291-5625 > >>>> Fax: 604-291-3496 > >>>> Personal site: http://www.sfu.ca/~rvosa > >>>> FAB* lab: http://www.sfu.ca/~fabstar > >>>> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ > >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>> > >>>> > >>>> _______________________________________________ > >>>> Bioperl-l mailing list > >>>> Bioperl-l at lists.open-bio.org > >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l > >>> > >>> _______________________________________________ > >>> Bioperl-l mailing list > >>> Bioperl-l at lists.open-bio.org > >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l > >> > > > > Christopher Fields > > Postdoctoral Researcher > > Lab of Dr. Robert Switzer > > Dept of Biochemistry > > University of Illinois Urbana-Champaign > > > > > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From dsche at uga.edu Thu Jul 13 14:55:03 2006 From: dsche at uga.edu (Dongsheng Che) Date: Thu, 13 Jul 2006 14:55:03 -0400 (EDT) Subject: [Bioperl-l] remoteBlast problem Message-ID: <20060713145503.CIV61560@punts2.cc.uga.edu> To whom it may concern: I'm trying to do blast search remotely, so I downloaded bioperl-1.5, and followed the installation procedure, ie, perl Makefile.PL, make, make test. make install. I know there are some installation failure during the installation. Since my main purpose is to get remoteBlast worked, I don't want bother to figure out all failures. but I run remote Blast, it gave me some erorrs from examples (bptutorial). ------------------------------------------------------------- Beginning run_remoteblast example... Use of uninitialized value in numeric lt (<) at bptutorial.pl line 3303. **Warning**: Couldn't connect to NCBI with Bio::Tools::Run::StandAloneBlast.pm! Probably no network access. Skipping Test ---------------------------------------------------------------- I wondering what cause the problem. Thanks in advance! Dongsheng From vrramnar at student.cs.uwaterloo.ca Thu Jul 13 18:39:19 2006 From: vrramnar at student.cs.uwaterloo.ca (vrramnar at student.cs.uwaterloo.ca) Date: Thu, 13 Jul 2006 18:39:19 -0400 Subject: [Bioperl-l] Remote Blast - Blast Human Genome Message-ID: <1152830359.44b6cb97ef16c@www.nexusmail.uwaterloo.ca> Hello Again, I have another question regarding Remote blast but this time using Genome Blast. Here is the link: http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=9606 which again uses the main Blast web site: http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi Again I am not sure what to add or what HEADER information to change within my script. Here is my program, which was the same as the last email: #!/usr/bin/perl -w use Bio::Perl; use Bio::Tools::Run::RemoteBlast; my $prog = "blastn"; my $db = "refseq_genomic"; my $e_val = 0.01; my @params = ( '-prog' => $prog, '-data' => $db, '-expect' => $e_val); my $factory = new Bio::Tools::Run::RemoteBlast->new(@params); $Bio::Tools::Run::RemoteBlast::HEADER{'WWW_BLAST_TYPE'} = '????'; <----- what do I put here #$Bio::Tools::Run::RemoteBlast::HEADER{'?????'} = '????'; <--- Do I need to add any other values to the form inputs $factory->submit_blast("blast.in"); $v = 1; while (my @rids = $factory->each_rid) { foreach my $rid ( @rids ) { my $rc = $factory->retrieve_blast($rid); if( !ref($rc) ) { if( $rc < 0 ) { $factory->remove_rid($rid); } print STDERR "." if ( $v > 0 ); sleep 5; } else { my $result = $rc->next_result(); my $filename = $result->query_name()."\.out"; $factory->save_output($filename); $factory->remove_rid($rid); print "\nQuery Name: ", $result->query_name(), "\n"; } } } Both of my questions are very similiar as in I know how to use remote blast but not sure what to change to access the specific blast I want. Again, any help would be very appreciated!! Rohan ---------------------------------------- This mail sent through www.mywaterloo.ca From vrramnar at student.cs.uwaterloo.ca Thu Jul 13 18:31:38 2006 From: vrramnar at student.cs.uwaterloo.ca (vrramnar at student.cs.uwaterloo.ca) Date: Thu, 13 Jul 2006 18:31:38 -0400 Subject: [Bioperl-l] Remote Blast - SNP data base Message-ID: <1152829898.44b6c9cab7a3a@www.nexusmail.uwaterloo.ca> Hello, 1. I was wondering if anyone knew how to use SNP Blast via the Remote Blast module?? Basically I want to blast my sequence against the dbSNP database and you can normally do this through NCBI's website: http://www.ncbi.nlm.nih.gov/SNP/snp_blastByOrg.cgi The site basically takes your info and submits it to the main blast site: http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi I am just not sure what settings to change within my script. I have something like this: #!/usr/bin/perl -w use Bio::Perl; use Bio::Tools::Run::RemoteBlast; my $prog = "blastn"; my $db = "refseq_genomic"; <--- What db should I use?? my $e_val = 0.01; my @params = ( '-prog' => $prog, '-data' => $db, '-expect' => $e_val); my $factory = new Bio::Tools::Run::RemoteBlast->new(@params); $factory->submit_blast("blast.in"); <--- Name of my file in fasta format $v = 1; while (my @rids = $factory->each_rid) { foreach my $rid ( @rids ) { my $rc = $qu->retrieve_blast($rid); if( !ref($rc) ) { if( $rc < 0 ) { $factory->remove_rid($rid); } print STDERR "." if ( $v > 0 ); sleep 5; } else { my $result = $rc->next_result(); my $filename = $result->query_name()."\.out"; $factory->save_output($filename); $factory->remove_rid($rid); print "\nQuery Name: ", $result->query_name(), "\n"; } } } I think something like this should be added to have the correct form inputs but I am unsure: $Bio::Tools::Run::RemoteBlast::HEADER{'???'} = '????'; Any help on this topic would greatly be appreciated!! Rohan ---------------------------------------- This mail sent through www.mywaterloo.ca From cjfields at uiuc.edu Thu Jul 13 20:42:57 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 13 Jul 2006 19:42:57 -0500 Subject: [Bioperl-l] remoteBlast problem In-Reply-To: <20060713145503.CIV61560@punts2.cc.uga.edu> Message-ID: <000401c6a6de$737fe570$15327e82@pyrimidine> 1) Before I get wound up in the obvious here, you need to upgrade to CVS; RemoteBlast and SearchIO::blast were fixed post v.-1.5.1 (i.e. in CVS) to account for changes in BLAST output at the NCBI 2) The Bio::Tools::Run::StandAloneBlast.pm bit worried me a little, so I did a little digging; that's a typo. Now corrected in CVS, along with some BPLite cruft left over. 3) Speaking bluntly? Come on. The error is stated as plainly as possible. No? How about this (note the arrows): -----------> **Warning**: Couldn't connect to NCBI with -----------> Bio::Tools::Run::StandAloneBlast.pm! -----------> Probably no network access. Skipping Test Check your network connections, preferably AFTER you update to CVS. It's possible that it's a proxy issue, but that should also be fixed in CVS. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Dongsheng Che > Sent: Thursday, July 13, 2006 1:55 PM > To: bioperl-l at bioperl.org > Subject: [Bioperl-l] remoteBlast problem > > To whom it may concern: > > I'm trying to do blast search remotely, so I downloaded bioperl-1.5, and > followed the installation procedure, ie, perl Makefile.PL, make, make > test. make install. I know there are some installation failure during the > installation. > > Since my main purpose is to get remoteBlast worked, I don't want bother to > figure out all failures. but I run remote Blast, it gave me some erorrs > from examples (bptutorial). > ------------------------------------------------------------- > Beginning run_remoteblast example... > Use of uninitialized value in numeric lt (<) at bptutorial.pl line 3303. > > > **Warning**: Couldn't connect to NCBI with > Bio::Tools::Run::StandAloneBlast.pm! > Probably no network access. > Skipping Test > ---------------------------------------------------------------- > > I wondering what cause the problem. > > Thanks in advance! > > Dongsheng > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Thu Jul 13 21:56:30 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 13 Jul 2006 20:56:30 -0500 Subject: [Bioperl-l] Remote Blast - Blast Human Genome In-Reply-To: <1152830359.44b6cb97ef16c@www.nexusmail.uwaterloo.ca> Message-ID: <000501c6a6e8$b9c24d20$15327e82@pyrimidine> I added a method to RemoteBlast in bioperl-live (CVS) if you want to play with changing the URL. I have been thinking about doing this for a bit now but I already see problems. Here's the issue: the BLAST page you see is NOT the NCBI BLAST page (note the differences in the URL) but a user-friendly request page, generated on the fly by Genome, to submit BLAST requests for the relevant database. So changing the URL will not work (even by adding extra parameters); you only get the original HTML web page. You could try changing the database or limiting the search using an Entrez term (which you should be able to include in the request, probably by adding it to the HEADER). Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of vrramnar at student.cs.uwaterloo.ca > Sent: Thursday, July 13, 2006 5:39 PM > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] Remote Blast - Blast Human Genome > > > Hello Again, > > I have another question regarding Remote blast but this time using Genome > Blast. > > Here is the link: > > http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=9606 > > which again uses the main Blast web site: > > http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi > > Again I am not sure what to add or what HEADER information to change > within my > script. > > Here is my program, which was the same as the last email: > > #!/usr/bin/perl -w > > use Bio::Perl; > use Bio::Tools::Run::RemoteBlast; > > my $prog = "blastn"; > my $db = "refseq_genomic"; > my $e_val = 0.01; > > my @params = ( '-prog' => $prog, > '-data' => $db, > '-expect' => $e_val); > > my $factory = new Bio::Tools::Run::RemoteBlast->new(@params); > $Bio::Tools::Run::RemoteBlast::HEADER{'WWW_BLAST_TYPE'} = '????'; <----- > what > do I put here > #$Bio::Tools::Run::RemoteBlast::HEADER{'?????'} = '????'; <--- Do I need > to add > any other values to the form inputs > > $factory->submit_blast("blast.in"); > $v = 1; > > while (my @rids = $factory->each_rid) > { foreach my $rid ( @rids ) > { my $rc = $factory->retrieve_blast($rid); > if( !ref($rc) ) > { if( $rc < 0 ) > { $factory->remove_rid($rid); > } > print STDERR "." if ( $v > 0 ); > sleep 5; > } > else > { my $result = $rc->next_result(); > my $filename = $result->query_name()."\.out"; > $factory->save_output($filename); > $factory->remove_rid($rid); > print "\nQuery Name: ", $result->query_name(), "\n"; > } > } > } > > > Both of my questions are very similiar as in I know how to use remote > blast but > not sure what to change to access the specific blast I want. > > Again, any help would be very appreciated!! > > Rohan > > > > ---------------------------------------- > This mail sent through www.mywaterloo.ca > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From smart_bioit at yahoo.com Fri Jul 14 13:25:51 2006 From: smart_bioit at yahoo.com (raj sharma) Date: Fri, 14 Jul 2006 10:25:51 -0700 (PDT) Subject: [Bioperl-l] advice Message-ID: <20060714172551.60978.qmail@web37309.mail.mud.yahoo.com> i have one problem in perl i want to make one program which whn run online can download required data from data bank to local server frm where i shld start --------------------------------- Yahoo! Music Unlimited - Access over 1 million songs.Try it free. From charlesh at stedwards.edu Sat Jul 15 15:29:46 2006 From: charlesh at stedwards.edu (Charles Hauser) Date: Sat, 15 Jul 2006 14:29:46 -0500 Subject: [Bioperl-l] Finding locations of a string within a fasta file Message-ID: <040D0FF5-0BD1-49DA-B66C-32E359F254BA@stedwards.edu> All, I'm trying to determine where (the start .. end positions) within a genomic scaffold sequence gaps occur. The gaps are denoted as runs of N's. Suggestions on how to easily retrieve this would be appreciated. ch From cjfields at uiuc.edu Sat Jul 15 17:22:15 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sat, 15 Jul 2006 16:22:15 -0500 Subject: [Bioperl-l] Finding locations of a string within a fasta file In-Reply-To: <040D0FF5-0BD1-49DA-B66C-32E359F254BA@stedwards.edu> Message-ID: <000001c6a854$bee47400$15327e82@pyrimidine> You can retrieve the original GenBank CONTIG file using Bio::DB::GenBank if the format is set to 'gb' (it is now set to 'gbwithparts' by default. The CONTIG lines are currently stored in a series of Bio::Annotation::SimpleValue objects; get the accessions using the following script. use strict; use warnings; use Bio::DB::GenBank; my $factory = Bio::DB::GenBank->new(-format => 'gb'); my $seq = $factory->get_Seq_by_id(shift); my $seqout = Bio::SeqIO->new(-fh => \*STDOUT, -format => 'genbank'); # greps only annotations with CONTIG tagname, joins all together my $contig = join '', grep {$_->tagname eq 'CONTIG'} $seq->get_Annotations(); # split each region, getting rid of gaps and join(), then split into acc/span for (grep {$_ !~ m{gap|join}} split ',', $contig) { my ($acc, $span) = split ':', $_; $span =~ s{\)}{}g; # spurious ')' print "ACC: $acc\n\tSpan:$span\n"; } > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Charles Hauser > Sent: Saturday, July 15, 2006 2:30 PM > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] Finding locations of a string within a fasta file > > All, > > I'm trying to determine where (the start .. end positions) within a > genomic scaffold sequence gaps occur. > The gaps are denoted as runs of N's. > > Suggestions on how to easily retrieve this would be appreciated. > > ch > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From sudhaneti at yahoo.com Sat Jul 15 15:26:01 2006 From: sudhaneti at yahoo.com (Sudha Gunturu) Date: Sat, 15 Jul 2006 12:26:01 -0700 (PDT) Subject: [Bioperl-l] BLOSUM matrix Message-ID: <20060715192601.36517.qmail@web53315.mail.yahoo.com> Being a beginner perl programming, was wondering if anyone can help me with implementation of BLOSUM 65 matrix for the following alignments or in general. Any inputs, websites to help with this are appreciated. AILCAA ALLLAA ILIICL Thanks Sudha --------------------------------- Do you Yahoo!? Next-gen email? Have it all with the all-new Yahoo! Mail Beta. From charlesh at stedwards.edu Sun Jul 16 19:32:38 2006 From: charlesh at stedwards.edu (Charles Hauser) Date: Sun, 16 Jul 2006 18:32:38 -0500 Subject: [Bioperl-l] Finding locations of a string within a fasta file In-Reply-To: <000001c6a854$bee47400$15327e82@pyrimidine> References: <000001c6a854$bee47400$15327e82@pyrimidine> Message-ID: Hi Chris, Thanks for the info. Unfortunately, I was not clear that the sequence is unannotated, i.e. there is no GenBank record. I need to extract the locations of the gaps from a raw fasta file. ch On Jul 15, 2006, at 4:22 PM, Chris Fields wrote: > You can retrieve the original GenBank CONTIG file using > Bio::DB::GenBank if > the format is set to 'gb' (it is now set to 'gbwithparts' by > default. The > CONTIG lines are currently stored in a series of > Bio::Annotation::SimpleValue objects; get the accessions using the > following > script. > > use strict; > use warnings; > > use Bio::DB::GenBank; > > my $factory = Bio::DB::GenBank->new(-format => 'gb'); > > my $seq = $factory->get_Seq_by_id(shift); > > my $seqout = Bio::SeqIO->new(-fh => \*STDOUT, > -format => 'genbank'); > > # greps only annotations with CONTIG tagname, joins all together > my $contig = join '', grep {$_->tagname eq 'CONTIG'} > $seq->get_Annotations(); > > # split each region, getting rid of gaps and join(), then split into > acc/span > for (grep {$_ !~ m{gap|join}} > split ',', $contig) { > my ($acc, $span) = split ':', $_; > $span =~ s{\)}{}g; # spurious ')' > print "ACC: $acc\n\tSpan:$span\n"; > } > > > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> bounces at lists.open-bio.org] On Behalf Of Charles Hauser >> Sent: Saturday, July 15, 2006 2:30 PM >> To: bioperl-l at lists.open-bio.org >> Subject: [Bioperl-l] Finding locations of a string within a fasta >> file >> >> All, >> >> I'm trying to determine where (the start .. end positions) within a >> genomic scaffold sequence gaps occur. >> The gaps are denoted as runs of N's. >> >> Suggestions on how to easily retrieve this would be appreciated. >> >> ch >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > From torsten.seemann at infotech.monash.edu.au Sun Jul 16 22:23:51 2006 From: torsten.seemann at infotech.monash.edu.au (Torsten Seemann) Date: Mon, 17 Jul 2006 12:23:51 +1000 Subject: [Bioperl-l] advice In-Reply-To: <20060714172551.60978.qmail@web37309.mail.mud.yahoo.com> References: <20060714172551.60978.qmail@web37309.mail.mud.yahoo.com> Message-ID: <44BAF4B7.8090508@infotech.monash.edu.au> raj sharma wrote: > i have one problem in perl is this Bio::Perl related? > i want to make one program which whn run online do you mean runs on a web server as a CGI script, or access on-line data? > can download required data from data bank to local server which databank - genbank or ... ? > frm where i shld start http://www.oreilly.com/catalog/lperl3/ -- Dr Torsten Seemann http://www.vicbioinformatics.com Victorian Bioinformatics Consortium, Monash University, Australia From torsten.seemann at infotech.monash.edu.au Sun Jul 16 22:21:31 2006 From: torsten.seemann at infotech.monash.edu.au (Torsten Seemann) Date: Mon, 17 Jul 2006 12:21:31 +1000 Subject: [Bioperl-l] Finding locations of a string within a fasta file In-Reply-To: <040D0FF5-0BD1-49DA-B66C-32E359F254BA@stedwards.edu> References: <040D0FF5-0BD1-49DA-B66C-32E359F254BA@stedwards.edu> Message-ID: <44BAF42B.8080102@infotech.monash.edu.au> > I'm trying to determine where (the start .. end positions) within a > genomic scaffold sequence gaps occur. > The gaps are denoted as runs of N's. > Suggestions on how to easily retrieve this would be appreciated. First you need to get the sequence into a string within Perl. As your email Subject: says it is in the Fasta file, you need to 1. open the fasta file - see Bio::SeqIO 2. read first sequence (as an object) - see next_seq() 3. get the string of the sequence in the object - see seq() Then you could just use the inbuilt Perl function index() to loop through all the occurences of 'N' - type 'perldoc -f index' for help. Alternatively use regexp matching eg, m/(N+)/g and the pos() function. -- Dr Torsten Seemann http://www.vicbioinformatics.com Victorian Bioinformatics Consortium, Monash University, Australia From sudhaneti at yahoo.com Sun Jul 16 22:33:20 2006 From: sudhaneti at yahoo.com (Sudha Gunturu) Date: Sun, 16 Jul 2006 19:33:20 -0700 (PDT) Subject: [Bioperl-l] BLOSUM matrix In-Reply-To: <44BAF316.9020301@infotech.monash.edu.au> Message-ID: <20060717023320.6402.qmail@web53313.mail.yahoo.com> Sorry for not being clear with my question. Let me try to explain. I want to Implement dynamic programing using Blosum as scoring matrix. 1. I want to know how to define the values of Blosum in an array. 2. What functions are suitable for global alignment of two sequences. Etc., Being a beginer programer want some direction, books, and good websites which can help me in achieving the implementation. It would be great if someone can walk me through this. Thanks Sudha Torsten Seemann wrote: Sudha, > Being a beginner perl programming, was wondering if anyone can help me with implementation of BLOSUM 65 matrix for the following alignments or in general. Any inputs, websites to help with this are appreciated. > AILCAA > ALLLAA > ILIICL The BLOSUM65 matrix does not define a method for alignment, it just provides some parameters. Perhaps you should read this first: http://en.wikipedia.org/wiki/Sequence_alignment -- Dr Torsten Seemann http://www.vicbioinformatics.com Victorian Bioinformatics Consortium, Monash University, Australia --------------------------------- Do you Yahoo!? Everyone is raving about the all-new Yahoo! Mail Beta. From torsten.seemann at infotech.monash.edu.au Sun Jul 16 22:16:54 2006 From: torsten.seemann at infotech.monash.edu.au (Torsten Seemann) Date: Mon, 17 Jul 2006 12:16:54 +1000 Subject: [Bioperl-l] BLOSUM matrix In-Reply-To: <20060715192601.36517.qmail@web53315.mail.yahoo.com> References: <20060715192601.36517.qmail@web53315.mail.yahoo.com> Message-ID: <44BAF316.9020301@infotech.monash.edu.au> Sudha, > Being a beginner perl programming, was wondering if anyone can help me with implementation of BLOSUM 65 matrix for the following alignments or in general. Any inputs, websites to help with this are appreciated. > AILCAA > ALLLAA > ILIICL The BLOSUM65 matrix does not define a method for alignment, it just provides some parameters. Perhaps you should read this first: http://en.wikipedia.org/wiki/Sequence_alignment -- Dr Torsten Seemann http://www.vicbioinformatics.com Victorian Bioinformatics Consortium, Monash University, Australia From smart_bioit at yahoo.com Mon Jul 17 00:21:41 2006 From: smart_bioit at yahoo.com (raj sharma) Date: Sun, 16 Jul 2006 21:21:41 -0700 (PDT) Subject: [Bioperl-l] advice In-Reply-To: <44BAF4B7.8090508@infotech.monash.edu.au> Message-ID: <20060717042141.9069.qmail@web37307.mail.mud.yahoo.com> hi trston well here i will make u clear my problem i want to make one data base of marine species u can say this as mirror of data so at present whn i click there on line data base of ncbi gets open so i want to dowload data of marine species (ny one) nd whn ever i click on tht link local data which i have downloaded shld open nd data shld also b updated online after some time waiting for ur reply --------------------------------- Do you Yahoo!? Next-gen email? Have it all with the all-new Yahoo! Mail Beta. From cjfields at uiuc.edu Mon Jul 17 00:51:20 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sun, 16 Jul 2006 23:51:20 -0500 Subject: [Bioperl-l] BLOSUM matrix In-Reply-To: <20060717023320.6402.qmail@web53313.mail.yahoo.com> References: <20060717023320.6402.qmail@web53313.mail.yahoo.com> Message-ID: Hmm, beginner programmer, wants to learn perl? Here are some directions: http://learn.perl.org/ Start with Schwartz's latest incarnation of Learning Perl, then work your way up to Intermediate Perl (I think Mastering Perl is on the horizon...) For some pointers using Perl and bioinformatics, pick up Tisdall's books Beginning/Mastering Perl for Bioinformatics. This is really a list for bioperl, not perl and bioinformatics (thought the two cross here all the time!). We normally don't mind answering questions but we typically don't do people's homework unless we're unusually bored. And we can be excessively cranky when someone repeatedly posts requests for something that shouldn't take much reading and Googling to find out. Again, we're not into that homework gig, i.e. 'walking you through it' is tantamount to 'doing it for you.' 1) Arrays and how to use them are in Learning Perl; there are probably better ways to do this than an array, though... 2) Use Torsten's link to get you started. Chris On Jul 16, 2006, at 9:33 PM, Sudha Gunturu wrote: > Sorry for not being clear with my question. Let me try to > explain. I want to Implement dynamic programing using Blosum as > scoring matrix. > > 1. I want to know how to define the values of Blosum in an array. > 2. What functions are suitable for global alignment of two > sequences. Etc., > > Being a beginer programer want some direction, books, and good > websites which can help me in achieving the implementation. It > would be great if someone can walk me through this. > > Thanks > Sudha > > Torsten Seemann wrote: > Sudha, > >> Being a beginner perl programming, was wondering if anyone can >> help me with implementation of BLOSUM 65 matrix for the following >> alignments or in > general. Any inputs, websites to help with this are appreciated. >> AILCAA >> ALLLAA >> ILIICL > > The BLOSUM65 matrix does not define a method for alignment, it just > provides some parameters. Perhaps you should read this first: > > http://en.wikipedia.org/wiki/Sequence_alignment > > -- > Dr Torsten Seemann http://www.vicbioinformatics.com > Victorian Bioinformatics Consortium, Monash University, Australia > > > > > --------------------------------- > Do you Yahoo!? > Everyone is raving about the all-new Yahoo! Mail Beta. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Mon Jul 17 01:01:53 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 17 Jul 2006 00:01:53 -0500 Subject: [Bioperl-l] advice In-Reply-To: <20060717042141.9069.qmail@web37307.mail.mud.yahoo.com> References: <20060717042141.9069.qmail@web37307.mail.mud.yahoo.com> Message-ID: <82C51420-A18B-4DEA-A519-CE1D7B9C7B10@uiuc.edu> This is a Bioperl list. If you don't have a Bioperl-related question, you will very likely get testy replies. I don't believe that you quite understand Torsten's response, so I'll just copy-and-paste from a reply I just gave a second ago to save myself the typing: Hmm, beginner programmer, wants to learn perl? Here are some directions: http://learn.perl.org/ Start with Schwartz's latest incarnation of Learning Perl, then work your way up to Intermediate Perl (I think Mastering Perl is on the horizon...) For some pointers using Perl and bioinformatics, pick up Tisdall's books Beginning/Mastering Perl for Bioinformatics. This is really a list for bioperl, not perl and bioinformatics (thought the two cross here all the time!). We normally don't mind answering questions but we typically don't do people's homework unless we're unusually bored. And we can be excessively cranky when someone repeatedly posts requests for something that shouldn't take much reading and Googling to find out. Again, we're not into that homework gig, i.e. 'walking you through it' is tantamount to 'doing it for you.' For your particular instance, you might want to brush up on web services, CGI, and a little web etiquette. http://catb.org/esr/faqs/smart-questions.html I think you may be waiting for a long time for a reply! Chris On Jul 16, 2006, at 11:21 PM, raj sharma wrote: > > hi trston > well here i will make u clear my problem > > i want to make one data base of marine species u can say this as > mirror of data > > > so at present whn i click there on line data base of ncbi gets open > > so i want to dowload data of marine species (ny one) > nd whn ever i click on tht link local data which i have > downloaded shld open > nd data shld also b updated online after some time > > waiting for ur reply > > > > > > > --------------------------------- > Do you Yahoo!? > Next-gen email? Have it all with the all-new Yahoo! Mail Beta. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bmoore at genetics.utah.edu Mon Jul 17 01:25:32 2006 From: bmoore at genetics.utah.edu (Barry Moore) Date: Sun, 16 Jul 2006 23:25:32 -0600 Subject: [Bioperl-l] advice In-Reply-To: <20060714172551.60978.qmail@web37309.mail.mud.yahoo.com> Message-ID: By reading this: http://catb.org/esr/faqs/smart-questions.html -----Original Message----- From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of raj sharma Sent: Friday, July 14, 2006 11:26 AM To: bioperl-l at lists.open-bio.org Subject: [Bioperl-l] advice i have one problem in perl i want to make one program which whn run online can download required data from data bank to local server frm where i shld start --------------------------------- Yahoo! Music Unlimited - Access over 1 million songs.Try it free. _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l From bmoore at genetics.utah.edu Mon Jul 17 01:34:58 2006 From: bmoore at genetics.utah.edu (Barry Moore) Date: Sun, 16 Jul 2006 23:34:58 -0600 Subject: [Bioperl-l] advice In-Reply-To: <20060717042141.9069.qmail@web37307.mail.mud.yahoo.com> Message-ID: If you're on a unix type system look at wget -mirror and it's variations. B -----Original Message----- From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of raj sharma Sent: Sunday, July 16, 2006 10:22 PM To: Torsten Seemann Subject: Re: [Bioperl-l] advice hi trston well here i will make u clear my problem i want to make one data base of marine species u can say this as mirror of data so at present whn i click there on line data base of ncbi gets open so i want to dowload data of marine species (ny one) nd whn ever i click on tht link local data which i have downloaded shld open nd data shld also b updated online after some time waiting for ur reply --------------------------------- Do you Yahoo!? Next-gen email? Have it all with the all-new Yahoo! Mail Beta. _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l From bix at sendu.me.uk Mon Jul 17 10:32:13 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 17 Jul 2006 15:32:13 +0100 Subject: [Bioperl-l] Bio::Map changes In-Reply-To: <44ACCCD5.3030309@sendu.me.uk> References: <44985915.8010607@sendu.me.uk> <449A9AF9.2000305@sendu.me.uk> <44ACCCD5.3030309@sendu.me.uk> Message-ID: <44BB9F6D.10005@sendu.me.uk> Sendu Bala wrote: > Sendu Bala wrote: >> The reimplementation will make Position central to the model, allowing >> for lots of other things to work properly without anything becoming >> inconsistent (as is currently the case). > > This is now done. It uses a new PositionHandler class behind the scenes. > > The next step is to introduce relative positioning across the board This is now done. It uses a new Relative class to describe what a given position is relative to. I also made Bio::Map:MapI an AnnotableI and SimpleMap an implementor. I think this pretty much brings an end to my changes to Bio::Map. Unless anyone thinks the changes lack sanity, I think the API of the new things should be somewhat stable. > possibly in a way that makes OrderedPosition redundant or an implementer > of the system. I haven't yet touched the other kinds of Positions to update/remove them. Docs in general could probably do with an update/ improvement. I plan to do this 'soon'. From golharam at umdnj.edu Mon Jul 17 10:13:20 2006 From: golharam at umdnj.edu (Ryan Golhar) Date: Mon, 17 Jul 2006 10:13:20 -0400 Subject: [Bioperl-l] advice In-Reply-To: Message-ID: <004301c6a9ab$2896c8b0$2f01a8c0@GOLHARMOBILE1> I apologize that this is off-topic, but it is an interesting email. Notice the lack of vowels (whn, ny, nd, shld, b) however in other words, the vowels are clearly included. Am I getting old or is "internet spelling" starting to differ from "english spelling"? Or is it that the younger generation (not that I'm old...a mere 32 is not old), using shorthand for frequently used words? -----Original Message----- From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Barry Moore Sent: Monday, July 17, 2006 1:35 AM To: raj sharma Cc: bioperl-l Subject: Re: [Bioperl-l] advice If you're on a unix type system look at wget -mirror and it's variations. B -----Original Message----- From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of raj sharma Sent: Sunday, July 16, 2006 10:22 PM To: Torsten Seemann Subject: Re: [Bioperl-l] advice hi trston well here i will make u clear my problem i want to make one data base of marine species u can say this as mirror of data so at present whn i click there on line data base of ncbi gets open so i want to dowload data of marine species (ny one) nd whn ever i click on tht link local data which i have downloaded shld open nd data shld also b updated online after some time waiting for ur reply --------------------------------- Do you Yahoo!? Next-gen email? Have it all with the all-new Yahoo! Mail Beta. _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l From arareko at campus.iztacala.unam.mx Mon Jul 17 11:31:09 2006 From: arareko at campus.iztacala.unam.mx (Mauricio Herrera Cuadra) Date: Mon, 17 Jul 2006 10:31:09 -0500 Subject: [Bioperl-l] advice In-Reply-To: <004301c6a9ab$2896c8b0$2f01a8c0@GOLHARMOBILE1> References: <004301c6a9ab$2896c8b0$2f01a8c0@GOLHARMOBILE1> Message-ID: <44BBAD3D.2040203@campus.iztacala.unam.mx> Maybe it's a new "obscure" perl6 syntax :) Ryan Golhar wrote: > I apologize that this is off-topic, but it is an interesting email. > Notice the lack of vowels (whn, ny, nd, shld, b) however in other > words, the vowels are clearly included. > > Am I getting old or is "internet spelling" starting to differ from > "english spelling"? Or is it that the younger generation (not that I'm > old...a mere 32 is not old), using shorthand for frequently used words? > > > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org > [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Barry Moore > Sent: Monday, July 17, 2006 1:35 AM > To: raj sharma > Cc: bioperl-l > Subject: Re: [Bioperl-l] advice > > > If you're on a unix type system look at wget -mirror and it's > variations. > > B > > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org > [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of raj sharma > Sent: Sunday, July 16, 2006 10:22 PM > To: Torsten Seemann > Subject: Re: [Bioperl-l] advice > > > hi trston > well here i will make u clear my problem > > i want to make one data base of marine species u can say this as > mirror of data > > > so at present whn i click there on line data base of ncbi gets open > > so i want to dowload data of marine species (ny one) > nd whn ever i click on tht link local data which i have downloaded > shld open > nd data shld also b updated online after some time > > waiting for ur reply > > > > > > > --------------------------------- > Do you Yahoo!? > Next-gen email? Have it all with the all-new Yahoo! Mail Beta. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- MAURICIO HERRERA CUADRA arareko at campus.iztacala.unam.mx Laboratorio de Gen?tica Unidad de Morfofisiolog?a y Funci?n Facultad de Estudios Superiores Iztacala, UNAM From cjfields at uiuc.edu Mon Jul 17 12:09:27 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 17 Jul 2006 11:09:27 -0500 Subject: [Bioperl-l] advice In-Reply-To: <004301c6a9ab$2896c8b0$2f01a8c0@GOLHARMOBILE1> Message-ID: <000a01c6a9bb$6478ba90$15327e82@pyrimidine> Ha ! I *almost* added something about that. I thought his vowel keys were broken for a bit, maybe from pounding the keyboard with extreme frustration! As an aside, doesn't Damian Conway say something about the non-use of vowels in 'Perl Best Practices?' I think it was in relation to variables, though... Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Ryan Golhar > Sent: Monday, July 17, 2006 9:13 AM > To: 'bioperl-l' > Subject: Re: [Bioperl-l] advice > > I apologize that this is off-topic, but it is an interesting email. > Notice the lack of vowels (whn, ny, nd, shld, b) however in other > words, the vowels are clearly included. > > Am I getting old or is "internet spelling" starting to differ from > "english spelling"? Or is it that the younger generation (not that I'm > old...a mere 32 is not old), using shorthand for frequently used words? > > > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org > [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Barry Moore > Sent: Monday, July 17, 2006 1:35 AM > To: raj sharma > Cc: bioperl-l > Subject: Re: [Bioperl-l] advice > > > If you're on a unix type system look at wget -mirror and it's > variations. > > B > > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org > [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of raj sharma > Sent: Sunday, July 16, 2006 10:22 PM > To: Torsten Seemann > Subject: Re: [Bioperl-l] advice > > > hi trston > well here i will make u clear my problem > > i want to make one data base of marine species u can say this as > mirror of data > > > so at present whn i click there on line data base of ncbi gets open > > so i want to dowload data of marine species (ny one) > nd whn ever i click on tht link local data which i have downloaded > shld open > nd data shld also b updated online after some time > > waiting for ur reply > > > > > > > --------------------------------- > Do you Yahoo!? > Next-gen email? Have it all with the all-new Yahoo! Mail Beta. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From bix at sendu.me.uk Mon Jul 17 12:31:37 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 17 Jul 2006 17:31:37 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes Message-ID: <44BBBB69.6000906@sendu.me.uk> I see strange node names via Bio::DB::Taxonomy::flatfile: use Bio::DB::Taxonomy; my $db = new Bio::DB::Taxonomy(-source => 'flatfile', -directory => $taxonomy_dir, -nodesfile => $taxonomy_dir.'nodes.dmp', -namesfile => $taxonomy_dir.'names.dmp'); my $tax_id = 89593; my $node = $db->get_Taxonomy_Node($tax_id); print "node $tax_id has name '", @{$node->name('common')}, "' and rank '", $node->rank, "'\n"; Results in: node 89593 has name 'Craniata ' and rank 'subphylum' Other examples: node 2 has name 'Bacteria ' and rank 'superkingdom' node 1386 has name 'Bacillus ' and rank 'genus' node 7776 has name 'Gnathostomata ' and rank 'superclass' etc. For me the bits in <> are inappropriate and shouldn't be there. The NCBI website agrees, and you won't see these things if you use -source => 'entrez'. Should they be removed by the flatfile parser as a matter of course, with no warnings or option? Or do people want them? Typically they are just the name of the parent node, so I don't see why anyone would /need/ them, and I argue it's invalid for parent node information to be duplicated here. If there are no objections I'll strip the <> bits. I also plan to make $node->name('scientific', 'sapiens'); set and get the node name, and have flatfile and entrez store all common names with $obj->name('common', 'human', 'man');. As these changes will make the implementation match the docs I don't see any problems, except that flatfile users will now find the node name in a different place (@{$node->name('scientific')} instead of @{$node->name('common')}). I'll also fix the problem with node names for ranks species and lower, as discussed in thread 'Bio::DB::Taxonomy:: mishandles species, subspecies/variant names', in the way I suggested there. If anyone can see a problem with any of these changes, let me know asap. From hlapp at gmx.net Mon Jul 17 13:53:17 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 17 Jul 2006 13:53:17 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BBBB69.6000906@sendu.me.uk> References: <44BBBB69.6000906@sendu.me.uk> Message-ID: <8C935CDE-BEFF-48BC-8CC8-0C3FFEF2EF17@gmx.net> Sound good to me. BTW NCBI guarantees (well, promises) that there will only be one node name of class 'scientific'. -hilmar On Jul 17, 2006, at 12:31 PM, Sendu Bala wrote: > I see strange node names via Bio::DB::Taxonomy::flatfile: > > use Bio::DB::Taxonomy; > > my $db = new Bio::DB::Taxonomy(-source => 'flatfile', -directory => > $taxonomy_dir, -nodesfile => $taxonomy_dir.'nodes.dmp', -namesfile => > $taxonomy_dir.'names.dmp'); > > my $tax_id = 89593; > my $node = $db->get_Taxonomy_Node($tax_id); > > print "node $tax_id has name '", @{$node->name('common')}, "' and rank > '", $node->rank, "'\n"; > > Results in: > node 89593 has name 'Craniata ' and rank 'subphylum' > > Other examples: > node 2 has name 'Bacteria ' and rank 'superkingdom' > node 1386 has name 'Bacillus ' and rank 'genus' > node 7776 has name 'Gnathostomata ' and rank 'superclass' > etc. > > For me the bits in <> are inappropriate and shouldn't be there. The > NCBI > website agrees, and you won't see these things if you use -source => > 'entrez'. Should they be removed by the flatfile parser as a matter of > course, with no warnings or option? Or do people want them? Typically > they are just the name of the parent node, so I don't see why anyone > would /need/ them, and I argue it's invalid for parent node > information > to be duplicated here. > > If there are no objections I'll strip the <> bits. I also plan to make > $node->name('scientific', 'sapiens'); set and get the node name, and > have flatfile and entrez store all common names with > $obj->name('common', 'human', 'man');. As these changes will make the > implementation match the docs I don't see any problems, except that > flatfile users will now find the node name in a different place > (@{$node->name('scientific')} instead of @{$node->name('common')}). > > I'll also fix the problem with node names for ranks species and lower, > as discussed in thread 'Bio::DB::Taxonomy:: mishandles species, > subspecies/variant names', in the way I suggested there. > > If anyone can see a problem with any of these changes, let me know > asap. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Mon Jul 17 14:31:08 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 17 Jul 2006 13:31:08 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <8C935CDE-BEFF-48BC-8CC8-0C3FFEF2EF17@gmx.net> Message-ID: <001d01c6a9cf$2cf50f60$15327e82@pyrimidine> I agree. Would be nice to get this to play well with weird bacterial names! I plan on doing some behind-the-scenes work on Bio::DB::Taxonomy::entrez at some point soon to test out Bio::DB::EUtilities as the user agent; it currently uses Bio::Root::HTTPget, I think. Reason I'm doing this is to quickly get tax info based on any primary ID, primarily for grabbing related Tax information from the sequence GI w/o parsing the sequence for the TaxID; this uses NCBI's ELink which I've now implemented. I'll make sure everything passes tests before I commit. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Hilmar Lapp > Sent: Monday, July 17, 2006 12:53 PM > To: Sendu Bala > Cc: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > Sound good to me. > > BTW NCBI guarantees (well, promises) that there will only be one node > name of class 'scientific'. > > -hilmar > > On Jul 17, 2006, at 12:31 PM, Sendu Bala wrote: > > > I see strange node names via Bio::DB::Taxonomy::flatfile: > > > > use Bio::DB::Taxonomy; > > > > my $db = new Bio::DB::Taxonomy(-source => 'flatfile', -directory => > > $taxonomy_dir, -nodesfile => $taxonomy_dir.'nodes.dmp', -namesfile => > > $taxonomy_dir.'names.dmp'); > > > > my $tax_id = 89593; > > my $node = $db->get_Taxonomy_Node($tax_id); > > > > print "node $tax_id has name '", @{$node->name('common')}, "' and rank > > '", $node->rank, "'\n"; > > > > Results in: > > node 89593 has name 'Craniata ' and rank 'subphylum' > > > > Other examples: > > node 2 has name 'Bacteria ' and rank 'superkingdom' > > node 1386 has name 'Bacillus ' and rank 'genus' > > node 7776 has name 'Gnathostomata ' and rank 'superclass' > > etc. > > > > For me the bits in <> are inappropriate and shouldn't be there. The > > NCBI > > website agrees, and you won't see these things if you use -source => > > 'entrez'. Should they be removed by the flatfile parser as a matter of > > course, with no warnings or option? Or do people want them? Typically > > they are just the name of the parent node, so I don't see why anyone > > would /need/ them, and I argue it's invalid for parent node > > information > > to be duplicated here. > > > > If there are no objections I'll strip the <> bits. I also plan to make > > $node->name('scientific', 'sapiens'); set and get the node name, and > > have flatfile and entrez store all common names with > > $obj->name('common', 'human', 'man');. As these changes will make the > > implementation match the docs I don't see any problems, except that > > flatfile users will now find the node name in a different place > > (@{$node->name('scientific')} instead of @{$node->name('common')}). > > > > I'll also fix the problem with node names for ranks species and lower, > > as discussed in thread 'Bio::DB::Taxonomy:: mishandles species, > > subspecies/variant names', in the way I suggested there. > > > > If anyone can see a problem with any of these changes, let me know > > asap. > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From bix at sendu.me.uk Mon Jul 17 14:09:44 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 17 Jul 2006 19:09:44 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <8C935CDE-BEFF-48BC-8CC8-0C3FFEF2EF17@gmx.net> References: <44BBBB69.6000906@sendu.me.uk> <8C935CDE-BEFF-48BC-8CC8-0C3FFEF2EF17@gmx.net> Message-ID: <44BBD268.2060308@sendu.me.uk> Hilmar Lapp wrote: >> I also plan to make $node->name('scientific', 'sapiens'); set and >> get the node name, [...] users will now find the node name in [...] >> @{$node->name('scientific')} > > BTW NCBI guarantees (well, promises) that there will only be one node > name of class 'scientific'. Yes, which is why I feel the API for name() isn't ideal, but thought it would be best to play along. Would having a new scientific_name() method be better, which gets/sets a single value? Perhaps it could just be a more 'sane' shorthand to setting @{$node->name('scientific')} to a list with only the supplied name, and getting ${$node->name('scientific')}[0] ? From hlapp at gmx.net Mon Jul 17 15:31:55 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 17 Jul 2006 15:31:55 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BBD268.2060308@sendu.me.uk> References: <44BBBB69.6000906@sendu.me.uk> <8C935CDE-BEFF-48BC-8CC8-0C3FFEF2EF17@gmx.net> <44BBD268.2060308@sendu.me.uk> Message-ID: <5B62229C-BAB7-4320-BBAE-87A483B0EC15@gmx.net> Yes I think $node->scientific_name() as shorthand would be good to have. Same BTW for $node->common_names() (which would return an array). -hilmar On Jul 17, 2006, at 2:09 PM, Sendu Bala wrote: > Hilmar Lapp wrote: >>> I also plan to make $node->name('scientific', 'sapiens'); set and >>> get the node name, [...] users will now find the node name in [...] >>> @{$node->name('scientific')} >> >> BTW NCBI guarantees (well, promises) that there will only be one node >> name of class 'scientific'. > > Yes, which is why I feel the API for name() isn't ideal, but > thought it > would be best to play along. Would having a new scientific_name() > method > be better, which gets/sets a single value? Perhaps it could just be a > more 'sane' shorthand to setting @{$node->name('scientific')} to a > list > with only the supplied name, and getting ${$node->name > ('scientific')}[0] ? > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Mon Jul 17 16:44:18 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 17 Jul 2006 15:44:18 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <5B62229C-BAB7-4320-BBAE-87A483B0EC15@gmx.net> Message-ID: <000001c6a9e1$c6b51610$15327e82@pyrimidine> There was some interest in getting Bio::Species to delegate to Bio::Taxonomy::Node, so having scientific_name() would help quite a bit since the name used on the ORGANISM line is the scientific name (well, is supposed to be; famous last words). Don't know about SwissProt, EMBL, and others though... Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Hilmar Lapp > Sent: Monday, July 17, 2006 2:32 PM > To: Sendu Bala > Cc: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > Yes I think $node->scientific_name() as shorthand would be good to > have. Same BTW for $node->common_names() (which would return an array). > > -hilmar > > On Jul 17, 2006, at 2:09 PM, Sendu Bala wrote: > > > Hilmar Lapp wrote: > >>> I also plan to make $node->name('scientific', 'sapiens'); set and > >>> get the node name, [...] users will now find the node name in [...] > >>> @{$node->name('scientific')} > >> > >> BTW NCBI guarantees (well, promises) that there will only be one node > >> name of class 'scientific'. > > > > Yes, which is why I feel the API for name() isn't ideal, but > > thought it > > would be best to play along. Would having a new scientific_name() > > method > > be better, which gets/sets a single value? Perhaps it could just be a > > more 'sane' shorthand to setting @{$node->name('scientific')} to a > > list > > with only the supplied name, and getting ${$node->name > > ('scientific')}[0] ? > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From vrramnar at student.cs.uwaterloo.ca Mon Jul 17 16:46:32 2006 From: vrramnar at student.cs.uwaterloo.ca (vrramnar at student.cs.uwaterloo.ca) Date: Mon, 17 Jul 2006 16:46:32 -0400 Subject: [Bioperl-l] Remote Blast - Blast Human Genome In-Reply-To: <000501c6a6e8$b9c24d20$15327e82@pyrimidine> References: <000501c6a6e8$b9c24d20$15327e82@pyrimidine> Message-ID: <1153169192.44bbf728056fd@www.nexusmail.uwaterloo.ca> Hi Chris, 1. I have tried changing the database to snp or dbSNP but neither works. It seems that depending on which type of blast you use(ie, Genome Blast, Blast SNP, normal blast such as blastn, etc...) you see a different listing of databases available for querys. Since you mention that the Blast page I see was generated by Genome, where could I go to see a complete listing of databases I can query?? Or if you knew off hand which database to search if I only wanted dbSNP hits? 2. You also mention, I can limit the search by using Entrez terms. Do you mean like: $Bio::Tools::Run::RemoteBlast::HEADER{'ENTREZ_QUERY'} = 'abc'; where 'abc' is the name of the subject with which you would only like to see result of. For example if you put it as 'Homo sapiens[Organism]' then only human sequences would be in hit lists. If this is what you mean, what would I change it to, to see only hits from dbSNP? Thanks for the ongoing help, Rohan Quoting Chris Fields : > I added a method to RemoteBlast in bioperl-live (CVS) if you want to play > with changing the URL. I have been thinking about doing this for a bit now > but I already see problems. > > Here's the issue: the BLAST page you see is NOT the NCBI BLAST page (note > the differences in the URL) but a user-friendly request page, generated on > the fly by Genome, to submit BLAST requests for the relevant database. So > changing the URL will not work (even by adding extra parameters); you only > get the original HTML web page. > > You could try changing the database or limiting the search using an Entrez > term (which you should be able to include in the request, probably by adding > it to the HEADER). > > Chris > > > -----Original Message----- > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > > bounces at lists.open-bio.org] On Behalf Of vrramnar at student.cs.uwaterloo.ca > > Sent: Thursday, July 13, 2006 5:39 PM > > To: bioperl-l at lists.open-bio.org > > Subject: [Bioperl-l] Remote Blast - Blast Human Genome > > > > > > Hello Again, > > > > I have another question regarding Remote blast but this time using Genome > > Blast. > > > > Here is the link: > > > > http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=9606 > > > > which again uses the main Blast web site: > > > > http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi > > > > Again I am not sure what to add or what HEADER information to change > > within my > > script. > > > > Here is my program, which was the same as the last email: > > > > #!/usr/bin/perl -w > > > > use Bio::Perl; > > use Bio::Tools::Run::RemoteBlast; > > > > my $prog = "blastn"; > > my $db = "refseq_genomic"; > > my $e_val = 0.01; > > > > my @params = ( '-prog' => $prog, > > '-data' => $db, > > '-expect' => $e_val); > > > > my $factory = new Bio::Tools::Run::RemoteBlast->new(@params); > > $Bio::Tools::Run::RemoteBlast::HEADER{'WWW_BLAST_TYPE'} = '????'; <----- > > what > > do I put here > > #$Bio::Tools::Run::RemoteBlast::HEADER{'?????'} = '????'; <--- Do I need > > to add > > any other values to the form inputs > > > > $factory->submit_blast("blast.in"); > > $v = 1; > > > > while (my @rids = $factory->each_rid) > > { foreach my $rid ( @rids ) > > { my $rc = $factory->retrieve_blast($rid); > > if( !ref($rc) ) > > { if( $rc < 0 ) > > { $factory->remove_rid($rid); > > } > > print STDERR "." if ( $v > 0 ); > > sleep 5; > > } > > else > > { my $result = $rc->next_result(); > > my $filename = $result->query_name()."\.out"; > > $factory->save_output($filename); > > $factory->remove_rid($rid); > > print "\nQuery Name: ", $result->query_name(), "\n"; > > } > > } > > } > > > > > > Both of my questions are very similiar as in I know how to use remote > > blast but > > not sure what to change to access the specific blast I want. > > > > Again, any help would be very appreciated!! > > > > Rohan > > > > > > > > ---------------------------------------- > > This mail sent through www.mywaterloo.ca > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > ---------------------------------------- This mail sent through www.mywaterloo.ca From cjfields at uiuc.edu Mon Jul 17 17:25:54 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 17 Jul 2006 16:25:54 -0500 Subject: [Bioperl-l] Remote Blast - Blast Human Genome In-Reply-To: <1153169192.44bbf728056fd@www.nexusmail.uwaterloo.ca> Message-ID: <001001c6a9e7$962b56c0$15327e82@pyrimidine> Okay, I think I may know what's going on a little more now with NCBI's BLAST interface. Looks like any NCBI BLAST query must use the default URL and so must set up to proper GET/PUT commands to retrieve everything correctly. Here's the API description for it all: http://www.ncbi.nlm.nih.gov/BLAST/Doc/urlapi.html You could try setting the database to 'snp' or something along those lines instead of 'nr'; or you could see what the name of the database is when you use the web form and try setting it to that. According to this page, this should be possible: http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helpsnpfaq.section.SearchdbSNP _test._Search_dbSNP_Using_B The Entrez Query limit was a recommendation for limiting your search to a set of sequences for human, for instance. I'll try looking into it a bit more but I'm pretty busy. If you find anything out you should probably post it here . Chris > Hi Chris, > > 1. I have tried changing the database to snp or dbSNP but neither works. > It > seems that depending on which type of blast you use(ie, Genome Blast, > Blast SNP, > normal blast such as blastn, etc...) you see a different listing of > databases > available for querys. Since you mention that the Blast page I see was > generated > by Genome, where could I go to see a complete listing of databases I can > query?? > Or if you knew off hand which database to search if I only wanted dbSNP > hits? > > 2. You also mention, I can limit the search by using Entrez terms. Do you > mean > like: > $Bio::Tools::Run::RemoteBlast::HEADER{'ENTREZ_QUERY'} = 'abc'; > where 'abc' is the name of the subject with which you would only like to > see > result of. For example if you put it as 'Homo sapiens[Organism]' then only > human > sequences would be in hit lists. > If this is what you mean, what would I change it to, to see only hits from > dbSNP? > > Thanks for the ongoing help, > > Rohan > > Quoting Chris Fields : > > > I added a method to RemoteBlast in bioperl-live (CVS) if you want to > play > > with changing the URL. I have been thinking about doing this for a bit > now > > but I already see problems. > > > > Here's the issue: the BLAST page you see is NOT the NCBI BLAST page > (note > > the differences in the URL) but a user-friendly request page, generated > on > > the fly by Genome, to submit BLAST requests for the relevant database. > So > > changing the URL will not work (even by adding extra parameters); you > only > > get the original HTML web page. > > > > You could try changing the database or limiting the search using an > Entrez > > term (which you should be able to include in the request, probably by > adding > > it to the HEADER). > > > > Chris > > > > > -----Original Message----- > > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > > > bounces at lists.open-bio.org] On Behalf Of > vrramnar at student.cs.uwaterloo.ca > > > Sent: Thursday, July 13, 2006 5:39 PM > > > To: bioperl-l at lists.open-bio.org > > > Subject: [Bioperl-l] Remote Blast - Blast Human Genome > > > > > > > > > Hello Again, > > > > > > I have another question regarding Remote blast but this time using > Genome > > > Blast. > > > > > > Here is the link: > > > > > > > http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=9606 > > > > > > which again uses the main Blast web site: > > > > > > http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi > > > > > > Again I am not sure what to add or what HEADER information to change > > > within my > > > script. > > > > > > Here is my program, which was the same as the last email: > > > > > > #!/usr/bin/perl -w > > > > > > use Bio::Perl; > > > use Bio::Tools::Run::RemoteBlast; > > > > > > my $prog = "blastn"; > > > my $db = "refseq_genomic"; > > > my $e_val = 0.01; > > > > > > my @params = ( '-prog' => $prog, > > > '-data' => $db, > > > '-expect' => $e_val); > > > > > > my $factory = new Bio::Tools::Run::RemoteBlast->new(@params); > > > $Bio::Tools::Run::RemoteBlast::HEADER{'WWW_BLAST_TYPE'} = '????'; <-- > --- > > > what > > > do I put here > > > #$Bio::Tools::Run::RemoteBlast::HEADER{'?????'} = '????'; <--- Do I > need > > > to add > > > any other values to the form inputs > > > > > > $factory->submit_blast("blast.in"); > > > $v = 1; > > > > > > while (my @rids = $factory->each_rid) > > > { foreach my $rid ( @rids ) > > > { my $rc = $factory->retrieve_blast($rid); > > > if( !ref($rc) ) > > > { if( $rc < 0 ) > > > { $factory->remove_rid($rid); > > > } > > > print STDERR "." if ( $v > 0 ); > > > sleep 5; > > > } > > > else > > > { my $result = $rc->next_result(); > > > my $filename = $result->query_name()."\.out"; > > > $factory->save_output($filename); > > > $factory->remove_rid($rid); > > > print "\nQuery Name: ", $result->query_name(), "\n"; > > > } > > > } > > > } > > > > > > > > > Both of my questions are very similiar as in I know how to use remote > > > blast but > > > not sure what to change to access the specific blast I want. > > > > > > Again, any help would be very appreciated!! > > > > > > Rohan > > > > > > > > > > > > ---------------------------------------- > > > This mail sent through www.mywaterloo.ca > > > _______________________________________________ > > > Bioperl-l mailing list > > > Bioperl-l at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > > > > ---------------------------------------- > This mail sent through www.mywaterloo.ca From bix at sendu.me.uk Mon Jul 17 17:33:26 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 17 Jul 2006 22:33:26 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <000001c6a9e1$c6b51610$15327e82@pyrimidine> References: <000001c6a9e1$c6b51610$15327e82@pyrimidine> Message-ID: <44BC0226.1080605@sendu.me.uk> Chris Fields wrote: > There was some interest in getting Bio::Species to delegate to > Bio::Taxonomy::Node, so having scientific_name() would help quite a bit > since the name used on the ORGANISM line is the scientific name (well, is > supposed to be; famous last words). Can you clarify exactly what you mean here? Preferably with an example? ORGANISM line of which file format? The reason I ask is that I still feel we need to do parsing of the names for species rank and lower: # The 'scientific name' for humans could be considered to be 'Homo sapiens'. # Taxid 9606 in the NCBI taxonomy database has rank 'species' and ScientificName 'Homo sapiens'. # For sanity, Bio::*Taxonomy* likes to interpret this ScientificName as 'sapiens' so that the genus is not held redundantly. It provides a binomial() method to give you 'Homo sapiens' again if you want it. # I plan on maintaining this; scientific_name() would give you the non-redundant sibling-unique name 'sapiens'. binomial() on a species rank and lower would give you 'Homo sapiens' (presumably grabbing the 'Homo' from the parent node with rank 'genus', or similar). Good, bad or ugly? I would prefer it works like this and we agree to differ with NCBI on what the 'scientific name' of a species node should be. Bio::Species can still delegate to Bio::Taxonomy::Node by calling binomial() (which I propose will actually give the correct answer, even for bacteria and viruses). Perhaps the short-hand (and the classifier used in name()) shouldn't mention the word 'scientific' to avoid confusion? But a) what else would we call it?, and b) for all ranks above species it /is/ the scientific name. From hlapp at gmx.net Mon Jul 17 19:47:24 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 17 Jul 2006 19:47:24 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BC0226.1080605@sendu.me.uk> References: <000001c6a9e1$c6b51610$15327e82@pyrimidine> <44BC0226.1080605@sendu.me.uk> Message-ID: <64AB2D48-B342-4FC4-BB4E-9927B24C972C@gmx.net> I don't think we should differ from NCBI in places where the connection between a method name and the NCBI data file is obvious or otherwise we will confuse people and send them into traps. $node->scientific_name() should simply report what NCBI reports. For simple species this will be identical to what $node->binomial() returns, but for others it may not, e.g., strains, varieties, etc or the weird world of viri and bacteria. This will also absolve us from retaining the business logic for how to construct the scientific name from genus, species, and possibly strain or whatever. binomial() isn't part of the NCBI taxonomy definition, so you have freedom there to report what suits you. -hilmar On Jul 17, 2006, at 5:33 PM, Sendu Bala wrote: > Chris Fields wrote: >> There was some interest in getting Bio::Species to delegate to >> Bio::Taxonomy::Node, so having scientific_name() would help quite >> a bit >> since the name used on the ORGANISM line is the scientific name >> (well, is >> supposed to be; famous last words). > > Can you clarify exactly what you mean here? Preferably with an > example? > ORGANISM line of which file format? > The reason I ask is that I still feel we need to do parsing of the > names > for species rank and lower: > > # The 'scientific name' for humans could be considered to be 'Homo > sapiens'. > # Taxid 9606 in the NCBI taxonomy database has rank 'species' and > ScientificName 'Homo sapiens'. > # For sanity, Bio::*Taxonomy* likes to interpret this > ScientificName as > 'sapiens' so that the genus is not held redundantly. It provides a > binomial() method to give you 'Homo sapiens' again if you want it. > # I plan on maintaining this; scientific_name() would give you the > non-redundant sibling-unique name 'sapiens'. binomial() on a species > rank and lower would give you 'Homo sapiens' (presumably grabbing the > 'Homo' from the parent node with rank 'genus', or similar). > > Good, bad or ugly? I would prefer it works like this and we agree to > differ with NCBI on what the 'scientific name' of a species node > should > be. Bio::Species can still delegate to Bio::Taxonomy::Node by calling > binomial() (which I propose will actually give the correct answer, > even > for bacteria and viruses). > > Perhaps the short-hand (and the classifier used in name()) shouldn't > mention the word 'scientific' to avoid confusion? But a) what else > would > we call it?, and b) for all ranks above species it /is/ the > scientific name. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From osborne1 at optonline.net Mon Jul 17 20:52:04 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Mon, 17 Jul 2006 20:52:04 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BC0226.1080605@sendu.me.uk> Message-ID: Sendu, The string "sapiens" is not what a biology textbook would call a scientific name. You're going to have to respect decades of convention and have scientific_name() return the genus and species name. Brian O. On 7/17/06 5:33 PM, "Sendu Bala" wrote: > # I plan on maintaining this; scientific_name() would give you the > non-redundant sibling-unique name 'sapiens'. binomial() on a species > rank and lower would give you 'Homo sapiens' (presumably grabbing the > 'Homo' from the parent node with rank 'genus', or similar). From cjfields at uiuc.edu Mon Jul 17 21:36:12 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 17 Jul 2006 20:36:12 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BC0226.1080605@sendu.me.uk> References: <000001c6a9e1$c6b51610$15327e82@pyrimidine> <44BC0226.1080605@sendu.me.uk> Message-ID: <1345AB61-E7AB-447A-AB40-2170244404B2@uiuc.edu> On Jul 17, 2006, at 4:33 PM, Sendu Bala wrote: > Chris Fields wrote: >> There was some interest in getting Bio::Species to delegate to >> Bio::Taxonomy::Node, so having scientific_name() would help quite >> a bit >> since the name used on the ORGANISM line is the scientific name >> (well, is >> supposed to be; famous last words). > > Can you clarify exactly what you mean here? Preferably with an > example? > ORGANISM line of which file format? > The reason I ask is that I still feel we need to do parsing of the > names > for species rank and lower: Sorry, should have clarified; GenBank sequence format. Here's the link: http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html The ORGANISM annotation line for a GenBank record contains the formal scientific name for the organism along with the lineage. I believe SwissProt/EMBL and several other RichSeq formats do the same. The lineage that is also present is almost always abbreviated, so it's not always possible to determine the formal rankings strictly from the file with any real degree of reliability (hence the past problems with Bio::Species). > > # The 'scientific name' for humans could be considered to be 'Homo > sapiens'. > # Taxid 9606 in the NCBI taxonomy database has rank 'species' and > ScientificName 'Homo sapiens'. > # For sanity, Bio::*Taxonomy* likes to interpret this > ScientificName as > 'sapiens' so that the genus is not held redundantly. It provides a > binomial() method to give you 'Homo sapiens' again if you want it. > # I plan on maintaining this; scientific_name() would give you the > non-redundant sibling-unique name 'sapiens'. binomial() on a species > rank and lower would give you 'Homo sapiens' (presumably grabbing the > 'Homo' from the parent node with rank 'genus', or similar). I think you should use scientific_name to designate the full formal scientific name for an organism according to the way NCBI describes it for that particular node (nothing more, except removing the <> stuff you mentioned earlier) and as it would appear for the ORGANISM line. Otherwise you'll run into serious species/subspecies/strain headaches (see below). If you want real genus/species (i.e. nothing extra, like strains or subspecies), separate them out and store them using a genus/species get/set if possible; the binomial them will give back the two name genus species designation. Here are a couple of example ones in (this is in XML, using EUtilities). These were retrieved using NCBI TaxIDs using Elink from a list of protein GI's (~700 of them total), so represent the actual NCBI TaxID linked with the sequence file. If you try breaking these apart into species, what happens to the strain/subspecies stuff? Notice that many of these nodes, which come directly from protein GI's, also have no rank. ... 376686 Flavobacterium johnsoniae UW101 Flavobacterium johnsoniae NBRC 14942 Flavobacterium johnsoniae IFO 14942 Flavobacterium johnsoniae IAM 14304 Flavobacterium johnsoniae MYX.1.1.1 Flavobacterium johnsoniae NCIB 11054 Flavobacterium johnsoniae DSM 2064 Flavobacterium johnsoniae LMG 1341 Flavobacterium johnsoniae ATCC 17061 Flavobacterium johnsoniae strain UW101 Flavobacterium johnsoniae str. UW101 986 no rank Bacteria ... 370552 Streptococcus pyogenes MGAS10270 Streptococcus pyogenes strain MGAS10270 Streptococcus pyogenes str. MGAS10270 301448 no rank Bacteria ... 224308 Bacillus subtilis subsp. subtilis str. 168 Bacillus subtilis subsp. subtilis 168 135461 no rank Bacteria > Good, bad or ugly? I would prefer it works like this and we agree to > differ with NCBI on what the 'scientific name' of a species node > should > be. Bio::Species can still delegate to Bio::Taxonomy::Node by calling > binomial() (which I propose will actually give the correct answer, > even > for bacteria and viruses). This is where I would strongly disagree (though I agree that the way NCBI uses 'scientific name' is a bit off). We are using the NCBI tax database, anf as such we are somewhat at the mercy of the NCBI tax nomenclature, unfortunately. If NCBI decides to change their official definition for the scientific name to something that made a bit more sense, the XML and dump data will reflect that and we won't have many problems adapting since the scientific name will always conform to their definition. But if we split the information up ad hoc then we are bound for disaster; it's just way too much headache to worry about. We could always point to the official NCBI definition as the one we adopt and then assign the tagged information from the node directly to scientific_name (no globbing together at all). Bio::Species could delegate likewise fro the ORGANISM line, so there's no piecemeal attempts to get Humpty Dumpty to fit back together again. You could go through and get the lineage from the XML/dump file data and try to sort the genus/species out, then paste it all back together (fingers crossed!), but I think it's more headache than it's worth to split these up, then hope that you can paste them back together again and always expect to get the same results. Chris > Perhaps the short-hand (and the classifier used in name()) shouldn't > mention the word 'scientific' to avoid confusion? But a) what else > would > we call it?, and b) for all ranks above species it /is/ the > scientific name. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Mon Jul 17 21:55:28 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 17 Jul 2006 20:55:28 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: References: Message-ID: <07041A82-BE01-4998-AF26-35212CA1F0F2@uiuc.edu> I agree with Hilmar's assessment, not b/c I disagree with your definition of scientific name or the reasoning Sendu proposes. I think we are somewhat bound to NCBI's nomenclature for their tax database. If we veer away from NCBI's definition for 'scientific name' it will just confuse users and lead to more trouble than it's worth, frankly. If we stick with it then any changes NCBI makes should be easier to deal with. Leaving the scientific_name as NCBI designates it, though it probably disagrees with ~99% of the world's textbooks, may be the most maintainable solution. Now, binomial() on the other hand... Chris On Jul 17, 2006, at 7:52 PM, Brian Osborne wrote: > Sendu, > > The string "sapiens" is not what a biology textbook would call a > scientific > name. You're going to have to respect decades of convention and have > scientific_name() return the genus and species name. > > Brian O. > > > On 7/17/06 5:33 PM, "Sendu Bala" wrote: > >> # I plan on maintaining this; scientific_name() would give you the >> non-redundant sibling-unique name 'sapiens'. binomial() on a species >> rank and lower would give you 'Homo sapiens' (presumably grabbing the >> 'Homo' from the parent node with rank 'genus', or similar). > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From hlapp at gmx.net Mon Jul 17 22:06:01 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 17 Jul 2006 22:06:01 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <07041A82-BE01-4998-AF26-35212CA1F0F2@uiuc.edu> References: <07041A82-BE01-4998-AF26-35212CA1F0F2@uiuc.edu> Message-ID: On Jul 17, 2006, at 9:55 PM, Chris Fields wrote: > Leaving the scientific_name as NCBI designates it, though it probably > disagrees with ~99% of the world's textbooks, may be the most > maintainable solution. It doesn't disagree, it's quite like what the world's textbooks give you as a 'scientific name'. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Tue Jul 18 00:24:50 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 17 Jul 2006 23:24:50 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: References: <07041A82-BE01-4998-AF26-35212CA1F0F2@uiuc.edu> Message-ID: <7BCA093B-90FB-4B0A-91FD-A6E0B34C96DD@uiuc.edu> When you mean genus-species, which would be yes. But parent nodes? If you trust WIkipedia, the scientific name == binomial nomenclature. Which could mean no subspecies, strains, etc if one were to be really strict about it, though that may be a grey area; I'm no taxonomist. http://en.wikipedia.org/wiki/Scientific_name The parent nodes shouldn't have a scientific name if one were to adhere strictly to the standard definition above, but NCBI refers to the names for the parent nodes as 'scientific name' (the XML element is still ScientificName, just like the child node). I'm not sure what the tax dump file is, though, so that may be different. Here's the lineage for Taxid 312284 (marine actinobacterium PHSC20C1). I cut out the irrelevant bits and just show the lineage with all the parent nodes, taxID, and rank: 131567 cellular organisms no rank 2 Bacteria superkingdom 201174 Actinobacteria phylum 1760 Actinobacteria (class) class 52018 unclassified Actinobacteria no rank 78537 unclassified Actinobacteria (miscellaneous) no rank .... Seems to me the easiest thing to do here, when looking at a particular node, is to use scientific_name() to hold that particular element for the node and have binomial represent the true 'scientific name', much as Sendu proposed. It would also make life much easier when parsing GenBank/SwissProt/EMBL (SeqIO) to have the data designating the formal scientific name (according to NCBI) be assigned to a scientific_name() get/set method in Bio::Species for later writing; then if we want to delegate this over to Bio::Taxonomy::Node from Bio::Species it would be that much easier. This would also get around some of the problems I have been seeing with bacterial names when passing GenBank data through SeqIO, since you wouldn't be required to glop the name together from the way Bio::Species tried to guess the lineage. Chris On Jul 17, 2006, at 9:06 PM, Hilmar Lapp wrote: > > On Jul 17, 2006, at 9:55 PM, Chris Fields wrote: > >> Leaving the scientific_name as NCBI designates it, though it probably >> disagrees with ~99% of the world's textbooks, may be the most >> maintainable solution. > > It doesn't disagree, it's quite like what the world's textbooks give > you as a 'scientific name'. > > -hilmar > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Tue Jul 18 03:27:49 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Tue, 18 Jul 2006 08:27:49 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <64AB2D48-B342-4FC4-BB4E-9927B24C972C@gmx.net> References: <000001c6a9e1$c6b51610$15327e82@pyrimidine> <44BC0226.1080605@sendu.me.uk> <64AB2D48-B342-4FC4-BB4E-9927B24C972C@gmx.net> Message-ID: <44BC8D75.1080806@sendu.me.uk> Hilmar Lapp wrote: > I don't think we should differ from NCBI in places where the > connection between a method name and the NCBI data file is obvious or > otherwise we will confuse people and send them into traps. > > $node->scientific_name() should simply report what NCBI reports. For > simple species this will be identical to what $node->binomial() > returns, but for others it may not, e.g., strains, varieties, etc or > the weird world of viri and bacteria. Ok, well this certainly seems to be consensus so I'll abide. > This will also absolve us from retaining the business logic for how > to construct the scientific name from genus, species, and possibly > strain or whatever. What about the existing genus(), species(), sub_species() and variant() methods? There would be no need for any logic to join things together, but I would still like to be able to get just 'sapiens' from somewhere. Can I use species() for that purpose (though again, species is strictly 'Homo sapiens')? Likewise sub_species() and variant() could hold the remaining non-redundant names. Or should all of these be deprecated because they don't really have a place in a generic Node class? What about node_name()? Yet another synonym of scientific_name? (right now it grabs the common name(s)). Ugh. What should I do with the classification array? Should it hold the raw ScientificName like: join(',', $node->classification) eq 'Homo sapiens, Homo, Homo/Pan/Gorilla group [...]'? Or should it be like: join(',', $node->classification) eq 'sapiens, Homo, Homo/Pan/Gorilla group [...]'? The latter is how it currently works (when it works correctly); I would rather fix it than lose the logic completely, but if we're staying true to proper classification (vs. what a programmer might expect), I guess I must use the raw ScientificName? > binomial() isn't part of the NCBI taxonomy definition, so you have > freedom there to report what suits you. I don't think binomial() would serve any useful purpose now, however. I can either deprecate it or make it a synonym of scientific_name() or both. Or binomial() can be a version of scientific_name() that complains if you use it on a rank higher or lower than species. As for species() et al., it may have no place in a generic Node class. Thoughts? From bix at sendu.me.uk Tue Jul 18 04:43:43 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Tue, 18 Jul 2006 09:43:43 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BBBB69.6000906@sendu.me.uk> References: <44BBBB69.6000906@sendu.me.uk> Message-ID: <44BC9F3F.2040500@sendu.me.uk> Sendu Bala wrote: [snip proposed changes to Bio::DB::Taxonomy::* and Bio::Taxonomy::Node] > If anyone can see a problem with any of these changes, let me know asap. I've just realised that there are currently no tests for Bio::DB::Taxonomy::flatfile, and that the ones for entrez get skipped. Node doesn't get an especially thorough work-out either (in the skipped section). I'm guessing it's not feasible to include the full taxdump from NCBI (~40MB) in t/data... do people think it would be reasonable to create some sort of small subset of the data? I could just pull out the lines from names.dmp and nodes.dmp relevant to a few example organisms. Say, for human and a tricky bacteria and virus? For the purposes of running the test, where should the index files be kept? In t/data with the .dmp files or in /tmp? Should the test script delete them afterwards, or leave them be? The entrez tests are skipped to 'avoid blocking', but the test only makes 2 entrez queries with a sleep(3) in-between. Basically, I don't think there's ever any reason to skip. Shall I remove the skip? Lots of other database-accessing tests in the test suite just go right ahead and access their database, no problem. Cheers, Sendu. From torsten.seemann at infotech.monash.edu.au Mon Jul 17 23:53:02 2006 From: torsten.seemann at infotech.monash.edu.au (Torsten Seemann) Date: Tue, 18 Jul 2006 13:53:02 +1000 Subject: [Bioperl-l] advice In-Reply-To: <000a01c6a9bb$6478ba90$15327e82@pyrimidine> References: <000a01c6a9bb$6478ba90$15327e82@pyrimidine> Message-ID: <44BC5B1E.5080600@infotech.monash.edu.au> > Ha ! I *almost* added something about that. I thought his vowel keys were > broken for a bit, maybe from pounding the keyboard with extreme frustration! The wide variety of pronunciation of English around the world can be mostly blamed on those damned vowels... so perhaps removing them helps one to reach a wider audience :-) > As an aside, doesn't Damian Conway say something about the non-use of vowels > in 'Perl Best Practices?' I think it was in relation to variables, > though... Yeah, on page 46 he says NOT to remove vowels in variable names, use prefixes instead. Apprntly rmvng vwls mks t hrdr t ndrstnd stff. (Actually, I studied at Monash University under Damian Conway, and recall his ridiculing of Perl, so I found it kind of ironic that he ended up changing the Perl landscape so significantly! He even wrote an internal publication "theStyle - a guide to C programming style" in about 1990 in which he violates some of his later Perl Best Practices :-) -- Dr Torsten Seemann http://www.vicbioinformatics.com Victorian Bioinformatics Consortium, Monash University, Australia From sharma.animesh at gmail.com Tue Jul 18 03:58:41 2006 From: sharma.animesh at gmail.com (Animesh Sharma) Date: Tue, 18 Jul 2006 13:28:41 +0530 Subject: [Bioperl-l] PDB file parser (Separates chain-sequence and chain-structure) Message-ID: <156674e60607180058r653fa8fesbc654508c9c19b5b@mail.gmail.com> Hi Chris, I have written a small script to separate the Chain in a PDB file. It stores the sequence (fasta format) and structure (pdb format) in separate files with middle name according to the Chain it contains. If the PDB file has only one chain, it creates a file with default as middle name. Eg, perl pdb_chain_extract.pl 1HCO.pdb Will create 4 files with names: 1HCO.A.fas ( Sequence of Chain A in fasta format) 1HCO.A.pdb ( Structure of Chain A in pdb format) 1HCO.B.fas ( Sequence of Chain B in fasta format) 1HCO.B.pdb ( Sequence of Chain B in pdb format) .I wrote it in the spirit of your example script given @ http://cvs.bioperl.org/cgi-bin/viewcvs/viewcvs.cgi/bioperl-live/examples/structure/structure-io.pl?rev=1.2&content-type=text/vnd.viewcvs-markupCan this be included in the example scripts too? Thanks and regards, Animesh -- ______________________"The Answer Lies in Genome"______________________ http://fuzzylife.org/animesh/ +919868580004 -------------- next part -------------- A non-text attachment was scrubbed... Name: pdb_chain_extract.pl Type: application/octet-stream Size: 2593 bytes Desc: not available URL: From bix at sendu.me.uk Tue Jul 18 09:20:34 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Tue, 18 Jul 2006 14:20:34 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BCAE08.8070307@ebi.ac.uk> References: <44BCAE08.8070307@ebi.ac.uk> Message-ID: <44BCE022.5000502@sendu.me.uk> I thought I'd post this here incase anyone wants to discuss the points Nadeem brings up. As far as I can see it is acceptable to remove the <> bits so I still plan to do so. Nadeem Faruque wrote: [off-list, posted here with permission] > In case you didn't realise, odd node names such as 'Gnathostomata > ' are created to uniquify some tax nodes that have identical > scientific names, eg there are 8 entries for Rhodotorula. > > When we parse the ncbi tax dump we store this column as UNIQUE_NAME but > I don't think that we actually use it for anything at within EMBL > nucleotide sequence bank. [...] > Also, I note that there are 548 non-unique NAME_TXT of class 'scientific > name', so the UNIQUE_NAME column may be of use to someone (though given > the strength of using a taxid directly I don't see why you'd want to). Indeed. And given that we are building a taxonomy with nodes, it doesn't matter that two different nodes in the entire taxonomy tree share the same name - the position in the tree implicitly is something unique. So if you find yourself with a node called 'Rhodotorula' you can find out which one it is by looking at the closest ranked parent. That said, for 'Rhodotorula ' the closest ranked parent is 'Sporidiobolales' and not 'Sporidiobolaceae'. Is that a problem? Do we need to care about this word 'Sporidiobolaceae' that is effectively just a synonym of 'Sporidiobolales'? [Nadeem later replied "...I can't imagine the <> value to be of any use.". He also clarified that if species have identical names and you store those, you can't work out what the corresponding taxid is. Without the <> bit you need some other information, like the classification. I think this other information will be present in input file formats and it must be up to the user to store the extra when outputting from bioperl] From osborne1 at optonline.net Tue Jul 18 10:50:48 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Tue, 18 Jul 2006 10:50:48 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BC9F3F.2040500@sendu.me.uk> Message-ID: Sendu, The idea to create mini *dmp files is a good one, I think. With respect to temporary files I'm fairly sure that most tests that use them create them some where in t/data and then delete them after. Brian O. On 7/18/06 4:43 AM, "Sendu Bala" wrote: > (~40MB) in t/data... do people think it would be reasonable to create > some sort of small subset of the data? I could just pull out the lines > from names.dmp and nodes.dmp relevant to a few example organisms. Say, > for human and a tricky bacteria and virus? > For the purposes of running the test, where should the index files be > kept? In t/data with the .dmp files or in /tmp? Should the test script > delete them afterwards, or leave them be? From cjfields at uiuc.edu Tue Jul 18 11:44:07 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 18 Jul 2006 10:44:07 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BC8D75.1080806@sendu.me.uk> Message-ID: <003201c6aa81$01db9a30$15327e82@pyrimidine> > What about the existing genus(), species(), sub_species() and variant() > methods? There would be no need for any logic to join things together, > but I would still like to be able to get just 'sapiens' from somewhere. > Can I use species() for that purpose (though again, species is strictly > 'Homo sapiens')? Likewise sub_species() and variant() could hold the > remaining non-redundant names. Or should all of these be deprecated > because they don't really have a place in a generic Node class? This is where Hilmar suggests that you have a bit of freedom in doing what you want, as with binomial(). So species() should return species ('sapiens'), genus return genus, etc. At that level there will need to be some additional data munging since the ranks below species seem to include the entire name, not just the species. But this could be done from the lineage if all nodes are present and tagged as such. > What about node_name()? Yet another synonym of scientific_name? (right > now it grabs the common name(s)). Ugh. I agree things need cleaning up. You could always make node_name() an alias for scientific_name() though it could just be deprecated. > What should I do with the classification array? Should it hold the raw > ScientificName like: > join(',', $node->classification) eq 'Homo sapiens, Homo, > Homo/Pan/Gorilla group [...]'? > Or should it be like: > join(',', $node->classification) eq 'sapiens, Homo, Homo/Pan/Gorilla > group [...]'? Don't know what the dump file gives; the XML output using efetch via entrez has the raw lineage (as appears in a GenBank sequence file) and the actual full lineage with TaxID, rank, 'scientific name,' in the actual lineage order. I think one problem area will be the 'no rank' designations in the lineage. Note that the below example also has a species and no genus; tricky! 312284 marine actinobacterium PHSC20C1 marine actinobacterium strain PHSC20C1 marine actinobacterium str. PHSC20C1 78537 species Bacteria ... cellular organisms; Bacteria; Actinobacteria; Actinobacteria (class); unclassified Actinobacteria; unclassified Actinobacteria (miscellaneous) 131567 cellular organisms no rank 2 Bacteria superkingdom 201174 Actinobacteria phylum 1760 Actinobacteria (class) class 52018 unclassified Actinobacteria no rank 78537 unclassified Actinobacteria (miscellaneous) no rank > The latter is how it currently works (when it works correctly); I would > rather fix it than lose the logic completely, but if we're staying true > to proper classification (vs. what a programmer might expect), I guess I > must use the raw ScientificName? > > > binomial() isn't part of the NCBI taxonomy definition, so you have > > freedom there to report what suits you. > > I don't think binomial() would serve any useful purpose now, however. I > can either deprecate it or make it a synonym of scientific_name() or > both. Or binomial() can be a version of scientific_name() that complains > if you use it on a rank higher or lower than species. As for species() > et al., it may have no place in a generic Node class. Thoughts? The use of scientific_name() in this context would be more to conform with what NCBI defines it as rather than as the actual definition; this should be explicitly stated as such in POD and is more for long-term maintainability. No matter what is done here, you will have some degree of confusion: those who want strict adherence to the term 'scientific name' and those who want the method to conform to NCBI's definition. Better to document the reasoning for it in some way that risk the random masses complaining. We could use binomial() for the 'scientific name' as the rest of the world knows it (as in binomial nomenclature), having it built from genus-species like you had originally suggested. That's what Hilmar suggested as an 'experimental' area of sorts, since NCBI doesn't use that particular term in its taxonomy definition. Chris From cjfields at uiuc.edu Tue Jul 18 11:48:36 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 18 Jul 2006 10:48:36 -0500 Subject: [Bioperl-l] advice In-Reply-To: <44BC5B1E.5080600@infotech.monash.edu.au> Message-ID: <003301c6aa81$a34fd8e0$15327e82@pyrimidine> Guess Dr. Conway became a Perl convert. The reviews of the book state that the 'best practices' really come from his experience as a Perl programmer over the last couple of decades, so maybe he learned something since 1990. Chris > > Ha ! I *almost* added something about that. I thought his vowel keys > were > > broken for a bit, maybe from pounding the keyboard with extreme > frustration! > > The wide variety of pronunciation of English around the world can be > mostly blamed on those damned vowels... so perhaps removing them helps > one to reach a wider audience :-) > > > As an aside, doesn't Damian Conway say something about the non-use of > vowels > > in 'Perl Best Practices?' I think it was in relation to variables, > > though... > > Yeah, on page 46 he says NOT to remove vowels in variable names, use > prefixes instead. Apprntly rmvng vwls mks t hrdr t ndrstnd stff. > > (Actually, I studied at Monash University under Damian Conway, and > recall his ridiculing of Perl, so I found it kind of ironic that he > ended up changing the Perl landscape so significantly! He even wrote an > internal publication "theStyle - a guide to C programming style" in > about 1990 in which he violates some of his later Perl Best Practices :-) > > -- > Dr Torsten Seemann http://www.vicbioinformatics.com > Victorian Bioinformatics Consortium, Monash University, Australia > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Tue Jul 18 12:05:48 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 18 Jul 2006 11:05:48 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BC9F3F.2040500@sendu.me.uk> Message-ID: <003401c6aa84$08ff6c80$15327e82@pyrimidine> > I've just realised that there are currently no tests for > Bio::DB::Taxonomy::flatfile, and that the ones for entrez get skipped. > Node doesn't get an especially thorough work-out either (in the skipped > section). > > I'm guessing it's not feasible to include the full taxdump from NCBI > (~40MB) in t/data... do people think it would be reasonable to create > some sort of small subset of the data? I could just pull out the lines > from names.dmp and nodes.dmp relevant to a few example organisms. Say, > for human and a tricky bacteria and virus? > For the purposes of running the test, where should the index files be > kept? In t/data with the .dmp files or in /tmp? Should the test script > delete them afterwards, or leave them be? I would place a small section in t/data or several individual examples in a subdirectory thereof (t/data/taxonomy). > The entrez tests are skipped to 'avoid blocking', but the test only > makes 2 entrez queries with a sleep(3) in-between. Basically, I don't > think there's ever any reason to skip. Shall I remove the skip? Lots of > other database-accessing tests in the test suite just go right ahead and > access their database, no problem. Depends on whether there is someone out there who doesn't have a network connection (and there always is). The DB.t tests skip based on testing for the env. variable BIOPERLDEBUG. 1..121 ok 1 # Skipping tests which require remote servers - set env variable BIOPERLDEBUG to test You could always do something along those lines or add a test for a network connection using an eval block and skip the tests if the network test fails, but there you run the risk of the tests failing not b/c of code problems but from remote server issues; I've seen this happen with SwissProt and GenBank testing before during peak hours. Chris > Cheers, > Sendu. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From bix at sendu.me.uk Tue Jul 18 13:03:54 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Tue, 18 Jul 2006 18:03:54 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <003201c6aa81$01db9a30$15327e82@pyrimidine> References: <003201c6aa81$01db9a30$15327e82@pyrimidine> Message-ID: <44BD147A.9020103@sendu.me.uk> Chris Fields wrote: >> What about the existing genus(), species(), sub_species() and variant() >> methods? There would be no need for any logic to join things together, >> but I would still like to be able to get just 'sapiens' from somewhere. >> Can I use species() for that purpose (though again, species is strictly >> 'Homo sapiens')? Likewise sub_species() and variant() could hold the >> remaining non-redundant names. Or should all of these be deprecated >> because they don't really have a place in a generic Node class? > > This is where Hilmar suggests that you have a bit of freedom in doing what > you want, as with binomial(). So species() should return species > ('sapiens'), genus return genus, etc. [regarding changes to Bio::Taxonomy::Node] Actually, I'm really strongly leaning toward getting rid of the following methods and new() options (and giving up entirely on being able to keep 'sapiens' somewhere): -organelle, organelle() -division, division() -sub_species, sub_species() -variant, variant() species(), validate_species_name() genus() binomial() As far as I can see none of these methods have any place in a generic Node class. If you want to know what your species is you have to be rank() 'species' and you just call scientific_name(). The above kind of methods belong in something like Bio::Species or similar, NOT in Node. Does anyone disagree? Can anyone offer a justification for keeping these methods? Changes I haven't yet discussed but have already made (but not committed): *parent_taxon_id = \&parent_id; *common_name = \&common_names; -factory and factory() removed, since there is no Bio::Taxonomy::FactoryI-implementing module, nothing in Node to make use of a factory once set, and a factory seems redundant when we're a node with a -dbh. validate_name() removed because it just returns 1. >> What about node_name()? Yet another synonym of scientific_name? (right >> now it grabs the common name(s)). Ugh. > > I agree things need cleaning up. You could always make node_name() an alias > for scientific_name() though it could just be deprecated. Actually, I've gone with node_name as the 'pure' and best method to set the name of your node with, and made scientific_name an alias of it (though it behaves as suggested earlier in the thread). >> What should I do with the classification array? Should it hold the raw >> ScientificName like: >> join(',', $node->classification) eq 'Homo sapiens, Homo, >> Homo/Pan/Gorilla group [...]'? (I've decided to do it the above way for consistency with scientific_name) >> Or should it be like: >> join(',', $node->classification) eq 'sapiens, Homo, Homo/Pan/Gorilla >> group [...]'? > > Don't know what the dump file gives; the XML output using efetch via entrez > has the raw lineage (as appears in a GenBank sequence file) and the actual > full lineage with TaxID, rank, 'scientific name,' in the actual lineage > order. I think one problem area will be the 'no rank' designations in the > lineage. Note that the below example also has a species and no genus; > tricky! Currently, flatfile and entrez ignore nodes with a rank of 'no rank' when they build the classification array. I had no intention of changing this behaviour. > 1760 > Actinobacteria (class) > class Ugh. I guess my proposal to remove <> bits via flatfile extends to removing () bits via entrez. We don't need unique names; we can use object_id() when uniqueness matters. >> I don't think binomial() would serve any useful purpose now, however. > > We could use binomial() for the 'scientific name' as the rest of the world > knows it (as in binomial nomenclature), having it built from genus-species > like you had originally suggested. No, see above. I don't think it makes the slightest bit of sense for a Node to go around trying to build things from a parent it may or may not have. Again, binomial() is a method for something like Bio::Species, not a generic Node class. From cjfields at uiuc.edu Tue Jul 18 15:34:29 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 18 Jul 2006 14:34:29 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BD147A.9020103@sendu.me.uk> Message-ID: <000901c6aaa1$328dd3d0$15327e82@pyrimidine> ... > [regarding changes to Bio::Taxonomy::Node] > > Actually, I'm really strongly leaning toward getting rid of the > following methods and new() options (and giving up entirely on being > able to keep 'sapiens' somewhere): > > -organelle, organelle() > -division, division() > -sub_species, sub_species() > -variant, variant() > species(), validate_species_name() > genus() > binomial() > > As far as I can see none of these methods have any place in a generic > Node class. If you want to know what your species is you have to be > rank() 'species' and you just call scientific_name(). The above kind of > methods belong in something like Bio::Species or similar, NOT in Node. > Does anyone disagree? Can anyone offer a justification for keeping these > methods? Bio::Species and Bio::Taxonomy::Node are closely linked and plans are to have Bio::Species delegate methods to Bio::Taxonomy::Node. So any changes to Node will affect Bio::Species to some degree. If you can get the lineage from XML, you could set many of these based on the rank given. Jason uses XML::Twig in Bio::DB::Taxonomy::entrez to parse out the XML data into Bio::Taxonomy::Node objects; it shouldn't be difficult to leave some methods based on rank (genus, species, etc) as simple get/set methods for the time being and leave the heavy lifting to the modules dealing directly with the data. Bio::Species could then delegate data/methods over to Bio::Taxonomy::Node fairly easily. If there is no genus/species data to be grabbed (either it doesn't exist or isn't present for some reason), then simply leave it as undef. That's also why I thought binomial() could stick around; if you have both the genus() and species() you could grab both using binomial(), building in special cases or error handling in case genus() or species() or both return undef. I don't see the problem in keeping this as long as users know what it means: by detailing the method in POD. If someone complains we tell them to RTFM. > Changes I haven't yet discussed but have already made (but not committed): > > *parent_taxon_id = \&parent_id; > *common_name = \&common_names; > -factory and factory() removed, since there is no > Bio::Taxonomy::FactoryI-implementing module, nothing in Node to make use > of a factory once set, and a factory seems redundant when we're a node > with a -dbh. > validate_name() removed because it just returns 1. > ... > Actually, I've gone with node_name as the 'pure' and best method to set > the name of your node with, and made scientific_name an alias of it > (though it behaves as suggested earlier in the thread). I don't have any problem with that. As long as it conforms somewhat to the NCBI definition to prevent confusion I think it's okay. > >> What should I do with the classification array? Should it hold the raw > >> ScientificName like: > >> join(',', $node->classification) eq 'Homo sapiens, Homo, > >> Homo/Pan/Gorilla group [...]'? > > (I've decided to do it the above way for consistency with scientific_name) I think that's fine. ... > Currently, flatfile and entrez ignore nodes with a rank of 'no rank' > when they build the classification array. I had no intention of changing > this behaviour. If you ignore nodes with 'no rank' there will be major problems when retrieving certain TaxID's from protein/nucleotide sequences. I had posted some sample XML from many NCBI TaxIDs taken from sequence files and via ELink and a good many of those nodes (most of them from genome projects) have 'no rank'. 376686 Flavobacterium johnsoniae UW101 ... 986 no rank ... 373903 Halothermothrix orenii H 168 ... 31909 no rank These aren't 'edge cases' anymore but now are pretty common from genome sequencing. I would just assign 'no rank' to rank() and have the node retained for DB purposes. It seems that the tax dump loses quite a bit of information somewhere along the way that shows up in the XML. Or am I wrong? > > 1760 > > Actinobacteria (class) > > class > > Ugh. I guess my proposal to remove <> bits via flatfile extends to > removing () bits via entrez. We don't need unique names; we can use > object_id() when uniqueness matters. The XML parsing in Taxonomy::entrez will take care of the and retains the character data in between. It would be a matter of setting the parser correctly to grab the relevant data and assign it properly. > >> I don't think binomial() would serve any useful purpose now, however. > > > > We could use binomial() for the 'scientific name' as the rest of the > world > > knows it (as in binomial nomenclature), having it built from genus- > species > > like you had originally suggested. > > No, see above. I don't think it makes the slightest bit of sense for a > Node to go around trying to build things from a parent it may or may not > have. Again, binomial() is a method for something like Bio::Species, not > a generic Node class. Bio::Species, from what I gather, was initially created to hold the tax data from GenBank/EMBL/SwissProt (RichSeq) files and is not DB-aware. Bio::Taxonomy::Node was supposed to be like Bio::Species and also be DB-aware: http://thread.gmane.org/gmane.comp.lang.perl.bio.general/4284/focus=4321 Again, Bio::Species methods are supposed to (eventually) delegate to Bio::Taxonomy::Node, so the two are closely linked along with their methods. Any way we go about it here (keeping certain methods and tossing others, changing the data returned, etc), it looks like there will be API issues down the road which will directly affect anyone using tax data. That affects bioperl-db directly as well as any other bioperl-based DB's which rely on tax data. So we need to tread a bit carefully when making major changes to make sure that they work for bioperl-db and anywhere else that may require it. Chris From cjfields at uiuc.edu Tue Jul 18 15:41:31 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 18 Jul 2006 14:41:31 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BD147A.9020103@sendu.me.uk> Message-ID: <000a01c6aaa2$2b4f50c0$15327e82@pyrimidine> Sendu et al, I'll play around with adding a quick method to Bio::Species for scientific_name(); if I can get it to play nice with Bio::SeqIO::genbank and it passes tests I'll commit it. Chris From golharam at umdnj.edu Tue Jul 18 15:36:54 2006 From: golharam at umdnj.edu (Ryan Golhar) Date: Tue, 18 Jul 2006 15:36:54 -0400 Subject: [Bioperl-l] advice In-Reply-To: <003301c6aa81$a34fd8e0$15327e82@pyrimidine> Message-ID: <00a501c6aaa1$86edb620$2f01a8c0@GOLHARMOBILE1> Right. There was a chain letter going around the internet for awhile about how you can leave out certain letters and the human brain will still be able to correctly interpret what the word is supposed to be. Either that or it was something about how Europe was adopting a new variation of English and after many successions it started to sound/look like German. > The wide variety of pronunciation of English around the world can be > mostly blamed on those damned vowels... so perhaps removing them helps > one to reach a wider audience :-) > > > As an aside, doesn't Damian Conway say something about the non-use > > of > vowels > > in 'Perl Best Practices?' I think it was in relation to variables, > > though... > > Yeah, on page 46 he says NOT to remove vowels in variable names, use > prefixes instead. Apprntly rmvng vwls mks t hrdr t ndrstnd stff. From cjfields at uiuc.edu Tue Jul 18 17:44:29 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 18 Jul 2006 16:44:29 -0500 Subject: [Bioperl-l] Bio::SeqIO::genbank and Bio::Species Message-ID: <000001c6aab3$58ee7bd0$15327e82@pyrimidine> For a given GenBank file, you'll have the following (this is from NCBI's current flatfile format, http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html): LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999 DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds. ACCESSION U49845 VERSION U49845.1 GI:1293613 KEYWORDS . SOURCE Saccharomyces cerevisiae (baker's yeast) ORGANISM Saccharomyces cerevisiae Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces. ... The SOURCE line above, according to NCBI, contains an abbreviated name and a common name (optional); it can also apparently contain additional information, such as organelles and so on. The ORGANISM line contains NCBI's definition of the formal scientific name (see the related thread on Taxonomy proposed changes) along with lineage information Currently, Bio::SeqIO::genbank and Bio::Species are very inconsistent with bacterial names, so when I process everything through SeqIO I get: SOURCE Mycobacterium tuberculosis H37Rv H37Rv ORGANISM Mycobacterium tuberculosis SOURCE Mycobacterium tuberculosis CDC1551 CDC1551 ORGANISM Mycobacterium tuberculosis SOURCE Mycobacterium avium subsp. paratuberculosis K-10 paratuberculosis K-10 ORGANISM Mycobacterium avium subsp. SOURCE Bacillus sp. NRRL B-14911 NRRL B-14911 ORGANISM Bacillus sp. I have added a scientific_name() method to Bio::Species to contain the string on the ORGANISM line and replace it as is, which seems to work well (doesn't chop the name down). The bigger issue is the mess with the SOURCE line. This stems from adding back information from sub_species(), which I don't think needs to be done as it's supposed to be an abbreviated name. Anybody mind if I try splitting up the original SOURCE line data into organelle(), abbreviated_name(), and common_name()? This will change common_name a bit (so, instead of 'Saccharomyces cerevisiae' it will give 'baker's yeast') but will also conform more to the NCBI definition of 'common name.' Also, organelle info isn't handled yet; I could toy with adding support for it. Any objections? I may proceed to do the same with EMBL, SwissPort, and others that use Bio::Species if this works out. Christopher Fields Postdoctoral Researcher - Switzer Lab Dept. of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Tue Jul 18 18:50:37 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Tue, 18 Jul 2006 23:50:37 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <000901c6aaa1$328dd3d0$15327e82@pyrimidine> References: <000901c6aaa1$328dd3d0$15327e82@pyrimidine> Message-ID: <44BD65BD.4030501@sendu.me.uk> Chris Fields wrote: > ... >> [regarding changes to Bio::Taxonomy::Node] >> >> Actually, I'm really strongly leaning toward getting rid of the >> following methods and new() options (and giving up entirely on being >> able to keep 'sapiens' somewhere): >> >> -organelle, organelle() >> -division, division() >> -sub_species, sub_species() >> -variant, variant() >> species(), validate_species_name() >> genus() >> binomial() > > Bio::Species and Bio::Taxonomy::Node are closely linked and plans are to > have Bio::Species delegate methods to Bio::Taxonomy::Node. So any changes > to Node will affect Bio::Species to some degree. I see from the original postings that Node was intended to be like Species, but I don't think it makes the slightest bit of sense. A /single/ Node need only (must only!) represent the information for a single node in the taxonomy. Or else what do these objects mean? What is the object model? It's bad bad bad for it to be sensible one way (when you're making your own taxonomy by making your own nodes) and nonsensical another (when we stuff in methods so that Bio::Species is happy). The way Node is written right now, and what you're suggesting, is that we stuff the entire Taxonomy into the Node. Well, except that you don't even have methods for every taxonomic level - there is genus() but no subphylum(). I can't emphasise strongly enough how insane all this is. The correct thing for Bio::Species to interact with is Bio::Taxonomy. Bio::Taxonomy is a collection of Nodes and has the sort of methods that Bio::Species would need to delegate its current functionality. I'm quite willing to do a proper overhaul here so everything makes sense. You either make your own nodes and add these to a Taxonomy or use a factory (which would use a Bio::DB::Taxonomy presumably). A Taxonomy lets you discover the classification of any node it contains. Bio::Species could implement a method like genus() by: $node = $taxonomy->get_node('genus') || return; return $node->scientific_name; Bio::Taxonomy isn't perfect, but I can certainly get it to do its job. I'd probably make it rank-name and order independent for starters. Bio::Taxonomy::Node needs to be reduced right down to just hold data about the node it represents, and possibly its parent node id (or other way of getting to its parent). So now I'm proposing dropping the classification() method from Node as well. It's simply not necessary; Bio::Taxonomy should give you that information. Bio::Taxnomoy::FactoryI doesn't make much sense to me at the moment from its docs, but it could be used to build a Taxonomy (that seems to be its intent, I'm just not sure what some of the methods are really supposed to do) such that Node might not even need any methods for getting its parent or child nodes. The Factory or Taxonomy might be able to deal with that. In short, I'm proposing a major change to Bio::Taxonomy::Node (make it just a node), and minor changes to (& implementation of) Bio::Taxonomy and Bio::Taxonomy::FactoryI such that they actually get used to do their jobs. > That's also why I thought binomial() could stick around; if you have both > the genus() and species() you could grab both using binomial(), building in > special cases or error handling in case genus() or species() or both return > undef. binomial() would belong in (and is present in) Bio::Taxonomy. But in any case, it's not needed there either; if you want the binomial you just ask for the scientific_name of the species node in your Taxonomy, since this now contains the actual scientific name == binomial. binomial() in Bio::Taxonomy could be reimplemented as: $node = $self->get_node('species') || return; return $node->scientific_name; >> Currently, flatfile and entrez ignore nodes with a rank of 'no rank' >> when they build the classification array. I had no intention of changing >> this behaviour. > > If you ignore nodes with 'no rank' there will be major problems when > retrieving certain TaxID's from protein/nucleotide sequences. This is only for the classification array, which is meaningless anyway (there only for file-format compatibility). If you want the real information you ask your Bio::Taxonomy (which asks each of its nodes). This is the whole point of having Bio::Taxonomy in the first place. It gives you great flexibility to do whatever you want to do. >>> 1760 >>> Actinobacteria (class) >>> class >> Ugh. I guess my proposal to remove <> bits via flatfile extends to >> removing () bits via entrez. We don't need unique names; we can use >> object_id() when uniqueness matters. > > The XML parsing in Taxonomy::entrez will take care of the and retains > the character data in between. You misunderstood. I meant the <> bits I discussed at the very start of this thread, that flatfile gives you. Here I'm referring to getting rid of ' (class)' as well. > Any way we go about it here (keeping certain methods and tossing others, > changing the data returned, etc), it looks like there will be API issues > down the road which will directly affect anyone using tax data. That > affects bioperl-db directly as well as any other bioperl-based DB's which > rely on tax data. So we need to tread a bit carefully when making major > changes to make sure that they work for bioperl-db and anywhere else that > may require it. Does anything make serious use of the current Bio::Taxonomy code? Or are they using Bio::Species? From cjfields at uiuc.edu Wed Jul 19 00:38:05 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 18 Jul 2006 23:38:05 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BD65BD.4030501@sendu.me.uk> References: <000901c6aaa1$328dd3d0$15327e82@pyrimidine> <44BD65BD.4030501@sendu.me.uk> Message-ID: I think we should wait a bit for any dramatic changes but implement the ones there seems to be a consensus on. I understand your reasoning for taking this on but I'm not sure completely revamping Bio::Taxonomy w/o input from the core developers is wise, especially since we do NOT know who uses it, why they use it, and how changing/ removing methods will affect their code. We are doing nothing productive here by constantly butting heads on this and having different opinions on what we think Bio::Taxonomy/Bio::Species is best suited for, when neither one of us is actually sure about who uses it and why. A reasonable solution is there but we must rely on outside opinions in order to reach it, so I propose a short moratorium on changes to Bio::Taxonomy/Bio::Species that radically redefine the API on either class. BTW, for anbody following, I'm perfectly comfortable if Sendu takes the lead on this and implements his changes; I'm just not sure about stripping the class down to the bare minimum. So far, the only thing that has been proposed (and accepted by all) is that scientific_name() hold the data for that tag in a node. I think most here would agree that's fine; I've already added a get/set to Bio::Species but haven't committed it yet. However, what you propose doing below is refactoring the code and changing the API. I agree there needs to be an overhaul but we can't do this w/o guidance or input from the GBE (Great Bioperl Elders). I would like some of the 'senior' core developers chime in a bit more on their thoughts on this. Jason also mentioned somewhere that any changes for Taxonomy/ Species should be tracked on the wiki somewhere as well to make sure everything is kosher and keep users up-to-date. I would like his input here but I think he's still incommunicado at the moment. Chris On Jul 18, 2006, at 5:50 PM, Sendu Bala wrote: > Chris Fields wrote: >> ... >>> [regarding changes to Bio::Taxonomy::Node] >>> >>> Actually, I'm really strongly leaning toward getting rid of the >>> following methods and new() options (and giving up entirely on being >>> able to keep 'sapiens' somewhere): >>> >>> -organelle, organelle() >>> -division, division() >>> -sub_species, sub_species() >>> -variant, variant() >>> species(), validate_species_name() >>> genus() >>> binomial() >> >> Bio::Species and Bio::Taxonomy::Node are closely linked and plans >> are to >> have Bio::Species delegate methods to Bio::Taxonomy::Node. So any >> changes >> to Node will affect Bio::Species to some degree. > > I see from the original postings that Node was intended to be like > Species, but I don't think it makes the slightest bit of sense. A > /single/ Node need only (must only!) represent the information for a > single node in the taxonomy. Or else what do these objects mean? > What is > the object model? It's bad bad bad for it to be sensible one way (when > you're making your own taxonomy by making your own nodes) and > nonsensical another (when we stuff in methods so that Bio::Species is > happy). The way Node is written right now, and what you're suggesting, > is that we stuff the entire Taxonomy into the Node. Well, except that > you don't even have methods for every taxonomic level - there is > genus() > but no subphylum(). I can't emphasise strongly enough how insane all > this is. > > The correct thing for Bio::Species to interact with is Bio::Taxonomy. > Bio::Taxonomy is a collection of Nodes and has the sort of methods > that > Bio::Species would need to delegate its current functionality. > > I'm quite willing to do a proper overhaul here so everything makes > sense. You either make your own nodes and add these to a Taxonomy > or use > a factory (which would use a Bio::DB::Taxonomy presumably). A Taxonomy > lets you discover the classification of any node it contains. > Bio::Species could implement a method like genus() by: > $node = $taxonomy->get_node('genus') || return; > return $node->scientific_name; > > Bio::Taxonomy isn't perfect, but I can certainly get it to do its job. > I'd probably make it rank-name and order independent for starters. > > Bio::Taxonomy::Node needs to be reduced right down to just hold data > about the node it represents, and possibly its parent node id (or > other > way of getting to its parent). So now I'm proposing dropping the > classification() method from Node as well. It's simply not necessary; > Bio::Taxonomy should give you that information. > > Bio::Taxnomoy::FactoryI doesn't make much sense to me at the moment > from > its docs, but it could be used to build a Taxonomy (that seems to > be its > intent, I'm just not sure what some of the methods are really supposed > to do) such that Node might not even need any methods for getting its > parent or child nodes. The Factory or Taxonomy might be able to deal > with that. > > In short, I'm proposing a major change to Bio::Taxonomy::Node (make it > just a node), and minor changes to (& implementation of) Bio::Taxonomy > and Bio::Taxonomy::FactoryI such that they actually get used to do > their > jobs. > > >> That's also why I thought binomial() could stick around; if you >> have both >> the genus() and species() you could grab both using binomial(), >> building in >> special cases or error handling in case genus() or species() or >> both return >> undef. > > binomial() would belong in (and is present in) Bio::Taxonomy. But > in any > case, it's not needed there either; if you want the binomial you just > ask for the scientific_name of the species node in your Taxonomy, > since > this now contains the actual scientific name == binomial. > > binomial() in Bio::Taxonomy could be reimplemented as: > $node = $self->get_node('species') || return; > return $node->scientific_name; > > >>> Currently, flatfile and entrez ignore nodes with a rank of 'no rank' >>> when they build the classification array. I had no intention of >>> changing >>> this behaviour. >> >> If you ignore nodes with 'no rank' there will be major problems when >> retrieving certain TaxID's from protein/nucleotide sequences. > > This is only for the classification array, which is meaningless anyway > (there only for file-format compatibility). If you want the real > information you ask your Bio::Taxonomy (which asks each of its nodes). > This is the whole point of having Bio::Taxonomy in the first place. > > It gives you great flexibility to do whatever you want to do. > > >>>> 1760 >>>> Actinobacteria (class) >>>> class >>> Ugh. I guess my proposal to remove <> bits via flatfile extends to >>> removing () bits via entrez. We don't need unique names; we can use >>> object_id() when uniqueness matters. >> >> The XML parsing in Taxonomy::entrez will take care of the >> and retains >> the character data in between. > > You misunderstood. I meant the <> bits I discussed at the very > start of > this thread, that flatfile gives you. Here I'm referring to getting > rid > of ' (class)' as well. > > >> Any way we go about it here (keeping certain methods and tossing >> others, >> changing the data returned, etc), it looks like there will be API >> issues >> down the road which will directly affect anyone using tax data. That >> affects bioperl-db directly as well as any other bioperl-based >> DB's which >> rely on tax data. So we need to tread a bit carefully when making >> major >> changes to make sure that they work for bioperl-db and anywhere >> else that >> may require it. > > Does anything make serious use of the current Bio::Taxonomy code? > Or are > they using Bio::Species? > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From ong at embl.de Wed Jul 19 03:51:48 2006 From: ong at embl.de (ong at embl.de) Date: Wed, 19 Jul 2006 09:51:48 +0200 Subject: [Bioperl-l] Fwd: Re: BioPerl query Message-ID: <20060719095148.f71b1v3p7qosk440@webmail.embl.de> HI, Anybody have an answer to the below query? Thanks. Regards, Ong ----- Forwarded message from birney at ebi.ac.uk ----- Date: Wed, 19 Jul 2006 08:16:06 +0100 From: Ewan Birney Reply-To: Ewan Birney Subject: Re: BioPerl query To: ong at embl.de On 18 Jul 2006, at 10:26, ong at embl.de wrote: > Dear Birney, > > Good day i wish to get your advise on how do i print out the PSM > matrix from > the code below. Thanks > I would ask this message on the bioperl list, not to me directly. > Regards, > Ong > > use Bio::Matrix::PSM::IO; > > my $psmIO=new Bio::Matrix::PSM::IO(-file=>'matrix.dat',- > format=>'transfac'); > while (my $psm=$psmIO->next_psm) { > my $id=$psm->id; > my $an=$psm->accession_number; > my $re = $psm->regexp; > #my $l=$psm->width; > my $cons=$psm->IUPAC; > print"$id\t$an\t$re\t$l\t$cons\t$psm\n"; > } ----- End forwarded message ----- From rmb32 at cornell.edu Tue Jul 18 20:06:02 2006 From: rmb32 at cornell.edu (Robert Buels) Date: Tue, 18 Jul 2006 17:06:02 -0700 Subject: [Bioperl-l] bioperl pulls, xml parsers push, and things get complicated Message-ID: <44BD776A.1080402@cornell.edu> Hi all, Here's a kind of abstract question about Bioperl and XML parsing: I'm thinking about writing a bioperl parser for genomethreader XML, and I'm sort of mulling over the 'impedence mismatch' between the way bioperl Bio::*IO::* modules work and the way all of the current XML parsers work. Bioperl uses a 'pull' model, where every time you want a new chunk of stuff, you call $io_object->next_thing. All the XML parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a 'push' model, where every time they parse a chunk, they call _your_ code, usually via a subroutine reference you've given to the XML parser when you start it up. From what I can tell, current Bioperl IO modules that parse XML are using push parsers to parse the whole document, holding stuff in memory, then spoon-feeding it in chunks to the calling program when it calls next_*(). This is fine until the input XML gets really big, in which case you can quickly run out of memory. Does anybody have good ideas for nice, robust ways of writing a bioperl IO module for really big input XML files? There don't seem to be any perl pull parsers for XML. All I've dug up so far would be having the XML push parser running in a different thread or process, pushing chunks of data into a pipe or similar structure that blocks the progress of the push parser until the pulling bioperl code wants the next piece of data, but there are plenty of ugly issues with that, whether one were too use perl threads for it (aaagh!) or fork and push some kind of intermediate format through a pipe or socket between the two processes (eek!). So, um, if you've read this far, do you have any ideas? Rob -- Robert Buels SGN Bioinformatics Analyst 252A Emerson Hall, Cornell University Ithaca, NY 14853 Tel: 503-889-8539 rmb32 at cornell.edu http://www.sgn.cornell.edu From alc at sanger.ac.uk Wed Jul 19 06:55:12 2006 From: alc at sanger.ac.uk (Avril Coghlan) Date: Wed, 19 Jul 2006 11:55:12 +0100 Subject: [Bioperl-l] parsing est2genome output Message-ID: <1153306513.27383.12.camel@deskpro104.dynamic.sanger.ac.uk> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From bernd.web at gmail.com Wed Jul 19 07:36:08 2006 From: bernd.web at gmail.com (Bernd Web) Date: Wed, 19 Jul 2006 13:36:08 +0200 Subject: [Bioperl-l] SearchIO HOWTO Message-ID: <716af09c0607190436n5fdd5576m23887051aaf95f8e@mail.gmail.com> Hi, On http://www.bioperl.org/wiki/HOWTO:SearchIO there is a great HOWTO parse your BLAST report. In the Table of methods, the third line from the bottom is: "HSP alignment Not available in this report Bio::SimpleAlign object " Would it not be good to add the get_aln method ( $hsp->get_aln) ? The line in "Using the methods" my $alignment_as_string = $alnIO->write_aln($aln); may be confusing: $alignment_as_string will be "1" on success and the alignment is printed to STDIO. Should IO::String be introduced here too set up a string filehandle? Best regards, Bernd From hlapp at gmx.net Wed Jul 19 09:40:47 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 19 Jul 2006 09:40:47 -0400 Subject: [Bioperl-l] bioperl pulls, xml parsers push, and things get complicated In-Reply-To: <44BD776A.1080402@cornell.edu> References: <44BD776A.1080402@cornell.edu> Message-ID: <73755CCF-2966-4580-BBEF-1F8A94CDC55D@gmx.net> In the past the way this was done for potentially big XML files is to use regex-based extraction of chunks that correspond to a object you want to return per call to next_XXX(). That chunk would then be passed on to the XML parser under the hood. This only gets problematic once even the chunks are huge, or the name of the element that encloses your chunk can be ambiguous with what's in your text. The latter is unlikely though if you include the angle brackets. I believe this is how at least some bioperl parsers for XML-based formats were written, and it seemed to work fine. -hilmar On Jul 18, 2006, at 8:06 PM, Robert Buels wrote: > Hi all, > > Here's a kind of abstract question about Bioperl and XML parsing: > > I'm thinking about writing a bioperl parser for genomethreader XML, > and > I'm sort of mulling over the 'impedence mismatch' between the way > bioperl Bio::*IO::* modules work and the way all of the current XML > parsers work. Bioperl uses a 'pull' model, where every time you > want a > new chunk of stuff, you call $io_object->next_thing. All the XML > parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a > 'push' model, where every time they parse a chunk, they call _your_ > code, usually via a subroutine reference you've given to the XML > parser > when you start it up. > > From what I can tell, current Bioperl IO modules that parse XML are > using push parsers to parse the whole document, holding stuff in > memory, > then spoon-feeding it in chunks to the calling program when it calls > next_*(). This is fine until the input XML gets really big, in which > case you can quickly run out of memory. > > Does anybody have good ideas for nice, robust ways of writing a > bioperl > IO module for really big input XML files? There don't seem to be any > perl pull parsers for XML. All I've dug up so far would be having the > XML push parser running in a different thread or process, pushing > chunks > of data into a pipe or similar structure that blocks the progress > of the > push parser until the pulling bioperl code wants the next piece of > data, > but there are plenty of ugly issues with that, whether one were too > use > perl threads for it (aaagh!) or fork and push some kind of > intermediate > format through a pipe or socket between the two processes (eek!). > > So, um, if you've read this far, do you have any ideas? > > Rob > > -- > Robert Buels > SGN Bioinformatics Analyst > 252A Emerson Hall, Cornell University > Ithaca, NY 14853 > Tel: 503-889-8539 > rmb32 at cornell.edu > http://www.sgn.cornell.edu > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From jay at jays.net Wed Jul 19 09:43:52 2006 From: jay at jays.net (Jay Hannah) Date: Wed, 19 Jul 2006 08:43:52 -0500 (CDT) Subject: [Bioperl-l] Walking multiple bioentries using bioperl-db Message-ID: Howdy -- I'm using bioperl-db + biosql-schema + mySQL. I can now successfully build a biosql-schema instance in mySQL, load taxonomy, then using bioperl-db load a GenBank file from disk, commiting the sequences I want. For a given accession number + version + namespace, I can tell bioperl-db to delete that from mySQL and it does. Yay!! I'll be throwing a "Using bioperl-db" document onto the wiki over the next week. What I am current baffled by: How do I ask bioperl-db to walk over multiple bioentries in my database so I can do things with them? The simplest possible example: print a list of all bioentries in my database. It is trivially easy to just query mySQL directly, but if I'm reading / understanding the documentation correctly bioperl-db intends to be database schema and RDBMS agnostic. In that case, I should use bioperl-db to walk my records. So, how do I do that? Is Bio::DB::Query::BioQuery the way to do this? The only way? If so then can someone help me understand the datacollections() and where() methods? perldoc Bio::DB::Query::BioQuery # all mouse sequences loaded under namespace ensembl that # have receptor in their description $query->datacollections(["Bio::PrimarySeqI e", "Bio::Species=>Bio::PrimarySeqI sp", "BioNamespace=>Bio::PrimarySeqI db"]); $query->where(["sp.binomial like 'Mus *'", "e.desc like '*receptor*'", "db.namespace = 'ensembl'"]); # all mouse sequences loaded under namespace ensembl that # have receptor in their description, and that also have a # cross-reference with SWISS as the database $query->datacollections(["Bio::PrimarySeqI e", "Bio::Species=>Bio::PrimarySeqI sp", "BioNamespace=>Bio::PrimarySeqI db", "Bio::Annotation::DBLink xref", I'm bewildered by this API. Please forgive my ignorance. 1) How do I get *all* bioentries out of my database? 2) Say I did want just the "namespace" 'Pico' (one of my biodatabase.name's). Where did "BioNamespace=>Bio::PrimarySeqI db"]); come from? How was I supposed to figure out the left hand side of that mapping? The right hand side? If that line wasn't sitting in that document was there a way for me to figure it out as a *user* of bioperl-db? Or would I need to be a *programmer* of bioperl-db reading source to figure this out? Where did "db.namespace = 'ensembl'"]); come from? Again, do I have to read source code to know how to invoke that magic? Sorry if I sound like a jerk. That is not my intention. Hopefully I can document the answers for future bioperl-db'ers. Thanks in advance, j my current plaything: http://openlab.jays.net From cjfields at uiuc.edu Wed Jul 19 10:34:48 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 19 Jul 2006 09:34:48 -0500 Subject: [Bioperl-l] bioperl pulls, xml parsers push, and things get complicated In-Reply-To: <44BD776A.1080402@cornell.edu> Message-ID: <002801c6ab40$7cfcd980$15327e82@pyrimidine> The Bio::SearchIO modules are supposed work like a SAX parser, where results are returned as the report is parsed b/c of the occurrence of specific 'events' (start_element, end_element, and so on). However, the actual behaviour for each module changes depending on the report type and the author's intention. There was a thread about a month ago on HMMPFAM report parsing where there was some contention as to how to build hits(models)/HSPs(domains). HMMPFAM output has one HSP per hit and is sorted on the sequence length so a particular hit can appear more than once, depending on how many times it hits along the sequence length itself. So, to gather all the HSPs together under one hit you would have to parse the entire report and build up a Hit/HSP tree, then use the next_hit/next_hsp oterators to parse through everything. Currently it just reports Hit/HSP pairs and it is up to the user to build that tree. In contrast, BLAST output should be capable of throwing hit/HSP clusters on the fly based on the report output, but is quite slow (event the XML output crawls). Jason thinks it's b/c of object inheritance and instantiation; I think it's probably more complicated than that (there are a ton of method calls which tend to slow things down quite a bit as well). I would say try using SearchIO, but instead of relying directly on object handler calls to create Hit/HSP objects using an object factory (which is where I think a majority of the speed is lost), build the data internally on the fly using start_element/end_element, then return hashes instead based on the element type triggered using end_element. As an aside, I'm trying to switch the SearchIO::blastxml over to XML::SAX (using XML::SAX::ExpatXS/expat) and plan on switching it over to using hashes at some point, possibly starting off with a different SearchIO plugin module. If you have other suggestions (XML parser of choice, ways to speed up parsing/retrieve data) we would be glad to hear them. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Robert Buels > Sent: Tuesday, July 18, 2006 7:06 PM > To: bioperl-l at bioperl.org > Subject: [Bioperl-l] bioperl pulls, xml parsers push,and things get > complicated > > Hi all, > > Here's a kind of abstract question about Bioperl and XML parsing: > > I'm thinking about writing a bioperl parser for genomethreader XML, and > I'm sort of mulling over the 'impedence mismatch' between the way > bioperl Bio::*IO::* modules work and the way all of the current XML > parsers work. Bioperl uses a 'pull' model, where every time you want a > new chunk of stuff, you call $io_object->next_thing. All the XML > parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a > 'push' model, where every time they parse a chunk, they call _your_ > code, usually via a subroutine reference you've given to the XML parser > when you start it up. > > From what I can tell, current Bioperl IO modules that parse XML are > using push parsers to parse the whole document, holding stuff in memory, > then spoon-feeding it in chunks to the calling program when it calls > next_*(). This is fine until the input XML gets really big, in which > case you can quickly run out of memory. > > Does anybody have good ideas for nice, robust ways of writing a bioperl > IO module for really big input XML files? There don't seem to be any > perl pull parsers for XML. All I've dug up so far would be having the > XML push parser running in a different thread or process, pushing chunks > of data into a pipe or similar structure that blocks the progress of the > push parser until the pulling bioperl code wants the next piece of data, > but there are plenty of ugly issues with that, whether one were too use > perl threads for it (aaagh!) or fork and push some kind of intermediate > format through a pipe or socket between the two processes (eek!). > > So, um, if you've read this far, do you have any ideas? > > Rob > > -- > Robert Buels > SGN Bioinformatics Analyst > 252A Emerson Hall, Cornell University > Ithaca, NY 14853 > Tel: 503-889-8539 > rmb32 at cornell.edu > http://www.sgn.cornell.edu > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Wed Jul 19 10:44:30 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 19 Jul 2006 09:44:30 -0500 Subject: [Bioperl-l] SearchIO HOWTO In-Reply-To: <716af09c0607190436n5fdd5576m23887051aaf95f8e@mail.gmail.com> Message-ID: <002901c6ab41$d7f61350$15327e82@pyrimidine> The information in that table is referring to the BLAST report example before the table itself. However, I can tell you that using that report works (sorry if the text wrapping here mangles the output), so the table information is erroneous. I'll do some updating on that. Chris Here's the script: use Bio::SearchIO; use Bio::AlignIO; my $parser = Bio::SearchIO->new (-file => shift @ARGV, -format => 'blast'); my $aln_out = Bio::AlignIO->new(-fh => \*STDOUT, -format => 'clustalw'); while (my $result = $parser->next_result) { while (my $hit = $result->next_hit) { while (my $hsp = $hit->next_hsp) { $aln_out->write_aln($hsp->get_aln); } } } Output (via STDOUT): ------------------------------------ CLUSTAL W(1.81) multiple sequence alignment gi|20521485|dbj|AP004641.2/2896-3051 DMGRCSSGCNRYPEPMTPDTMIKLYREKEGLGAYIWMPTPDMSTEGRVQMLP gb|443893|124775/197-246 DIVQNSSGCNRYPEPMTPDTMIKLYRE-EGL-AYIWMPTPDMSTEGRVQMLP *: : ********************** *** ******************** ------------------------------------ > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Bernd Web > Sent: Wednesday, July 19, 2006 6:36 AM > To: bioperl-l at bioperl.org > Subject: [Bioperl-l] SearchIO HOWTO > > Hi, > > On http://www.bioperl.org/wiki/HOWTO:SearchIO there is a great HOWTO > parse your BLAST report. > In the Table of methods, the third line from the bottom is: > "HSP alignment Not available in this report Bio::SimpleAlign object " > > Would it not be good to add the get_aln method ( $hsp->get_aln) ? > > The line in "Using the methods" > my $alignment_as_string = $alnIO->write_aln($aln); > > may be confusing: $alignment_as_string will be "1" on success and the > alignment is printed to STDIO. Should IO::String be introduced here > too set up a string filehandle? > > > Best regards, > Bernd > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Wed Jul 19 10:55:02 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 19 Jul 2006 09:55:02 -0500 Subject: [Bioperl-l] ListSummaries delay apologies Message-ID: <002a01c6ab43$508aa5a0$15327e82@pyrimidine> Sorry about the delay for the ListSummaries the past couple months; things have been pretty hectic here which has put me really behind on them (it hasn't ever been my top priority, anyway). We're getting papers ready for publication, I going to a summer institute in a few weeks, and research (as always) is full steam ahead. Just so everybody know, I haven't given up on them, and plan on getting caught up after I get back from the institute in Connecticut (beginning of August). Cheers! Christopher Fields Postdoctoral Researcher - Switzer Lab Dept. of Biochemistry University of Illinois Urbana-Champaign From hlapp at gmx.net Wed Jul 19 11:31:50 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 19 Jul 2006 11:31:50 -0400 Subject: [Bioperl-l] Walking multiple bioentries using bioperl-db In-Reply-To: References: Message-ID: <62DA6CBC-CD0E-46A7-A669-71FFC808041B@gmx.net> On Jul 19, 2006, at 9:43 AM, Jay Hannah wrote: > Howdy -- > > I'm using bioperl-db + biosql-schema + mySQL. > > I can now successfully build a biosql-schema instance in mySQL, load > taxonomy, then using bioperl-db load a GenBank file from disk, > commiting > the sequences I want. For a given accession number + version + > namespace, > I can tell bioperl-db to delete that from mySQL and it does. Yay!! > I'll be > throwing a "Using bioperl-db" document onto the wiki over the next > week. Excellent! > > What I am current baffled by: > > How do I ask bioperl-db to walk over multiple bioentries in my > database so > I can do things with them? The simplest possible example: print a > list of > all bioentries in my database. > > It is trivially easy to just query mySQL directly, but if I'm > reading / > understanding the documentation correctly bioperl-db intends to be > database schema and RDBMS agnostic. In that case, I should use > bioperl-db > to walk my records. So, how do I do that? Bioperl-db indeed intends to be schema(-variant) and RDBMS agnostic, but that doesn't mean that you have to be as well. If you find it trivially easy to query your database using SQL and DBI and you don't care about being RDBMS or schema-variant agnostic, then by all means don't feel obligated to go through the bioperl-db API for querying. Note you can obtain the DBI database handle being used by a persistence adaptor by calling dbh(): my $dbh = $adaptor->dbh(); (The advantage of this is that you use the same connection, and therefore the same machinery for obtaining connection parameters and building the DSN that the rest of bioperl-db uses. Also, you have the ability to see transactions in progress that have not been committed yet by the adaptor.) What you should not do through SQL directly is modifying (UPDATE & DELETE) entities which bioperl-db also holds in a cache (by default terms, dbxrefs), unless you also take care to clear the cache of the respective adaptor. > > Is Bio::DB::Query::BioQuery the way to do this? The only way? Well, yes, unless you want to use SQL directly (which is not 0a despised option, see above). > > If so then can someone help me understand the datacollections() and > where() methods? datacollections() in essence corresponds to the FROM clause in a SQL statement, including JOIN statements. '=>' joins two entities in 1:n relationship, '<=>' joins two entities in n:n relationship. Instead of the table(s) you give the (Bioperl) objects that are to be joined, and bioperl-db will translate the objects to database entities, i.e., tables. Each object may be followed by an alias. The alias makes it easier to refer to the object (entity) in the query constraint part (where()). A single alias following a join expression will always apply to the master object (table). > > perldoc Bio::DB::Query::BioQuery > > # all mouse sequences loaded under namespace ensembl that > # have receptor in their description > $query->datacollections(["Bio::PrimarySeqI e", > "Bio::Species=>Bio::PrimarySeqI sp", > "BioNamespace=>Bio::PrimarySeqI > db"]); This is short for $query->datacollections([ # enumare the objects we need: "Bio::PrimarySeqI e", "Bio::Species sp", "BioNamespace db", # specify master-detail relationships "Bio::Species=>Bio::PrimarySeqI", "BioNamespace=>Bio::PrimarySeqI"]); because the alias following the join statement applies to the master entity. > $query->where(["sp.binomial like 'Mus *'", > "e.desc like '*receptor*'", > "db.namespace = 'ensembl'"]); The where() method corresponds to the WHERE clause in SQL. The default logical operator between constraints is AND. There is more documentation in on the syntax of expressing constraints in Bio::DB::Query::QueryConstraint. The column for which to constrain the value is given as the attribute (method) of the (bioperl) object. If there are multiple objects in the 'datacollections' then you need to qualify each attribute by prefixing it with the object, or the alias assigned in datacollections (), followed by a dot; corresponding to typical OO syntax. > > # all mouse sequences loaded under namespace ensembl that > # have receptor in their description, and that also have a > # cross-reference with SWISS as the database > $query->datacollections(["Bio::PrimarySeqI e", > "Bio::Species=>Bio::PrimarySeqI sp", > "BioNamespace=>Bio::PrimarySeqI db", > "Bio::Annotation::DBLink xref", > > I'm bewildered by this API. Please forgive my ignorance. I understand. This part of the API is by far the one with the skimpiest documentation. There are a considerable number of tests in t/query.t which may serve as examples. They also are known to work if their tests don't fail. The tests don't actually execute any query, instead some internal guts are used to test the translation to SQL, so if you know SQL you may be able to understand better what's going on by seeing the object- level query and the SQL-level query side-by-side. > > 1) How do I get *all* bioentries out of my database? Your datacollections would consist of the single object Bio::SeqI (or Bio::PrimarySeqI if you didn't want any annotation), and there would be no query constraint: my $query = Bio::DB::Query::BioQuery->new(-datacollections=> ["Bio::SeqI"]); > > 2) Say I did want just the "namespace" 'Pico' (one of my > biodatabase.name's). Where did > > "BioNamespace=>Bio::PrimarySeqI db"]); > > come from? How was I supposed to figure out the left hand side of that > mapping? The right hand side? If that line wasn't sitting in that > document > was there a way for me to figure it out as a *user* of bioperl-db? You would not know from Bioperl itself. The right hand side is a Bioperl class. The left hand side is a kludge because Bioperl does not have a namespace class, instead objects that have a namespace implement the Bio::IdentifiableI interface directly. This kind of one class mapping to two database entities (biodatabase is a table separate from, in fact a master for, bioentry) is extremely cumbersome to express in a generic way, so I chose to create a Bio::DB::Persistent::BioNamespace class to represent that for the purpose of queries. > Or would I need to be a *programmer* of bioperl-db reading source > to figure > this out? Where did > > "db.namespace = 'ensembl'"]); > > come from? Again, do I have to read source code to know how to invoke > that magic? Well, I'm not sure even reading the source code clears it all up ;) As I said before, the part before the dot is the alias or object, the part after is the attribute (or method) to be constrained. > > Sorry if I sound like a jerk. That is not my intention. Hopefully I > can > document the answers for future bioperl-db'ers. No problem, that's fine - and whatever you would be willing to contribute to documentation would be highly appreciated. -hilmar > > Thanks in advance, > > j > my current plaything: http://openlab.jays.net > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From aaron.j.mackey at gsk.com Wed Jul 19 09:48:55 2006 From: aaron.j.mackey at gsk.com (aaron.j.mackey at gsk.com) Date: Wed, 19 Jul 2006 09:48:55 -0400 Subject: [Bioperl-l] bioperl pulls, xml parsers push, and things get complicated In-Reply-To: <44BD776A.1080402@cornell.edu> Message-ID: There are 3rd generation XML "Pull" parsers (also called "StAX" for Streaming API for XML), but they seem to still be stuck in Java land (e.g. "MXP1") You could probably use POE to setup a state machine that used XML::Twig to "push" units of XML content onto a stack, to be read by your "next_*" pull method (where the XML::Twig push "stalled" until the "next_*" method was called, and vice versa). -Aaron bioperl-l-bounces at lists.open-bio.org wrote on 07/18/2006 08:06:02 PM: > Hi all, > > Here's a kind of abstract question about Bioperl and XML parsing: > > I'm thinking about writing a bioperl parser for genomethreader XML, and > I'm sort of mulling over the 'impedence mismatch' between the way > bioperl Bio::*IO::* modules work and the way all of the current XML > parsers work. Bioperl uses a 'pull' model, where every time you want a > new chunk of stuff, you call $io_object->next_thing. All the XML > parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a > 'push' model, where every time they parse a chunk, they call _your_ > code, usually via a subroutine reference you've given to the XML parser > when you start it up. > > From what I can tell, current Bioperl IO modules that parse XML are > using push parsers to parse the whole document, holding stuff in memory, > then spoon-feeding it in chunks to the calling program when it calls > next_*(). This is fine until the input XML gets really big, in which > case you can quickly run out of memory. > > Does anybody have good ideas for nice, robust ways of writing a bioperl > IO module for really big input XML files? There don't seem to be any > perl pull parsers for XML. All I've dug up so far would be having the > XML push parser running in a different thread or process, pushing chunks > of data into a pipe or similar structure that blocks the progress of the > push parser until the pulling bioperl code wants the next piece of data, > but there are plenty of ugly issues with that, whether one were too use > perl threads for it (aaagh!) or fork and push some kind of intermediate > format through a pipe or socket between the two processes (eek!). > > So, um, if you've read this far, do you have any ideas? > > Rob > > -- > Robert Buels > SGN Bioinformatics Analyst > 252A Emerson Hall, Cornell University > Ithaca, NY 14853 > Tel: 503-889-8539 > rmb32 at cornell.edu > http://www.sgn.cornell.edu > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From arareko at campus.iztacala.unam.mx Wed Jul 19 12:20:21 2006 From: arareko at campus.iztacala.unam.mx (Mauricio Herrera Cuadra) Date: Wed, 19 Jul 2006 11:20:21 -0500 Subject: [Bioperl-l] bioperl pulls, xml parsers push, and things get complicated In-Reply-To: <002801c6ab40$7cfcd980$15327e82@pyrimidine> References: <002801c6ab40$7cfcd980$15327e82@pyrimidine> Message-ID: <44BE5BC5.5040006@campus.iztacala.unam.mx> There are a lot of different XML processing strategies. Most fall into two categories: stream-based and tree-based. With the stream-based strategy, the parser continuously alerts a program to patterns in the XML. The parser functions like a pipeline, taking XML markup on one end and pumping out processed nuggets of data to your program. With the tree-based strategy, the parser keeps the data to itself until the very end, when it presents a complete model of the document to your program. The whole point to this strategy is that your program can pull out any data it needs, in any order. Most of the times I use tree-based strategies because they place all of the data into a structure which lets me to access any internal node using array/hash references. The simplest parser for this is XML::Simple using XML::Parser as the 'preferred parser' (which is built on top of XML::Parser::Expat, which is a wrapper around the expat library). More advanced parsers (both stream and tree-based) are: * XML::LibXML (a wrapper for libxml2's C library) * XML::Grove (takes a tree and changes it into an object hierarchy. Each node type is represented by a different class) * XML::PYX (for repackaging XML as a stream of easily recognizable and transmutable symbols) * XML::SimpleObject (changes a hierarchy of lists into a hierarchy of objects) * XML::XPath (for writing expressions that pinpoint specific pieces of documents) There are also some standards-based solutions like: * XML::SAX (Simple API for XML) for event streams. * XML::DOM (Document Object Model) for tree processing. Your strategy of choice depends a lot on the type of XML files you want to parse. Understanding the structure of the files and deciding which is the data you want to extract from them is a fundamental step to choose the appropriate method/parser to use. Just my 2 cents :) Regards, Mauricio. Chris Fields wrote: > The Bio::SearchIO modules are supposed work like a SAX parser, where results > are returned as the report is parsed b/c of the occurrence of specific > 'events' (start_element, end_element, and so on). However, the actual > behaviour for each module changes depending on the report type and the > author's intention. > > There was a thread about a month ago on HMMPFAM report parsing where there > was some contention as to how to build hits(models)/HSPs(domains). HMMPFAM > output has one HSP per hit and is sorted on the sequence length so a > particular hit can appear more than once, depending on how many times it > hits along the sequence length itself. So, to gather all the HSPs together > under one hit you would have to parse the entire report and build up a > Hit/HSP tree, then use the next_hit/next_hsp oterators to parse through > everything. Currently it just reports Hit/HSP pairs and it is up to the > user to build that tree. > > In contrast, BLAST output should be capable of throwing hit/HSP clusters on > the fly based on the report output, but is quite slow (event the XML output > crawls). Jason thinks it's b/c of object inheritance and instantiation; I > think it's probably more complicated than that (there are a ton of method > calls which tend to slow things down quite a bit as well). > > I would say try using SearchIO, but instead of relying directly on object > handler calls to create Hit/HSP objects using an object factory (which is > where I think a majority of the speed is lost), build the data internally on > the fly using start_element/end_element, then return hashes instead based on > the element type triggered using end_element. > > As an aside, I'm trying to switch the SearchIO::blastxml over to XML::SAX > (using XML::SAX::ExpatXS/expat) and plan on switching it over to using > hashes at some point, possibly starting off with a different SearchIO plugin > module. If you have other suggestions (XML parser of choice, ways to speed > up parsing/retrieve data) we would be glad to hear them. > > Chris > > > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> bounces at lists.open-bio.org] On Behalf Of Robert Buels >> Sent: Tuesday, July 18, 2006 7:06 PM >> To: bioperl-l at bioperl.org >> Subject: [Bioperl-l] bioperl pulls, xml parsers push,and things get >> complicated >> >> Hi all, >> >> Here's a kind of abstract question about Bioperl and XML parsing: >> >> I'm thinking about writing a bioperl parser for genomethreader XML, and >> I'm sort of mulling over the 'impedence mismatch' between the way >> bioperl Bio::*IO::* modules work and the way all of the current XML >> parsers work. Bioperl uses a 'pull' model, where every time you want a >> new chunk of stuff, you call $io_object->next_thing. All the XML >> parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a >> 'push' model, where every time they parse a chunk, they call _your_ >> code, usually via a subroutine reference you've given to the XML parser >> when you start it up. >> >> From what I can tell, current Bioperl IO modules that parse XML are >> using push parsers to parse the whole document, holding stuff in memory, >> then spoon-feeding it in chunks to the calling program when it calls >> next_*(). This is fine until the input XML gets really big, in which >> case you can quickly run out of memory. >> >> Does anybody have good ideas for nice, robust ways of writing a bioperl >> IO module for really big input XML files? There don't seem to be any >> perl pull parsers for XML. All I've dug up so far would be having the >> XML push parser running in a different thread or process, pushing chunks >> of data into a pipe or similar structure that blocks the progress of the >> push parser until the pulling bioperl code wants the next piece of data, >> but there are plenty of ugly issues with that, whether one were too use >> perl threads for it (aaagh!) or fork and push some kind of intermediate >> format through a pipe or socket between the two processes (eek!). >> >> So, um, if you've read this far, do you have any ideas? >> >> Rob >> >> -- >> Robert Buels >> SGN Bioinformatics Analyst >> 252A Emerson Hall, Cornell University >> Ithaca, NY 14853 >> Tel: 503-889-8539 >> rmb32 at cornell.edu >> http://www.sgn.cornell.edu >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- MAURICIO HERRERA CUADRA arareko at campus.iztacala.unam.mx Laboratorio de Gen?tica Unidad de Morfofisiolog?a y Funci?n Facultad de Estudios Superiores Iztacala, UNAM From cjfields at uiuc.edu Wed Jul 19 14:45:55 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 19 Jul 2006 13:45:55 -0500 Subject: [Bioperl-l] bioperl pulls, xml parsers push, and things get complicated In-Reply-To: <44BE5BC5.5040006@campus.iztacala.unam.mx> Message-ID: <000301c6ab63$91d31680$15327e82@pyrimidine> Yeah, we use XML::SAX, with XML::SAX::ExpatXS and expat, for SearchIO::blastxml. It previously used XML::Parser::PerlSAX but that didn't support SAX2-based parsing. XML::Twig is also used quite a bit Jason added his thoughts about this to the wiki: http://www.bioperl.org/wiki/XML_parsers Personally, I use XML::Simple with EUtilities because the XML returned is remarkably simple and normally fairly short. The trick is making sure when parsing data to dereference everything properly since XML::Simple stores everything in an elaborate data structure. I plan on switching to XML::SAX::ExpatXS or XML::Twig soon. Chris > There are a lot of different XML processing strategies. Most fall into > two categories: stream-based and tree-based. > > With the stream-based strategy, the parser continuously alerts a program > to patterns in the XML. The parser functions like a pipeline, taking XML > markup on one end and pumping out processed nuggets of data to your > program. > > With the tree-based strategy, the parser keeps the data to itself until > the very end, when it presents a complete model of the document to your > program. The whole point to this strategy is that your program can pull > out any data it needs, in any order. > > Most of the times I use tree-based strategies because they place all of > the data into a structure which lets me to access any internal node > using array/hash references. The simplest parser for this is XML::Simple > using XML::Parser as the 'preferred parser' (which is built on top of > XML::Parser::Expat, which is a wrapper around the expat library). > > More advanced parsers (both stream and tree-based) are: > > * XML::LibXML (a wrapper for libxml2's C library) > * XML::Grove (takes a tree and changes it into an object hierarchy. Each > node type is represented by a different class) > * XML::PYX (for repackaging XML as a stream of easily recognizable and > transmutable symbols) > * XML::SimpleObject (changes a hierarchy of lists into a hierarchy of > objects) > * XML::XPath (for writing expressions that pinpoint specific pieces of > documents) > > There are also some standards-based solutions like: > > * XML::SAX (Simple API for XML) for event streams. > * XML::DOM (Document Object Model) for tree processing. > > Your strategy of choice depends a lot on the type of XML files you want > to parse. Understanding the structure of the files and deciding which is > the data you want to extract from them is a fundamental step to choose > the appropriate method/parser to use. > > Just my 2 cents :) > > Regards, > Mauricio. > > Chris Fields wrote: > > The Bio::SearchIO modules are supposed work like a SAX parser, where > results > > are returned as the report is parsed b/c of the occurrence of specific > > 'events' (start_element, end_element, and so on). However, the actual > > behaviour for each module changes depending on the report type and the > > author's intention. > > > > There was a thread about a month ago on HMMPFAM report parsing where > there > > was some contention as to how to build hits(models)/HSPs(domains). > HMMPFAM > > output has one HSP per hit and is sorted on the sequence length so a > > particular hit can appear more than once, depending on how many times it > > hits along the sequence length itself. So, to gather all the HSPs > together > > under one hit you would have to parse the entire report and build up a > > Hit/HSP tree, then use the next_hit/next_hsp oterators to parse through > > everything. Currently it just reports Hit/HSP pairs and it is up to the > > user to build that tree. > > > > In contrast, BLAST output should be capable of throwing hit/HSP clusters > on > > the fly based on the report output, but is quite slow (event the XML > output > > crawls). Jason thinks it's b/c of object inheritance and instantiation; > I > > think it's probably more complicated than that (there are a ton of > method > > calls which tend to slow things down quite a bit as well). > > > > I would say try using SearchIO, but instead of relying directly on > object > > handler calls to create Hit/HSP objects using an object factory (which > is > > where I think a majority of the speed is lost), build the data > internally on > > the fly using start_element/end_element, then return hashes instead > based on > > the element type triggered using end_element. > > > > As an aside, I'm trying to switch the SearchIO::blastxml over to > XML::SAX > > (using XML::SAX::ExpatXS/expat) and plan on switching it over to using > > hashes at some point, possibly starting off with a different SearchIO > plugin > > module. If you have other suggestions (XML parser of choice, ways to > speed > > up parsing/retrieve data) we would be glad to hear them. > > > > Chris > > > > > > > >> -----Original Message----- > >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > >> bounces at lists.open-bio.org] On Behalf Of Robert Buels > >> Sent: Tuesday, July 18, 2006 7:06 PM > >> To: bioperl-l at bioperl.org > >> Subject: [Bioperl-l] bioperl pulls, xml parsers push,and things get > >> complicated > >> > >> Hi all, > >> > >> Here's a kind of abstract question about Bioperl and XML parsing: > >> > >> I'm thinking about writing a bioperl parser for genomethreader XML, and > >> I'm sort of mulling over the 'impedence mismatch' between the way > >> bioperl Bio::*IO::* modules work and the way all of the current XML > >> parsers work. Bioperl uses a 'pull' model, where every time you want a > >> new chunk of stuff, you call $io_object->next_thing. All the XML > >> parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a > >> 'push' model, where every time they parse a chunk, they call _your_ > >> code, usually via a subroutine reference you've given to the XML parser > >> when you start it up. > >> > >> From what I can tell, current Bioperl IO modules that parse XML are > >> using push parsers to parse the whole document, holding stuff in > memory, > >> then spoon-feeding it in chunks to the calling program when it calls > >> next_*(). This is fine until the input XML gets really big, in which > >> case you can quickly run out of memory. > >> > >> Does anybody have good ideas for nice, robust ways of writing a bioperl > >> IO module for really big input XML files? There don't seem to be any > >> perl pull parsers for XML. All I've dug up so far would be having the > >> XML push parser running in a different thread or process, pushing > chunks > >> of data into a pipe or similar structure that blocks the progress of > the > >> push parser until the pulling bioperl code wants the next piece of > data, > >> but there are plenty of ugly issues with that, whether one were too use > >> perl threads for it (aaagh!) or fork and push some kind of intermediate > >> format through a pipe or socket between the two processes (eek!). > >> > >> So, um, if you've read this far, do you have any ideas? > >> > >> Rob > >> > >> -- > >> Robert Buels > >> SGN Bioinformatics Analyst > >> 252A Emerson Hall, Cornell University > >> Ithaca, NY 14853 > >> Tel: 503-889-8539 > >> rmb32 at cornell.edu > >> http://www.sgn.cornell.edu > >> > >> > >> _______________________________________________ > >> Bioperl-l mailing list > >> Bioperl-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- > MAURICIO HERRERA CUADRA > arareko at campus.iztacala.unam.mx > Laboratorio de Gen?tica > Unidad de Morfofisiolog?a y Funci?n > Facultad de Estudios Superiores Iztacala, UNAM > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From rmb32 at cornell.edu Wed Jul 19 15:30:28 2006 From: rmb32 at cornell.edu (Robert Buels) Date: Wed, 19 Jul 2006 12:30:28 -0700 Subject: [Bioperl-l] bioperl pulls, xml parsers push, and things get complicated In-Reply-To: References: Message-ID: <44BE8854.8010301@cornell.edu> POE is a really neat thing, I didn't know about it before. Something tells me, however, that I would have trouble convincing people to install POE as a dependency for a genomethreader output parser. ;-) I hope I'll have the opportunity to use it sometime. For the curious, here's a nice intro to POE: http://perl.com/pub/a/2001/01/poe.html And the POE main site: http://poe.perl.org/ Rob aaron.j.mackey at GSK.COM wrote: > There are 3rd generation XML "Pull" parsers (also called "StAX" for > Streaming API for XML), but they seem to still be stuck in Java land (e.g. > "MXP1") > > You could probably use POE to setup a state machine that used XML::Twig to > "push" units of XML content onto a stack, to be read by your "next_*" pull > method (where the XML::Twig push "stalled" until the "next_*" method was > called, and vice versa). > > -Aaron > > bioperl-l-bounces at lists.open-bio.org wrote on 07/18/2006 08:06:02 PM: > > >> Hi all, >> >> Here's a kind of abstract question about Bioperl and XML parsing: >> >> I'm thinking about writing a bioperl parser for genomethreader XML, and >> I'm sort of mulling over the 'impedence mismatch' between the way >> bioperl Bio::*IO::* modules work and the way all of the current XML >> parsers work. Bioperl uses a 'pull' model, where every time you want a >> new chunk of stuff, you call $io_object->next_thing. All the XML >> parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a >> 'push' model, where every time they parse a chunk, they call _your_ >> code, usually via a subroutine reference you've given to the XML parser >> when you start it up. >> >> From what I can tell, current Bioperl IO modules that parse XML are >> using push parsers to parse the whole document, holding stuff in memory, >> > > >> then spoon-feeding it in chunks to the calling program when it calls >> next_*(). This is fine until the input XML gets really big, in which >> case you can quickly run out of memory. >> >> Does anybody have good ideas for nice, robust ways of writing a bioperl >> IO module for really big input XML files? There don't seem to be any >> perl pull parsers for XML. All I've dug up so far would be having the >> XML push parser running in a different thread or process, pushing chunks >> > > >> of data into a pipe or similar structure that blocks the progress of the >> > > >> push parser until the pulling bioperl code wants the next piece of data, >> > > >> but there are plenty of ugly issues with that, whether one were too use >> perl threads for it (aaagh!) or fork and push some kind of intermediate >> format through a pipe or socket between the two processes (eek!). >> >> So, um, if you've read this far, do you have any ideas? >> >> Rob >> >> -- >> Robert Buels >> SGN Bioinformatics Analyst >> 252A Emerson Hall, Cornell University >> Ithaca, NY 14853 >> Tel: 503-889-8539 >> rmb32 at cornell.edu >> http://www.sgn.cornell.edu >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> > > > -- Robert Buels SGN Bioinformatics Analyst 252A Emerson Hall, Cornell University Ithaca, NY 14853 Tel: 503-889-8539 rmb32 at cornell.edu http://www.sgn.cornell.edu From dwaner at scitegic.com Wed Jul 19 15:47:58 2006 From: dwaner at scitegic.com (dwaner at scitegic.com) Date: Wed, 19 Jul 2006 12:47:58 -0700 Subject: [Bioperl-l] EMBL release 87 format changes. Message-ID: BioPerl Users and Developers, I have updated the EMBL SeqIO parser to work correctly with Release 87 of EMBL (June 19th, 2006). As suggested by Chris Fields in an earlier message, the EMBL parser now reads both new and old formats, but only writes the new format. I don't think that my changes will affect most users, but if you are using the EMBL format can you review the changes described below and speak up if anything looks like it could create a problem for you? If I don't hear any objections soon, I will submit a patch to bugzilla. Thanks, - David Parser changes: - EMBL files no longer contain the "entry name". When reading old format files, the EMBL "entry name" from the ID line is used as the Bio::Seq::id and Bio::Seq::display_id, but when reading new format files, the accession number is used for these fields. Changes to output: - The ID line was changed to the new format. - The SV line is never written; SV is now part of the ID line. - "DNA" and "RNA" are no longer valid EMBL molecule types. They are now written as "unassigned DNA" and "unassigned RNA" - Strictly speaking, EMBL format should only be used for nucleotide sequences. If the alphabet is 'protein', write_seq() emits a warning and writes the non-standard molecule type "AA" in the ID line. - Because BioPerl sequences do not have a "data class" attribute, all sequences are written with a data class of "STD" in the ID line. - The ID line contains the Bio::Seq::accession, unless it is missing, in which case the Bio::Seq::id is used. - molecule type is strictly validated. Non-EMBL values are output as "unassigned DNA" or "unassigned RNA", depending on the sequence alphabet. - "taxonomic division" is strictly validated. Non-EMBL values are output as "UNC". - The taxonomic division code "UNK" is now written as "UNC" (unclassified). Possible Gotchas for some users: - Because the EMBL entry name is no longer included anywhere in the file, when round-tripping from old format to new format the entry name will be lost. - In order to ensure that BioPerl writes valid EMBL files, I have added strict validation to the writer for "molecule type" and "taxonomic division". This could present a problem for users who are using non-standard values for these fields, but I felt it was important to write files that adhere to the EMBL spec. From slenk at emich.edu Wed Jul 19 16:04:16 2006 From: slenk at emich.edu (Stephen Gordon Lenk) Date: Wed, 19 Jul 2006 16:04:16 -0400 Subject: [Bioperl-l] bioperl pulls, xml parsers push, and things get complicated Message-ID: <13edac5b13ed8208.13ed820813edac5b@emich.edu> Hi, I have found that POE fails to execute a periodic task after 32 iterations in a Perl thread, consistent failure on both XP and OSX - if I knew how to write up a defect for Perl I would do this (hint ? how is this done - I'm *not* asking RTFM etc) - probably remiss for not doing so - I was going to write messages to a Controller Area Network (CAN) to control automotive widgets from Perl - I wound up using a C code exe (piped to from Perl) with its own threads to do this. Oh yes I believe that bio lab systems can be done this way as well. But ... POE is really neat if you think in state machine terms. I have an alternate architecture for my test harness (Perlizer) that would use POE to run tests with CAN and GPIB. Steve Lenk ----- Original Message ----- From: Robert Buels Date: Wednesday, July 19, 2006 3:30 pm Subject: Re: [Bioperl-l] bioperl pulls, xml parsers push, and things get complicated > POE is a really neat thing, I didn't know about it before. > Something > tells me, however, that I would have trouble convincing people to > install POE as a dependency for a genomethreader output parser. ;- > ) I > hope I'll have the opportunity to use it sometime. > > For the curious, here's a nice intro to POE: > http://perl.com/pub/a/2001/01/poe.html > And the POE main site: > http://poe.perl.org/ > > Rob > > aaron.j.mackey at GSK.COM wrote: > > There are 3rd generation XML "Pull" parsers (also called "StAX" > for > > Streaming API for XML), but they seem to still be stuck in Java > land (e.g. > > "MXP1") > > > > You could probably use POE to setup a state machine that used > XML::Twig to > > "push" units of XML content onto a stack, to be read by your > "next_*" pull > > method (where the XML::Twig push "stalled" until the "next_*" > method was > > called, and vice versa). > > > > -Aaron > > > > bioperl-l-bounces at lists.open-bio.org wrote on 07/18/2006 > 08:06:02 PM: > > > > > >> Hi all, > >> > >> Here's a kind of abstract question about Bioperl and XML parsing: > >> > >> I'm thinking about writing a bioperl parser for genomethreader > XML, and > >> I'm sort of mulling over the 'impedence mismatch' between the > way > >> bioperl Bio::*IO::* modules work and the way all of the current > XML > >> parsers work. Bioperl uses a 'pull' model, where every time > you want a > >> new chunk of stuff, you call $io_object->next_thing. All the > XML > >> parsers (including XML::SAX, XML::Parser::PerlSAX and > XML::Twig) use a > >> 'push' model, where every time they parse a chunk, they call > _your_ > >> code, usually via a subroutine reference you've given to the > XML parser > >> when you start it up. > >> > >> From what I can tell, current Bioperl IO modules that parse > XML are > >> using push parsers to parse the whole document, holding stuff > in memory, > >> > > > > > >> then spoon-feeding it in chunks to the calling program when it > calls > >> next_*(). This is fine until the input XML gets really big, in > which > >> case you can quickly run out of memory. > >> > >> Does anybody have good ideas for nice, robust ways of writing a > bioperl > >> IO module for really big input XML files? There don't seem to > be any > >> perl pull parsers for XML. All I've dug up so far would be > having the > >> XML push parser running in a different thread or process, > pushing chunks > >> > > > > > >> of data into a pipe or similar structure that blocks the > progress of the > >> > > > > > >> push parser until the pulling bioperl code wants the next piece > of data, > >> > > > > > >> but there are plenty of ugly issues with that, whether one were > too use > >> perl threads for it (aaagh!) or fork and push some kind of > intermediate > >> format through a pipe or socket between the two processes (eek!). > >> > >> So, um, if you've read this far, do you have any ideas? > >> > >> Rob > >> > >> -- > >> Robert Buels > >> SGN Bioinformatics Analyst > >> 252A Emerson Hall, Cornell University > >> Ithaca, NY 14853 > >> Tel: 503-889-8539 > >> rmb32 at cornell.edu > >> http://www.sgn.cornell.edu > >> > >> > >> _______________________________________________ > >> Bioperl-l mailing list > >> Bioperl-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > >> > >> > > > > > > > > -- > Robert Buels > SGN Bioinformatics Analyst > 252A Emerson Hall, Cornell University > Ithaca, NY 14853 > Tel: 503-889-8539 > rmb32 at cornell.edu > http://www.sgn.cornell.edu > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From cjfields at uiuc.edu Wed Jul 19 17:46:43 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 19 Jul 2006 16:46:43 -0500 Subject: [Bioperl-l] EMBL release 87 format changes. In-Reply-To: Message-ID: <000601c6ab7c$d39d8cd0$15327e82@pyrimidine> You can go ahead and submit the patch to Bugzilla anyway. Comments about the proposed changes from the developers can be added there. I think there's some confusion here, though: the EMBL SeqIO change you mentioned I committed is actually for Bio::SeqIO::swiss (SwissProt). I haven't touched Bio::SeqIO::embl (yet). 'swiss' format now reads old and new swiss data files and writes only new format; no major changes have been made to SeqIO::embl in about a year (and even that was a small one). Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of dwaner at scitegic.com > Sent: Wednesday, July 19, 2006 2:48 PM > To: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] EMBL release 87 format changes. > > BioPerl Users and Developers, > > I have updated the EMBL SeqIO parser to work correctly with Release 87 of > EMBL (June 19th, 2006). As suggested by Chris Fields in an earlier > message, the EMBL parser now reads both new and old formats, but only > writes the new format. > > I don't think that my changes will affect most users, but if you are using > the EMBL format can you review the changes described below and speak up if > anything looks like it could create a problem for you? > > If I don't hear any objections soon, I will submit a patch to bugzilla. > > Thanks, > > - David > > Parser changes: > > - EMBL files no longer contain the "entry name". When reading old format > files, > the EMBL "entry name" from the ID line is used as the Bio::Seq::id and > Bio::Seq::display_id, but when reading new format files, the accession > number > is used for these fields. > > Changes to output: > > - The ID line was changed to the new format. > > - The SV line is never written; SV is now part of the ID line. > > - "DNA" and "RNA" are no longer valid EMBL molecule types. They are now > written > as "unassigned DNA" and "unassigned RNA" > > - Strictly speaking, EMBL format should only be used for nucleotide > sequences. > If the alphabet is 'protein', write_seq() emits a warning and writes the > > non-standard molecule type "AA" in the ID line. > > - Because BioPerl sequences do not have a "data class" attribute, all > sequences > are written with a data class of "STD" in the ID line. > > - The ID line contains the Bio::Seq::accession, unless it is missing, in > which > case the Bio::Seq::id is used. > > - molecule type is strictly validated. Non-EMBL values are output as > "unassigned DNA" or "unassigned RNA", depending on the sequence > alphabet. > > - "taxonomic division" is strictly validated. Non-EMBL values are output > as "UNC". > > - The taxonomic division code "UNK" is now written as "UNC" > (unclassified). > > Possible Gotchas for some users: > > - Because the EMBL entry name is no longer included anywhere in the file, > when round-tripping from old format to new format the entry name will be > lost. > > - In order to ensure that BioPerl writes valid EMBL files, I have added > strict > validation to the writer for "molecule type" and "taxonomic division". > This > could present a problem for users who are using non-standard values for > these > fields, but I felt it was important to write files that adhere to the > EMBL spec. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From stewarta at nmrc.navy.mil Wed Jul 19 18:00:26 2006 From: stewarta at nmrc.navy.mil (Andrew Stewart) Date: Wed, 19 Jul 2006 18:00:26 -0400 Subject: [Bioperl-l] #bioperl Message-ID: Wandering about the new bioperl.org page, I noticed that there's never really been much mention of starting up a bioperl chat channel on IRC for casual bioperl discussion and support. This has worked really well for projects like MediaWiki, etc. I'll sit on the channel for awhile and maybe we can see if the idea picks up. Point your favorite IRC client to... (windows users I would suggest mIRC, mac I would suggest Colloquy) server: irc.freenode.net channel: #bioperl Hope to see you there. -- Andrew Stewart Research Assistant, Genomics Team Navy Medical Research Center (NMRC) Biological Defense Research Directorate (BDRD) BDRD Annex 12300 Washington Avenue, 2nd Floor Rockville, MD 20852 email: stewarta at nmrc.navy.mil phone: 301-231-6700 Ext 270 From rmb32 at cornell.edu Wed Jul 19 18:40:52 2006 From: rmb32 at cornell.edu (Robert Buels) Date: Wed, 19 Jul 2006 15:40:52 -0700 Subject: [Bioperl-l] bioperl pulls, xml parsers push, and things get complicated In-Reply-To: <002801c6ab40$7cfcd980$15327e82@pyrimidine> References: <002801c6ab40$7cfcd980$15327e82@pyrimidine> Message-ID: <44BEB4F4.1060407@cornell.edu> Hi Chris, It seems to me the SearchIO framework isn't really appropriate for genomethreader, since it's more of a gene prediction program than a search/alignment program. Also, w.r.t. XML parsing and buffering, I don't see how Bio::SearchIO is fundamentally different from the other bioperl IO systems, it still has a next_this(), next_that() interface,which means lots of buffering memory if you're doing your actual parsing with a push parser (or a tree parser, of course, which is buffering an expanded form of the entire document). It looks like it just adds another layer of method calls for parser events, allowing the SearchIO to make different kinds of objects and stuff. It looks like none of this changes the fact that these are all push parsers, and bioperl pulls, so you have to buffer a lot of stuff. I guess the only really general strategies for reducing the buffering is a.) to break up the XML with regexps and such like Hilmar said, b.) to put your push parser in another process, and somehow keep it blocking in one of its callbacks until you're ready for its next data. I think what I'll do with the gthxml parser is find a way to split the input XML into chunks and run a parser separately on each, like Hilmar said. If more performance is needed, maybe a multi-process approach would be appropriate, but not yet. Anyway, looking at blastxml, I have some ruminations, which fill the rest of this email: Looking at SearchIO::blastxml, it looks like it's already using XML::SAX, which will use XML::SAX::ExpatXS if installed. Is that recent? Is blastxml faster when using the tempfile option than when putting the whole report in a string in memory? If you're looking for speed gains, have you tried running some kind of profiling on it? Whenever one is out to optimize code, profiling should be stop number one. Almost every time, you will be surprised at what parts of the code are actually eating up the most time. Here's a perl profiling intro: http://perl.com/pub/a/2004/06/25/profiling.html . The profiling mechansim talked about in that article is kind of old, there are also a bunch of newer code profiling tools available on CPAN. I haven't used any of them though. But yeah, I can't emphasize enough the importance of profiling if you're trying to optimize for speed. As for memory, the blastxml parser suffers from the same handicap I was pondering at the start of this thread. To see what I mean, think of what would happen if there were somehow 10 million HSPs in one of the reports? It's buffering all of them before returning each result, and your machine could melt. :-) Things would be beautiful (and fast, probably) if next_hsp() would actually parse the next HSP in the report instead of just returning a HSP object that's sitting in memory. But there's not really anything that can be done about that, I don't think. One nice thing, the blastxml parser's memory footprint doesn't really suffer if you have 100,000 blast reports in your input file, because it splits out the reports and parses each one individually. This I think is a good illustration of what Hilmar was talking about, breaking the input XML into chunks cuts down on the amount of buffering you have to do. As XML parsers go, I kind of like XML::Twig, because it manages to combine most of the easy use of a DOM/tree parser with the better memory usage and speed of a push parser (like SAX and XML::Parser). Within a parser callback, you have a DOM-like tree that's just the part of your XML document you're interested in at that time, and then you free that structure when you're done picking things out of it. I'm not sure how fast it is, though, probably not as fast as ExpatXS. At any rate, it is definitely a lot more intuitive to use than a more standard push parser, since if you make good choices about what elements to use as the roots of your twigs, you can often do your processing on a self-contained chunk and not have to keep track of a bunch of parse state like you typically need with a straight push parser like XML::Parser or a SAX parser. Rob Chris Fields wrote: > The Bio::SearchIO modules are supposed work like a SAX parser, where results > are returned as the report is parsed b/c of the occurrence of specific > 'events' (start_element, end_element, and so on). However, the actual > behaviour for each module changes depending on the report type and the > author's intention. > > There was a thread about a month ago on HMMPFAM report parsing where there > was some contention as to how to build hits(models)/HSPs(domains). HMMPFAM > output has one HSP per hit and is sorted on the sequence length so a > particular hit can appear more than once, depending on how many times it > hits along the sequence length itself. So, to gather all the HSPs together > under one hit you would have to parse the entire report and build up a > Hit/HSP tree, then use the next_hit/next_hsp oterators to parse through > everything. Currently it just reports Hit/HSP pairs and it is up to the > user to build that tree. > > In contrast, BLAST output should be capable of throwing hit/HSP clusters on > the fly based on the report output, but is quite slow (event the XML output > crawls). Jason thinks it's b/c of object inheritance and instantiation; I > think it's probably more complicated than that (there are a ton of method > calls which tend to slow things down quite a bit as well). > > I would say try using SearchIO, but instead of relying directly on object > handler calls to create Hit/HSP objects using an object factory (which is > where I think a majority of the speed is lost), build the data internally on > the fly using start_element/end_element, then return hashes instead based on > the element type triggered using end_element. > > As an aside, I'm trying to switch the SearchIO::blastxml over to XML::SAX > (using XML::SAX::ExpatXS/expat) and plan on switching it over to using > hashes at some point, possibly starting off with a different SearchIO plugin > module. If you have other suggestions (XML parser of choice, ways to speed > up parsing/retrieve data) we would be glad to hear them. > > Chris > > > > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> bounces at lists.open-bio.org] On Behalf Of Robert Buels >> Sent: Tuesday, July 18, 2006 7:06 PM >> To: bioperl-l at bioperl.org >> Subject: [Bioperl-l] bioperl pulls, xml parsers push,and things get >> complicated >> >> Hi all, >> >> Here's a kind of abstract question about Bioperl and XML parsing: >> >> I'm thinking about writing a bioperl parser for genomethreader XML, and >> I'm sort of mulling over the 'impedence mismatch' between the way >> bioperl Bio::*IO::* modules work and the way all of the current XML >> parsers work. Bioperl uses a 'pull' model, where every time you want a >> new chunk of stuff, you call $io_object->next_thing. All the XML >> parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a >> 'push' model, where every time they parse a chunk, they call _your_ >> code, usually via a subroutine reference you've given to the XML parser >> when you start it up. >> >> From what I can tell, current Bioperl IO modules that parse XML are >> using push parsers to parse the whole document, holding stuff in memory, >> then spoon-feeding it in chunks to the calling program when it calls >> next_*(). This is fine until the input XML gets really big, in which >> case you can quickly run out of memory. >> >> Does anybody have good ideas for nice, robust ways of writing a bioperl >> IO module for really big input XML files? There don't seem to be any >> perl pull parsers for XML. All I've dug up so far would be having the >> XML push parser running in a different thread or process, pushing chunks >> of data into a pipe or similar structure that blocks the progress of the >> push parser until the pulling bioperl code wants the next piece of data, >> but there are plenty of ugly issues with that, whether one were too use >> perl threads for it (aaagh!) or fork and push some kind of intermediate >> format through a pipe or socket between the two processes (eek!). >> >> So, um, if you've read this far, do you have any ideas? >> >> Rob >> >> -- >> Robert Buels >> SGN Bioinformatics Analyst >> 252A Emerson Hall, Cornell University >> Ithaca, NY 14853 >> Tel: 503-889-8539 >> rmb32 at cornell.edu >> http://www.sgn.cornell.edu >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > -- Robert Buels SGN Bioinformatics Analyst 252A Emerson Hall, Cornell University Ithaca, NY 14853 Tel: 503-889-8539 rmb32 at cornell.edu http://www.sgn.cornell.edu From skirov at utk.edu Wed Jul 19 17:54:03 2006 From: skirov at utk.edu (Stefan Kirov) Date: Wed, 19 Jul 2006 17:54:03 -0400 Subject: [Bioperl-l] Accessing TRANSFAC matrices In-Reply-To: <20060717145000.0aig7ymcmurk4wsk@webmail.embl.de> References: <20060717145000.0aig7ymcmurk4wsk@webmail.embl.de> Message-ID: <44BEA9FB.1070009@utk.edu> I have nothing to do with TFBS (except for using it). I suggest you contact Boris Lenhard who is behind TFBS. Please also send bioperl questions to the list. Finally, I believe TRANSFAC does not distribute the data files anymore. However, if you find out this is not the case, please let me know. Stefan ong at embl.de wrote: >HI , > > Good day, i am trying to retrieve TRANSFAC matrices via TFBS Perl module, but >it happens that about 50 matrices are missing after M00359 do you have any idea? >Also i wish to try using the Bio::Matrix::PSM::IO object, but can you advise how >do i get the matrix.dat which is a transfac file? > > Tahnks and hear for you soon. > >REgards, >Ong > > From bix at sendu.me.uk Thu Jul 20 02:49:45 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 20 Jul 2006 07:49:45 +0100 Subject: [Bioperl-l] Accessing TRANSFAC matrices In-Reply-To: <44BEA9FB.1070009@utk.edu> References: <20060717145000.0aig7ymcmurk4wsk@webmail.embl.de> <44BEA9FB.1070009@utk.edu> Message-ID: <44BF2789.1090204@sendu.me.uk> Stefan Kirov wrote: > Finally, I believe TRANSFAC does not distribute the data files anymore. > However, if you find out this is not the case, please let me know. They get distributed as Transfac 'Pro', for which you need a license (money). > ong at embl.de wrote: >> good day, i am trying to retrieve TRANSFAC matrices via TFBS Perl module, but >> it happens that about 50 matrices are missing after M00359 do you have any idea? What is meant by this? Missing from where? At the least, M00360 is accessible via the website (public database). >> Also i wish to try using the Bio::Matrix::PSM::IO object, but can you advise how >> do i get the matrix.dat which is a transfac file? http://www.biobase-international.com/pages/index.php?id=174 From dhoworth at mrc-lmb.cam.ac.uk Thu Jul 20 05:19:22 2006 From: dhoworth at mrc-lmb.cam.ac.uk (Dave Howorth) Date: Thu, 20 Jul 2006 10:19:22 +0100 Subject: [Bioperl-l] bioperl pulls, xml parsers push, and things get complicated In-Reply-To: <13edac5b13ed8208.13ed820813edac5b@emich.edu> References: <13edac5b13ed8208.13ed820813edac5b@emich.edu> Message-ID: <44BF4A9A.60100@mrc-lmb.cam.ac.uk> Stephen Gordon Lenk wrote: > I have found that POE fails to execute a periodic task after 32 > iterations in a Perl thread, consistent failure on both XP and OSX - > if I knew how to write up a defect for Perl I would do this (hint ? > how is this done - I'm *not* asking RTFM etc) Generally: Go to http://search.cpan.org and search for the module (POE). Click on the distribution link, rather than the doc link (i.e. POE-0.3502, which takes you to http://search.cpan.org/~rcaputo/POE-0.3502/). Click on the View/Report Bugs link. Check through the existing bugs and if it's not there click on the Report a new bug link. Cheers, Dave From georg.otto at tuebingen.mpg.de Thu Jul 20 06:53:53 2006 From: georg.otto at tuebingen.mpg.de (Georg Otto) Date: Thu, 20 Jul 2006 12:53:53 +0200 Subject: [Bioperl-l] Features in SeqIO GenBank output Message-ID: Hi, this is probably a FAQ but I could not find anything to solve it. I want to get sequences from GenBank and save them in GenBank format. This works with the script shown below, but the "Features" part is missing and contains references instead (see below). How can I print out the complete GenBank entry? I am running Bioperl 1.5, Perl 5.8.6, Mac 10.4.7 Best, Georg Here is my script: use strict; use warnings; use Bio::Seq; use Bio::SeqIO; use Bio::DB::GenBank; my $acc = 'AB017118'; my $db_obj = Bio::DB::GenBank->new(); my $seq_obj = $db_obj-> get_Seq_by_acc($acc); my $out = Bio::SeqIO->new(-format => 'genbank', -file => '>output.gb'); $out->write_seq($seq_obj); Here is the output: LOCUS AB017118 2038 bp mRNA linear VRT 06-JUN-2006 DEFINITION Danio rerio mRNA for ornithine decarboxylase antizyme long isoform, complete cds. ACCESSION AB017118 VERSION AB017118.1 GI:4239978 KEYWORDS . SOURCE Danio rerio (zebrafish) ORGANISM Danio rerio Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Actinopterygii; Neopterygii; Teleostei; Ostariophysi; Cypriniformes; Cyprinidae; Danio. REFERENCE 1 AUTHORS Saito,T., Hascilowicz,T., Ohkido,I., Kikuchi,Y., Okamoto,H., Hayashi,S., Murakami,Y. and Matsufuji,S. TITLE Two zebrafish (Danio rerio) antizymes with different expression and activities JOURNAL Biochem. J. 345 PT 1, 99-106 (2000) PUBMED 10600644 REFERENCE 2 (bases 1 to 2038) AUTHORS Matsufuji,S. and Saito,T. TITLE Direct Submission JOURNAL Submitted (23-AUG-1998) Senya Matsufuji, Jikei University School of Medicine, Department of Biochemistry II; 3-25-8 Nishishinbashi, Minato-ku, Tokyo 105-8461, Japan (E-mail:senya at jikei.ac.jp, Tel:+81-3-3433-1111(ex.2276), Fax:+81-3-3436-3897) FEATURES Location/Qualifiers source 1..2038 /db_xref="Bio::Annotation::SimpleValue=HASH(0x19b9a28)" /mol_type="Bio::Annotation::SimpleValue=HASH(0x19b9b6c)" /dev_stage="Bio::Annotation::SimpleValue=HASH(0x19b9bb4)" /organism="Bio::Annotation::SimpleValue=HASH(0x19bfe18)" /clone_lib="Bio::Annotation::SimpleValue=HASH(0x19bfe60)" CDS join(45..224,226..702) /db_xref="Bio::Annotation::SimpleValue=HASH(0x19c0960)" /ribosomal_slippage="Bio::Annotation::SimpleValue=HASH(0x1 9beecc)" /codon_start=Bio::Annotation::SimpleValue=HASH(0x19bef14) /protein_id="Bio::Annotation::SimpleValue=HASH(0x19bef5c)" /translation="Bio::Annotation::SimpleValue=HASH(0x19befa4) " /product="Bio::Annotation::SimpleValue=HASH(0x19befec)" /note="Bio::Annotation::SimpleValue=HASH(0x19bf034)" CDS 45..227 /db_xref="Bio::Annotation::SimpleValue=HASH(0x19bee24)" /codon_start=Bio::Annotation::SimpleValue=HASH(0x19bf160) /protein_id="Bio::Annotation::SimpleValue=HASH(0x19bf1cc)" /translation="Bio::Annotation::SimpleValue=HASH(0x19c1830) " /note="Bio::Annotation::SimpleValue=HASH(0x19c1878)" polyA_signal 2017..2022 polyA_site 2038 /note="Bio::Annotation::SimpleValue=HASH(0x19bffc8)" BASE COUNT 439 a 377 c 532 g 690 t ORIGIN 1 cagcagccga gcgcacaggc cgccgtgaaa cctcccgagg ccggatggta aaatccaacc 1981 ttatcctcta tagtggtaca cctttgcttc tgtcataata aaaccattat ttaaagac // From cjfields at uiuc.edu Thu Jul 20 08:43:08 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 20 Jul 2006 07:43:08 -0500 Subject: [Bioperl-l] Features in SeqIO GenBank output In-Reply-To: References: Message-ID: <73C89D17-91FE-47E4-80C1-AA6A689FA14E@uiuc.edu> I'll give it a look. You might try upgrading to Bioperl 1.5.1 to see if this was fixed. Chris On Jul 20, 2006, at 5:53 AM, Georg Otto wrote: > > Hi, > > this is probably a FAQ but I could not find anything to solve it. > > I want to get sequences from GenBank and save them in GenBank > format. This works with the script shown below, but the "Features" > part is missing and contains references instead (see below). How can I > print out the complete GenBank entry? > > I am running Bioperl 1.5, Perl 5.8.6, Mac 10.4.7 > > Best, > > Georg > > > > Here is my script: > > use strict; > use warnings; > > use Bio::Seq; > use Bio::SeqIO; > use Bio::DB::GenBank; > > > my $acc = 'AB017118'; > my $db_obj = Bio::DB::GenBank->new(); > my $seq_obj = $db_obj-> get_Seq_by_acc($acc); > my $out = Bio::SeqIO->new(-format => 'genbank', > -file => '>output.gb'); > $out->write_seq($seq_obj); > > > > Here is the output: > > LOCUS AB017118 2038 bp mRNA linear VRT > 06-JUN-2006 > DEFINITION Danio rerio mRNA for ornithine decarboxylase antizyme long > isoform, complete cds. > ACCESSION AB017118 > VERSION AB017118.1 GI:4239978 > KEYWORDS . > SOURCE Danio rerio (zebrafish) > ORGANISM Danio rerio > Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; > Euteleostomi; > Actinopterygii; Neopterygii; Teleostei; Ostariophysi; > Cypriniformes; Cyprinidae; Danio. > REFERENCE 1 > AUTHORS Saito,T., Hascilowicz,T., Ohkido,I., Kikuchi,Y., > Okamoto,H., > Hayashi,S., Murakami,Y. and Matsufuji,S. > TITLE Two zebrafish (Danio rerio) antizymes with different > expression > and activities > JOURNAL Biochem. J. 345 PT 1, 99-106 (2000) > PUBMED 10600644 > REFERENCE 2 (bases 1 to 2038) > AUTHORS Matsufuji,S. and Saito,T. > TITLE Direct Submission > JOURNAL Submitted (23-AUG-1998) Senya Matsufuji, Jikei > University School > of Medicine, Department of Biochemistry II; 3-25-8 > Nishishinbashi, > Minato-ku, Tokyo 105-8461, Japan (E- > mail:senya at jikei.ac.jp, > Tel:+81-3-3433-1111(ex.2276), Fax:+81-3-3436-3897) > FEATURES Location/Qualifiers > source 1..2038 > /db_xref="Bio::Annotation::SimpleValue=HASH > (0x19b9a28)" > /mol_type="Bio::Annotation::SimpleValue=HASH > (0x19b9b6c)" > /dev_stage="Bio::Annotation::SimpleValue=HASH > (0x19b9bb4)" > /organism="Bio::Annotation::SimpleValue=HASH > (0x19bfe18)" > /clone_lib="Bio::Annotation::SimpleValue=HASH > (0x19bfe60)" > CDS join(45..224,226..702) > /db_xref="Bio::Annotation::SimpleValue=HASH > (0x19c0960)" > / > ribosomal_slippage="Bio::Annotation::SimpleValue=HASH(0x1 > 9beecc)" > /codon_start=Bio::Annotation::SimpleValue=HASH > (0x19bef14) > /protein_id="Bio::Annotation::SimpleValue=HASH > (0x19bef5c)" > /translation="Bio::Annotation::SimpleValue=HASH > (0x19befa4) > " > /product="Bio::Annotation::SimpleValue=HASH > (0x19befec)" > /note="Bio::Annotation::SimpleValue=HASH > (0x19bf034)" > CDS 45..227 > /db_xref="Bio::Annotation::SimpleValue=HASH > (0x19bee24)" > /codon_start=Bio::Annotation::SimpleValue=HASH > (0x19bf160) > /protein_id="Bio::Annotation::SimpleValue=HASH > (0x19bf1cc)" > /translation="Bio::Annotation::SimpleValue=HASH > (0x19c1830) > " > /note="Bio::Annotation::SimpleValue=HASH > (0x19c1878)" > polyA_signal 2017..2022 > polyA_site 2038 > /note="Bio::Annotation::SimpleValue=HASH > (0x19bffc8)" > BASE COUNT 439 a 377 c 532 g 690 t > ORIGIN > 1 cagcagccga gcgcacaggc cgccgtgaaa cctcccgagg ccggatggta > aaatccaacc > > > > > 1981 ttatcctcta tagtggtaca cctttgcttc tgtcataata aaaccattat > ttaaagac > // > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Thu Jul 20 09:35:43 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 20 Jul 2006 14:35:43 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BBBB69.6000906@sendu.me.uk> References: <44BBBB69.6000906@sendu.me.uk> Message-ID: <44BF86AF.8080408@sendu.me.uk> Sendu Bala wrote: > node 2 has name 'Bacteria ' and rank 'superkingdom' > node 1386 has name 'Bacillus ' and rank 'genus' > node 7776 has name 'Gnathostomata ' and rank 'superclass' > etc. > > For me the bits in <> are inappropriate and shouldn't be there. > [...] > If there are no objections I'll strip the <> bits. I also plan to make > $node->name('scientific', 'sapiens'); set and get the node name, and > have flatfile and entrez store all common names with > $obj->name('common', 'human', 'man');. I'll describe all the changes I've now made and if no-one complains I'll commit. (I've also made these notes into bug 2047 for easier reference in the future.) Bio::DB::Taxonomy::flatfile --------------------------- # Bug-fixes Removed invalid requirement that all species nodes have at least 7 named-rank parents. The names->id solution used by get_taxonid() only stored that last id associated with a name. However the name used wasn't necessarily unique, such that multiple ids could match. names->id solution now remembers all ids that match a name. API-CHANGE: for this reason I've renamed get_taxonid() to get_taxonids() and it returns an array of ids in list context. For backward compatibility it returns one of the ids in scalar context, and *get_taxonid = \&get_taxonids. Added missing division ENV 'Environmental samples'. # Improvements Like Bio::DB::Taxonomy::entrez, flatfile now retrieves and stores the common names, genetic code and mitochondrial genetic code in each node it makes. NOTE: entrez also stores creation, publication and update dates, but this data is not available in the taxdump from NCBI ftp site. NOTE: the common names are stored in no particular order; the genbank common name in particular isn't necessarily the first in the list (cf. old entrez.pm behaviour). BEHAVIOUR-CHANGE: flatfile used to store within the nodes it makes the division as a three letter code, like 'PRI'. However, for consistency with entrez and the scientific_name() of the node the division is supposed to correspond to, it is now stored as the full name, like 'Primates'. The names->id solution also stores the artificially uniqued names like 'Craniata ', allowing you for the first time to retrieve the correct id. Previously the search would have simply failed completely. The names->id solution now handles nodes with scientific names of 'xyz (class)', allowing you to retrieve the id with both get_taxonids('xyz') and get_taxonids('xyz (class)'). Previously only the latter would work. NOTE: the previous 2 changes (and the issues with entrez, see below) make flatfile better at searching the taxonomy database than entrez module or the website, both in terms of speed and completeness of results. BEHAVIOUR-CHANGE: The scientific name field isn't touched in any way, always being sent directly to Bio::Taxonomy::Node->new(-name => $untouched) or the $node->classification() array. Previously, a species node would have its name converted from 'Homo sapiens' to 'sapiens', but the conversion mangled very badly certain other species names. Bio::DB::Taxonomy::entrez ------------------------- # Bug-fixes Special characters like ", ( and ) in the input query string to get_taxonid() result in the failure or inaccuracy of the search. These characters are now removed prior to submission, allowing for correct search results. API-CHANGE: entrez has always been able to return multiple ids that match a single input name, so I've renamed get_taxonid() to get_taxonids() and it returns an array of ids in list context. It returns one of the ids in scalar context. For backward compatibility, *get_taxonid = \&get_taxonids. NOTE: entrez modules (and website) cannot cope with '' in the query, failing searches like 'Craniata '. For this reason, if get_taxonids() is given a query with '' it will immediately return undefined, saving a pointless website access. If you want the id of 'Craniata ' you must search for 'Craniata', then get the node for each returned id to see which one has a parent node with a scientific_name() or common_names() case-insensitive matching to 'chordata'. # Improvements BEHAVIOUR-CHANGE: now throws on failure to retrieve data from website. BEHAVIOUR-CHANGE: the ScientificName field isn't touched except for s/ \(class\)$//, being sent directly to Bio::Taxonomy::Node->new(-name => $untouched) or the $node->classification() array. Previously, a species node would have its name converted from 'Homo sapiens' to 'sapiens', but the conversion mangled very badly certain other species names. BEHAVIOUR-CHANGE: all common names of a node are now stored in the resulting Node object with Bio::Taxonomy::Node->new(-common_names => \@names). This means that the Genbank common name is now just one amongst others, and isn't guaranteed to be the first in the list either. Bio::Taxonomy::Node ------------------- # Bug-fixes non-interesting fixes to get get_Children_Nodes(), get_Lineage_Nodes() and get_LCA_Node() to work correctly. classification() has a proper solution to finding the classification when the array wasn't manually set. # Improvements BEHAVIOUR-CHANGE: node_name() used to be an alias to name('common'). Now it is an alias to name('scientific'). NOTE: node_name is what is set when ->new(-name => $name) is set, so flatfile and entrez and user-created nodes now implicitly associate the name of the node they create with its scientific name. BEHAVIOUR-CHANGE: scientific_name() used to be an alias to binomial(). Now it is *scientific_name = \&node_name. binomial(), in addition to working the old way (assume first two elements of classification array are species and genus, combine them), will shortcut and return the scientific_name() if we are a node with rank 'species' and scientific_name is two words. This makes binomial() an effective synonym of scientific_name() when Nodes were constructed as per flatfile or entrez, and when it is used correctly on a species node. BEHAVIOUR-CHANGE: *parent_taxon_id = \&parent_id. (Previously, you could assign and retrieve different values to/from each method.) New method common_names() supersedes common_name(), returning a list of all common_names. For backward compatibility, returns one of the names in scalar context, and *common_name = \&common_names. -factory and factory() removed, since there is no Bio::Taxonomy::FactoryI-implementing module, nothing in Node to make use of a factory once set, and a factory seems redundant when we're a node with a -dbh. species() and genus() issue a warning when you try to use them on a node that isn't of rank 'species' (since they interact with the classification array and not names('method') like the other similar methods). validate_name() removed because it just returns 1. validate_species_name() removed because species() can (should) now contain the real species name, like 'Homo sapiens', not 'sapiens'. But it could also be any wonderfully complex thing, so there's nothing we can confidently check for as being 'correct'. t/Taxonomy.t ------------ Runs a slightly more comprehensive set of tests on entrez, which are now only skipped if data retrieval fails. Tests flatfile on a cut-down version of the taxdump. > I'll also fix the problem with node names for ranks species and lower, > as discussed in thread 'Bio::DB::Taxonomy:: mishandles species, > subspecies/variant names', in the way I suggested there. This hasn't been done per se, because we now store the real ScientificName so there is no 'mishandling' to fix. From bix at sendu.me.uk Thu Jul 20 09:49:04 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 20 Jul 2006 14:49:04 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BF86AF.8080408@sendu.me.uk> References: <44BBBB69.6000906@sendu.me.uk> <44BF86AF.8080408@sendu.me.uk> Message-ID: <44BF89D0.7090103@sendu.me.uk> Sendu Bala wrote: > > Bio::DB::Taxonomy::flatfile > > BEHAVIOUR-CHANGE: The scientific name field isn't touched in any way, > always being sent directly to Bio::Taxonomy::Node->new(-name => > $untouched) or the $node->classification() array. Previously, a species > node would have its name converted from 'Homo sapiens' to 'sapiens', but > the conversion mangled very badly certain other species names. [...] > Bio::DB::Taxonomy::entrez > > BEHAVIOUR-CHANGE: the ScientificName field isn't touched except for s/ > \(class\)$//, being sent directly to Bio::Taxonomy::Node->new(-name => > $untouched) or the $node->classification() array. Previously, a species > node would have its name converted from 'Homo sapiens' to 'sapiens', but > the conversion mangled very badly certain other species names. Oops. In both cases the scientific name has ' (class)' removed from it, but the original name (with ' (class)') is stored as one of the common names. From georg.otto at tuebingen.mpg.de Thu Jul 20 10:29:33 2006 From: georg.otto at tuebingen.mpg.de (Georg Otto) Date: Thu, 20 Jul 2006 16:29:33 +0200 Subject: [Bioperl-l] Features in SeqIO GenBank output References: <73C89D17-91FE-47E4-80C1-AA6A689FA14E@uiuc.edu> Message-ID: This indeed seems to be the case. After upgrading it works fine. Sorry for stealing your time. Georg Chris Fields writes: > I'll give it a look. You might try upgrading to Bioperl 1.5.1 to see > if this was fixed. > > Chris > > On Jul 20, 2006, at 5:53 AM, Georg Otto wrote: > >> >> Hi, >> >> this is probably a FAQ but I could not find anything to solve it. >> >> I want to get sequences from GenBank and save them in GenBank >> format. This works with the script shown below, but the "Features" >> part is missing and contains references instead (see below). How can I >> print out the complete GenBank entry? >> >> I am running Bioperl 1.5, Perl 5.8.6, Mac 10.4.7 >> >> Best, >> >> Georg >> >> >> >> Here is my script: >> >> use strict; >> use warnings; >> >> use Bio::Seq; >> use Bio::SeqIO; >> use Bio::DB::GenBank; >> >> >> my $acc = 'AB017118'; >> my $db_obj = Bio::DB::GenBank->new(); >> my $seq_obj = $db_obj-> get_Seq_by_acc($acc); >> my $out = Bio::SeqIO->new(-format => 'genbank', >> -file => '>output.gb'); >> $out->write_seq($seq_obj); >> >> >> >> Here is the output: >> >> LOCUS AB017118 2038 bp mRNA linear VRT >> 06-JUN-2006 >> DEFINITION Danio rerio mRNA for ornithine decarboxylase antizyme long >> isoform, complete cds. >> ACCESSION AB017118 >> VERSION AB017118.1 GI:4239978 >> KEYWORDS . >> SOURCE Danio rerio (zebrafish) >> ORGANISM Danio rerio >> Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; >> Euteleostomi; >> Actinopterygii; Neopterygii; Teleostei; Ostariophysi; >> Cypriniformes; Cyprinidae; Danio. >> REFERENCE 1 >> AUTHORS Saito,T., Hascilowicz,T., Ohkido,I., Kikuchi,Y., >> Okamoto,H., >> Hayashi,S., Murakami,Y. and Matsufuji,S. >> TITLE Two zebrafish (Danio rerio) antizymes with different >> expression >> and activities >> JOURNAL Biochem. J. 345 PT 1, 99-106 (2000) >> PUBMED 10600644 >> REFERENCE 2 (bases 1 to 2038) >> AUTHORS Matsufuji,S. and Saito,T. >> TITLE Direct Submission >> JOURNAL Submitted (23-AUG-1998) Senya Matsufuji, Jikei >> University School >> of Medicine, Department of Biochemistry II; 3-25-8 >> Nishishinbashi, >> Minato-ku, Tokyo 105-8461, Japan (E- >> mail:senya at jikei.ac.jp, >> Tel:+81-3-3433-1111(ex.2276), Fax:+81-3-3436-3897) >> FEATURES Location/Qualifiers >> source 1..2038 >> /db_xref="Bio::Annotation::SimpleValue=HASH >> (0x19b9a28)" >> /mol_type="Bio::Annotation::SimpleValue=HASH >> (0x19b9b6c)" >> /dev_stage="Bio::Annotation::SimpleValue=HASH >> (0x19b9bb4)" >> /organism="Bio::Annotation::SimpleValue=HASH >> (0x19bfe18)" >> /clone_lib="Bio::Annotation::SimpleValue=HASH >> (0x19bfe60)" >> CDS join(45..224,226..702) >> /db_xref="Bio::Annotation::SimpleValue=HASH >> (0x19c0960)" >> / >> ribosomal_slippage="Bio::Annotation::SimpleValue=HASH(0x1 >> 9beecc)" >> /codon_start=Bio::Annotation::SimpleValue=HASH >> (0x19bef14) >> /protein_id="Bio::Annotation::SimpleValue=HASH >> (0x19bef5c)" >> /translation="Bio::Annotation::SimpleValue=HASH >> (0x19befa4) >> " >> /product="Bio::Annotation::SimpleValue=HASH >> (0x19befec)" >> /note="Bio::Annotation::SimpleValue=HASH >> (0x19bf034)" >> CDS 45..227 >> /db_xref="Bio::Annotation::SimpleValue=HASH >> (0x19bee24)" >> /codon_start=Bio::Annotation::SimpleValue=HASH >> (0x19bf160) >> /protein_id="Bio::Annotation::SimpleValue=HASH >> (0x19bf1cc)" >> /translation="Bio::Annotation::SimpleValue=HASH >> (0x19c1830) >> " >> /note="Bio::Annotation::SimpleValue=HASH >> (0x19c1878)" >> polyA_signal 2017..2022 >> polyA_site 2038 >> /note="Bio::Annotation::SimpleValue=HASH >> (0x19bffc8)" >> BASE COUNT 439 a 377 c 532 g 690 t >> ORIGIN >> 1 cagcagccga gcgcacaggc cgccgtgaaa cctcccgagg ccggatggta >> aaatccaacc >> >> >> >> >> 1981 ttatcctcta tagtggtaca cctttgcttc tgtcataata aaaccattat >> ttaaagac >> // >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign From prabubio at gmail.com Thu Jul 20 12:01:35 2006 From: prabubio at gmail.com (Prabu R) Date: Thu, 20 Jul 2006 21:31:35 +0530 Subject: [Bioperl-l] Blast Output Parsing Message-ID: Dear All! I am now trying to parse a Blast output using PERL. I have to extract each alignment and have to parse the alignment. I mean, I have to check whether a particular part of the given sequence got aligned 100%. Anybody please tell me what module in PERL I have to use for getting this. I've tried Bio::SearchIO. But I didnt get any method to get the alignment. Kindly help. Thanks, R. Prabu From cjfields at uiuc.edu Thu Jul 20 13:03:17 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 20 Jul 2006 12:03:17 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BF86AF.8080408@sendu.me.uk> Message-ID: <002901c6ac1e$66ea3820$15327e82@pyrimidine> These all seem fine to me. Fantastic work! I added some comments but everything seems fine to me. I still plan on switching Bio::DB::Taxonomy::entrez to use Bio::DB::EUtilities at some point but probably won't get around to it until August; I still need to write up tests for the EUtilities modules. I may add a method for retrieving tax data based on protein/nucleotide sequence primary ID and relevant sequence database, so you could directly retrieve the relevant TaxID w/o parsing sequences directly for them. This would mainly be useful if you gather GIs from a BLAST search, for instance. Anyway, I could add this in then base class Bio::DB::Taxonomy directly so one could used the retrieved TaxIDs for flat-file or entrez searches; this requires, of course, access to the remote Entrez database (it would use ELink). Would that be of interest? If so, I'll work on that and add relevant tests to Taxonomy.t when I can. > Bio::DB::Taxonomy::flatfile > --------------------------- ... > API-CHANGE: for this reason I've renamed get_taxonid() to get_taxonids() > and it returns an array of ids in list context. For backward > compatibility it returns one of the ids in scalar context, and > *get_taxonid = \&get_taxonids. Returning a scalar makes sense as long as its noted in the POD. I have seen similar methods return an array ref based on wantarray instead of a scalar, but that largely depends on the complexity of the array (an array of hashes, for instance). ... > Bio::DB::Taxonomy::entrez > ------------------------- ... > NOTE: entrez modules (and website) cannot cope with '' in the > query, failing searches like 'Craniata '. For this reason, if > get_taxonids() is given a query with '' it will immediately > return undefined, saving a pointless website access. If you want the id > of 'Craniata ' you must search for 'Craniata', then get the > node for each returned id to see which one has a parent node with a > scientific_name() or common_names() case-insensitive matching to > 'chordata'. It may be something with the esearch interface, though the direct TaxBrowser query also seems to have problems with this: http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/ I'll try looking into it to see if there is a more direct way to get those (there probably isn't). > # Improvements > BEHAVIOUR-CHANGE: now throws on failure to retrieve data from website. > > BEHAVIOUR-CHANGE: the ScientificName field isn't touched except for s/ > \(class\)$//, being sent directly to Bio::Taxonomy::Node->new(-name => > $untouched) or the $node->classification() array. Previously, a species > node would have its name converted from 'Homo sapiens' to 'sapiens', but > the conversion mangled very badly certain other species names. This actually relates to the similar comment made for Bio::DB::Taxonomy::flatfle. The mangling probably depends on the current node and whether using flatfile or XML (entrez). Most of the odd XML examples I posted before, where the TaxID associated with a sequence had extra data, were a rank of 'no rank'. The species rank, if present, has a normal binomial name for : Flavobacterium johnsoniae UW101 ... Flavobacterium johnsoniae species Pseudomonas putida F1 ... Pseudomonas putida species Caldicellulosiruptor saccharolyticus DSM 8903 ... Caldicellulosiruptor saccharolyticus species The genus rank has one name; the subspecies rank has the full species name with 'subsp.' followed by the subspecies name. So, if using XML, one could use the taxon subelements stored in the XML element to sort out genus(), species(), subspecies(), and also higher order elements if someone wanted to implement them. This, of course, isn't necessary for the current changes, but down the road if anybody wanted it... ... > Bio::Taxonomy::Node > ------------------- ... > species() and genus() issue a warning when you try to use them on a node > that isn't of rank 'species' (since they interact with the > classification array and not names('method') like the other similar > methods). I would just have genus() and species() issue warnings if they aren't set to a particular value. So, if the current node is at the genus rank, genus() will be set but species() won't be. And no need to do additional checking! Fabulous work Sendu! Chris From cjfields at uiuc.edu Thu Jul 20 13:23:14 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 20 Jul 2006 12:23:14 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BF89D0.7090103@sendu.me.uk> Message-ID: <002a01c6ac21$2ed16190$15327e82@pyrimidine> Just thought of something... You had mentioned using a stripped-down version of Bio::Taxonomy::Node previously, which led to a bit of contention. One way to make everybody happy would be to create an interface class that contains the basic shared methods (Bio::Taxonomy::NodeI), then have the currently-named Bio::Taxonomy::Node (which could be renamed to Bio::Taxonomy::Species or something similar) implement those methods along with the current methods. Another class (your stripped down version, which could then be Bio::Taxonomy::Node) would also implement whatever base class methods were needed. They would both be Bio::Taxonomy::NodeI-implementing, so you could use either object type where required. |------Node NodeI----| |------Species Another option would be to have Bio::Taxonomy::Node itself stripped down, then have another class (Bio::Taxonomy::Species) inherit methods from it and also implement additional methods (genus(), species(), etc). Node----Species Would something like that be feasible? I favor the interface version as it sticks with the interface-implementation design that Bioperl has been migrating towards: http://www.bioperl.org/wiki/Advanced_BioPerl#Bioperl_Interface_design This would also help out with the whole Bio::Species issue; just have Bio::Taxonomy::Species replace it. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Sendu Bala > Sent: Thursday, July 20, 2006 8:49 AM > To: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > Sendu Bala wrote: > > > > Bio::DB::Taxonomy::flatfile > > > > BEHAVIOUR-CHANGE: The scientific name field isn't touched in any way, > > always being sent directly to Bio::Taxonomy::Node->new(-name => > > $untouched) or the $node->classification() array. Previously, a species > > node would have its name converted from 'Homo sapiens' to 'sapiens', but > > the conversion mangled very badly certain other species names. > [...] > > Bio::DB::Taxonomy::entrez > > > > BEHAVIOUR-CHANGE: the ScientificName field isn't touched except for s/ > > \(class\)$//, being sent directly to Bio::Taxonomy::Node->new(-name => > > $untouched) or the $node->classification() array. Previously, a species > > node would have its name converted from 'Homo sapiens' to 'sapiens', but > > the conversion mangled very badly certain other species names. > > Oops. In both cases the scientific name has ' (class)' removed from it, > but the original name (with ' (class)') is stored as one of the common > names. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Thu Jul 20 13:31:42 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 20 Jul 2006 12:31:42 -0500 Subject: [Bioperl-l] Blast Output Parsing In-Reply-To: Message-ID: <002b01c6ac22$5d75e1f0$15327e82@pyrimidine> Grab the HSPs, then use get_aln() to generate a Bio::SimpleAlign object. You can then use Bio::AlignIO to generate the alignment output if needed, or use the Bio::SimpleAlign methods to get what you want. http://www.bioperl.org/wiki/HOWTO:Beginners http://www.bioperl.org/wiki/HOWTO:SearchIO http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/SimpleAlign .html Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Prabu R > Sent: Thursday, July 20, 2006 11:02 AM > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] Blast Output Parsing > > Dear All! > > I am now trying to parse a Blast output using PERL. > > I have to extract each alignment and have to parse the alignment. I mean, > I > have to check whether a particular part of the given sequence got aligned > 100%. > > Anybody please tell me what module in PERL I have to use for getting this. > > I've tried Bio::SearchIO. But I didnt get any method to get the > alignment. > > Kindly help. > > Thanks, > R. Prabu > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From bix at sendu.me.uk Thu Jul 20 13:53:03 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 20 Jul 2006 18:53:03 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <002901c6ac1e$66ea3820$15327e82@pyrimidine> References: <002901c6ac1e$66ea3820$15327e82@pyrimidine> Message-ID: <44BFC2FF.3030704@sendu.me.uk> Chris Fields wrote: > > I still plan on switching Bio::DB::Taxonomy::entrez to use > Bio::DB::EUtilities at some point but probably won't get around to it until > August; If I may make two feature requests (you've probably already done them, if so apologies)? a) Automatically enforce the 3second wait rule when querying via the ncbi website. b) Automatically cache results locally in a reasonable way, such that repeated queries aiming to get the same result don't have to go via the website. > Anyway, I could add this in then base class Bio::DB::Taxonomy directly so > one could used the retrieved TaxIDs for flat-file or entrez searches; this > requires, of course, access to the remote Entrez database (it would use > ELink). Would that be of interest? Sorry, I don't really understand this paragraph. I'm unable to parse '...then base class Bio::DB::Taxonomy directly so...', for starters. >> Bio::Taxonomy::Node >> ------------------- > > ... > >> species() and genus() issue a warning when you try to use them on a node >> that isn't of rank 'species' (since they interact with the >> classification array and not names('method') like the other similar >> methods). > > I would just have genus() and species() issue warnings if they aren't set to > a particular value. So, if the current node is at the genus rank, genus() > will be set but species() won't be. And no need to do additional checking! The problem is, genus() and species() are special cases that aren't normally directly set. They get their values from the classification array: genus() returns (classification())[1] and species() returns (classification())[0]. They set the same values. Doing this is only sane (though is still likely to be wrong, given that there can be ranks between species and genus) when the node is of rank 'species', hence the warnings. I imagine this is to work with pesky file formats like genbank, so I can't really change anything here without major overhaul. And my plans for overhaul involve getting rid of genus() and species(), so I'll just leave them be for now. Anyway, thanks for your comments and input into this thread! It's much appreciated. From bix at sendu.me.uk Thu Jul 20 13:55:56 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 20 Jul 2006 18:55:56 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <002a01c6ac21$2ed16190$15327e82@pyrimidine> References: <002a01c6ac21$2ed16190$15327e82@pyrimidine> Message-ID: <44BFC3AC.8010704@sendu.me.uk> Chris Fields wrote: > Just thought of something... > > You had mentioned using a stripped-down version of Bio::Taxonomy::Node > previously, which led to a bit of contention. One way to make everybody > happy would be to create an interface class that contains the basic shared > methods (Bio::Taxonomy::NodeI), then have the currently-named > Bio::Taxonomy::Node (which could be renamed to Bio::Taxonomy::Species or > something similar) implement those methods along with the current methods. > Another class (your stripped down version, which could then be > Bio::Taxonomy::Node) would also implement whatever base class methods were > needed. They would both be Bio::Taxonomy::NodeI-implementing, so you could > use either object type where required. > > |------Node > NodeI----| > |------Species [...] > I favor the interface version as it > sticks with the interface-implementation design that Bioperl has been > migrating towards: > > http://www.bioperl.org/wiki/Advanced_BioPerl#Bioperl_Interface_design > > This would also help out with the whole Bio::Species issue; just have > Bio::Taxonomy::Species replace it. Yes, this sounds good to me. Should I still wait until Jason/elders are able to comment before I start exploring this avenue? From cjfields at uiuc.edu Thu Jul 20 14:21:48 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 20 Jul 2006 13:21:48 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BFC3AC.8010704@sendu.me.uk> Message-ID: <000601c6ac29$5d533a90$15327e82@pyrimidine> I would say go ahead, why not? This would likely lead to the eventual deprecation of Bio::Species, which was in the cards anyway. The only problem I can foresee is which class to use with Bio::DB::Taxonomy*? I guess one could settle on one class by default and have the option to use another Bio::Taxonomy::NodeI-implementing class if you wanted more data/methods available... Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Sendu Bala > Sent: Thursday, July 20, 2006 12:56 PM > To: bioperl-l at bioperl.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > Chris Fields wrote: > > Just thought of something... > > > > You had mentioned using a stripped-down version of Bio::Taxonomy::Node > > previously, which led to a bit of contention. One way to make everybody > > happy would be to create an interface class that contains the basic > shared > > methods (Bio::Taxonomy::NodeI), then have the currently-named > > Bio::Taxonomy::Node (which could be renamed to Bio::Taxonomy::Species or > > something similar) implement those methods along with the current > methods. > > Another class (your stripped down version, which could then be > > Bio::Taxonomy::Node) would also implement whatever base class methods > were > > needed. They would both be Bio::Taxonomy::NodeI-implementing, so you > could > > use either object type where required. > > > > |------Node > > NodeI----| > > |------Species > [...] > > I favor the interface version as it > > sticks with the interface-implementation design that Bioperl has been > > migrating towards: > > > > http://www.bioperl.org/wiki/Advanced_BioPerl#Bioperl_Interface_design > > > > This would also help out with the whole Bio::Species issue; just have > > Bio::Taxonomy::Species replace it. > > Yes, this sounds good to me. Should I still wait until Jason/elders are > able to comment before I start exploring this avenue? > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From hlapp at gmx.net Thu Jul 20 14:24:19 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 20 Jul 2006 14:24:19 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BFC3AC.8010704@sendu.me.uk> References: <002a01c6ac21$2ed16190$15327e82@pyrimidine> <44BFC3AC.8010704@sendu.me.uk> Message-ID: On Jul 20, 2006, at 1:55 PM, Sendu Bala wrote: > > Yes, this sounds good to me. Should I still wait until Jason/elders > are > able to comment before I start exploring this avenue? Unless you're afraid that your suggestions are going too wild for our palate please do go ahead. The joy of CVS is we can always go back. For my part, I just haven't been able to keep up with the flurry of long emails ... I'll have to do some extensive bedtime reading (and then writing ;) soon I guess :-) -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From saunders at uchicago.edu Thu Jul 20 17:47:08 2006 From: saunders at uchicago.edu (Matthew A. Saunders) Date: Thu, 20 Jul 2006 16:47:08 -0500 (CDT) Subject: [Bioperl-l] installing bioperl Message-ID: Dear Bioperl representative, I have been trying to install bioperl (in order to ultimately run some Ensembl APIs) but I seem to be having some problems with the bioperl installation. I have followed the installation directions and I get to the last steps of the "make" process, yet this stage fails with the error message below. Can you possibly tell me what is the problem. I am not sure that I understand the command "make", but I think that it requires that there be a file named "makefile" in the given folder, when I look in my newly formed "bioperl-1.4" folder there is no "makefile" in there. Perhaps that is a problem. If so, how might I rectify the matter? Thanks! Matt ************************************************************* . . . Enjoy the rest of bioperl, which you can use after going 'make install' Checking if your kit is complete... Looks good /usr/bin/perl: symbol lookup error: /usr/lib64/perl5/5.8.5/x86_64-linux-thread-multi/auto/DB_File/DB_File.so: undefined symbol: db_version Running make test Make had some problems, maybe interrupted? Won't test Running make install Make had some problems, maybe interrupted? Won't install *************************************************************** ----------------------------------------------------- Matthew A. Saunders UNCF-MERCK Postdoctoral Research Fellow Dept. of Ecology and Evolution University of Chicago (773)834-3964 Skype: mattsaunders555 http://home.uchicago.edu/~saunders ------------------------------------------------------- From saunders at uchicago.edu Thu Jul 20 18:01:53 2006 From: saunders at uchicago.edu (Matthew A. Saunders) Date: Thu, 20 Jul 2006 17:01:53 -0500 (CDT) Subject: [Bioperl-l] installing bioperl In-Reply-To: References: Message-ID: In continuation to my described problem, I have just installed the bioperl-run file from the .tar.gz format and that was successful through the "perl Makefile.PL" and the "make" & "make test" phases. It is the "bioperl core" file that is still giving me the problems described below. Thanks! Matt ******************************** On Thu, 20 Jul 2006, Matthew A. Saunders wrote: > Dear Bioperl representative, > > I have been trying to install bioperl (in order to ultimately run some > Ensembl APIs) but I seem to be having some problems with the bioperl > installation. > > I have followed the installation directions and I get to the last steps of > the "make" process, yet this stage fails with the error message below. Can > you possibly tell me what is the problem. I am not sure that I understand > the command "make", but I think that it requires that there be a file named > "makefile" in the given folder, when I look in my newly formed "bioperl-1.4" > folder there is no "makefile" in there. Perhaps that is a problem. If so, > how might I rectify the matter? > > Thanks! > > Matt > > > ************************************************************* . . . > Enjoy the rest of bioperl, which you can use after going 'make install' > > Checking if your kit is complete... > Looks good > /usr/bin/perl: symbol lookup error: > /usr/lib64/perl5/5.8.5/x86_64-linux-thread-multi/auto/DB_File/DB_File.so: > undefined symbol: db_version > Running make test > Make had some problems, maybe interrupted? Won't test > Running make install > Make had some problems, maybe interrupted? Won't install > *************************************************************** > > > > ----------------------------------------------------- > Matthew A. Saunders > UNCF-MERCK Postdoctoral Research Fellow > > Dept. of Ecology and Evolution > University of Chicago > (773)834-3964 > Skype: mattsaunders555 > http://home.uchicago.edu/~saunders > ------------------------------------------------------- > > From bix at sendu.me.uk Thu Jul 20 18:47:33 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 20 Jul 2006 23:47:33 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine> References: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine> Message-ID: <44C00805.7090403@sendu.me.uk> Chris Fields wrote: > As for caching, > do you mean caching of the tax information or the sequence ID information? Anything you get from entrez. > Caching of tax information would be great, but how would you go about it? I > can see how it would be easy to have a cache for the flatfile using a local > index, but not so much for XML data retrieved from Entrez (a > continually-appended local file, maybe, with a n accompanying index file?). I didn't actually mean a stored file (but that would be possible with a tied hash or something: DB_File, just like flatfile), but an in-memory one for use during the course of program execution. Stored file would probably be dangerous because you wouldn't know if the data has become stale or not - and checking to see if it wasn't would defeat the point. >> The problem is, genus() and species() are special cases that aren't >> normally directly set. They get their values from the classification >> array: genus() returns (classification())[1] and species() returns >> (classification())[0]. They set the same values. Doing this is only sane >> (though is still likely to be wrong, given that there can be ranks >> between species and genus) when the node is of rank 'species', hence the >> warnings. >> >> I imagine this is to work with pesky file formats like genbank, so I >> can't really change anything here without major overhaul. And my plans >> for overhaul involve getting rid of genus() and species(), so I'll just >> leave them be for now. > > This would all depend on where the information came from; if the information > came from the Entrez XML element data: > [snip] > > The subspecies(), genus(), and species() could all be set from this instead > of the classification array. The problem lies then with the flatfile data > and how it would be parsed out, if that's at all possible with the flatfile > data. If not, I see why you would rather have this return a stripped-down > Bio::Taxonomy::Node object. > > I would have to look at how everything is indexed in > Bio::DB::Taxonomy::entrez, but I think it's feasible. entrez already parses through LineageEx to build the classification array. flatfile walks up all the parents to do the same. Having the information isn't the issue. We have the information. The methods genus() and species() need to work with the genbank fileformat, that is the problem. From MEC at stowers-institute.org Thu Jul 20 18:40:55 2006 From: MEC at stowers-institute.org (Cook, Malcolm) Date: Thu, 20 Jul 2006 17:40:55 -0500 Subject: [Bioperl-l] Remote Blast - Blast Human Genome Message-ID: Rohan, 'snp/human/human_snp' is the database name you need to use to blast into human snp database at NCBI See the following document for the full list (which link was provided to me via personal correspondace with NCBI helpdesk). Very useful... Hmm, looming again, there appear now to be two versions: http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastdblist.html (last updated 2/7/2006) http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/remote_accessible_blastdbli st.html (last uypdated 5/29/2006) Neither are linked to by any other document on the internet (google sez) including anywhere else at NCBI. Go figure. It should be IMHO since this info is nowhere else collected. Of course it may be out of date, but it always has got me through. Good luck Malcolm Cook - mec at stowers-institute.org - 816-926-4449 Database Applications Manager - Bioinformatics Stowers Institute for Medical Research - Kansas City, MO USA >-----Original Message----- >From: bioperl-l-bounces at lists.open-bio.org >[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Chris Fields >Sent: Monday, July 17, 2006 4:26 PM >To: vrramnar at student.cs.uwaterloo.ca; bioperl-l at lists.open-bio.org >Subject: Re: [Bioperl-l] Remote Blast - Blast Human Genome > >Okay, I think I may know what's going on a little more now >with NCBI's BLAST >interface. Looks like any NCBI BLAST query must use the >default URL and so >must set up to proper GET/PUT commands to retrieve everything >correctly. > >Here's the API description for it all: > >http://www.ncbi.nlm.nih.gov/BLAST/Doc/urlapi.html > >You could try setting the database to 'snp' or something along >those lines >instead of 'nr'; or you could see what the name of the >database is when you >use the web form and try setting it to that. According to >this page, this >should be possible: > >http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helpsnpfaq.sectio >n.SearchdbSNP >_test._Search_dbSNP_Using_B > >The Entrez Query limit was a recommendation for limiting your >search to a >set of sequences for human, for instance. > >I'll try looking into it a bit more but I'm pretty busy. If you find >anything out you should probably post it here . > >Chris > >> Hi Chris, >> >> 1. I have tried changing the database to snp or dbSNP but >neither works. >> It >> seems that depending on which type of blast you use(ie, Genome Blast, >> Blast SNP, >> normal blast such as blastn, etc...) you see a different listing of >> databases >> available for querys. Since you mention that the Blast page I see was >> generated >> by Genome, where could I go to see a complete listing of >databases I can >> query?? >> Or if you knew off hand which database to search if I only >wanted dbSNP >> hits? >> >> 2. You also mention, I can limit the search by using Entrez >terms. Do you >> mean >> like: >> $Bio::Tools::Run::RemoteBlast::HEADER{'ENTREZ_QUERY'} = 'abc'; >> where 'abc' is the name of the subject with which you would >only like to >> see >> result of. For example if you put it as 'Homo >sapiens[Organism]' then only >> human >> sequences would be in hit lists. >> If this is what you mean, what would I change it to, to see >only hits from >> dbSNP? >> >> Thanks for the ongoing help, >> >> Rohan >> >> Quoting Chris Fields : >> >> > I added a method to RemoteBlast in bioperl-live (CVS) if >you want to >> play >> > with changing the URL. I have been thinking about doing >this for a bit >> now >> > but I already see problems. >> > >> > Here's the issue: the BLAST page you see is NOT the NCBI BLAST page >> (note >> > the differences in the URL) but a user-friendly request >page, generated >> on >> > the fly by Genome, to submit BLAST requests for the >relevant database. >> So >> > changing the URL will not work (even by adding extra >parameters); you >> only >> > get the original HTML web page. >> > >> > You could try changing the database or limiting the search using an >> Entrez >> > term (which you should be able to include in the request, >probably by >> adding >> > it to the HEADER). >> > >> > Chris >> > >> > > -----Original Message----- >> > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> > > bounces at lists.open-bio.org] On Behalf Of >> vrramnar at student.cs.uwaterloo.ca >> > > Sent: Thursday, July 13, 2006 5:39 PM >> > > To: bioperl-l at lists.open-bio.org >> > > Subject: [Bioperl-l] Remote Blast - Blast Human Genome >> > > >> > > >> > > Hello Again, >> > > >> > > I have another question regarding Remote blast but this >time using >> Genome >> > > Blast. >> > > >> > > Here is the link: >> > > >> > > >> >http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=9606 >> > > >> > > which again uses the main Blast web site: >> > > >> > > http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi >> > > >> > > Again I am not sure what to add or what HEADER >information to change >> > > within my >> > > script. >> > > >> > > Here is my program, which was the same as the last email: >> > > >> > > #!/usr/bin/perl -w >> > > >> > > use Bio::Perl; >> > > use Bio::Tools::Run::RemoteBlast; >> > > >> > > my $prog = "blastn"; >> > > my $db = "refseq_genomic"; >> > > my $e_val = 0.01; >> > > >> > > my @params = ( '-prog' => $prog, >> > > '-data' => $db, >> > > '-expect' => $e_val); >> > > >> > > my $factory = new Bio::Tools::Run::RemoteBlast->new(@params); >> > > $Bio::Tools::Run::RemoteBlast::HEADER{'WWW_BLAST_TYPE'} >= '????'; <-- >> --- >> > > what >> > > do I put here >> > > #$Bio::Tools::Run::RemoteBlast::HEADER{'?????'} = >'????'; <--- Do I >> need >> > > to add >> > > any other values to the form inputs >> > > >> > > $factory->submit_blast("blast.in"); >> > > $v = 1; >> > > >> > > while (my @rids = $factory->each_rid) >> > > { foreach my $rid ( @rids ) >> > > { my $rc = $factory->retrieve_blast($rid); >> > > if( !ref($rc) ) >> > > { if( $rc < 0 ) >> > > { $factory->remove_rid($rid); >> > > } >> > > print STDERR "." if ( $v > 0 ); >> > > sleep 5; >> > > } >> > > else >> > > { my $result = $rc->next_result(); >> > > my $filename = $result->query_name()."\.out"; >> > > $factory->save_output($filename); >> > > $factory->remove_rid($rid); >> > > print "\nQuery Name: ", $result->query_name(), "\n"; >> > > } >> > > } >> > > } >> > > >> > > >> > > Both of my questions are very similiar as in I know how >to use remote >> > > blast but >> > > not sure what to change to access the specific blast I want. >> > > >> > > Again, any help would be very appreciated!! >> > > >> > > Rohan >> > > >> > > >> > > >> > > ---------------------------------------- >> > > This mail sent through www.mywaterloo.ca >> > > _______________________________________________ >> > > Bioperl-l mailing list >> > > Bioperl-l at lists.open-bio.org >> > > http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > >> >> >> >> >> ---------------------------------------- >> This mail sent through www.mywaterloo.ca > >_______________________________________________ >Bioperl-l mailing list >Bioperl-l at lists.open-bio.org >http://lists.open-bio.org/mailman/listinfo/bioperl-l > From cjfields at uiuc.edu Thu Jul 20 19:01:02 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 20 Jul 2006 18:01:02 -0500 Subject: [Bioperl-l] installing bioperl In-Reply-To: References: Message-ID: <68C6025D-A9FE-47F0-905C-28B79C4B843A@uiuc.edu> Did you run perl Makefile.PL make make install 'perl Makefile.PL' generates the Makefile. Something screwy with DB_File, apparently, is also going on. > /usr/lib64/perl5/5.8.5/x86_64-linux-thread-multi/auto/DB_File/ > DB_File.so: Try updating or reinstalling DB_File. Chris On Jul 20, 2006, at 4:47 PM, Matthew A. Saunders wrote: > Dear Bioperl representative, > > I have been trying to install bioperl (in order to ultimately run some > Ensembl APIs) but I seem to be having some problems with the > bioperl installation. > > I have followed the installation directions and I get to the last > steps of > the "make" process, yet this stage fails with the error message below. > Can you possibly tell me what is the problem. I am not sure that I > understand the command "make", but I think that it requires that > there be > a file named "makefile" in the given folder, when I look in my newly > formed "bioperl-1.4" folder there is no "makefile" in there. > Perhaps that > is a problem. If so, how might I rectify the matter? > > Thanks! > > Matt > > > ************************************************************* . . > . > Enjoy the rest of bioperl, which you can use after going 'make > install' > > Checking if your kit is complete... > Looks good > /usr/bin/perl: symbol lookup error: > /usr/lib64/perl5/5.8.5/x86_64-linux-thread-multi/auto/DB_File/ > DB_File.so: > undefined symbol: db_version > Running make test > Make had some problems, maybe interrupted? Won't test > Running make install > Make had some problems, maybe interrupted? Won't install > *************************************************************** > > > > ----------------------------------------------------- > Matthew A. Saunders > UNCF-MERCK Postdoctoral Research Fellow > > Dept. of Ecology and Evolution > University of Chicago > (773)834-3964 > Skype: mattsaunders555 > http://home.uchicago.edu/~saunders > ------------------------------------------------------- > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Thu Jul 20 19:02:08 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 20 Jul 2006 18:02:08 -0500 Subject: [Bioperl-l] Remote Blast - Blast Human Genome In-Reply-To: References: Message-ID: Nice to know! I'll add this to the wiki. Chris On Jul 20, 2006, at 5:40 PM, Cook, Malcolm wrote: > Rohan, > > 'snp/human/human_snp' is the database name you need to use to blast > into > human snp database at NCBI > > See the following document for the full list (which link was > provided to > me via personal correspondace with NCBI helpdesk). Very useful... > > Hmm, looming again, there appear now to be two versions: > > http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastdblist.html (last > updated 2/7/2006) > http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/ > remote_accessible_blastdbli > st.html (last uypdated 5/29/2006) > > Neither are linked to by any other document on the internet (google > sez) > including anywhere else at NCBI. Go figure. It should be IMHO since > this info is nowhere else collected. > > Of course it may be out of date, but it always has got me through. > > Good luck > > Malcolm Cook - mec at stowers-institute.org - 816-926-4449 > Database Applications Manager - Bioinformatics > Stowers Institute for Medical Research - Kansas City, MO USA > > > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org >> [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Chris >> Fields >> Sent: Monday, July 17, 2006 4:26 PM >> To: vrramnar at student.cs.uwaterloo.ca; bioperl-l at lists.open-bio.org >> Subject: Re: [Bioperl-l] Remote Blast - Blast Human Genome >> >> Okay, I think I may know what's going on a little more now >> with NCBI's BLAST >> interface. Looks like any NCBI BLAST query must use the >> default URL and so >> must set up to proper GET/PUT commands to retrieve everything >> correctly. >> >> Here's the API description for it all: >> >> http://www.ncbi.nlm.nih.gov/BLAST/Doc/urlapi.html >> >> You could try setting the database to 'snp' or something along >> those lines >> instead of 'nr'; or you could see what the name of the >> database is when you >> use the web form and try setting it to that. According to >> this page, this >> should be possible: >> >> http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helpsnpfaq.sectio >> n.SearchdbSNP >> _test._Search_dbSNP_Using_B >> >> The Entrez Query limit was a recommendation for limiting your >> search to a >> set of sequences for human, for instance. >> >> I'll try looking into it a bit more but I'm pretty busy. If you find >> anything out you should probably post it here . >> >> Chris >> >>> Hi Chris, >>> >>> 1. I have tried changing the database to snp or dbSNP but >> neither works. >>> It >>> seems that depending on which type of blast you use(ie, Genome >>> Blast, >>> Blast SNP, >>> normal blast such as blastn, etc...) you see a different listing of >>> databases >>> available for querys. Since you mention that the Blast page I see >>> was >>> generated >>> by Genome, where could I go to see a complete listing of >> databases I can >>> query?? >>> Or if you knew off hand which database to search if I only >> wanted dbSNP >>> hits? >>> >>> 2. You also mention, I can limit the search by using Entrez >> terms. Do you >>> mean >>> like: >>> $Bio::Tools::Run::RemoteBlast::HEADER{'ENTREZ_QUERY'} = 'abc'; >>> where 'abc' is the name of the subject with which you would >> only like to >>> see >>> result of. For example if you put it as 'Homo >> sapiens[Organism]' then only >>> human >>> sequences would be in hit lists. >>> If this is what you mean, what would I change it to, to see >> only hits from >>> dbSNP? >>> >>> Thanks for the ongoing help, >>> >>> Rohan >>> >>> Quoting Chris Fields : >>> >>>> I added a method to RemoteBlast in bioperl-live (CVS) if >> you want to >>> play >>>> with changing the URL. I have been thinking about doing >> this for a bit >>> now >>>> but I already see problems. >>>> >>>> Here's the issue: the BLAST page you see is NOT the NCBI BLAST page >>> (note >>>> the differences in the URL) but a user-friendly request >> page, generated >>> on >>>> the fly by Genome, to submit BLAST requests for the >> relevant database. >>> So >>>> changing the URL will not work (even by adding extra >> parameters); you >>> only >>>> get the original HTML web page. >>>> >>>> You could try changing the database or limiting the search using an >>> Entrez >>>> term (which you should be able to include in the request, >> probably by >>> adding >>>> it to the HEADER). >>>> >>>> Chris >>>> >>>>> -----Original Message----- >>>>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >>>>> bounces at lists.open-bio.org] On Behalf Of >>> vrramnar at student.cs.uwaterloo.ca >>>>> Sent: Thursday, July 13, 2006 5:39 PM >>>>> To: bioperl-l at lists.open-bio.org >>>>> Subject: [Bioperl-l] Remote Blast - Blast Human Genome >>>>> >>>>> >>>>> Hello Again, >>>>> >>>>> I have another question regarding Remote blast but this >> time using >>> Genome >>>>> Blast. >>>>> >>>>> Here is the link: >>>>> >>>>> >>> >> http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi? >> taxid=9606 >>>>> >>>>> which again uses the main Blast web site: >>>>> >>>>> http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi >>>>> >>>>> Again I am not sure what to add or what HEADER >> information to change >>>>> within my >>>>> script. >>>>> >>>>> Here is my program, which was the same as the last email: >>>>> >>>>> #!/usr/bin/perl -w >>>>> >>>>> use Bio::Perl; >>>>> use Bio::Tools::Run::RemoteBlast; >>>>> >>>>> my $prog = "blastn"; >>>>> my $db = "refseq_genomic"; >>>>> my $e_val = 0.01; >>>>> >>>>> my @params = ( '-prog' => $prog, >>>>> '-data' => $db, >>>>> '-expect' => $e_val); >>>>> >>>>> my $factory = new Bio::Tools::Run::RemoteBlast->new(@params); >>>>> $Bio::Tools::Run::RemoteBlast::HEADER{'WWW_BLAST_TYPE'} >> = '????'; <-- >>> --- >>>>> what >>>>> do I put here >>>>> #$Bio::Tools::Run::RemoteBlast::HEADER{'?????'} = >> '????'; <--- Do I >>> need >>>>> to add >>>>> any other values to the form inputs >>>>> >>>>> $factory->submit_blast("blast.in"); >>>>> $v = 1; >>>>> >>>>> while (my @rids = $factory->each_rid) >>>>> { foreach my $rid ( @rids ) >>>>> { my $rc = $factory->retrieve_blast($rid); >>>>> if( !ref($rc) ) >>>>> { if( $rc < 0 ) >>>>> { $factory->remove_rid($rid); >>>>> } >>>>> print STDERR "." if ( $v > 0 ); >>>>> sleep 5; >>>>> } >>>>> else >>>>> { my $result = $rc->next_result(); >>>>> my $filename = $result->query_name()."\.out"; >>>>> $factory->save_output($filename); >>>>> $factory->remove_rid($rid); >>>>> print "\nQuery Name: ", $result->query_name(), "\n"; >>>>> } >>>>> } >>>>> } >>>>> >>>>> >>>>> Both of my questions are very similiar as in I know how >> to use remote >>>>> blast but >>>>> not sure what to change to access the specific blast I want. >>>>> >>>>> Again, any help would be very appreciated!! >>>>> >>>>> Rohan >>>>> >>>>> >>>>> >>>>> ---------------------------------------- >>>>> This mail sent through www.mywaterloo.ca >>>>> _______________________________________________ >>>>> Bioperl-l mailing list >>>>> Bioperl-l at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>> >>> >>> >>> >>> >>> ---------------------------------------- >>> This mail sent through www.mywaterloo.ca >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From vrramnar at student.cs.uwaterloo.ca Thu Jul 20 19:07:15 2006 From: vrramnar at student.cs.uwaterloo.ca (vrramnar at student.cs.uwaterloo.ca) Date: Thu, 20 Jul 2006 19:07:15 -0400 Subject: [Bioperl-l] Remote Blast - Blast Human Genome In-Reply-To: References: Message-ID: <1153436835.44c00ca39f2ee@www.nexusmail.uwaterloo.ca> Hi Malcolm, Thanks for the help, I actually figured this out today the same way you did through discussions with NCBI help deskng. He mentioned the main site is: http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/ But specifically: http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/remote_accessible_blastdblist.html So all you would need to do while using remoteblast is set your $db to one of the following: snp/human_9606/human_9606 Human SNPs snp/human_9606/rs_ch1 Human chr 1 SNPs snp/human_9606/rs_ch10 Human chr 10 SNPs snp/human_9606/rs_ch11 Human chr 11 SNPs snp/human_9606/rs_ch12 Human chr 12 SNPs snp/human_9606/rs_ch13 Human chr 13 SNPs snp/human_9606/rs_ch14 Human chr 14 SNPs snp/human_9606/rs_ch15 Human chr 15 SNPs snp/human_9606/rs_ch16 Human chr 16 SNPs snp/human_9606/rs_ch17 Human chr 17 SNPs snp/human_9606/rs_ch18 Human chr 18 SNPs snp/human_9606/rs_ch19 Human chr 19 SNPs snp/human_9606/rs_ch2 Human chr 2 SNPs snp/human_9606/rs_ch20 Human chr 20 SNPs snp/human_9606/rs_ch21 Human chr 21 SNPs snp/human_9606/rs_ch22 Human chr 22 SNPs snp/human_9606/rs_ch3 Human chr 3 SNPs snp/human_9606/rs_ch4 Human chr 4 SNPs snp/human_9606/rs_ch5 Human chr 5 SNPs snp/human_9606/rs_ch6 Human chr 6 SNPs snp/human_9606/rs_ch7 Human chr 7 SNPs snp/human_9606/rs_ch8 Human chr 8 SNPs snp/human_9606/rs_ch9 Human chr 9 SNPs snp/human_9606/rs_chMT Human chr Mitochondrial SNPs snp/human_9606/rs_chMulti Human SNPs mapped to multiple locations snp/human_9606/rs_chNotOn Human SNPs not mapped snp/human_9606/rs_chUn Human SNPs mapped to unplaced contigs snp/human_9606/rs_chX Human chr x SNPs snp/human_9606/rs_chY Human chr y SNPs The web site has a more complete list of all other databases available using the remoteblast module. Rohan Quoting "Cook, Malcolm" : > Rohan, > > 'snp/human/human_snp' is the database name you need to use to blast into > human snp database at NCBI > > See the following document for the full list (which link was provided to > me via personal correspondace with NCBI helpdesk). Very useful... > > Hmm, looming again, there appear now to be two versions: > > http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastdblist.html (last > updated 2/7/2006) > http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/remote_accessible_blastdbli > st.html (last uypdated 5/29/2006) > > Neither are linked to by any other document on the internet (google sez) > including anywhere else at NCBI. Go figure. It should be IMHO since > this info is nowhere else collected. > > Of course it may be out of date, but it always has got me through. > > Good luck > > Malcolm Cook - mec at stowers-institute.org - 816-926-4449 > Database Applications Manager - Bioinformatics > Stowers Institute for Medical Research - Kansas City, MO USA > > > > >-----Original Message----- > >From: bioperl-l-bounces at lists.open-bio.org > >[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Chris Fields > >Sent: Monday, July 17, 2006 4:26 PM > >To: vrramnar at student.cs.uwaterloo.ca; bioperl-l at lists.open-bio.org > >Subject: Re: [Bioperl-l] Remote Blast - Blast Human Genome > > > >Okay, I think I may know what's going on a little more now > >with NCBI's BLAST > >interface. Looks like any NCBI BLAST query must use the > >default URL and so > >must set up to proper GET/PUT commands to retrieve everything > >correctly. > > > >Here's the API description for it all: > > > >http://www.ncbi.nlm.nih.gov/BLAST/Doc/urlapi.html > > > >You could try setting the database to 'snp' or something along > >those lines > >instead of 'nr'; or you could see what the name of the > >database is when you > >use the web form and try setting it to that. According to > >this page, this > >should be possible: > > > >http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helpsnpfaq.sectio > >n.SearchdbSNP > >_test._Search_dbSNP_Using_B > > > >The Entrez Query limit was a recommendation for limiting your > >search to a > >set of sequences for human, for instance. > > > >I'll try looking into it a bit more but I'm pretty busy. If you find > >anything out you should probably post it here . > > > >Chris > > > >> Hi Chris, > >> > >> 1. I have tried changing the database to snp or dbSNP but > >neither works. > >> It > >> seems that depending on which type of blast you use(ie, Genome Blast, > >> Blast SNP, > >> normal blast such as blastn, etc...) you see a different listing of > >> databases > >> available for querys. Since you mention that the Blast page I see was > >> generated > >> by Genome, where could I go to see a complete listing of > >databases I can > >> query?? > >> Or if you knew off hand which database to search if I only > >wanted dbSNP > >> hits? > >> > >> 2. You also mention, I can limit the search by using Entrez > >terms. Do you > >> mean > >> like: > >> $Bio::Tools::Run::RemoteBlast::HEADER{'ENTREZ_QUERY'} = 'abc'; > >> where 'abc' is the name of the subject with which you would > >only like to > >> see > >> result of. For example if you put it as 'Homo > >sapiens[Organism]' then only > >> human > >> sequences would be in hit lists. > >> If this is what you mean, what would I change it to, to see > >only hits from > >> dbSNP? > >> > >> Thanks for the ongoing help, > >> > >> Rohan > >> > >> Quoting Chris Fields : > >> > >> > I added a method to RemoteBlast in bioperl-live (CVS) if > >you want to > >> play > >> > with changing the URL. I have been thinking about doing > >this for a bit > >> now > >> > but I already see problems. > >> > > >> > Here's the issue: the BLAST page you see is NOT the NCBI BLAST page > >> (note > >> > the differences in the URL) but a user-friendly request > >page, generated > >> on > >> > the fly by Genome, to submit BLAST requests for the > >relevant database. > >> So > >> > changing the URL will not work (even by adding extra > >parameters); you > >> only > >> > get the original HTML web page. > >> > > >> > You could try changing the database or limiting the search using an > >> Entrez > >> > term (which you should be able to include in the request, > >probably by > >> adding > >> > it to the HEADER). > >> > > >> > Chris > >> > > >> > > -----Original Message----- > >> > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > >> > > bounces at lists.open-bio.org] On Behalf Of > >> vrramnar at student.cs.uwaterloo.ca > >> > > Sent: Thursday, July 13, 2006 5:39 PM > >> > > To: bioperl-l at lists.open-bio.org > >> > > Subject: [Bioperl-l] Remote Blast - Blast Human Genome > >> > > > >> > > > >> > > Hello Again, > >> > > > >> > > I have another question regarding Remote blast but this > >time using > >> Genome > >> > > Blast. > >> > > > >> > > Here is the link: > >> > > > >> > > > >> > >http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=9606 > >> > > > >> > > which again uses the main Blast web site: > >> > > > >> > > http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi > >> > > > >> > > Again I am not sure what to add or what HEADER > >information to change > >> > > within my > >> > > script. > >> > > > >> > > Here is my program, which was the same as the last email: > >> > > > >> > > #!/usr/bin/perl -w > >> > > > >> > > use Bio::Perl; > >> > > use Bio::Tools::Run::RemoteBlast; > >> > > > >> > > my $prog = "blastn"; > >> > > my $db = "refseq_genomic"; > >> > > my $e_val = 0.01; > >> > > > >> > > my @params = ( '-prog' => $prog, > >> > > '-data' => $db, > >> > > '-expect' => $e_val); > >> > > > >> > > my $factory = new Bio::Tools::Run::RemoteBlast->new(@params); > >> > > $Bio::Tools::Run::RemoteBlast::HEADER{'WWW_BLAST_TYPE'} > >= '????'; <-- > >> --- > >> > > what > >> > > do I put here > >> > > #$Bio::Tools::Run::RemoteBlast::HEADER{'?????'} = > >'????'; <--- Do I > >> need > >> > > to add > >> > > any other values to the form inputs > >> > > > >> > > $factory->submit_blast("blast.in"); > >> > > $v = 1; > >> > > > >> > > while (my @rids = $factory->each_rid) > >> > > { foreach my $rid ( @rids ) > >> > > { my $rc = $factory->retrieve_blast($rid); > >> > > if( !ref($rc) ) > >> > > { if( $rc < 0 ) > >> > > { $factory->remove_rid($rid); > >> > > } > >> > > print STDERR "." if ( $v > 0 ); > >> > > sleep 5; > >> > > } > >> > > else > >> > > { my $result = $rc->next_result(); > >> > > my $filename = $result->query_name()."\.out"; > >> > > $factory->save_output($filename); > >> > > $factory->remove_rid($rid); > >> > > print "\nQuery Name: ", $result->query_name(), "\n"; > >> > > } > >> > > } > >> > > } > >> > > > >> > > > >> > > Both of my questions are very similiar as in I know how > >to use remote > >> > > blast but > >> > > not sure what to change to access the specific blast I want. > >> > > > >> > > Again, any help would be very appreciated!! > >> > > > >> > > Rohan > >> > > > >> > > > >> > > > >> > > ---------------------------------------- > >> > > This mail sent through www.mywaterloo.ca > >> > > _______________________________________________ > >> > > Bioperl-l mailing list > >> > > Bioperl-l at lists.open-bio.org > >> > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > >> > > >> > >> > >> > >> > >> ---------------------------------------- > >> This mail sent through www.mywaterloo.ca > > > >_______________________________________________ > >Bioperl-l mailing list > >Bioperl-l at lists.open-bio.org > >http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > ---------------------------------------- This mail sent through www.mywaterloo.ca From vrramnar at student.cs.uwaterloo.ca Thu Jul 20 19:18:27 2006 From: vrramnar at student.cs.uwaterloo.ca (vrramnar at student.cs.uwaterloo.ca) Date: Thu, 20 Jul 2006 19:18:27 -0400 Subject: [Bioperl-l] SNP reference file download Message-ID: <1153437507.44c00f43b53d4@www.nexusmail.uwaterloo.ca> Hello All, I was wondering if anyone knew how to download an entire SNP reference file from NCBI?? Or even downloading the sequence data for a particular SNP. I know how to do this via Bio::DB::GenBank, Bio::DB::SwissP, etc.. when referring to NM_##### but when I try to access rs###### files I am unsure of what Bio::DB to point to, if there is one. For example, if I had the accession number: rs4986950 How could I retrieve NCBI's entire reference file for this SNP record OR just the SNP sequence relating to this accession number. Any help on this subject would greatly be appreciated, Rohan ---------------------------------------- This mail sent through www.mywaterloo.ca From cjfields at uiuc.edu Fri Jul 21 00:51:30 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 20 Jul 2006 23:51:30 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C00805.7090403@sendu.me.uk> References: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine> <44C00805.7090403@sendu.me.uk> Message-ID: <9777D0A4-5340-4367-879D-F9125393B4FD@uiuc.edu> > I didn't actually mean a stored file (but that would be possible > with a > tied hash or something: DB_File, just like flatfile), but an in-memory > one for use during the course of program execution. Stored file would > probably be dangerous because you wouldn't know if the data has become > stale or not - and checking to see if it wasn't would defeat the > point. Okay, that wouldn't be a problem. I currently use in-memory caches to hold NCBI history information and ELink information for EUtilities. It would just a matter of doing the same for Bio::DB::Taxonomy. ... > entrez already parses through LineageEx to build the classification > array. flatfile walks up all the parents to do the same. Having the > information isn't the issue. We have the information. The methods > genus() and species() need to work with the genbank fileformat, > that is > the problem. The original purpose for Bio::Species was a simple object to hold taxonomic information. This object was then used in an attempt to hold the basic organism information (scientific name, common name, lineage information, etc) contained in a RichSeq file, like GenBank, EMBL, SwissProt, etc. The problem: trying to determine which term in the lineage corresponds to which rank and what part of the organism's scientific name is the genus, the species, and so on based solely on the data in the file, which comes down to a best-guess scenario for many organisms. It does work, but not equally well for all RichSeq files, not for every organism, and definitely not all the time. So, yes, genus(), species(), binomial, and other methods are present, but one must realize that parsing out the data into the appropriate object data using the various get/sets, with the obvious exceptions, is not the best way. Unless... you incorporate information available only outside the actual file itself (i.e. NCBI Taxonomy information). This is where Bio::Taxonomy seems to come along, as it's not-species specific (it can represent any rank) and is also DB-aware. Though Bio::Species was originally going to delegate all its data to Bio::Taxonomy::Node, I think the purpose was to eventually replace Bio::Species. So, my question is, why not use a Bio::Taxonomy::Node-like class initially to contain the appropriate data for a GenBank file (just for read/write purposes)? This object, since it implements Bio::Taxonomy::NodeI, is also DB-aware and thus, if set up with a database could also get/set the appropriate object data correctly using the lineage data. So, for instance, if I called $species = $seq->species(); and wanted the classification, scientific_name(), common_name, and other information that is gleaned from the file, then there's no need for a lookup. Once you cross into the bounds of: print $species->species(); print $species->genus(); then there's trouble, since we're working straight from the file (i.e. parsing is mainly correct, but still guesswork and sometimes wrong). But what if you could do something like this: my $db = Bio::DB::Taxonomy->new(-source => 'entrez'); # normally not needed as this is set by default internally, but as a demo here... $species->db_handle($db); # reset the appropriate data (genus, species, etc) based on Entrez tax data $species->reset_data(); # this method, BTW, doesn't exist yet but should be easy to implement print $species->species(); my $parent = $species->get_Parent_Node; my @child = $species->get_Children_Nodes; ...and so on Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From prabubio at gmail.com Fri Jul 21 02:17:41 2006 From: prabubio at gmail.com (Prabu R) Date: Fri, 21 Jul 2006 11:47:41 +0530 Subject: [Bioperl-l] Blast Output Parsing In-Reply-To: <002b01c6ac22$5d75e1f0$15327e82@pyrimidine> References: <002b01c6ac22$5d75e1f0$15327e82@pyrimidine> Message-ID: It works great Thanks a lot Mr.Chris. R. Prabu On 7/20/06, Chris Fields wrote: > > Grab the HSPs, then use get_aln() to generate a Bio::SimpleAlign object. > You can then use Bio::AlignIO to generate the alignment output if needed, > or > use the Bio::SimpleAlign methods to get what you want. > > http://www.bioperl.org/wiki/HOWTO:Beginners > > http://www.bioperl.org/wiki/HOWTO:SearchIO > > > http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/SimpleAlign > .html > > Chris > > > -----Original Message----- > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > > bounces at lists.open-bio.org] On Behalf Of Prabu R > > Sent: Thursday, July 20, 2006 11:02 AM > > To: bioperl-l at lists.open-bio.org > > Subject: [Bioperl-l] Blast Output Parsing > > > > Dear All! > > > > I am now trying to parse a Blast output using PERL. > > > > I have to extract each alignment and have to parse the alignment. I > mean, > > I > > have to check whether a particular part of the given sequence got > aligned > > 100%. > > > > Anybody please tell me what module in PERL I have to use for getting > this. > > > > I've tried Bio::SearchIO. But I didnt get any method to get the > > alignment. > > > > Kindly help. > > > > Thanks, > > R. Prabu > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- "Every noble work is at first impossible." - Thomas Carlyle From mh6 at sanger.ac.uk Fri Jul 21 05:04:57 2006 From: mh6 at sanger.ac.uk (Michael Han) Date: Fri, 21 Jul 2006 10:04:57 +0100 Subject: [Bioperl-l] PAML parser Message-ID: <44C098B9.4090003@sanger.ac.uk> Hi, I have some questions about the PAML parser (Bio::Tools::Phylo::PAML in CVS HEAD). Maybe some of you could help. If you call next_result, $self->_parse_summary might be called, which loops over $self->_readline . Later in next_result when "while (defined ($_=$self->_readline))" is used isn't the filepointer/filehandle already at the end of the output file and should return undef breaking the parsing? I added a crude seek($self->{_filehandle},0,0) after the _parse_summary and it seemed to work, but I wonder if I missed something obvious. thanks, Mike From cjfields at uiuc.edu Fri Jul 21 08:22:01 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 21 Jul 2006 07:22:01 -0500 Subject: [Bioperl-l] PAML parser In-Reply-To: <44C098B9.4090003@sanger.ac.uk> References: <44C098B9.4090003@sanger.ac.uk> Message-ID: Normally when you parse a report you use a loop to iterate through results: while (my $result = $parser->next_result) { # do work here } So returning undef is necessary to end the loop. This type of loop construct is common in BioPerl (and in Perl in general). There is a HOWTO for PAML: http://www.bioperl.org/wiki/HOWTO:PAML Chris On Jul 21, 2006, at 4:04 AM, Michael Han wrote: > Hi, > > I have some questions about the PAML parser > (Bio::Tools::Phylo::PAML in CVS HEAD). Maybe some of you could help. > > If you call next_result, $self->_parse_summary might be called, > which loops over $self->_readline . > > Later in next_result when "while (defined ($_=$self->_readline))" > is used isn't the filepointer/filehandle > already at the end of the output file and should return undef > breaking the parsing? > > I added a crude seek($self->{_filehandle},0,0) after the > _parse_summary and it seemed to work, but I wonder if I missed > something obvious. > > thanks, > > Mike > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Fri Jul 21 11:50:20 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 21 Jul 2006 10:50:20 -0500 Subject: [Bioperl-l] SNP reference file download In-Reply-To: <1153437507.44c00f43b53d4@www.nexusmail.uwaterloo.ca> Message-ID: <000901c6acdd$5f38ddb0$15327e82@pyrimidine> You'll need the latest code from CVS; you could try (the highly experimental) Bio::DB::EUtilities to get the raw flatfile XML data, then pass everything through Bio::ClusterIO. Currently there isn't tempfile, file, or filehandle support for the EUtilities but I plan on adding this soon. You could also pipe STDOUT from one SNP retrieval script into STDIN for the ClusterIO. BTW, the EFetch object below accepts an array reference of primary IDs if you want to use them instead, so you don't need to run an ESearch query first. To do this you'll need to set the database parameter (-db => 'snp'); the database from the ESearch query is passed to EFetch via the Cookie object. Chris use Bio::DB::EUtilities; use Bio::ClusterIO; # save XML to tempfile for read/write open my $XMLDATA, '+>', 'tempfile.xml'; # ESearch for term, place data in search history my $esearch= Bio::DB::EUtilities->new(-eutil => 'esearch', -term => 'dihydroorotase', -db => 'snp', -usehistory => 'y'); $esearch->get_response; print STDERR "Count: ", $esearch->count,"\n"; # efetch is default EUtility my $efetch = Bio::DB::EUtilities->new(-cookie => $esearch->next_cookie, -rettype => 'flt'); # SNP flatfile print $XMLDATA $efetch->get_response->content; seek ($XMLDATA, 0, 0); # don't forget to rewind... my $cio = Bio::ClusterIO->new(-format => 'dbsnp', -fh => $XMLDATA); # $snp is a Bio::Variation::snp object, see perldoc for methods while (my $snp = $cio->next_cluster) { print "ID : ",$snp->id,"\n"; } close $XMLDATA; > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of vrramnar at student.cs.uwaterloo.ca > Sent: Thursday, July 20, 2006 6:18 PM > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] SNP reference file download > > > Hello All, > > I was wondering if anyone knew how to download an entire SNP reference > file from > NCBI?? Or even downloading the sequence data for a particular SNP. > > I know how to do this via Bio::DB::GenBank, Bio::DB::SwissP, etc.. when > referring > to NM_##### but when I try to access rs###### files I am unsure of what > Bio::DB > to point to, if there is one. > > For example, if I had the accession number: rs4986950 How could I retrieve > NCBI's > entire reference file for this SNP record OR just the SNP sequence > relating to > this accession number. > > Any help on this subject would greatly be appreciated, > > Rohan > > > ---------------------------------------- > This mail sent through www.mywaterloo.ca > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Sun Jul 23 15:09:48 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sun, 23 Jul 2006 14:09:48 -0500 Subject: [Bioperl-l] obo_parser.t test warnings Message-ID: Hilmar, Sohel, Didn't know who to notify, so sorry in advance about cross-posting this to the list. I was running through cleaning up some bugs and found that obo_parser.t is throwing a ton of warnings: bayou-75:~/Chris/Bioperl/bioperl-live natashacapell$ perl -I. -w t/ obo_parser.t 1..40 "my" variable $val masks earlier declaration in same scope at Bio/ OntologyIO/obo.pm line 592. "my" variable $qh masks earlier declaration in same scope at Bio/ OntologyIO/obo.pm line 592. Use of uninitialized value in string eq at Bio/OntologyIO/obo.pm line 239, line 13. ... Good news: all tests pass! Cheers! Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Sun Jul 23 16:53:32 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sun, 23 Jul 2006 15:53:32 -0500 Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes Message-ID: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu> Sendu, Hilmar, et al, I was looking through SeqIO::genbank and though I would bring up a couple of things to think about re: GenBank Taxonomy information. This is how NCBI defines the names used for SOURCE and ORGANISM according to the latest GenBank release notes: SOURCE - Common name of the organism or the name most frequently used in the literature. Mandatory keyword in all annotated entries/one or more records/includes one subkeyword. ORGANISM - Formal scientific name of the organism (first line) and taxonomic classification levels (second and subsequent lines). Mandatory subkeyword in all annotated entries/two or more records. According to their sample file page (http://www.ncbi.nlm.nih.gov/ Sitemap/samplerecord.html), the SOURCE is this: Free-format information including an abbreviated form of the organism name, sometimes followed by a molecule type. (See section 3.4.10 of the GenBank release notes for more info.) The SOURCE can also include the organelle and also may include additional information, such as an abbreviated name and a common name in parentheses. ... SOURCE Saccharomyces cerevisiae (baker's yeast) ORGANISM Saccharomyces cerevisiae Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces. ... Setting scientific_name() isn't a problem; acc. to the above definition, it is the full name on the ORGANISM line. The lineage (or classification() array) is also straight-forward. The common_name (), though as used by Bio::SeqIO::genbank, is the entire SOURCE line (not just the abbreviated name, but the name and everything else). No additional parsing is performed on it. write_seq() also seems to do the wrong thing when rebuilding the SOURCE line as well as the method writes the subspecies to the line. I plan on using Bio::SeqIO::genbank as a guinea pig of sorts to try using Bio::Taxonomy::Node objects instead of Bio::Species, then get the parsing for these lines corrected and simplified. Essentially, the way NCBI describes it, the main name on the line is actually the free-form abbreviated name, the name in parentheses is the common name (optionally present), and the organelle precedes all of these if present. I want to try getting common_name() to match the common name found for taxonomy (baker's yeast) rather than have it be a simple container, add an abbreviated_name() method for the name container for the SOURCE line, and have the organelle() method actually be used if an organelle is present (it doesn't seem to be set at the moment in SeqIO::genbank). Right now, I have NO idea how EMBL, DDBJ, other formats deal with organism info; I would think that the main three (GenBank/EMBL- SwissProt/DDBJ) handle them similarly...(Famous Last Words) I also propose (I'll probably get yelled at here) NOT actively supporting additional parsing of species, subspecies, etc directly from a file w/o a DB lookup. As in, leave species, subspecies, genus parsing from the flatfile as is (no longer support it) or remove it completely and leave them unset. I haven't looked, but I have a strong feeling that the species parsing in Bio::SeqIO is different from format to format. It really seems like more trouble than it's worth to maintain this, especially as there is perfectly valid Taxonomy database information available either locally using a flatfile or via Entrez. If people want to have reliable $species->species or $species-genus for taxonomy information, they will need to have the db_handle() set for the Bio::Taxonomy::Node object and have an Node-based method to reset species, genus, etc to the tax database information (maybe reset_taxon or something along those lines). Okay, rambled on enough. Any thoughts? Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From hlapp at gmx.net Sun Jul 23 19:40:45 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 23 Jul 2006 19:40:45 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BF86AF.8080408@sendu.me.uk> References: <44BBBB69.6000906@sendu.me.uk> <44BF86AF.8080408@sendu.me.uk> Message-ID: On Jul 20, 2006, at 9:35 AM, Sendu Bala wrote: > I'll describe all the changes I've now made and if no-one complains > I'll > commit. (I've also made these notes into bug 2047 for easier reference > in the future.) > > Bio::DB::Taxonomy::flatfile > --------------------------- > [...] > > BEHAVIOUR-CHANGE: flatfile used to store within the nodes it makes the > division as a three letter code, like 'PRI'. However, for consistency > with entrez and the scientific_name() of the node the division is > supposed to correspond to, it is now stored as the full name, like > 'Primates'. What about adding a method division_code() which would return the 3- letter abbreviation? The abbreviation may be needed by flat-file writers, so it may be handy to have in some cases. > > The names->id solution also stores the artificially uniqued names like > 'Craniata ', allowing you for the first time to retrieve the > correct id. Previously the search would have simply failed completely. > > The names->id solution now handles nodes with scientific names of 'xyz > (class)', allowing you to retrieve the id with both get_taxonids > ('xyz') > and get_taxonids('xyz (class)'). Previously only the latter would > work. Should angle brackets be allowed too? > > NOTE: the previous 2 changes (and the issues with entrez, see below) > make flatfile better at searching the taxonomy database than entrez > module or the website, both in terms of speed and completeness of > results. > > BEHAVIOUR-CHANGE: The scientific name field isn't touched in any way, > always being sent directly to Bio::Taxonomy::Node->new(-name => > $untouched) Maybe there should also be a -names parameter which accepts a hash reference with keys being the kind of name (scientific, common, etc) and the values being array references with the set of names of that kind? > or the $node->classification() array. Bio::Taxonomy::Node shouldn't have this attribute. It is legacy brought over from a flawed (because flat) object model in Bio::Species. > [...] > > Bio::DB::Taxonomy::entrez > ------------------------- > > # Bug-fixes > Special characters like ", ( and ) in the input query string to > get_taxonid() result in the failure or inaccuracy of the search. These > characters are now removed prior to submission, allowing for correct > search results. > API-CHANGE: entrez has always been able to return multiple ids that > match a single input name, so I've renamed get_taxonid() to > get_taxonids() and it returns an array of ids in list context. It > returns one of the ids in scalar context. For backward compatibility, > *get_taxonid = \&get_taxonids. Sounds good to me. > NOTE: entrez modules (and website) cannot cope with '' > in the > query, failing searches like 'Craniata '. For this > reason, if > get_taxonids() is given a query with '' it will immediately > return undefined, saving a pointless website access. If there is a 'next-best-thing' that is still semantically compatible with the API documentation, I would do that. In this case, if there is a in the query the entrez module should strip it and automatically use the rest for searching. If indeed multiple IDs match there should be a warning to inform the user that entrez cannot use the notation to limit the query results. In fact, you might as well provide an option to enable an automatic check for the correct branch for each ID if multiple ones are returned. I.e., if this option is enabled, the module would automatically query the parent nodes to see if is in the lineage, and if not will remove the respective ID from the result set. The reason you may want to make it optional is because it potentially costs time. (but in reality I'm not sure why a client will not want to enable the option - so maybe this should even be default) > If you want the id > of 'Craniata ' you must search for 'Craniata', then get the > node for each returned id to see which one has a parent node with a > scientific_name() or common_names() case-insensitive matching to > 'chordata'. Yep, see above. The more burden you can shield from the user the better. > [...] > Bio::Taxonomy::Node > ------------------- > [...] > classification() has a proper solution to finding the classification > when the array wasn't manually set. > > # Improvements > BEHAVIOUR-CHANGE: node_name() used to be an alias to name > ('common'). Now > it is an alias to name('scientific'). > NOTE: node_name is what is set when ->new(-name => $name) is set, so > flatfile and entrez and user-created nodes now implicitly associate > the > name of the node they create with its scientific name. I'm not even sure node_name() should just be deprecated. The methods falsely suggests that there is only a single and definitive name for the taxon node. In NCBI reality, this is only true for the scientific name of the node. In real reality, many nodes have multiple scientific names - taxonomy isn't static and therefore the scientific naming of nodes isn't either. > [...] > Thanks for the work, all other changes sound great. Thanks also to Chris for assisting! Looks like this is in much better shape now than before. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Sun Jul 23 19:44:23 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 23 Jul 2006 19:44:23 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BD147A.9020103@sendu.me.uk> References: <003201c6aa81$01db9a30$15327e82@pyrimidine> <44BD147A.9020103@sendu.me.uk> Message-ID: <6701FDC3-7630-48EB-9B21-5BB5C566424E@gmx.net> On Jul 18, 2006, at 1:03 PM, Sendu Bala wrote: > > [regarding changes to Bio::Taxonomy::Node] > > Actually, I'm really strongly leaning toward getting rid of the > following methods and new() options (and giving up entirely on being > able to keep 'sapiens' somewhere): > > -organelle, organelle() > -division, division() > -sub_species, sub_species() > -variant, variant() > species(), validate_species_name() > genus() > binomial() > > As far as I can see none of these methods have any place in a generic > Node class. I agree. Some of them are a special case for genbank files (organelle () etc), and the rest is legacy from Bio::Species. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Sun Jul 23 20:48:22 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 23 Jul 2006 20:48:22 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <9777D0A4-5340-4367-879D-F9125393B4FD@uiuc.edu> References: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine> <44C00805.7090403@sendu.me.uk> <9777D0A4-5340-4367-879D-F9125393B4FD@uiuc.edu> Message-ID: <21C7763A-B8FF-4E91-BAEC-E5F79E462DD3@gmx.net> On Jul 21, 2006, at 12:51 AM, Chris Fields wrote: > my $db = Bio::DB::Taxonomy->new(-source => 'entrez'); > > # normally not needed as this is set by default internally, but as a > demo here... > $species->db_handle($db); > > # reset the appropriate data (genus, species, etc) based on Entrez > tax data > $species->reset_data(); # this method, BTW, doesn't exist yet but > should be easy to implement Don't call this reset_data() as it may be misleading (usually reset() means to revert into a native or original state). Instead, you would use fetch_from_db() or something. However, it seems redundant to me to begin with. If we ignore for a second that the return value in the following isn't exactly compatible, why would you not just call $species = $db->get_Taxonomy_Node(-taxonid => $species->ncbi_taxid); So I think more than anything else, this should be made to work, and you would have a more seamless interface. > Short and sweet summary: > > Sendu volunteered making changes to Bio::Taxonomy::Node and related > modules; > we disagreed on exactly what changes should be made. Sendu wanted a > stripped-down version of Bio::Taxonomy::Node; I wanted one which would > support similar methods as in Bio::Species. Bio::Species should be considered legacy; I think it is flawed as an object model because it imposes a flat view on something which in reality is only a node in a tree and not flat at all. The only real need for the flat view came from the desire to write sequence files - for all other purposes the classification() etc attributes are useless anyway. I.e., binomial() and common_name() (corresponding to scientific_name () and names('common')) are the only real useful attributes, the rest is baggage for writing sequence files. The baggage should not be passed on to a better model ... Instead, there should be a separate module (in essence a Bio::Species factory) which can translate a Bio::Taxonomy::Node into a Bio::Species object - if the rank is 'species' or below. Alternatively, you could have a Bio::Taxonomy::SpeciesNode object which implements both APIs and can be initialized with either a Bio::Taxonomy::Node instance, or the combination of a Bio::Species and a db handle. At any rate, I think Bio::Taxonomy::Node should be stripped of legacy methods that are only there to achieve Bio::Species compatibility. > > I suggested have a common interface module, one for Node and > another for > Species; both implement the same interface methods (NodeI maybe), > so you > could use either a bare-bones Node or a full-fledged Species > object. I then > suggested this new version of Species could replace Bio::Species. > We could > worry about which one to use for Bio::DB::Taxonomy* later. I'm not following here... How would this look like? What would the API (s) be? > > We both agreed. Everybody's happy. Happiness is great, so maybe you shouldn't bother about me not following... > I still plan on switching Bio::DB::Taxonomy::entrez to use > Bio::DB::EUtilities at some point Wouldn't that rather be Bio::DB::Taxonomy::eutil? > I may > add a method for retrieving tax data based on protein/nucleotide > sequence > primary ID and relevant sequence database, so you could directly > retrieve > the relevant TaxID w/o parsing sequences directly for them. This > would > mainly be useful if you gather GIs from a BLAST search, for instance. > > Anyway, I could add this in then base class Bio::DB::Taxonomy > directly so > one could used the retrieved TaxIDs for flat-file or entrez > searches; this > requires, of course, access to the remote Entrez database (it would > use > ELink). Would that be of interest? If you add the API methods for this to the base class (which in this case is close in concept to an interface, much like Bio/SeqIO.pm), then make clear that not every database will allow you to implement this. > > |------Node > NodeI----| > |------Species > > Another option would be to have Bio::Taxonomy::Node itself stripped > down, > then have another class (Bio::Taxonomy::Species) inherit methods > from it and > also implement additional methods (genus(), species(), etc). I think this would be the way to go. I.e., |------Node NodeI----| |-| |----SpeciesNode Species----| This way the NodeI interface and its direct implementors are kept free of legacy. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Sun Jul 23 20:43:45 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sun, 23 Jul 2006 19:43:45 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <6701FDC3-7630-48EB-9B21-5BB5C566424E@gmx.net> References: <003201c6aa81$01db9a30$15327e82@pyrimidine> <44BD147A.9020103@sendu.me.uk> <6701FDC3-7630-48EB-9B21-5BB5C566424E@gmx.net> Message-ID: <5F6027E0-A504-4019-8DAB-C50DF9EB6E18@uiuc.edu> As an aside, the 'source' seqfeature in a GenBank file contains some of the following information as tags; that's where the NCBI tax ID is taken from in Bio::SeqIO::genbank: FEATURES Location/Qualifiers source 1..814 /organism="Porterinema fluviatile" /organelle="plastid:chloroplast" /mol_type="genomic DNA" /strain="SAG 124.79" /db_xref="taxon:246123" /country="Germany" ... So, variant(), organelle(), and ncbi_taxid() could all be set from the same point in Bio::SeqIO::genbank. I suggested an option to Sendu, but I'd like to hear your thoughts on this since this will possibly affect bioperl-db. We could have two Node-like Taxonomy objects using a common interface class (Bio::Taxonomy::NodeI) : Bio::Taxonomy::Node (stripped down version), and Bio::Taxonomy::Species (the sequence-based NodeI-implementing object, which would retain the other Bio::Species-like methods). Bio::Taxonomy::Species would act sort of as an 'entry point' for Bio::Taxonomy from sequences; moving up or down the tax node hierarchy gets Tax::Node objects, unless you are specifically at a 'species'-ranked node (though this could be just a Tax::Node as well). BTW, I have managed to get Bio::SeqIO::genbank switched over to Bio::Taxonomy::Node (er... Bio::Taxonomy::Species); all tests pass. I was quite surprised how easy it was. It shouldn't be too hard to get a NodeI/Node/Species class hierarchy set up if everybody thinks it's worth it. Then we could deprecate Bio::Species! Chris On Jul 23, 2006, at 6:44 PM, Hilmar Lapp wrote: > > On Jul 18, 2006, at 1:03 PM, Sendu Bala wrote: > >> >> [regarding changes to Bio::Taxonomy::Node] >> >> Actually, I'm really strongly leaning toward getting rid of the >> following methods and new() options (and giving up entirely on being >> able to keep 'sapiens' somewhere): >> >> -organelle, organelle() >> -division, division() >> -sub_species, sub_species() >> -variant, variant() >> species(), validate_species_name() >> genus() >> binomial() >> >> As far as I can see none of these methods have any place in a generic >> Node class. > > I agree. Some of them are a special case for genbank files (organelle > () etc), and the rest is legacy from Bio::Species. > > -hilmar > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From hlapp at gmx.net Sun Jul 23 20:58:32 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 23 Jul 2006 20:58:32 -0400 Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes In-Reply-To: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu> References: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu> Message-ID: On Jul 23, 2006, at 4:53 PM, Chris Fields wrote: > I also propose (I'll probably get yelled at here) NOT actively > supporting additional parsing of species, subspecies, etc directly > from a file w/o a DB lookup. As in, leave species, subspecies, genus > parsing from the flatfile as is (no longer support it) or remove it > completely and leave them unset. Note that most (as in: most used, not most taxa) cases are actually straightforward. I don't think removing what's there is desirable, just everyone needs to understand that it will recognize only a limited number of syntactical variations, and beyond that if you want correct taxon attributes you will a database (be it flatfile, eutil, whatever) lookup. > If people want to > have reliable $species->species or $species-genus for taxonomy > information, they will need to have the db_handle() set for the > Bio::Taxonomy::Node object and have an Node-based method to reset > species, genus, etc to the tax database information (maybe > reset_taxon or something along those lines). That's what I've saying all along. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Sun Jul 23 23:30:07 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sun, 23 Jul 2006 22:30:07 -0500 Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes In-Reply-To: References: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu> Message-ID: <28D3470B-DA8F-4C41-96C7-F0D0DE89BAEE@uiuc.edu> On Jul 23, 2006, at 7:58 PM, Hilmar Lapp wrote: > > On Jul 23, 2006, at 4:53 PM, Chris Fields wrote: > >> I also propose (I'll probably get yelled at here) NOT actively >> supporting additional parsing of species, subspecies, etc directly >> from a file w/o a DB lookup. As in, leave species, subspecies, genus >> parsing from the flatfile as is (no longer support it) or remove it >> completely and leave them unset. > > Note that most (as in: most used, not most taxa) cases are actually > straightforward. I don't think removing what's there is desirable, > just everyone needs to understand that it will recognize only a > limited number of syntactical variations, and beyond that if you > want correct taxon attributes you will a database (be it flatfile, > eutil, whatever) lookup. Aha! We seem to agree on that... >> If people want to >> have reliable $species->species or $species-genus for taxonomy >> information, they will need to have the db_handle() set for the >> Bio::Taxonomy::Node object and have an Node-based method to reset >> species, genus, etc to the tax database information (maybe >> reset_taxon or something along those lines). > > That's what I've saying all along. > > -hilmar > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== I thought you had mentioned something about this a few months back on EMBL format issues with organism data. Anyway, I don't think it was from anybody disagreeing with you as much as it was one of the project priorities that sort of got lost in the shuffle. I'm sure Sendu will like having a bit of freedom with Bio::Taxonomy::Node. Anyway, I'll do what I can within reason; I have to leave next weekend for a 5-day conference. Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Mon Jul 24 04:21:55 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 24 Jul 2006 09:21:55 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <21C7763A-B8FF-4E91-BAEC-E5F79E462DD3@gmx.net> References: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine> <44C00805.7090403@sendu.me.uk> <9777D0A4-5340-4367-879D-F9125393B4FD@uiuc.edu> <21C7763A-B8FF-4E91-BAEC-E5F79E462DD3@gmx.net> Message-ID: <44C48323.5060704@sendu.me.uk> Hilmar Lapp wrote: > On Jul 21, 2006, at 12:51 AM, Chris Fields wrote: > >> my $db = Bio::DB::Taxonomy->new(-source => 'entrez'); >> >> # normally not needed as this is set by default internally, but as a >> demo here... >> $species->db_handle($db); >> >> # reset the appropriate data (genus, species, etc) based on Entrez >> tax data >> $species->reset_data(); # this method, BTW, doesn't exist yet but >> should be easy to implement > > Don't call this reset_data() as it may be misleading (usually reset() > means to revert into a native or original state). Instead, you would > use fetch_from_db() or something. > > However, it seems redundant to me to begin with. If we ignore for a > second that the return value in the following isn't exactly > compatible, why would you not just call > > $species = $db->get_Taxonomy_Node(-taxonid => $species->ncbi_taxid); If Bio::Species was a Bio::Taxonomy, and we had FactoryI implementing classes or similar, we would say: $species = $factory->fetch(-taxon_id => $species->ncbi_taxid); > Instead, there should be a separate module (in essence a Bio::Species > factory) which can translate a Bio::Taxonomy::Node into a > Bio::Species object - if the rank is 'species' or below. I don't think a 'translation' module is necessary. Bio::Species can just be a Bio::Taxonomy. > At any rate, I think Bio::Taxonomy::Node should be stripped of legacy > methods that are only there to achieve Bio::Species compatibility. Yes :) > I think this would be the way to go. I.e., > > > |------Node > NodeI----| > |-| > |----SpeciesNode > Species----| Actually, if we're changing the name of the module that Species interacts with, any existing code needs to be re-written. So why not just do it properly and have Bio::Species interact with Bio::Taxonomy? |----Bio::Taxonomy Bio::TaxonomyI----| |----Bio::Species Or Bio::TaxonomyI----|----Bio::Taxonomy----|----Bio::Species Leaving Node completely free to be just a node. This way we don't have a crufty SpeciesNode there simply for the sake of Bio::Species. Bio::Species itself provides all the legacy stuff it needs for itself, while interacting with Nodes via TaxonomyI methods in the 'correct' way only. From bix at sendu.me.uk Mon Jul 24 03:58:57 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 24 Jul 2006 08:58:57 +0100 Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes In-Reply-To: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu> References: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu> Message-ID: <44C47DC1.8020503@sendu.me.uk> Chris Fields wrote: > Sendu, Hilmar, et al, > > I was looking through SeqIO::genbank and though I would bring up a > couple of things to think about re: GenBank Taxonomy information. [...] > SOURCE - Common name of the organism or the name most frequently used > in the literature. Mandatory keyword in all annotated entries/one or > more records/includes one subkeyword. [...] > Free-format information including an abbreviated form of the organism > name, sometimes followed by a molecule type. (See section 3.4.10 of > the GenBank release notes for more info.) > > The SOURCE can also include the organelle and also may include > additional information, such as an abbreviated name and a common name > in parentheses. More specifically: (from 3.4.10 ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt) The SOURCE field consists of two parts. The first part is found after the SOURCE keyword and contains free-format information including an abbreviated form of the organism name followed by a molecule type; multiple lines are allowed, but the last line must end with a period. The second part consists of information found after the ORGANISM subkeyword. The formal scientific name for the source organism (genus and species, where appropriate) is found on the same line as ORGANISM. The records following the ORGANISM line list the taxonomic classification levels, separated by semicolons and ending with a period. > The common_name (), though as used by Bio::SeqIO::genbank, is the > entire SOURCE line (not just the abbreviated name, but the name and > everything else). No additional parsing is performed on it. > write_seq() also seems to do the wrong thing when rebuilding the > SOURCE line as well as the method writes the subspecies to the line. > > I plan on using Bio::SeqIO::genbank as a guinea pig of sorts to try > using Bio::Taxonomy::Node objects instead of Bio::Species, then get > the parsing for these lines corrected and simplified. Essentially, > the way NCBI describes it, the main name on the line is actually the > free-form abbreviated name, the name in parentheses is the common > name (optionally present), and the organelle precedes all of these if > present. I want to try getting common_name() to match the common > name found for taxonomy (baker's yeast) rather than have it be a > simple container, add an abbreviated_name() method for the name > container for the SOURCE line, and have the organelle() method > actually be used if an organelle is present (it doesn't seem to be > set at the moment in SeqIO::genbank). This is not how I read the specification. Everything on the the same line as 'Source' is free-format text and therefore cannot be parsed. For the purposes of writing out it must be stored as-is, but it serves no other useful purpose. The file also provides the scientific name which can be used to do an accurate database lookup, which in turn gives you access to the common names, like "baker's yeast". On a side note, why would we care about 'organelle' when we're dealing with taxonomy? Why does the NCBI taxonomy db have a slot for organelle? From bix at sendu.me.uk Mon Jul 24 04:45:38 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 24 Jul 2006 09:45:38 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: References: <44BBBB69.6000906@sendu.me.uk> <44BF86AF.8080408@sendu.me.uk> Message-ID: <44C488B2.5070806@sendu.me.uk> Hilmar Lapp wrote: > On Jul 20, 2006, at 9:35 AM, Sendu Bala wrote: > >> Bio::DB::Taxonomy::flatfile >> --------------------------- >> [...] >> >> BEHAVIOUR-CHANGE: flatfile used to store within the nodes it makes the >> division as a three letter code, like 'PRI'. However, for consistency >> with entrez and the scientific_name() of the node the division is >> supposed to correspond to, it is now stored as the full name, like >> 'Primates'. > > What about adding a method division_code() which would return the 3- > letter abbreviation? > > The abbreviation may be needed by flat-file writers, so it may be > handy to have in some cases. As far as I know you can't get the 3-letter version via entrez, so no other module can really expect to be able to get it, not knowing which database (flatfile.pm or entez.pm) the taxonomic information is coming from. But of course it would be somewhat harmless to add division_code() anyway. It might be better done as a -code => 1 option to division()? >> The names->id solution also stores the artificially uniqued names like >> 'Craniata ', allowing you for the first time to retrieve the >> correct id. Previously the search would have simply failed completely. >> >> The names->id solution now handles nodes with scientific names of 'xyz >> (class)', allowing you to retrieve the id with both get_taxonids >> ('xyz') >> and get_taxonids('xyz (class)'). Previously only the latter would >> work. > > Should angle brackets be allowed too? Allowed in what sense? You can indeed search for both get_taxonids('Craniata ') [returns a single id] and get_taxonids('Craniata') [returns multipe ids, one of which is the previous answer]. > Maybe there should also be a -names parameter which accepts a hash > reference with keys being the kind of name (scientific, common, etc) > and the values being array references with the set of names of that > kind? Not sure what you mean. name() has that data structure, though you're not supposed to set its hash ref directly. >> or the $node->classification() array. > > Bio::Taxonomy::Node shouldn't have this attribute. It is legacy > brought over from a flawed (because flat) object model in Bio::Species. Yes, I agree. >> NOTE: entrez modules (and website) cannot cope with '' >> in the >> query, failing searches like 'Craniata '. For this >> reason, if >> get_taxonids() is given a query with '' it will immediately >> return undefined, saving a pointless website access. > > If there is a 'next-best-thing' that is still semantically compatible > with the API documentation, I would do that. > > In this case, if there is a in the query the entrez > module should strip it and automatically use the rest for searching. > If indeed multiple IDs match there should be a warning to inform the > user that entrez cannot use the notation to limit the > query results. I wouldn't like this. I actually had it working this way initially, but decided that if someone entered 'xyz ' they really didn't want multiple ids, expected to get multiple ids with just 'xyz' and don't want their query made something else and then be warned about it. > In fact, you might as well provide an option to enable an automatic > check for the correct branch for each ID if multiple ones are > returned. I.e., if this option is enabled, the module would > automatically query the parent nodes to see if is in the > lineage, and if not will remove the respective ID from the result > set. The reason you may want to make it optional is because it > potentially costs time. (but in reality I'm not sure why a client > will not want to enable the option - so maybe this should even be > default) I can certainly add that, it seems like a good idea. I don't, however, see any scope for an option at all. What would the option be called? -don't_give_me_the_answer_I_actually_want_to_save_time ? Pointless, imho. If the user queries 'xyz ' with that option, they're just going to have to do for themselves manually what the method would have done for them without that option, in order to get the correct answer. It'll be slower that way, if anything. So the option would actually be called -don't_give_me_the_answer_I_actually_want_so_I_can_get_it_myself_a_little_slower (!). >> Bio::Taxonomy::Node >> ------------------- >> [...] >> classification() has a proper solution to finding the classification >> when the array wasn't manually set. >> >> # Improvements >> BEHAVIOUR-CHANGE: node_name() used to be an alias to name >> ('common'). Now >> it is an alias to name('scientific'). >> NOTE: node_name is what is set when ->new(-name => $name) is set, so >> flatfile and entrez and user-created nodes now implicitly associate >> the >> name of the node they create with its scientific name. > > I'm not even sure node_name() should just be deprecated. The methods > falsely suggests that there is only a single and definitive name for > the taxon node. > > In NCBI reality, this is only true for the scientific name of the > node. In real reality, many nodes have multiple scientific names - > taxonomy isn't static and therefore the scientific naming of nodes > isn't either. For the programmer not using any database but just making up his own nodes, I think he needs a node_name() because he may not be thinking about anything fancy or realistic. He just want to give his node a single name that he invents. node_name() seems like the ideal method name to me. From jaynelvallance at hotmail.com Mon Jul 24 05:45:50 2006 From: jaynelvallance at hotmail.com (Jayne Vallance) Date: Mon, 24 Jul 2006 09:45:50 +0000 Subject: [Bioperl-l] SearchIO - Stop throwing away data Message-ID: Hi I developing someone elses work. I wondered whether anyone could identify the mistake that the previous coder made? I am not very familiar with SearchIO yet. They are trying to extract filenames from an output report. This is their code: # store the query name of the mito db blast hits into an array my $searchio = new Bio::SearchIO( -file => $blast_mito_output ); # array to store the mitochondrial BLAST database hits my @mito_hits; # name of query for BLAST hit my $query_name; while ( my $result = $searchio->next_result() ) { # get the hits and their associated name # do not want to include these in the clustering step while( my $hit = $result->next_hit ) { # store the names of these hits into an array # these filenames will not be copied over $query_name = $result->query_name(); #print "\nQuery $query_name\n"; push(@mito_hits, $query_name); } } I think they have based it on the code at http://www.bioperl.org/wiki/HOWTO:SearchIO#Authors use Bio::SearchIO; use Bio::SearchIO::FastHitEventBuilder; my $searchio = new Bio::SearchIO(-format => $format, -file => $file); $searchio->attach_EventHandler(Bio::SearchIO::FastHitEventBuilder->new); while( my $r = $searchio->next_result ) { while( my $h = $r->next_hit ) { # Hits will NOT have HSPs print $h->significance,"\n"; } which "throws away data you don't want"??? I am finding that our code is finding the last file name in the ouput report, rather than each and every one. I suspect it is overwriting (or throwing away the data). How do I need to change the code to make sure *every* file name goes into @mito_hits? Thankyou Jayne _________________________________________________________________ The new MSN Search Toolbar now includes Desktop search! http://join.msn.com/toolbar/overview From simon.andrews at bbsrc.ac.uk Mon Jul 24 07:14:08 2006 From: simon.andrews at bbsrc.ac.uk (simon andrews (BI)) Date: Mon, 24 Jul 2006 12:14:08 +0100 Subject: [Bioperl-l] SearchIO - Stop throwing away data In-Reply-To: Message-ID: > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org > [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of > Jayne Vallance > Sent: 24 July 2006 10:46 > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] SearchIO - Stop throwing away data > > Hi > > I developing someone > elses work. I wondered whether anyone could identify the > mistake that the previous coder made? > I am not very familiar with SearchIO yet. > > They are trying to extract filenames from an output report. I'm not sure what you mean by filenames here. The value which is being collected in your code snippet is the name of the original query sequence. > This is their code: > while ( my $result = $searchio->next_result() ) { > # get the hits and their associated name > # do not want to include these in the clustering step > while( my $hit = $result->next_hit ) { > # store the names of these hits into an array > # these filenames will not be copied over > $query_name = $result->query_name(); > #print "\nQuery $query_name\n"; > push(@mito_hits, $query_name); OK, this bit is odd. You're collecting the name of the query sequence but you're doing it as you're looping through the hits. Since all the hits come from the same result you're just going to get the same query name put into your array multiple times (once for each hit). This almost certainly isn't what you want. If you just want the name of the query sequence you can miss out the inner loop (the $result->next_hit() loop). If you actually want to collect the names of the sequences which were hit then you need to collect $hit->name() rather than $result->query_name(); > } > } > > I think they have based it on the code at > http://www.bioperl.org/wiki/HOWTO:SearchIO#Authors > $searchio->attach_EventHandler(Bio::SearchIO::FastHitEventBuil der->new); > while( my $r = $searchio->next_result ) { while( my $h = > $r->next_hit ) { > # Hits will NOT have HSPs > print $h->significance,"\n"; > } > > which "throws away data you don't want"??? Indeed, but probably not in the way you're thinking. The data it throws away is the details of each individual HSP (mostly the alinment data). You're not using hsp data in your code so it will have no effect (other than making it a bit quicker). It doesn't throw away whole hits or anything like that. > I am finding that our code is finding the last file name in > the ouput report, rather than each and every one. I suspect > it is overwriting (or throwing away the data). I suspect then that you should be collecting the hit names rather than the query names? Simon. From hlapp at gmx.net Mon Jul 24 08:20:00 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 24 Jul 2006 08:20:00 -0400 Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes In-Reply-To: <44C47DC1.8020503@sendu.me.uk> References: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu> <44C47DC1.8020503@sendu.me.uk> Message-ID: <27E70EEF-28F6-4B54-9CCB-6090E338CEFB@gmx.net> On Jul 24, 2006, at 3:58 AM, Sendu Bala wrote: > On a side note, why would we care about 'organelle' when we're dealing > with taxonomy? Why does the NCBI taxonomy db have a slot for > organelle? Because some sequences are of the organelle DNA, and Genbank needs a way to express this. Highly artificial, but still can't be ignored. -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Mon Jul 24 08:27:28 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 24 Jul 2006 08:27:28 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C488B2.5070806@sendu.me.uk> References: <44BBBB69.6000906@sendu.me.uk> <44BF86AF.8080408@sendu.me.uk> <44C488B2.5070806@sendu.me.uk> Message-ID: <11A2B917-C633-4806-A6F4-920F02F0BF6E@gmx.net> :-) I think we're largely in agreement. As for node_name() I fully understand the motivation, but it needs to be understood that the attribute's value will be based on a largely arbitrary choice unless it is set directly by the user. -hilmar On Jul 24, 2006, at 4:45 AM, Sendu Bala wrote: > Hilmar Lapp wrote: >> On Jul 20, 2006, at 9:35 AM, Sendu Bala wrote: >> >>> Bio::DB::Taxonomy::flatfile >>> --------------------------- >>> [...] >>> >>> BEHAVIOUR-CHANGE: flatfile used to store within the nodes it >>> makes the >>> division as a three letter code, like 'PRI'. However, for >>> consistency >>> with entrez and the scientific_name() of the node the division is >>> supposed to correspond to, it is now stored as the full name, like >>> 'Primates'. >> >> What about adding a method division_code() which would return the 3- >> letter abbreviation? >> >> The abbreviation may be needed by flat-file writers, so it may be >> handy to have in some cases. > > As far as I know you can't get the 3-letter version via entrez, so no > other module can really expect to be able to get it, not knowing which > database (flatfile.pm or entez.pm) the taxonomic information is > coming from. > > But of course it would be somewhat harmless to add division_code() > anyway. It might be better done as a -code => 1 option to division()? > > >>> The names->id solution also stores the artificially uniqued names >>> like >>> 'Craniata ', allowing you for the first time to >>> retrieve the >>> correct id. Previously the search would have simply failed >>> completely. >>> >>> The names->id solution now handles nodes with scientific names of >>> 'xyz >>> (class)', allowing you to retrieve the id with both get_taxonids >>> ('xyz') >>> and get_taxonids('xyz (class)'). Previously only the latter would >>> work. >> >> Should angle brackets be allowed too? > > Allowed in what sense? You can indeed search for both > get_taxonids('Craniata ') [returns a single id] and > get_taxonids('Craniata') [returns multipe ids, one of which is the > previous answer]. > > >> Maybe there should also be a -names parameter which accepts a hash >> reference with keys being the kind of name (scientific, common, etc) >> and the values being array references with the set of names of that >> kind? > > Not sure what you mean. name() has that data structure, though you're > not supposed to set its hash ref directly. > > >>> or the $node->classification() array. >> >> Bio::Taxonomy::Node shouldn't have this attribute. It is legacy >> brought over from a flawed (because flat) object model in >> Bio::Species. > > Yes, I agree. > > >>> NOTE: entrez modules (and website) cannot cope with '' >>> in the >>> query, failing searches like 'Craniata '. For this >>> reason, if >>> get_taxonids() is given a query with '' it will >>> immediately >>> return undefined, saving a pointless website access. >> >> If there is a 'next-best-thing' that is still semantically compatible >> with the API documentation, I would do that. >> >> In this case, if there is a in the query the entrez >> module should strip it and automatically use the rest for searching. >> If indeed multiple IDs match there should be a warning to inform the >> user that entrez cannot use the notation to limit the >> query results. > > I wouldn't like this. I actually had it working this way initially, > but > decided that if someone entered 'xyz ' they really didn't > want multiple ids, expected to get multiple ids with just 'xyz' and > don't want their query made something else and then be warned about > it. > > >> In fact, you might as well provide an option to enable an automatic >> check for the correct branch for each ID if multiple ones are >> returned. I.e., if this option is enabled, the module would >> automatically query the parent nodes to see if is in the >> lineage, and if not will remove the respective ID from the result >> set. The reason you may want to make it optional is because it >> potentially costs time. (but in reality I'm not sure why a client >> will not want to enable the option - so maybe this should even be >> default) > > I can certainly add that, it seems like a good idea. I don't, however, > see any scope for an option at all. What would the option be called? > -don't_give_me_the_answer_I_actually_want_to_save_time ? Pointless, > imho. If the user queries 'xyz ' with that option, they're > just going to have to do for themselves manually what the method would > have done for them without that option, in order to get the correct > answer. It'll be slower that way, if anything. So the option would > actually be called > - > don't_give_me_the_answer_I_actually_want_so_I_can_get_it_myself_a_litt > le_slower > (!). > > >>> Bio::Taxonomy::Node >>> ------------------- >>> [...] >>> classification() has a proper solution to finding the classification >>> when the array wasn't manually set. >>> >>> # Improvements >>> BEHAVIOUR-CHANGE: node_name() used to be an alias to name >>> ('common'). Now >>> it is an alias to name('scientific'). >>> NOTE: node_name is what is set when ->new(-name => $name) is set, so >>> flatfile and entrez and user-created nodes now implicitly associate >>> the >>> name of the node they create with its scientific name. >> >> I'm not even sure node_name() should just be deprecated. The methods >> falsely suggests that there is only a single and definitive name for >> the taxon node. >> >> In NCBI reality, this is only true for the scientific name of the >> node. In real reality, many nodes have multiple scientific names - >> taxonomy isn't static and therefore the scientific naming of nodes >> isn't either. > > For the programmer not using any database but just making up his own > nodes, I think he needs a node_name() because he may not be thinking > about anything fancy or realistic. He just want to give his node a > single name that he invents. node_name() seems like the ideal method > name to me. > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Mon Jul 24 08:31:44 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 24 Jul 2006 08:31:44 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C48323.5060704@sendu.me.uk> References: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine> <44C00805.7090403@sendu.me.uk> <9777D0A4-5340-4367-879D-F9125393B4FD@uiuc.edu> <21C7763A-B8FF-4E91-BAEC-E5F79E462DD3@gmx.net> <44C48323.5060704@sendu.me.uk> Message-ID: <8B9F12C5-C520-48CF-825F-DD24975B43BB@gmx.net> Sounds good to me, except there is no Bio::TaxonomyI yet, and also Bio::Species shouldn't fully depend on an internet connection or flat file to do anything meaningful. I.e., it should take advantage of a lookup database if there is one, but in the absence of that one should also be able to statically set attribute values to whatever one thinks can be gleaned from a parsed text or whatever. -hilmar On Jul 24, 2006, at 4:21 AM, Sendu Bala wrote: >> I think this would be the way to go. I.e., >> >> >> |------Node >> NodeI----| >> |-| >> |----SpeciesNode >> Species----| > > Actually, if we're changing the name of the module that Species > interacts with, any existing code needs to be re-written. So why not > just do it properly and have Bio::Species interact with Bio::Taxonomy? > > |----Bio::Taxonomy > Bio::TaxonomyI----| > |----Bio::Species > > Or > > Bio::TaxonomyI----|----Bio::Taxonomy----|----Bio::Species > > Leaving Node completely free to be just a node. This way we don't > have a > crufty SpeciesNode there simply for the sake of Bio::Species. > Bio::Species itself provides all the legacy stuff it needs for itself, > while interacting with Nodes via TaxonomyI methods in the 'correct' > way > only. > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From bix at sendu.me.uk Mon Jul 24 08:34:45 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 24 Jul 2006 13:34:45 +0100 Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes In-Reply-To: <27E70EEF-28F6-4B54-9CCB-6090E338CEFB@gmx.net> References: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu> <44C47DC1.8020503@sendu.me.uk> <27E70EEF-28F6-4B54-9CCB-6090E338CEFB@gmx.net> Message-ID: <44C4BE65.8080304@sendu.me.uk> Hilmar Lapp wrote: > > On Jul 24, 2006, at 3:58 AM, Sendu Bala wrote: > >> On a side note, why would we care about 'organelle' when we're dealing >> with taxonomy? Why does the NCBI taxonomy db have a slot for organelle? > > Because some sequences are of the organelle DNA, and Genbank needs a way > to express this. Highly artificial, but still can't be ignored. Ok, but why is it stored as part of the taxonomy? Why isn't it stored in its own field? And does /bioperl/ have to store it as part of the taxonomy? Maybe the file parser could have its own organelle() method and leave all taxonomic classes without such a method. Or it could stay as is, I don't know. Do different organelles in the same species get unique taxonomy ids? From hlapp at gmx.net Mon Jul 24 08:46:51 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 24 Jul 2006 08:46:51 -0400 Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes In-Reply-To: <44C4BE65.8080304@sendu.me.uk> References: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu> <44C47DC1.8020503@sendu.me.uk> <27E70EEF-28F6-4B54-9CCB-6090E338CEFB@gmx.net> <44C4BE65.8080304@sendu.me.uk> Message-ID: <2C99E56B-84D2-4C51-BBF1-76BAF81205AB@gmx.net> On Jul 24, 2006, at 8:34 AM, Sendu Bala wrote: > Hilmar Lapp wrote: >> >> On Jul 24, 2006, at 3:58 AM, Sendu Bala wrote: >> >>> On a side note, why would we care about 'organelle' when we're >>> dealing >>> with taxonomy? Why does the NCBI taxonomy db have a slot for >>> organelle? >> Because some sequences are of the organelle DNA, and Genbank needs >> a way >> to express this. Highly artificial, but still can't be ignored. > > Ok, but why is it stored as part of the taxonomy? Why isn't it > stored in > its own field? And does /bioperl/ have to store it as part of the > taxonomy? No, but clients need to be able to obtain it. Organelles have their own genome. If we talk about the human genome, for instance, most commonly we refer to the nuclear genome only. > Maybe the file parser could have its own organelle() method > and leave all taxonomic classes without such a method. Or it could > stay > as is, I don't know. Like I said above, at the end of the day there needs to be a way to qualify a sequence by the genome it is part of. > > Do different organelles in the same species get unique taxonomy ids? I would have to confirm, but I believe so. As I said, from a genome/ sequence-centric viewpoint, the organelle and nuclear genomes are two different things. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From simon.andrews at bbsrc.ac.uk Mon Jul 24 09:34:10 2006 From: simon.andrews at bbsrc.ac.uk (simon andrews (BI)) Date: Mon, 24 Jul 2006 14:34:10 +0100 Subject: [Bioperl-l] New EMBL format parsing/writing Message-ID: I few weeks ago I saw a couple of messages on this list mentioning the new ID/SV line format used in the latest EMBL release. I'm in the process of moving our database server over to the new format and was looking to update SeqIO::embl.pm. I'm sure someone said they'd made a patch to fix up parsing of the new format, but I can't find it either in CVS or bugzilla. Rather than do this again myself can someone point me to an updated SeqIO::embl.pm please? If there isn't one then I'll look into making the patch myself. Since this is such a major change are there any plans to put out a new release with this fix included? I'm sure this will start to bite more people as the new format becomes more widely adopted. Cheers Simon. -- Simon Andrews PhD Bioinformatics Group The Babraham Institute simon.andrews at bbsrc.ac.uk +44 (0) 1223 496463 From cjfields at uiuc.edu Mon Jul 24 09:44:37 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 08:44:37 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <8B9F12C5-C520-48CF-825F-DD24975B43BB@gmx.net> References: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine> <44C00805.7090403@sendu.me.uk> <9777D0A4-5340-4367-879D-F9125393B4FD@uiuc.edu> <21C7763A-B8FF-4E91-BAEC-E5F79E462DD3@gmx.net> <44C48323.5060704@sendu.me.uk> <8B9F12C5-C520-48CF-825F-DD24975B43BB@gmx.net> Message-ID: <5CE136DF-2443-4AC0-AD12-B333D894A75B@uiuc.edu> Hence the reason to have it be a hybrid of Bio::Species and Tax::Node. Bio::SeqIO::genbank works very happily with the current Bio::Taxonomy::Node now; if we intend to remove most of the method we need to have a similar DB-aware module to house the flatfile data (like Bio::Species) yet be capable of working with Bio::Taxonomy (like Tax::Node). As for organelle(), that could be made into something else (Bio::Annotation::SimpleValue or similar) but as it's always been included with the tax data, that's where it has been. The TaxID in the 'source' seqfeature doesn't refer to the organelle but the organism. Chris On Jul 24, 2006, at 7:31 AM, Hilmar Lapp wrote: > Sounds good to me, except there is no Bio::TaxonomyI yet, and also > Bio::Species shouldn't fully depend on an internet connection or flat > file to do anything meaningful. > > I.e., it should take advantage of a lookup database if there is one, > but in the absence of that one should also be able to statically set > attribute values to whatever one thinks can be gleaned from a parsed > text or whatever. > > -hilmar > > On Jul 24, 2006, at 4:21 AM, Sendu Bala wrote: > >>> I think this would be the way to go. I.e., >>> >>> >>> |------Node >>> NodeI----| >>> |-| >>> |----SpeciesNode >>> Species----| >> >> Actually, if we're changing the name of the module that Species >> interacts with, any existing code needs to be re-written. So why not >> just do it properly and have Bio::Species interact with >> Bio::Taxonomy? >> >> |----Bio::Taxonomy >> Bio::TaxonomyI----| >> |----Bio::Species >> >> Or >> >> Bio::TaxonomyI----|----Bio::Taxonomy----|----Bio::Species >> >> Leaving Node completely free to be just a node. This way we don't >> have a >> crufty SpeciesNode there simply for the sake of Bio::Species. >> Bio::Species itself provides all the legacy stuff it needs for >> itself, >> while interacting with Nodes via TaxonomyI methods in the 'correct' >> way >> only. >> > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Mon Jul 24 09:49:42 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 24 Jul 2006 14:49:42 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <8B9F12C5-C520-48CF-825F-DD24975B43BB@gmx.net> References: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine> <44C00805.7090403@sendu.me.uk> <9777D0A4-5340-4367-879D-F9125393B4FD@uiuc.edu> <21C7763A-B8FF-4E91-BAEC-E5F79E462DD3@gmx.net> <44C48323.5060704@sendu.me.uk> <8B9F12C5-C520-48CF-825F-DD24975B43BB@gmx.net> Message-ID: <44C4CFF6.40609@sendu.me.uk> Hilmar Lapp wrote: > Sounds good to me, except there is no Bio::TaxonomyI yet, Indeed, I propose making one. > Bio::Species shouldn't fully depend on an internet connection or flat > file to do anything meaningful. > > I.e., it should take advantage of a lookup database if there is one, but > in the absence of that one should also be able to statically set > attribute values to whatever one thinks can be gleaned from a parsed > text or whatever. Yes, which is why Bio::Taxonomy is appropriate here. Assuming that Bio::Species isa Bio::TaxonomyI: ... SOURCE Saccharomyces cerevisiae (baker's yeast) ORGANISM Saccharomyces cerevisiae Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces. ... ## the fully-manual way my $species = new Bio::Species; my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces cerevisiae', -rank => 'species', -object_id => 1, -parent_id => 2); my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces', -object_id => 2, -parent_id => 3); # (no assumption that 'Saccharomyces' is the genus, so rank() undefined) my $n3 = [etc] $species->add_node($node); $species->add_node($n2); [etc] ## Using a factory without db access # assume that Bio::Taxonomy::GenbankFactory implements # some modified Bio::Taxonomy::FactoryI my $factory = Bio::Taxonomy::GenbankFactory->new(); my $species = $factory->generate(-classification => ['Saccharomyces cerevisiae', 'Saccharomyces', 'Saccharomycetaceae' ...]); # the generate() method above just does the fully-manual way for you ## Using a factory with db access # assume that Bio::Taxonomy::EntrezFactory implements some # modified Bio::Taxonomy::FactoryI and uses Bio::DB::Taxonomy::entrez # to get the nodes my $factory = Bio::Taxonomy::EntrezFactory->new(); my $species = $factory->fetch(-scientifc_name => 'Saccharomyces cerevisiae'); # (would probably want to come up with a more generic name for the # fetch() and generate() methods, so that all Factories use the same # same method name) It's very clean and flexible this way. Ultimately you always make your Bio::Species the same way - you add nodes to it. You can make those nodes yourself or use a factory. We also solve Chris' earlier quandary: [ in a world where Bio::Taxonomy::Node and Bio::Taxonomy::SpeciesNode exist, and given that Bio::DB::Taxonomy* currently directly make Node objects ] > The only problem I can foresee is which class to use with > Bio::DB::Taxonomy*? I guess one could settle on one class by default and > have the option to use another Bio::Taxonomy::NodeI-implementing class if > you wanted more data/methods available... The way to do it is to have the Bio::DB::Taxonomy* modules return only the information that a Bio::Taxonomy::FactoryI would need to make a NodeI. The specific Factory that you use could generate whatever type of Node you wanted. But actually I propose there is only one Node and the specific Factory that you use determines the kind of Bio::TaxonomyI made; GenbankFactory might make a Bio::Species, while EntrezFactory might make a Bio::Taxonomy. Bio::Species differs from Bio::Taxonomy only so it contains all the legacy methods names that Bio::Species currently has, for backward compatibility. Setting $species->classification() would delete all nodes of self, use a GenbankFactory to make a new Bio::Species, then pull out all its Nodes and add them to self. Unless anyone can think of a better way of doing things, I'll explore the above ideas and start writing code. To summarise: major changes to Bio::DB::Taxonomy* (make them factory slaves), implementation of some Bio::Taxonomy::FactoryIs, tweak Bio::Taxonomy::FactoryI and make Bio::TaxonomyI, make Bio::Species a Bio::TaxonomyI. Oh, Bio::Taxonomy might need some changes as well. It has a classify() method does something with a Bio::Species, which would be all wrong in the new way of doing things. From bix at sendu.me.uk Mon Jul 24 09:53:23 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 24 Jul 2006 14:53:23 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <5CE136DF-2443-4AC0-AD12-B333D894A75B@uiuc.edu> References: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine> <44C00805.7090403@sendu.me.uk> <9777D0A4-5340-4367-879D-F9125393B4FD@uiuc.edu> <21C7763A-B8FF-4E91-BAEC-E5F79E462DD3@gmx.net> <44C48323.5060704@sendu.me.uk> <8B9F12C5-C520-48CF-825F-DD24975B43BB@gmx.net> <5CE136DF-2443-4AC0-AD12-B333D894A75B@uiuc.edu> Message-ID: <44C4D0D3.1020506@sendu.me.uk> Chris Fields wrote: > Bio::SeqIO::genbank works very happily with the current > Bio::Taxonomy::Node now; if we intend to remove most of the method we > need to have a similar DB-aware module to house the flatfile data (like > Bio::Species) yet be capable of working with Bio::Taxonomy (like Tax::Node). Can you give code examples of what Bio::SeqIO::genbank is doing and what makes it 'happy'? What are the requirements? Would it be as happy working with a Bio::Taxonomy object? From aramsey at vecna.com Mon Jul 24 10:23:46 2006 From: aramsey at vecna.com (Al Ramsey) Date: Mon, 24 Jul 2006 10:23:46 -0400 Subject: [Bioperl-l] Making BioPerl Faster Message-ID: <44C4D7F2.6020107@vecna.com> I'm interested into following up with a suggestion from the bioperl.org site about making it faster (http://www.bioperl.org/wiki/Why_BioPerl_is_slow). In particular, I wanted to look a little more into how the object instantiations might be more efficient. Is anyone else looking into this actively now? I want to ask if anyone had any additional insights that weren't previously published before I started. Thank you, Al Ramsey -- Alvin Ramsey, PhD. Vecna Technologies, Inc. 5205 Leesburg Pike Falls Church, VA 22041 aramsey at vecna.com t: 703.998.5333 f: 703.998.5816 From s-merchant at northwestern.edu Mon Jul 24 11:09:49 2006 From: s-merchant at northwestern.edu (Sohel Merchant) Date: Mon, 24 Jul 2006 10:09:49 -0500 Subject: [Bioperl-l] obo_parser.t test warnings In-Reply-To: Message-ID: <004301c6af33$3564a8e0$c2987ca5@pc13> Hey Chris, I usually run perl with all warnings disabled. So I never saw these. I will put a fix to them sometime this week. Thanks, Sohel. _____ From: Chris Fields [mailto:cjfields at uiuc.edu] Sent: Sunday, July 23, 2006 2:10 PM To: bioperl-l List; Hilmar Lapp; s-merchant at northwestern.edu Subject: obo_parser.t test warnings Hilmar, Sohel, Didn't know who to notify, so sorry in advance about cross-posting this to the list. I was running through cleaning up some bugs and found that obo_parser.t is throwing a ton of warnings: bayou-75:~/Chris/Bioperl/bioperl-live natashacapell$ perl -I. -w t/obo_parser.t 1..40 "my" variable $val masks earlier declaration in same scope at Bio/OntologyIO/obo.pm line 592. "my" variable $qh masks earlier declaration in same scope at Bio/OntologyIO/obo.pm line 592. Use of uninitialized value in string eq at Bio/OntologyIO/obo.pm line 239, line 13. ... Good news: all tests pass! Cheers! Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From prabubio at gmail.com Mon Jul 24 11:39:43 2006 From: prabubio at gmail.com (Prabu R) Date: Mon, 24 Jul 2006 21:09:43 +0530 Subject: [Bioperl-l] Remote Blast Execution Message-ID: Dear All! I am trying to run Remote Blast using Bio::Tools::Run::RemoteBlast. I am not able to get the blast result. Upto my knowledge, the Bio::SearchIO::blast hash object does not returns any result. Secondly, I tried 'remote_blast.pl ' a program from CPAN bioperl 1.5release. Command: perl bp_remote_blast.pl -p blastn -d est_mouse -e 1e-5 -i /home/prabucn/Blast/mm_test1.fa Error Message: retrieving blasts.. -------------------- WARNING --------------------- MSG: Possible error (1) while parsing BLAST report! --------------------------------------------------- Please help. Thanks, R. Prabu. Please look into my test program. ---------------------------------------------------------------------------------------------- use Bio::Tools::Run::RemoteBlast; use strict; use Bio::SeqIO; use Bio::SearchIO; my $prog = 'blastn'; my $db = 'est'; my $e_val= '1e-10'; my @params = ( '-prog' => $prog, '-data' => $db, '-expect' => $e_val, '-readmethod' => 'SearchIO' ); my $factory = Bio::Tools::Run::RemoteBlast->new(@params) || die "Cant do"; my $v = 1; my $str = Bio::SeqIO->new(-file=>'mm_test2.txt' , '-format' => 'fasta' ); while (my $input = $str->next_seq()){ my $r = $factory->submit_blast($input); print STDERR "waiting..." if( $v > 0 ); while ( my @rids = $factory->each_rid ) { foreach my $rid ( @rids ) { my $rc = $factory->retrieve_blast($rid); if( !ref($rc) ) { if( $rc < 0 ) { $factory->remove_rid($rid); } print STDERR "." if ( $v > 0 ); sleep 5; } else { print "$rc\n"; my $result = $rc->next_result(); my $filename = $result->query_name()."\.out"; $factory->save_output($filename); $factory->remove_rid($rid); print "\nQuery Name: ", $result->query_name(), "\n"; while ( my $hit = $result->next_hit ) { next unless ( $v > 0); print "\thit name is ", $hit->name, "\n"; while( my $hsp = $hit->next_hsp ) { print "\t\tscore is ", $hsp->score, "\n"; } } } } } } ---------------------------------------------------------------------------------------------- -- "Every noble work is at first impossible." - Thomas Carlyle From cjfields at uiuc.edu Mon Jul 24 11:48:45 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 10:48:45 -0500 Subject: [Bioperl-l] SearchIO - Stop throwing away data In-Reply-To: Message-ID: <001701c6af38$a81c1580$15327e82@pyrimidine> > Hi > > I developing someone > elses work. I wondered whether anyone could identify the > mistake that the previous coder made? > I am not very familiar with SearchIO yet. > > They are trying to extract filenames from an output report. > This is their code: > > # store the query name of the mito db blast hits into an array > my $searchio = new Bio::SearchIO( -file => $blast_mito_output ); > # array to store the mitochondrial BLAST database hits > my @mito_hits; > # name of query for BLAST hit > my $query_name; > Just as a gripe here: you should always designate the '-format' here to be 'blast' for BLAST text output. my $searchio = new Bio::SearchIO(-file => $blast_mito_output, -format => 'blast' ); The default is still text, so the above works, but that very well may change in the future. Each BLAST report is a Result. Each Result contains one or more hits; each hit contains one or more HSPs. SearchIO only parses the information contained in the BLAST report (i.e. no filenames). From here, it looks like you want Hit information, though. The code below copies the query_name from the BlastResult object, $result (i.e. the name of your query sequence, the one you submitted for BLAST'ing against a database). You need the BlastHit data from $hit. Change : $query_name = $result->query_name(); #print "\nQuery $query_name\n"; push(@mito_hits, $query_name); To : $hit_name = $hit->description(); #print "\nHit $hit_name\n"; push(@mito_hits, $hit_name); or, for the hit accession, use $hit_name = $hit->accession(); For all accessions in the description (there may be multiples if sequences are identical), use an array and @hit_name = $hit->get_all_accessions(); You can use a different EventHandler if you want to speed things up: my $searchio = new Bio::SearchIO(-format => $format, -file => $file); $searchio->attach_EventHandler(Bio::SearchIO::FastHitEventBuilder->new); But to have this work you need to update to the latest CVS version of bioperl; this was a recent bug that was fixed. Chris > while ( my $result = $searchio->next_result() ) { > # get the hits and their associated name > # do not want to include these in the clustering step > while( my $hit = $result->next_hit ) { > # store the names of these hits into an array > # these filenames will not be copied over > $query_name = $result->query_name(); > #print "\nQuery $query_name\n"; > push(@mito_hits, $query_name); > } > } > I think they have based it on the code at > http://www.bioperl.org/wiki/HOWTO:SearchIO#Authors > > use Bio::SearchIO; > use Bio::SearchIO::FastHitEventBuilder; > my $searchio = new Bio::SearchIO(-format => $format, -file => $file); > > $searchio->attach_EventHandler(Bio::SearchIO::FastHitEventBuilder->new); > while( my $r = $searchio->next_result ) { > while( my $h = $r->next_hit ) { > # Hits will NOT have HSPs > print $h->significance,"\n"; > } > > which "throws away data you don't want"??? > > I am finding that our code is finding the last file name in the ouput > report, > rather than each and every one. I suspect it is overwriting (or throwing > away the data). > > How do I need to change the code to make sure *every* file name goes > into @mito_hits? > > Thankyou > > Jayne > > _________________________________________________________________ > The new MSN Search Toolbar now includes Desktop search! > http://join.msn.com/toolbar/overview > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From dwaner at scitegic.com Mon Jul 24 12:03:21 2006 From: dwaner at scitegic.com (dwaner at scitegic.com) Date: Mon, 24 Jul 2006 09:03:21 -0700 Subject: [Bioperl-l] New EMBL format parsing/writing Message-ID: Simon, I have already updated SeqIO::embl.pm to support release 87. All I have left to do is generate the patch and update the /t test. I will try to get this submitted to bugzilla today (24 July). - David From cjfields at uiuc.edu Mon Jul 24 12:04:40 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 11:04:40 -0500 Subject: [Bioperl-l] Making BioPerl Faster In-Reply-To: <44C4D7F2.6020107@vecna.com> Message-ID: <001901c6af3a$df146ea0$15327e82@pyrimidine> Give it a look, sure! Not sure if this the only problem though when it comes to speed; I think it's more complicated than that. I think that (at least on WinXP) the Perl version used is also partially to blame. It's possible that something modified between v 5.6 and 5.8 slowed everything down considerably. I always wondered if it had something to do with Unicode support in perl 5.8 ... There is a report on Bugzilla about a dramatic slowdown on sequence parsing between v. 1.4 and v. 1.5 (including the latest, v 1.5.1) http://bugzilla.open-bio.org/show_bug.cgi?id=1875 This is unresolved at this time but may be unrelated to the possible perl versioning issue above. I've a feeling you may find regexes and redundant methods calls also add quite a bit of overhead. I've seen several places where accessors are called over and over w/o assigning to a local variable. Or places where a tr/// would work much faster than a s///. There was an instance of the latter in SeqIO which sped up parsing about 2-3x faster on WinXP. If you want to look at the impact of object instantiation on speed, check out Bio::SearchIO (parsing of BLAST/FASTA/HMMER reports). Lots of method calls, object creation, etc. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Al Ramsey > Sent: Monday, July 24, 2006 9:24 AM > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] Making BioPerl Faster > > I'm interested into following up with a suggestion from the bioperl.org > site about making it faster > (http://www.bioperl.org/wiki/Why_BioPerl_is_slow). In particular, I > wanted to look a little more into how the object instantiations might be > more efficient. Is anyone else looking into this actively now? I want > to ask if anyone had any additional insights that weren't previously > published before I started. > > Thank you, > Al Ramsey > > > -- > Alvin Ramsey, PhD. > > Vecna Technologies, Inc. > 5205 Leesburg Pike > Falls Church, VA 22041 > aramsey at vecna.com > t: 703.998.5333 > f: 703.998.5816 > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Mon Jul 24 12:06:03 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 11:06:03 -0500 Subject: [Bioperl-l] Remote Blast Execution In-Reply-To: Message-ID: <001a01c6af3b$10187f50$15327e82@pyrimidine> You need to update to the latest code (bioperl-live) from CVS. BLAST parsing using RemoteBlast is broken in all the latest releases. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Prabu R > Sent: Monday, July 24, 2006 10:40 AM > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] Remote Blast Execution > > Dear All! > > I am trying to run Remote Blast using Bio::Tools::Run::RemoteBlast. > > I am not able to get the blast result. > Upto my knowledge, the Bio::SearchIO::blast hash object does not returns > any > result. > > > Secondly, I tried 'remote_blast.pl ' a program from CPAN bioperl > 1.5release. > > Command: > perl bp_remote_blast.pl -p blastn -d est_mouse -e 1e-5 -i > /home/prabucn/Blast/mm_test1.fa > > Error Message: > > retrieving blasts.. > > -------------------- WARNING --------------------- > MSG: Possible error (1) while parsing BLAST report! > --------------------------------------------------- > > Please help. > > Thanks, > R. Prabu. > > > Please look into my test program. > -------------------------------------------------------------------------- > -------------------- > use Bio::Tools::Run::RemoteBlast; > use strict; > use Bio::SeqIO; > use Bio::SearchIO; > > my $prog = 'blastn'; > my $db = 'est'; > my $e_val= '1e-10'; > > my @params = ( '-prog' => $prog, > '-data' => $db, > '-expect' => $e_val, > '-readmethod' => 'SearchIO' ); > > my $factory = Bio::Tools::Run::RemoteBlast->new(@params) || die "Cant > do"; > > my $v = 1; > > my $str = Bio::SeqIO->new(-file=>'mm_test2.txt' , '-format' => 'fasta' > ); > > while (my $input = $str->next_seq()){ > my $r = $factory->submit_blast($input); > > print STDERR "waiting..." if( $v > 0 ); > while ( my @rids = $factory->each_rid ) { > foreach my $rid ( @rids ) { > my $rc = $factory->retrieve_blast($rid); > > if( !ref($rc) ) { > if( $rc < 0 ) { > $factory->remove_rid($rid); > } > print STDERR "." if ( $v > 0 ); > sleep 5; > } else { > print "$rc\n"; > my $result = $rc->next_result(); > my $filename = $result->query_name()."\.out"; > $factory->save_output($filename); > $factory->remove_rid($rid); > print "\nQuery Name: ", $result->query_name(), "\n"; > while ( my $hit = $result->next_hit ) { > next unless ( $v > 0); > print "\thit name is ", $hit->name, "\n"; > while( my $hsp = $hit->next_hsp ) { > print "\t\tscore is ", $hsp->score, "\n"; > } > } > } > } > } > } > -------------------------------------------------------------------------- > -------------------- > > -- > "Every noble work is at first impossible." > - Thomas Carlyle > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Mon Jul 24 12:21:39 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 11:21:39 -0500 Subject: [Bioperl-l] New EMBL format parsing/writing In-Reply-To: Message-ID: <001c01c6af3d$3df2dc70$15327e82@pyrimidine> The only proposed EMBL changes I can remember were for Tax data (organism lines). It shouldn't be hard to change the way these are parsed. We could leave parsing of SV for older files and run a check on the ID line format to accommodate old and new sequences, though I have no problem with only supporting the latest formats. Continual support for old deprecated sequence formats leads to lots of cruft over time; SwissPort parsing has the same issue. You would be surprised how many people out there never bother to update their sequences and use old data... I believe you are referring to this (from the latest EMBL release notes): ... 2 CHANGES IN THIS RELEASE 2.1 Changes to the Feature Table Document: Chapter 3.5 "Location" The use of range (.) descriptor within location spans is no longer legal. 2.2 ID line changes ID line structure underwent the following changes * All tokens are separated by a semicolon. * The entry name is not displayed, in its place there is the primary accession number. * The sequence version is indicated. * The topology is a separate token and is indicated for both circular and linear molecules. * Both the data class and taxonomic divisions will be displayed. This is an example of the new ID line: ID CD789012; SV 4; linear; genomic DNA; HTG; MAM; 500 BP. (1) (2) (3) (4) (5) (6) (7) The tokens represent: 1. Primary accession number. 2. 'SV' + sequence version number. 3. Topology: 'circular' or 'linear'. 4. Molecule type. 5. Data class (ANN, CON, PAT, EST, GSS, HTC, HTG, MGA, WGS, TPA, STS, STD, "normal" entries will have STD for standard). 6. Taxonomic division (HUM, MUS, ROD, PRO, MAM, VRT, FUN, PLN, ENV, INV, SYN, UNC, VRL, PHG). 7. Sequence length + 'BP.'. The entry name is no longer displayed in the ID line. A mapping file (entryname to accession number) ftp://ftp.ebi.ac.uk/pub/databases/embl/misc/entryname_to_acc.mapping is provided for those entries where the entryname is not the same as the accession number. The SV line has been dropped as sequence version information is now displayed in the ID line. In order to facilitate the changeover to the new ID line structure, two small utilities have been released: 'new2oldID.pl' and 'old2newID.pl'. They can be used to convert EMBL flat files from the old to the new format and vice-versa. The converters can be found at ftp://ftp.ebi.ac.uk/pub/databases/embl/tools A new version of the Syncron tools (for maintaining synchronised copies of EMBL database updates) that became the working version with EMBL release 87 can be found in the same directory. In this version the tools were adjusted to cope with the new format of the ID line in EMBL entries and some related changes. ... Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of simon andrews (BI) > Sent: Monday, July 24, 2006 8:34 AM > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] New EMBL format parsing/writing > > I few weeks ago I saw a couple of messages on this list mentioning the > new ID/SV line format used in the latest EMBL release. I'm in the > process of moving our database server over to the new format and was > looking to update SeqIO::embl.pm. > > I'm sure someone said they'd made a patch to fix up parsing of the new > format, but I can't find it either in CVS or bugzilla. > > Rather than do this again myself can someone point me to an updated > SeqIO::embl.pm please? If there isn't one then I'll look into making > the patch myself. > > Since this is such a major change are there any plans to put out a new > release with this fix included? I'm sure this will start to bite more > people as the new format becomes more widely adopted. > > > Cheers > > Simon. > > -- > Simon Andrews PhD > Bioinformatics Group > The Babraham Institute > > simon.andrews at bbsrc.ac.uk > +44 (0) 1223 496463 > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Mon Jul 24 12:37:32 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 11:37:32 -0500 Subject: [Bioperl-l] New EMBL format parsing/writing In-Reply-To: Message-ID: <002001c6af3f$76214490$15327e82@pyrimidine> Great work! Does it support old and new EMBL or only the newest? I don't have a problem with dumping old format support, but if we do we need to note this in POD and elsewhere (wiki, perhaps). Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of dwaner at scitegic.com > Sent: Monday, July 24, 2006 11:03 AM > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] New EMBL format parsing/writing > > Simon, > > I have already updated SeqIO::embl.pm to support release 87. All I have > left to do is generate the patch and update the /t test. I will try to > get this submitted to bugzilla today (24 July). > > - David > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Mon Jul 24 14:40:03 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 13:40:03 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C4D0D3.1020506@sendu.me.uk> Message-ID: <002f01c6af50$97242250$15327e82@pyrimidine> I have to do a little catching up on things here; lots of conversation this morning! According to NCBI, the SOURCE line can hold organelle data, an abbreviated version of the scientific name, and the GenBank common name in parentheses. No other information is present. The ORGANISM lines contains the scientific name (NCBI definition) and the lineage, generally only ranked node but not always. I believe it was Nadeem Faruque who indicated that there is some way that NCBI marks the ranks which determines whether or not they appear in the lineage. Here's what Bio::SeqIO::genbank does to get data into and out of GenBank files: ------------------------------------------------------ Bio::SeqIO::genbank in methods next_seq() and _read_GenBank_Species(): 1) Bio::Species acts as a container object 2) The SOURCE data is dumped entirely into common_name() (ughhhh). There is some additional work done as well before instantiating a Bio::Species ; if it is considered an unknown organism there is no Bio::Species object returned. We should get rid of that bit; every GenBank SOURCE has a TaxID and therefore has a node, including plasmids and unknowns. There will be no genus/species or anything else set for that group. 3) The ORGANISM name was divided up into genus(), species(), and subspecies(), based on the classification array (again, ughhh). 4) The classification array is split into an array and dumped into classification() 5) No parsing of potential organelle information occurs. None. Zero. Squat. 6) TaxID is grabbed from the 'source' seqfeature and assigned via ncbi_taxid(). We could use this to also grab the organelle, etc. ------------------------------------------------------ Bio::SeqIO::genbank in method write_seq(): 1) SOURCE line : use the common_name data for output, but tag on the subspecies information (?!?!?!). 2) ORGANISM lines : the name is rebuilt from the organelle() (which should be on the SOURCE line) and genus and species, which comes from the classification array (?!?!?!). The classification array is rebuilt from classification() ------------------------------------------------------ Much of this may be cruft from changes in the official GenBank format that we neglected to update. However, I think there's WAY too much hand-wringing about trying to get everything into genus() species() etc without anything more that the (very scant) information in the flatfile, esp. when using the classification array as a basis. The only places where reliable tax information is present in the flatfile are: 1) SOURCE line (organelle, common name, abbreviated name) 2) ORGANISM lines (scientific name, classification array) 3) 'source' seqfeature (strain/variant (!), organelle, TaxID, etc found here). We should assign those accordingly; we could even use the 'source' seqfeature to grab strain, organelle, etc. just like we now do for the TaxID. Beyond that we're really just guessing the ranks and the genus-species names. Makes no sense, especially when that is easily available in Bio::Taxonomy using entrez/flatfile. We could have Bio::Taxonomy::Species act as a container for IO purpose, ONLY using the methods in the 'reliable information' list above in Bio::SeqIO::genbank and other SeqIO RichSeqs. Then hold the additional data with warnings attached if a lookup hasn't been run, or not set them at all. Or, use Hilmar's suggestion and force the user to use the db handle and ncbi_taxid() to grab a new Bio::Taxonomy::Node/Species object (based on the rank) which has the correct information. As for the other container get/sets: species(), genus() etc. These methods should be present, but only for species or below (hence Bio::Taxonomy::Species). In a way Bio::Taxonomy::Species is not entirely correct as the sequence file many times the sequence is from an organism at the genus level (unassigned species) or subspecies/strain levels, or is unranked (environmental samples, for instance). All of these seem to have TaxIDs though. Don't think it really matters... We could convert Bio::Species into an abstract interface class (Bio::SpeciesI), moving the implemented methods over to Bio::Taxonomy::Species, and have Bio::Taxonomy::Species implement Bio::Taxonomy::NodeI or Bio::TaxonomyI as well. Bio::Taxonomy::Species could be checked with $obj->isa('Bio::TaxonomyI') && $obj->isa('Bio::SpeciesI') Or, modifying Hilmar's suggestion: |-----Tax::Node NodeI/TaxI -| |-----Tax::Species | SpeciesI -------| So Species doesn't 'contaminate' Node. This will allow you to proceed with doing what you want to Bio::Taxonomy::Node; both Node and Species could be checked simultaneously though they need to be changed at some point to implement the same base class, so you could check using : if ($obj->isa('Bio::Taxonomy::NodeI')) { As for getting Bio::SeqIO::genbank to play well with Bio::Taxonomy::Species, all I did was 'clone' the Bio::Taxonomy::Node module into Bio::Taxonomy::Species, removed the warnings in species() and other methods for the time being, and changed the method call for classification() in Bio::SeqIO::genbank to send an array instead of an array_ref. Then I modified the parsing to retain the scientific_name and abbreviated_name (though the latter should go into common_names()). Passed all but one test, where common_name was called and returned the entire SOURCE line (not correct!). Pretty simple, really... BTW, I checked EMBL format, and it is very similar in format to the way GenBank is with the interesting addition of the OG line (for organelle). Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Sendu Bala > Sent: Monday, July 24, 2006 8:53 AM > To: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > Chris Fields wrote: > > Bio::SeqIO::genbank works very happily with the current > > Bio::Taxonomy::Node now; if we intend to remove most of the method we > > need to have a similar DB-aware module to house the flatfile data (like > > Bio::Species) yet be capable of working with Bio::Taxonomy (like > Tax::Node). > > Can you give code examples of what Bio::SeqIO::genbank is doing and what > makes it 'happy'? What are the requirements? Would it be as happy > working with a Bio::Taxonomy object? > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Mon Jul 24 15:24:23 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 14:24:23 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C4CFF6.40609@sendu.me.uk> Message-ID: <003c01c6af56$c5fd2df0$15327e82@pyrimidine> > Hilmar Lapp wrote: > > Sounds good to me, except there is no Bio::TaxonomyI yet, > > Indeed, I propose making one. So, Node would implement this, correct? Naming it Bio::TaxonomyI makes me think that Bio::Taxonomy implements TaxonomyI, not that Bio::Taxonomy::Node implements it. ... > Yes, which is why Bio::Taxonomy is appropriate here. Assuming that > Bio::Species isa Bio::TaxonomyI: > > ... > SOURCE Saccharomyces cerevisiae (baker's yeast) > ORGANISM Saccharomyces cerevisiae > Eukaryota; Fungi; Ascomycota; Saccharomycotina; > Saccharomycetes; > Saccharomycetales; Saccharomycetaceae; Saccharomyces. > > ... > > ## the fully-manual way > my $species = new Bio::Species; > my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces cerevisiae', > -rank => 'species', -object_id => 1, > -parent_id => 2); > my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces', > -object_id => 2, -parent_id => 3); > # (no assumption that 'Saccharomyces' is the genus, so rank() undefined) > my $n3 = [etc] > $species->add_node($node); > $species->add_node($n2); > [etc] Hrmm... why would you add multiple nodes to a species object? A Species is-a Node, not a full Bio::Taxonomy. Taxonomy has-a Node (hence the add_node() method). So, you should be able to add a NodeI-implementing object to a Taxonomy object (either a Node or a Species). Not sure I agree with what you propose here; doesn't seem right... ... > We also solve Chris' earlier quandary: > > [ in a world where Bio::Taxonomy::Node and Bio::Taxonomy::SpeciesNode > exist, and given that Bio::DB::Taxonomy* currently directly make Node > objects ] > > The only problem I can foresee is which class to use with > > Bio::DB::Taxonomy*? I guess one could settle on one class by default > and > > have the option to use another Bio::Taxonomy::NodeI-implementing class > if > > you wanted more data/methods available... > > The way to do it is to have the Bio::DB::Taxonomy* modules return only > the information that a Bio::Taxonomy::FactoryI would need to make a > NodeI. The specific Factory that you use could generate whatever type of > Node you wanted. Yes, using an object factory here makes a lot of sense, returning the correct object type based on the rank. ... > Bio::Species differs from Bio::Taxonomy only so it contains all the > legacy methods names that Bio::Species currently has, for backward > compatibility. Setting $species->classification() would delete all nodes > of self, use a GenbankFactory to make a new Bio::Species, then pull out > all its Nodes and add them to self. The idea is to replace Bio::Species with something that works well, so having it implement a Node-like interface works since it is-a Node. Having it implement a Taxonomy-like interface, though, doesn't make a lot of sense as a species is-not-a Taxonomy. It should act just like a fancier node object. Using a factory in Bio::DB::Taxonomy should solve any issues about what object type is returned, since that could simply be made based on the rank itself (species rank or below == Bio::Taxonomy::Species, genus and above == Bio::Taxonomy::Node). > Unless anyone can think of a better way of doing things, I'll explore > the above ideas and start writing code. To summarise: major changes to > Bio::DB::Taxonomy* (make them factory slaves), implementation of some > Bio::Taxonomy::FactoryIs, tweak Bio::Taxonomy::FactoryI and make > Bio::TaxonomyI, make Bio::Species a Bio::TaxonomyI. Nope. Don't agree. Sorry. I can't see why you would force a Species to be a Taxonomy when it isn't. The object hierarchy doesn't make sense to me. I would just have a simple interface for Node (NodeI), and either convert Bio::Species to an abstract interface or place its methods in Bio::Taxonomy::Species/SpeciesNode. I like the interface idea as Bio::Taxonomy::Node is-a NodeI only, while Bio::Taxonomy::Species is-a NodeI and SpeciesI; these checks can be run using the UNIVERSAL object method 'isa' when using a Factory. I'll repeat: a Node and a Species is-not-a Taxonomy. A Taxonomy object has-a Node or Species or combinations thereof ; all would be NodeI-implementing. That's the reason that add_node() is there, which could be modified to allow only objects that isa->('Bio::Taxonomy::NodeI') (i.e. a Node or a Species). > Oh, Bio::Taxonomy might need some changes as well. It has a classify() > method does something with a Bio::Species, which would be all wrong in > the new way of doing things. We'll have to make eventual changes to anything referencing Bio::Species to get them to work correctly. Getting the object hierarchy finalized and worked out is priority one. Getting Bio::SeqIO modules switched over to Bio::Taxonomy::Species (pretty commonly used) and making sure that Bio::DB::Taxonomy returns the correct objects from the factory is a close second. Any small issues that pop up along the way can be taken care of when they reveal themselves. Chris From cjfields at uiuc.edu Mon Jul 24 15:34:55 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 14:34:55 -0500 Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes In-Reply-To: <2C99E56B-84D2-4C51-BBF1-76BAF81205AB@gmx.net> Message-ID: <003d01c6af58$3dc4ac40$15327e82@pyrimidine> > > Maybe the file parser could have its own organelle() method > > and leave all taxonomic classes without such a method. Or it could > > stay > > as is, I don't know. > > Like I said above, at the end of the day there needs to be a way to > qualify a sequence by the genome it is part of. Agreed. I think Sendu's right in one regard, it doesn't seem to have anything to do with the taxonomy itself. See below... There should be a way of containing this somehow, maybe using a Bio::Annotation::SimpleValue object or having a get/set somehow. > > Do different organelles in the same species get unique taxonomy ids? > > I would have to confirm, but I believe so. As I said, from a genome/ > sequence-centric viewpoint, the organelle and nuclear genomes are two > different things. Looks like the organelle sequence data uses the organism TaxID. I couldn't find organelle-specific taxon information using the TaxBrowser for mitochondrion, chloroplast, or plastid. source 1..426 /organism="Reticulitermes tibialis" /organelle="mitochondrion" /mol_type="genomic DNA" /db_xref="taxon:186107" /haplotype="T9" TaxID refers to the organism ("Reticulitermes tibialis"), not the mitochondrion. source 1..814 /organism="Porterinema fluviatile" /organelle="plastid:chloroplast" /mol_type="genomic DNA" /strain="SAG 124.79" /db_xref="taxon:246123" /country="Germany" TaxID refers to the organism ("Porterinema fluviatile"), not the chloroplast. Chris From bix at sendu.me.uk Mon Jul 24 15:45:09 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 24 Jul 2006 20:45:09 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <003c01c6af56$c5fd2df0$15327e82@pyrimidine> References: <003c01c6af56$c5fd2df0$15327e82@pyrimidine> Message-ID: <44C52345.5060903@sendu.me.uk> Chris Fields wrote: >> Hilmar Lapp wrote: >>> Sounds good to me, except there is no Bio::TaxonomyI yet, >> Indeed, I propose making one. > > So, Node would implement this, correct? Naming it Bio::TaxonomyI makes me > think that Bio::Taxonomy implements TaxonomyI, not that Bio::Taxonomy::Node > implements it. No no, I guess the whole rest of you reply was confused by this one point. Bio::TaxonomyI would be the interface for Bio::Taxonomy. Definitely not a Node. >> Yes, which is why Bio::Taxonomy is appropriate here. Assuming that >> Bio::Species isa Bio::TaxonomyI: >> >> ... >> SOURCE Saccharomyces cerevisiae (baker's yeast) >> ORGANISM Saccharomyces cerevisiae >> Eukaryota; Fungi; Ascomycota; Saccharomycotina; >> Saccharomycetes; >> Saccharomycetales; Saccharomycetaceae; Saccharomyces. >> >> ... >> >> ## the fully-manual way >> my $species = new Bio::Species; >> my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces cerevisiae', >> -rank => 'species', -object_id => 1, >> -parent_id => 2); >> my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces', >> -object_id => 2, -parent_id => 3); >> # (no assumption that 'Saccharomyces' is the genus, so rank() undefined) >> my $n3 = [etc] >> $species->add_node($node); >> $species->add_node($n2); >> [etc] > > > Hrmm... why would you add multiple nodes to a species object? A Species > is-a Node, not a full Bio::Taxonomy. In my proposal, a Bio::Species certainly is a full Bio::Taxonomy. >> Bio::Species differs from Bio::Taxonomy only so it contains all the >> legacy methods names that Bio::Species currently has, for backward >> compatibility. Setting $species->classification() would delete all nodes >> of self, use a GenbankFactory to make a new Bio::Species, then pull out >> all its Nodes and add them to self. > > The idea is to replace Bio::Species with something that works well, so > having it implement a Node-like interface works since it is-a Node. Having > it implement a Taxonomy-like interface, though, doesn't make a lot of sense > as a species is-not-a Taxonomy. Right. So this is why we've been 'butting heads'. Up till now I had no idea why you were so adamant about keeping things the old Bio::Taxonomy::Node way. Bio::Species very definitely has never been, nor do we want it to become, a single node of a taxonomy. It has always been a complete taxonomy. You can tell that by the fact it has a classification, and you could ask what its genus is. This is why I'm proposing that Bio::Species become a Bio::Taxonomy. Because that's the correct object model for the kinds of things Bio::Species wants to do. > Using a factory in Bio::DB::Taxonomy should solve any issues about what > object type is returned, since that could simply be made based on the rank > itself (species rank or below == Bio::Taxonomy::Species, genus and above == > Bio::Taxonomy::Node). Frankly, that idea makes me ill. A Node, at the fundamental level, is just a very simple object that needs to associated a taxonomic rank with a scientific name. If you start making different objects for different ranks, you've departed from any semblance of meaning in the object model. > Nope. Don't agree. Sorry. I can't see why you would force a Species to be > a Taxonomy when it isn't. The object hierarchy doesn't make sense to me. Does it make sense now? > I'll repeat: a Node and a Species is-not-a Taxonomy. I'll repeat: A Node is a Node and a Bio::Species is a Taxonomy ;) > A Taxonomy object has-a Node or Species or combinations thereof ; No, a Taxonomy contains Nodes. One of those Nodes might have a rank() of 'species'. A Bio::Species contains Nodes. One of those Nodes definitely has a rank() of 'species'. It /must/ have other nodes, because the job of Bio::Species has in the past and will in the future be to store all the other taxonomic levels in a Genbank file. For the same reason Bio::Species can't be a Node itself, because you can't store other Nodes inside a Node. From cjfields at uiuc.edu Mon Jul 24 15:49:06 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 14:49:06 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <11A2B917-C633-4806-A6F4-920F02F0BF6E@gmx.net> Message-ID: <003e01c6af5a$390cdea0$15327e82@pyrimidine> Yes, 'largely' the key word. I don't really agree with Sendu's hierarchy scheme (making Species implement Taxonomy and not Node doesn't make sense), but, besides that, everything else seems fine. I like the following setup (which is similar to what you proposed, I believe), which I already posted. |-----Tax::Node NodeI-------| |-----Tax::SpeciesNode | SpeciesI -------| Taxonomy::Node is-a NodeI Taxonomy::SpeciesNode is-a NodeI and-a SpeciesI Bio::Taxonomy 'has-a' NodeI-implementing module SeqIO has-a SpeciesI-implementing module Bio::DB::Taxonomy uses a factory to return NodeI-implementing modules; specifically, a SpeciesNode for species ranks or below, and a Node for anything else. It would be nice to get this hammered out soon. I think we can actually start work on the Bio::Taxonomy::Node/SpeciesNode split; the interface classes would be easy to add. I could work on getting SeqIO to work with Bio::Taxonomy::SpeciesNode when I can (sometime in the next few weeks). Like I mentioned before, I got Bio::SeqIO::genbank already using it but haven't committed it to CVS until we sorted out the class hierarchy and interface-implementation issues. I won't be able to add too much more to this for a few weeks, unfortunately. I need to prepare for a conference as well as finish up a ton of bench research. I'll try keeping up though... Chris > :-) I think we're largely in agreement. As for node_name() I fully > understand the motivation, but it needs to be understood that the > attribute's value will be based on a largely arbitrary choice unless > it is set directly by the user. > > -hilmar > > On Jul 24, 2006, at 4:45 AM, Sendu Bala wrote: > > > Hilmar Lapp wrote: > >> On Jul 20, 2006, at 9:35 AM, Sendu Bala wrote: > >> > >>> Bio::DB::Taxonomy::flatfile > >>> --------------------------- > >>> [...] > >>> > >>> BEHAVIOUR-CHANGE: flatfile used to store within the nodes it > >>> makes the > >>> division as a three letter code, like 'PRI'. However, for > >>> consistency > >>> with entrez and the scientific_name() of the node the division is > >>> supposed to correspond to, it is now stored as the full name, like > >>> 'Primates'. > >> > >> What about adding a method division_code() which would return the 3- > >> letter abbreviation? > >> > >> The abbreviation may be needed by flat-file writers, so it may be > >> handy to have in some cases. > > > > As far as I know you can't get the 3-letter version via entrez, so no > > other module can really expect to be able to get it, not knowing which > > database (flatfile.pm or entez.pm) the taxonomic information is > > coming from. > > > > But of course it would be somewhat harmless to add division_code() > > anyway. It might be better done as a -code => 1 option to division()? > > > > > >>> The names->id solution also stores the artificially uniqued names > >>> like > >>> 'Craniata ', allowing you for the first time to > >>> retrieve the > >>> correct id. Previously the search would have simply failed > >>> completely. > >>> > >>> The names->id solution now handles nodes with scientific names of > >>> 'xyz > >>> (class)', allowing you to retrieve the id with both get_taxonids > >>> ('xyz') > >>> and get_taxonids('xyz (class)'). Previously only the latter would > >>> work. > >> > >> Should angle brackets be allowed too? > > > > Allowed in what sense? You can indeed search for both > > get_taxonids('Craniata ') [returns a single id] and > > get_taxonids('Craniata') [returns multipe ids, one of which is the > > previous answer]. > > > > > >> Maybe there should also be a -names parameter which accepts a hash > >> reference with keys being the kind of name (scientific, common, etc) > >> and the values being array references with the set of names of that > >> kind? > > > > Not sure what you mean. name() has that data structure, though you're > > not supposed to set its hash ref directly. > > > > > >>> or the $node->classification() array. > >> > >> Bio::Taxonomy::Node shouldn't have this attribute. It is legacy > >> brought over from a flawed (because flat) object model in > >> Bio::Species. > > > > Yes, I agree. > > > > > >>> NOTE: entrez modules (and website) cannot cope with '' > >>> in the > >>> query, failing searches like 'Craniata '. For this > >>> reason, if > >>> get_taxonids() is given a query with '' it will > >>> immediately > >>> return undefined, saving a pointless website access. > >> > >> If there is a 'next-best-thing' that is still semantically compatible > >> with the API documentation, I would do that. > >> > >> In this case, if there is a in the query the entrez > >> module should strip it and automatically use the rest for searching. > >> If indeed multiple IDs match there should be a warning to inform the > >> user that entrez cannot use the notation to limit the > >> query results. > > > > I wouldn't like this. I actually had it working this way initially, > > but > > decided that if someone entered 'xyz ' they really didn't > > want multiple ids, expected to get multiple ids with just 'xyz' and > > don't want their query made something else and then be warned about > > it. > > > > > >> In fact, you might as well provide an option to enable an automatic > >> check for the correct branch for each ID if multiple ones are > >> returned. I.e., if this option is enabled, the module would > >> automatically query the parent nodes to see if is in the > >> lineage, and if not will remove the respective ID from the result > >> set. The reason you may want to make it optional is because it > >> potentially costs time. (but in reality I'm not sure why a client > >> will not want to enable the option - so maybe this should even be > >> default) > > > > I can certainly add that, it seems like a good idea. I don't, however, > > see any scope for an option at all. What would the option be called? > > -don't_give_me_the_answer_I_actually_want_to_save_time ? Pointless, > > imho. If the user queries 'xyz ' with that option, they're > > just going to have to do for themselves manually what the method would > > have done for them without that option, in order to get the correct > > answer. It'll be slower that way, if anything. So the option would > > actually be called > > - > > don't_give_me_the_answer_I_actually_want_so_I_can_get_it_myself_a_litt > > le_slower > > (!). > > > > > >>> Bio::Taxonomy::Node > >>> ------------------- > >>> [...] > >>> classification() has a proper solution to finding the classification > >>> when the array wasn't manually set. > >>> > >>> # Improvements > >>> BEHAVIOUR-CHANGE: node_name() used to be an alias to name > >>> ('common'). Now > >>> it is an alias to name('scientific'). > >>> NOTE: node_name is what is set when ->new(-name => $name) is set, so > >>> flatfile and entrez and user-created nodes now implicitly associate > >>> the > >>> name of the node they create with its scientific name. > >> > >> I'm not even sure node_name() should just be deprecated. The methods > >> falsely suggests that there is only a single and definitive name for > >> the taxon node. > >> > >> In NCBI reality, this is only true for the scientific name of the > >> node. In real reality, many nodes have multiple scientific names - > >> taxonomy isn't static and therefore the scientific naming of nodes > >> isn't either. > > > > For the programmer not using any database but just making up his own > > nodes, I think he needs a node_name() because he may not be thinking > > about anything fancy or realistic. He just want to give his node a > > single name that he invents. node_name() seems like the ideal method > > name to me. > > > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From hlapp at gmx.net Mon Jul 24 15:56:02 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 24 Jul 2006 15:56:02 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <003c01c6af56$c5fd2df0$15327e82@pyrimidine> References: <003c01c6af56$c5fd2df0$15327e82@pyrimidine> Message-ID: <88700A84-B426-4BC7-88F2-D5E793870ADF@gmx.net> On Jul 24, 2006, at 3:24 PM, Chris Fields wrote: > >> Hilmar Lapp wrote: >>> Sounds good to me, except there is no Bio::TaxonomyI yet, >> >> Indeed, I propose making one. > > So, Node would implement this, correct? No - > Naming it Bio::TaxonomyI makes me > think that Bio::Taxonomy implements TaxonomyI, not that > Bio::Taxonomy::Node > implements it. I'd suppose so. >> Yes, which is why Bio::Taxonomy is appropriate here. Assuming that >> Bio::Species isa Bio::TaxonomyI: >> >> ... >> SOURCE Saccharomyces cerevisiae (baker's yeast) >> ORGANISM Saccharomyces cerevisiae >> Eukaryota; Fungi; Ascomycota; Saccharomycotina; >> Saccharomycetes; >> Saccharomycetales; Saccharomycetaceae; Saccharomyces. >> >> ... >> >> ## the fully-manual way >> my $species = new Bio::Species; >> my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces >> cerevisiae', >> -rank => 'species', -object_id >> => 1, >> -parent_id => 2); >> my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces', >> -object_id => 2, -parent_id => 3); >> # (no assumption that 'Saccharomyces' is the genus, so rank() >> undefined) >> my $n3 = [etc] >> $species->add_node($node); >> $species->add_node($n2); >> [etc] > > > Hrmm... why would you add multiple nodes to a species object? A > Species > is-a Node, not a full Bio::Taxonomy. No. See above: Bio::Species is-a Bio::Taxonomy. > Taxonomy has-a Node (hence the > add_node() method). So, you should be able to add a NodeI- > implementing > object to a Taxonomy object (either a Node or a Species). Let's keep Bio::Species and Taxonomy::Node separate. They look like representing something similar but once you look at the Bio::Species API (and a Genbank record) you realize they do not. Bio::Species is more like an entire lineage and the species node all flattened out into one. I'm not sure Bio::Species would need to implement a Bio::TaxonomyI interface; it may as well just use an implementation of it internally. I'm not sure how Sendu wants to design this, but for sure Bio::Taxonomy::Node should not be a Bio::Species, and the reverse should rather be avoided too. >> [..] >> The way to do it is to have the Bio::DB::Taxonomy* modules return >> only >> the information that a Bio::Taxonomy::FactoryI would need to make a >> NodeI. The specific Factory that you use could generate whatever >> type of >> Node you wanted. > > Yes, using an object factory here makes a lot of sense, returning the > correct object type based on the rank. Well, I don't think you'd want to create instances of different node classes depending on the rank of the node. However, a particular factory implementation may of course be free to do exactly that. > ... >> Bio::Species differs from Bio::Taxonomy only so it contains all the >> legacy methods names that Bio::Species currently has, for backward >> compatibility. Setting $species->classification() would delete all >> nodes >> of self, use a GenbankFactory to make a new Bio::Species, then >> pull out >> all its Nodes and add them to self. > > The idea is to replace Bio::Species with something that works well, so > having it implement a Node-like interface works since it is-a > Node. Having > it implement a Taxonomy-like interface, though, doesn't make a lot > of sense > as a species is-not-a Taxonomy. It should act just like a fancier > node > object. No, I'd really recommend against muddling up a taxonomy node model with the Bio::Species legacy model. Bio::Species is not a node at all. You may argue it's not a taxonomy either. This is just one more reason for containing the Bio::Species contagious disease of conflating disjoint concepts into one. > > Using a factory in Bio::DB::Taxonomy should solve any issues about > what > object type is returned, since that could simply be made based on > the rank > itself (species rank or below == Bio::Taxonomy::Species, genus and > above == > Bio::Taxonomy::Node). Bio::Taxonomy::Species was an invention of mine and - if created - should not be used for anything else other than representing a taxonomy node as a Bio::Species object iff necessary (i.e., if the client really wants a Bio::Species object). I'd actually like to see what Sendu would come up with. It sounds at the very minimum like an excellent start. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Mon Jul 24 15:59:10 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 24 Jul 2006 15:59:10 -0400 Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes In-Reply-To: <003d01c6af58$3dc4ac40$15327e82@pyrimidine> References: <003d01c6af58$3dc4ac40$15327e82@pyrimidine> Message-ID: <3C520B8C-8755-4A7E-80CF-8B94FEAB867E@gmx.net> On Jul 24, 2006, at 3:34 PM, Chris Fields wrote: > Looks like the organelle sequence data uses the organism TaxID. Then you might as well store it as annotation. Really the only thing that matters is that the flat file writers can get from an expected location. In fact storing as annotation is better e.g. for Biosql since right now the taxonomy model is the NCBI model and so organelle will not be stored (and hence neither be round-tripped). -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Mon Jul 24 16:10:20 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 15:10:20 -0500 Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes In-Reply-To: <3C520B8C-8755-4A7E-80CF-8B94FEAB867E@gmx.net> Message-ID: <000001c6af5d$3094b830$15327e82@pyrimidine> Sounds good. Will be easy to change this over. Chris > -----Original Message----- > From: Hilmar Lapp [mailto:hlapp at gmx.net] > Sent: Monday, July 24, 2006 2:59 PM > To: Chris Fields > Cc: 'Sendu Bala'; bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Bio::Species/Bio::Taxonomy changes > > > On Jul 24, 2006, at 3:34 PM, Chris Fields wrote: > > > Looks like the organelle sequence data uses the organism TaxID. > > Then you might as well store it as annotation. Really the only thing > that matters is that the flat file writers can get from an expected > location. > > In fact storing as annotation is better e.g. for Biosql since right > now the taxonomy model is the NCBI model and so organelle will not be > stored (and hence neither be round-tripped). > > -hilmar > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > From hlapp at gmx.net Mon Jul 24 16:12:39 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 24 Jul 2006 16:12:39 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <003e01c6af5a$390cdea0$15327e82@pyrimidine> References: <003e01c6af5a$390cdea0$15327e82@pyrimidine> Message-ID: <5FB07071-42D7-4F43-B2A1-3AF5F1FC5193@gmx.net> On Jul 24, 2006, at 3:49 PM, Chris Fields wrote: > Yes, 'largely' the key word. I don't really agree with Sendu's > hierarchy > scheme (making Species implement Taxonomy and not Node doesn't make > sense), > but, besides that, everything else seems fine. I like the > following setup > (which is similar to what you proposed, I believe), which I already > posted. > > |-----Tax::Node > NodeI-------| > |-----Tax::SpeciesNode > | > SpeciesI -------| > > Taxonomy::Node is-a NodeI > Taxonomy::SpeciesNode is-a NodeI and-a SpeciesI I don't even think we would need SpeciesI - why would a species- ranked taxonomy node be so different from any other node such that it would need its own interface. Chris - just one suggestion: take a step back and imagine a Bioperl in which Bio::Species had never existed. Instead, only taxonomy nodes existed, and code that can effectively deal with them, including filtering by rank. In this picture, what would you make to want to introduce SpeciesI and Bio::Species? Frankly, I don't see anything. I.e., the only reason is backward compatibility (which is a valid reason), but let's not glorify Bio::Species by adding ill-conceived interfaces. > > Bio::DB::Taxonomy uses a factory to return NodeI-implementing modules; > specifically, a SpeciesNode for species ranks or below, and a Node for > anything else. Like I said before, SpeciesNode or whatever it's called would draw its right of existence solely from backward compatibility - don't use it for anything else. And if you can achieve backward compatibility by other means, don't even create a SpeciesNode. My $0.02 ... -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Mon Jul 24 17:34:29 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 16:34:29 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <5FB07071-42D7-4F43-B2A1-3AF5F1FC5193@gmx.net> Message-ID: <000101c6af68$f27521a0$15327e82@pyrimidine> > I don't even think we would need SpeciesI - why would a species- > ranked taxonomy node be so different from any other node such that it > would need its own interface. > > Chris - just one suggestion: take a step back and imagine a Bioperl > in which Bio::Species had never existed. Instead, only taxonomy nodes > existed, and code that can effectively deal with them, including > filtering by rank. In this picture, what would you make to want to > introduce SpeciesI and Bio::Species? Argh!!! Just when I thought I could pull away... Okay. I thought it would be nice to have a class that could accomplish two things: 1) Act as a container for GenBank taxonomy information; Bio::Taxonomy::Node, as written by Jason, was meant to be a replacement for Bio::Species. 2) Also act as a bridge, so you had the option to retrieve the Species object from a sequence object and have it act like a Node (be db-aware out-of-the-box, so to speak). Also, I'm trying to follow the original idea as proposed by Jason (this is from perldoc Bio::Taxonomy::Node): DESCRIPTION This is the next generation (for Bioperl) of representing Taxonomy information. Previously all information was managed by a single object called Bio::Species. This new implementation allows representation of the intermediate nodes not just the species nodes and can relate their connections. Which, to me, indicated that this would eventually replace Bio::Species (so, in effect, must at least contain the relevant data for sequence objects w/o being completely reliant on DB, yet still be DB-aware). Everything about Bio::Species on the wiki also leads me to believe that this was the original intent for Bio::Taxonomy::Node. http://www.bioperl.org/wiki/Module:Bio::Species http://www.bioperl.org/wiki/Project_priority_list#Taxonomy_.2F_Species_data And all the original methods (genus(), species(), etc.) also seem to indicate this. That's really it. I could give a toss about getting taxonomy information directly from Bio::Species. And you're right: in hindsight Bio::Species is flawed. However, it seemed from the beginning of this discussion with Sendu and the proposed changes, that Bio::Species should stick around in some capacity but should also be involved with Bio::Taxonomy (contrary to Jason's idea above). Now I'm hearing something completely different (Sendu still argues that it should be involved). I had originally wanted to start delegating everything over to Taxonomy::Node about a month ago, when I found that it was remarkably easy to do so. However, when Sendu proposed making changes to remove methods in Bio::Taxonomy::Node and make sweeping changes to Taxonomy which would prevent an easy transition over to Node, I felt that it would be harder to effectively have it take over for Bio::Species when parsing SeqIO objects (all the calls to genus/species/subspecies etc methods would have to be removed from all the classes which use Bio::Species). Hence Bio::Taxonomy::Species as a compromise. Now it turns out no one wants to have either Bio::Species (your 'contagion' references clues me in there) or Bio::Taxonomy::Species. If we think it would be better to completely toss all this out the window and use only a bare-bones Node, then I'm fine with that. But if we go that route we should just get rid of the Bio::Species 'disease' completely and have things be much simpler. Simple is good! I think Node can still act as a viable container class for the tax data from a GenBank file (it's original purpose) as long as it has the very basic methods for doing so. That would require: scientific_name() - ORGANISM line data common_names() - which could hold common names (in parentheses on the SOURCE line) and the abbreviated name (from the SOURCE line) ncbi_taxid() - from the 'source' seqfeature (already there). The lineage information and organelle information could be stored in Node or in SimpleValue objects. My vote is for the latter as there's no need for a classification() container for Node, which you have repeatedly pointed out. > Frankly, I don't see anything. I.e., the only reason is backward > compatibility (which is a valid reason), but let's not glorify > Bio::Species by adding ill-conceived interfaces. I think we should just get rid of Bio::Species completely. We would need to go in and rework species parsing in the SeqIO modules that use Bio::Species, but that would only make things simpler, not more complex. Get rid of trying to figure out what is a genus or species based on the GenBank information only, and have the bridge between the sequences be stored in a Taxonomy::Node object (which should contain the NCBI TaxID, so then it can use the associated DB object to traverse up and down other nodes). The interface idea was a proposed compromise i.e. my 'bridge' between GenBank taxonomy hell and Bio::Taxonomy bliss, and intended to follow what I thought was Jason's original intent for Bio::Taxonomy::Node. Nothing more. > > Bio::DB::Taxonomy uses a factory to return NodeI-implementing modules; > > specifically, a SpeciesNode for species ranks or below, and a Node for > > anything else. > > Like I said before, SpeciesNode or whatever it's called would draw > its right of existence solely from backward compatibility - don't use > it for anything else. And if you can achieve backward compatibility > by other means, don't even create a SpeciesNode. Agreed. But, if there is such venom towards Bio::Species, why not put it out of it's misery as well? Seems like it has outlived it's usefulness. Chris From cjfields at uiuc.edu Mon Jul 24 17:53:46 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 16:53:46 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C52345.5060903@sendu.me.uk> Message-ID: <000201c6af6b$a4534580$15327e82@pyrimidine> > > I'll repeat: a Node and a Species is-not-a Taxonomy. > > I'll repeat: A Node is a Node and a Bio::Species is a Taxonomy ;) Nope. I think this is incorrect. Here's why. Let's look at the reasons Bio::Taxonomy was started, shall we? >From perldoc Bio::Taxonomy: DESCRIPTION Bio::Taxonomy object represents any rank-level in taxonomy system, rather than Bio::Species which is able to represent only species-level. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >From perldoc Bio::Taxonomy::Node DESCRIPTION This is the next generation (for Bioperl) of representing Taxonomy information. Previously all information was managed by a single object called Bio::Species. This new implementation allows representation of the intermediate nodes not just the species nodes and can relate their ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ connections. Bioperl wiki: http://www.bioperl.org/wiki/Project_priority_list#Taxonomy_.2F_Species_data http://www.bioperl.org/wiki/Module:Bio::Species Both talk about delegating or replacing Bio::Species with Bio::Taxonomy::Node. Everyone of those indicates what the original idea for Bio::Taxonomy::Node was (eventual replacement for Bio::Species). Even the original methods for Bio::Taxonomy::Node are the same. So, according to this alone, Bio::Species would eventually be replaced by Bio::Taxonomy::Node. I wanted an easier transition to Node from Bio::Species (hell, just a few changes and using Bio::Taxonomy::Node worked fine!) , but your proposals made sense. I saw having a Species-based Tax object as a nice compromise, but Hilmar has made a few good points: would we have a Bio::Species object around knowing what we know now? When Bio::Species was originally designed, it was probably before the NCBI Tax database existed. I think it has outlasted its current use. I have posted a response to Hilmar. I think we should just get rid of Bio::Species altogether and have a Taxonomy::Node contain the basic data (scientific_name(), common_names(), etc). And remove any SeqIO parsing of genus/species to simplify everything. All this extra parsing and hand-wringing over trying to get species/genus information from a GenBank file just mucks up ORGANISM and SOURCE line parsing anyway. Simplify it. Simple is good. Radical? Yes, but I agree with him that Bio::Species has outlasted it's use. As for organelle and lineage information, they could be placed in SimpleValue objects. If anyone wants to grab tax information, they can use the Node object to get it but they'll need a local flatfile database or network connection to do so. This also means there is no need for a Bio::DB::Taxonomy factory: just return Node objects directly. Each format (flatfile and entrez) currently works this way anyway, correct? Simplifies that. Simple is better. Of course, we couldn't get rid of Bio::Species until all the following were shifted over to Node somehow: ; > Instances: 2 BP Module : Bio::Cluster::SequenceFamily Instances: 4 BP Module : Bio::Cluster::UniGene Instances: 1 BP Module : Bio::Cluster::UniGeneI Instances: 1 BP Module : Bio::DB::FileCache Instances: 3 BP Module : Bio::DB::GFF::Segment Instances: 1 BP Module : Bio::DB::Taxonomy::flatfile Instances: 2 BP Module : Bio::Graph::IO::psi_xml Instances: 1 BP Module : Bio::Map::CytoMap Instances: 1 BP Module : Bio::Map::LinkageMap Instances: 3 BP Module : Bio::Map::MapI Instances: 3 BP Module : Bio::Map::SimpleMap Instances: 3 BP Module : Bio::Matrix::PSM::InstanceSite Instances: 6 BP Module : Bio::Phenotype::Correlate Instances: 1 BP Module : Bio::Phenotype::OMIM::OMIMentry Instances: 3 BP Module : Bio::Phenotype::OMIM::OMIMparser Instances: 5 BP Module : Bio::Phenotype::Phenotype Instances: 2 BP Module : Bio::Phenotype::PhenotypeI Instances: 4 BP Module : Bio::Seq Instances: 3 BP Module : Bio::SeqI Instances: 2 BP Module : Bio::SeqIO::agave Instances: 4 BP Module : Bio::SeqIO::bsml Instances: 2 BP Module : Bio::SeqIO::bsml_sax Instances: 1 BP Module : Bio::SeqIO::chadoxml Instances: 1 BP Module : Bio::SeqIO::chaos Instances: 4 BP Module : Bio::SeqIO::embl Instances: 2 BP Module : Bio::SeqIO::entrezgene Instances: 3 BP Module : Bio::SeqIO::game::seqHandler Instances: 4 BP Module : Bio::SeqIO::genbank Instances: 2 BP Module : Bio::SeqIO::kegg Instances: 2 BP Module : Bio::SeqIO::locuslink Instances: 4 BP Module : Bio::SeqIO::swiss Instances: 2 BP Module : Bio::SeqIO::table Instances: 2 BP Module : Bio::SeqIO::tigr Instances: 2 BP Module : Bio::SeqIO::tigrxml Instances: 7 BP Module : Bio::SeqIO::tinyseq Instances: 4 BP Module : Bio::Taxonomy Instances: 1 BP Module : Bio::Taxonomy::Node Instances: 6 BP Module : Bio::Taxonomy::Taxon Instances: 9 BP Module : Bio::Taxonomy::Tree Instances: 5 BP Module : Bio::Tools::Analysis::Protein::ELM Chris From bix at sendu.me.uk Mon Jul 24 18:15:31 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 24 Jul 2006 23:15:31 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <000101c6af68$f27521a0$15327e82@pyrimidine> References: <000101c6af68$f27521a0$15327e82@pyrimidine> Message-ID: <44C54683.70707@sendu.me.uk> Chris Fields wrote: > > Also, I'm trying to follow the original idea as proposed by Jason (this is > from perldoc Bio::Taxonomy::Node): > > Which, to me, indicated that this would eventually replace Bio::Species Well, we don't really know that Jason didn't later change his mind, but in any case it doesn't make sense (anymore, given that we have Bio::Taxonomy). In a direct reply to me you point out specific passages in the current docs that explain why you have thought we should delegate or replace Bio::Species with Bio::Taxonomy::Node. With respect, the old plans are not something we are forced to blindly follow. We decide for ourselves if they make sense, we decide for ourselves if there is a better way of doing it, and then we do it the best way. So if you ignore what those old bits of documentation say, just pretend you never ever read them, would my proposals make sense or not? Since those old proposals were never implemented we have no reason to try and stick with them if there is a better proposal. And for the record, '...Bio::Species which is able to represent only species-level' can (correctly) be interpreted as 'Bio::Species is only supposed to be used for representing a taxonomy that includes the species-level'. You can't interpret it literally because Bio::Species is used for levels below species, and also represents all the levels above species-level as well. Either Jason got it wrong when he wrote that, or you have misinterpreted it. Likewise, let's play the interpretation game again: 'Previously all information was managed by a single object called Bio::Species. [the Bio::Taxonomy::Node] implementation allows representation of the intermediate nodes not just the species nodes'. Note the apposition of 'single object' vs implication of multiple Node objects to do the same job. I imagine at the time Jason wrote that there was no Bio::Taxonomy, no holder for multiple Nodes. > I had originally wanted to start delegating everything over to > Taxonomy::Node about a month ago, when I found that it was remarkably easy > to do so. However, when Sendu proposed making changes to remove methods in > Bio::Taxonomy::Node and make sweeping changes to Taxonomy which would > prevent an easy transition over to Node, But an equally easy transition to Bio::Taxonomy instead. I don't know why you would care about the name of the class we switch to. My concern is that when the switch is made it makes sense. > If we think it would be better to completely toss all this out the window > and use only a bare-bones Node, then I'm fine with that. But if we go that > route we should just get rid of the Bio::Species 'disease' completely and > have things be much simpler. Simple is good! > > I think Node can still act as a viable container class for the tax data from > a GenBank file (it's original purpose) as long as it has the very basic > methods for doing so. That would require: > > scientific_name() - ORGANISM line data > common_names() - which could hold common names (in parentheses on the SOURCE > line) and the abbreviated name (from the SOURCE line) > ncbi_taxid() - from the 'source' seqfeature (already there). > > The lineage information and organelle information could be stored in Node or > in SimpleValue objects. My vote is for the latter as there's no need for a > classification() container for Node, which you have repeatedly pointed out. No, this is the whole point. The lineage information can NOT be stored in a Node (unless you absuse Node by having all those crufty methods like genus() and classification()), and why would we store it in SimpleValue objects when we have Bio::Taxonomy? Bio::Taxonomy is completely perfect for storing the taxonomic information from a GenBank file. That's all you need to worry about. Can we represent the data correctly? Yes. Do we gain all the good things about a pure Bio::Taxonomy? Yes. Can we still do everything we used to be able to do? Yes. > I think we should just get rid of Bio::Species completely. There's no need to get rid of Bio::Species. It can be a Bio::Taxonomy with backward-compatible methods. No harm done, all good. I'll tell you what. This will be easier if I just write the code for my proposals, including whatever changes would be needed in Bio::SeqIO::genbank et al. You'll see how easy and appropriate it is, and hopefully everyone will be happy. Perhaps you could just hold off doing any similar-but-contradictory work until then. From hlapp at gmx.net Mon Jul 24 19:47:10 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 24 Jul 2006 19:47:10 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C54683.70707@sendu.me.uk> References: <000101c6af68$f27521a0$15327e82@pyrimidine> <44C54683.70707@sendu.me.uk> Message-ID: <86C97545-8DDF-4FE3-9CE9-1799D8BD2EB5@gmx.net> On Jul 24, 2006, at 6:15 PM, Sendu Bala wrote: > I'll tell you what. This will be easier if I just write the code > for my > proposals, including whatever changes would be needed in > Bio::SeqIO::genbank et al. Never get in the way of somebody who threatens to code :-) so I certainly won't. I think you're on the right track. My suggestion is, if you have a good picture in front of you of how it's going to look like when done, just pretend for a second it is done already and give us some code examples that use the new (to be done) API. As a start, some of the situations it's currently used in: - genbank.pm parsing and setting species information for the sequence - user asking for the scientific name of the species of the sequence (obviously, the call would remain unchanged: $seq->species->binomial (). But what happens behind the scene?) - genbank.pm writing the SOURCE information for a sequence Replace genbank.pm with your rich annotation source parser of choice. Then maybe some advanced uses: - from a sequence stream, retain only those of primates - like above, but only mitochondrial sequences - for an organism, query entrez for all sequences of strains, varieties, or subspecies sequences for that organism Add your own if these sound stupid ... Just an idea. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Mon Jul 24 22:06:16 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 21:06:16 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <86C97545-8DDF-4FE3-9CE9-1799D8BD2EB5@gmx.net> References: <000101c6af68$f27521a0$15327e82@pyrimidine> <44C54683.70707@sendu.me.uk> <86C97545-8DDF-4FE3-9CE9-1799D8BD2EB5@gmx.net> Message-ID: <4678548F-ABEC-4E14-AD7F-D282D2DC2730@uiuc.edu> > >> I'll tell you what. This will be easier if I just write the code >> for my >> proposals, including whatever changes would be needed in >> Bio::SeqIO::genbank et al. > > Never get in the way of somebody who threatens to code :-) so I > certainly won't. I think you're on the right track. Fine by me. My only request: I don't want every sequence passing through SeqIO having an automatic DB lookup performed on it. SeqIO parsing of GenBank files is slow enough as it is w/o enforcing lookups, even if they are cached. If you want lookups, have it as an option and not as default behavior. We could have the option for a lookup added pretty easily in genbank.pm _initialize or the main SeqIO constructor as a simple Boolean flag. That might be pretty nice. ... > (). But what happens behind the scene?) > - genbank.pm writing the SOURCE information for a sequence You know, the only really divisive point here is the lineage data and how to store it in _read_GenBank_Species or reproduce it in write_seq (). Again, I don't think we should have a forced lookup for this; it should just be stored as is, either in Node or SimpleValue. Again, I think the latter as everyone seems averse to containing this in Node. > Then maybe some advanced uses: > > - from a sequence stream, retain only those of primates > - like above, but only mitochondrial sequences > - for an organism, query entrez for all sequences of strains, > varieties, or subspecies sequences for that organism For the primate example, would you screen those out via the in-file lineage or using lookups? Something like '$seqout->write_seq($seq) if ($seq->species->organelle eq 'mitochondrion');' for the mitochondria example, which would mean leaving organelle() in Species/Node or whatever is used. The last one, I think, can be done w/o using the sequence directly using NCBI's ELink and the TaxID to cross-reference the nucleotide database. You would probably have to walk through all child nodes, but it's feasible that way. > Add your own if these sound stupid ... > > Just an idea. > > -hilmar > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Mon Jul 24 22:29:57 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 21:29:57 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C54683.70707@sendu.me.uk> References: <000101c6af68$f27521a0$15327e82@pyrimidine> <44C54683.70707@sendu.me.uk> Message-ID: <6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu> Look, we're just going back and forth on this stupid little thing, when the only point we really are divided on is what object type we should store certain items in a GenBank file (Bio::Species/ Bio::Tax::Node/Bio::Whatever). In particular, the main sticking point is the lineage. We could go back and forth on what Jason really intended. Personally, I think his past statements are quite clear on what his intent was (he's very clear in the wiki on what Bio::Taxonomy::Node was built to replace, in two separate posts and within the last four months). The reality is he's not here and you're willing to do the job. There is one thing I will make perfectly clear here: there should never, ever be enforced lookups for SeqIO (even using caches), though I have no problem having optional ones. This is something I have stated before and what you propose below steers dangerously in that direction. Where, for instance, do you store the lineage from a GenBank file? Do you want to do a series of Tax lookups to restore that data? I think that the number one complaint for sequence parsing is speed, which would only get slower with lookups (even cached). What I propose is we make it as simple as possible. Remove the unnecessary genus/species/subspecies parsing in genbank.pm, store the scientific name, common names, and lineage in some easily accessible way to make it easier for everyday users to use, have it tied to Bio::Taxonomy in some way (I propose Node, as it contains almost all the methods needed) so that you could get more information by moving up and down nodes, or retrieve more information. I, personally, don't see the point in having Bio:Species around after this discussion as Node seems to do the job adequately. My last word (I will be exiting this discussion and the group for two weeks): This would have been MUCH easier if all three of us could have gone to the local bar for a beer and discussed it. We should just take the time out to videoconference next time. Chris > Chris Fields wrote: >> >> Also, I'm trying to follow the original idea as proposed by Jason >> (this is >> from perldoc Bio::Taxonomy::Node): >> >> Which, to me, indicated that this would eventually replace >> Bio::Species > > Well, we don't really know that Jason didn't later change his mind, > but > in any case it doesn't make sense (anymore, given that we have > Bio::Taxonomy). > > In a direct reply to me you point out specific passages in the current > docs that explain why you have thought we should delegate or replace > Bio::Species with Bio::Taxonomy::Node. With respect, the old plans are > not something we are forced to blindly follow. We decide for ourselves > if they make sense, we decide for ourselves if there is a better > way of > doing it, and then we do it the best way. > > So if you ignore what those old bits of documentation say, just > pretend > you never ever read them, would my proposals make sense or not? Since > those old proposals were never implemented we have no reason to try > and > stick with them if there is a better proposal. > > And for the record, '...Bio::Species which is able to represent only > species-level' can (correctly) be interpreted as 'Bio::Species is only > supposed to be used for representing a taxonomy that includes the > species-level'. You can't interpret it literally because > Bio::Species is > used for levels below species, and also represents all the levels > above > species-level as well. Either Jason got it wrong when he wrote > that, or > you have misinterpreted it. > > Likewise, let's play the interpretation game again: 'Previously all > information was managed by a single object called Bio::Species. [the > Bio::Taxonomy::Node] implementation allows representation of the > intermediate nodes not just the species nodes'. Note the apposition of > 'single object' vs implication of multiple Node objects to do the same > job. I imagine at the time Jason wrote that there was no > Bio::Taxonomy, > no holder for multiple Nodes. > > >> I had originally wanted to start delegating everything over to >> Taxonomy::Node about a month ago, when I found that it was >> remarkably easy >> to do so. However, when Sendu proposed making changes to remove >> methods in >> Bio::Taxonomy::Node and make sweeping changes to Taxonomy which would >> prevent an easy transition over to Node, > > But an equally easy transition to Bio::Taxonomy instead. I don't know > why you would care about the name of the class we switch to. My > concern > is that when the switch is made it makes sense. > > >> If we think it would be better to completely toss all this out the >> window >> and use only a bare-bones Node, then I'm fine with that. But if >> we go that >> route we should just get rid of the Bio::Species 'disease' >> completely and >> have things be much simpler. Simple is good! >> >> I think Node can still act as a viable container class for the tax >> data from >> a GenBank file (it's original purpose) as long as it has the very >> basic >> methods for doing so. That would require: >> >> scientific_name() - ORGANISM line data >> common_names() - which could hold common names (in parentheses on >> the SOURCE >> line) and the abbreviated name (from the SOURCE line) >> ncbi_taxid() - from the 'source' seqfeature (already there). >> >> The lineage information and organelle information could be stored >> in Node or >> in SimpleValue objects. My vote is for the latter as there's no >> need for a >> classification() container for Node, which you have repeatedly >> pointed out. > > No, this is the whole point. The lineage information can NOT be stored > in a Node (unless you absuse Node by having all those crufty methods > like genus() and classification()), and why would we store it in > SimpleValue objects when we have Bio::Taxonomy? > > Bio::Taxonomy is completely perfect for storing the taxonomic > information from a GenBank file. That's all you need to worry > about. Can > we represent the data correctly? Yes. Do we gain all the good things > about a pure Bio::Taxonomy? Yes. Can we still do everything we used to > be able to do? Yes. > > >> I think we should just get rid of Bio::Species completely. > > There's no need to get rid of Bio::Species. It can be a Bio::Taxonomy > with backward-compatible methods. No harm done, all good. > > > I'll tell you what. This will be easier if I just write the code > for my > proposals, including whatever changes would be needed in > Bio::SeqIO::genbank et al. You'll see how easy and appropriate it is, > and hopefully everyone will be happy. > > Perhaps you could just hold off doing any similar-but-contradictory > work > until then. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From hlapp at gmx.net Mon Jul 24 23:31:41 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 24 Jul 2006 23:31:41 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu> References: <000101c6af68$f27521a0$15327e82@pyrimidine> <44C54683.70707@sendu.me.uk> <6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu> Message-ID: <49AA1351-5741-4E19-AFCE-EEA5532531B6@gmx.net> On Jul 24, 2006, at 10:29 PM, Chris Fields wrote: > [...] > We could go back and forth on what Jason really intended. [...] The > reality is he's not here and you're willing to do the job. Right. And, knowing Jason, I think he'd be perfectly fine with seeing his original idea develop in a possibly different direction, provided it will all work nicely in the end. I'm willing to take the beating on me if that doesn't turn out to be true ... > > There is one thing I will make perfectly clear here: there should > never, ever be enforced lookups for SeqIO (even using caches), You certainly don't want taxonomy lookups during the parsing stage, and also not for the client requesting properties of the species that have been parsed with high confidence, i.e., genus and species for a straightforward binomial like 'Homo sapiens'. Writing sequences, IMHO, doesn't have to be as fast. It may be better to emit strict format a bit slower rather than sloppy format a bit faster. Upon parsing, one idea could be for the flat file parser to set a dirty bit in the parsed out species if the parsed text didn't follow strict binomial conventions, hence the parser may have made a mistake and if a client requests the information it is better to lookup the correct values from a taxonomy database. I.e., you could try with a strict regex first that would imply a high-confidence result. If that fails you don't give up but mark the result as untrustworthy. > [...] > This would have been MUCH easier if all three of us could have gone > to the local bar for a beer and discussed it. We should just take > the time out to videoconference next time. You're not honestly suggesting that a videoconference is better than having beer together? Enjoy your trip, and thanks for hanging in there in the discussion, I appreciate it. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Tue Jul 25 01:53:33 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 25 Jul 2006 00:53:33 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <49AA1351-5741-4E19-AFCE-EEA5532531B6@gmx.net> References: <000101c6af68$f27521a0$15327e82@pyrimidine> <44C54683.70707@sendu.me.uk> <6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu> <49AA1351-5741-4E19-AFCE-EEA5532531B6@gmx.net> Message-ID: <190F1365-1F8B-401F-8FBA-8DE39D851111@uiuc.edu> So do we intend on having everyone who installs bioperl have a local copy of the taxonomy dumpfile? Or perform a remote lookup via Entrez? Seems a bit extreme. I would like the option of not having the lookup run; as I mentioned to Sendu, one of the biggest complaints about bioperl is speed. Additional lookups won't help on that end. Chris On Jul 24, 2006, at 10:31 PM, Hilmar Lapp wrote: > > On Jul 24, 2006, at 10:29 PM, Chris Fields wrote: > >> [...] >> We could go back and forth on what Jason really intended. [...] The >> reality is he's not here and you're willing to do the job. > > Right. And, knowing Jason, I think he'd be perfectly fine with seeing > his original idea develop in a possibly different direction, provided > it will all work nicely in the end. I'm willing to take the beating > on me if that doesn't turn out to be true ... > >> >> There is one thing I will make perfectly clear here: there should >> never, ever be enforced lookups for SeqIO (even using caches), > > You certainly don't want taxonomy lookups during the parsing stage, > and also not for the client requesting properties of the species that > have been parsed with high confidence, i.e., genus and species for a > straightforward binomial like 'Homo sapiens'. > > Writing sequences, IMHO, doesn't have to be as fast. It may be better > to emit strict format a bit slower rather than sloppy format a bit > faster. > > Upon parsing, one idea could be for the flat file parser to set a > dirty bit in the parsed out species if the parsed text didn't follow > strict binomial conventions, hence the parser may have made a mistake > and if a client requests the information it is better to lookup the > correct values from a taxonomy database. I.e., you could try with a > strict regex first that would imply a high-confidence result. If that > fails you don't give up but mark the result as untrustworthy. > > >> [...] >> This would have been MUCH easier if all three of us could have gone >> to the local bar for a beer and discussed it. We should just take >> the time out to videoconference next time. > > You're not honestly suggesting that a videoconference is better than > having beer together? > > Enjoy your trip, and thanks for hanging in there in the discussion, I > appreciate it. > > -hilmar > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Tue Jul 25 03:05:23 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Tue, 25 Jul 2006 08:05:23 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu> References: <000101c6af68$f27521a0$15327e82@pyrimidine> <44C54683.70707@sendu.me.uk> <6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu> Message-ID: <44C5C2B3.1020304@sendu.me.uk> Chris Fields wrote: > > There is one thing I will make perfectly clear here: there should > never, ever be enforced lookups for SeqIO (even using caches), though > I have no problem having optional ones. This is something I have > stated before and what you propose below steers dangerously in that > direction. Where, for instance, do you store the lineage from a > GenBank file? Do you want to do a series of Tax lookups to restore > that data? I think that the number one complaint for sequence > parsing is speed, which would only get slower with lookups (even > cached). I already gave a code example of exactly how Bio::Taxonomy is perfect for storing the lineage data in a GenBank file with or without a database lookup. I think perhaps at the time you first read this you basically ignored it because you had trouble with the idea of adding nodes to a species. If you have been glossing over my argument, it may be instructive to go over what I've been saying with a clear eye. Anyway, here it is again, and remember in this example, Bio::Species isa Bio::Taxonomy: ## the fully-manual way my $species = new Bio::Species; my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces cerevisiae', -rank => 'species', -object_id => 1, -parent_id => 2); my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces', -object_id => 2, -parent_id => 3); # (no assumption that 'Saccharomyces' is the genus, so rank() undefined) my $n3 = [etc] $species->add_node($node); $species->add_node($n2); [etc] ## Using a factory without db access # assume that Bio::Taxonomy::GenbankFactory implements # some modified Bio::Taxonomy::FactoryI my $factory = Bio::Taxonomy::GenbankFactory->new(); my $species = $factory->generate(-classification => ['Saccharomyces cerevisiae', 'Saccharomyces', 'Saccharomycetaceae' ...]); # the generate() method above just does the fully-manual way for you ## Using a factory with db access # assume that Bio::Taxonomy::EntrezFactory implements some # modified Bio::Taxonomy::FactoryI and uses Bio::DB::Taxonomy::entrez # to get the nodes my $factory = Bio::Taxonomy::EntrezFactory->new(); my $species = $factory->fetch(-scientifc_name => 'Saccharomyces cerevisiae'); So now do you see how we're able to do the Genbank no-db way and the db-using way with the same object model? We're able to do it the same, sane way because a Node is just a node; you can make them yourself manually, or retrieve them from a database. Once you stick them in a Taxonomy you can then (potentially) ask all the questions of the data that you can with existing Bio::Species. No cruft is required anywhere at all. All the Taxonomy classes can be 'pure', while only Bio::Species has to have backward-compatibility methods. From bernd.web at gmail.com Tue Jul 25 06:47:50 2006 From: bernd.web at gmail.com (Bernd Web) Date: Tue, 25 Jul 2006 12:47:50 +0200 Subject: [Bioperl-l] Structure::IO Message-ID: <716af09c0607250347w24e0d8dbj674758049bc0a4e2@mail.gmail.com> Hi, Does someone have experience with Bio::Structure::IO? The example III.9.1 from the bptutorial.pl covers most, but what is e.g. the chain() method of Bio::Structure::Entry doing? The POD states: Title : chain Usage : @chains = $structure->chain($chain); Function: Connects a (or a list of) Chain objects to a Bio::Structure::Entry. Returns : list of Bio::Structure::Residue objects Args : One Residue or a reference to an array of Residue objects But in e.g my $stream = Bio::Structure::IO->new(-file => $filename, -format => 'pdb'); while ( my $struc = $stream->next_structure() ) { for my $chain ($struc->get_chains) { my $chainid = $chain->id; my @chains = $struc->chain($chain); } } I get Bio::Structure::Chain=HASH(0x9f1ab50). What is the function of the chain method and how to use it? Best regards, bernd From bernd.web at gmail.com Tue Jul 25 07:44:28 2006 From: bernd.web at gmail.com (Bernd Web) Date: Tue, 25 Jul 2006 13:44:28 +0200 Subject: [Bioperl-l] SeqUtils Message-ID: <716af09c0607250444y3e005fb1t4e20094fd8db993d@mail.gmail.com> Hi, With Bio::SeqUtils it may be nice to support 3 letter codes with capitals only, too. Now my $string = Bio::SeqUtils->seq3in($seqobj, 'METGLYTER'); will give in $string->seq: XXX. Possibly the capitals in MetGlyTer are used to find the amino acids codes? If not maybe it's easy to implement case-insensitive, or all-capitals for AA codes in SeqUtils? In addition about the POD: maybe it's better not use use $string since Bio::SeqUtils->seq3in does not return a string but a Bio::PrimarySeq object. Regards, Bernd From cjfields at uiuc.edu Tue Jul 25 08:28:01 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 25 Jul 2006 07:28:01 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C5C2B3.1020304@sendu.me.uk> References: <000101c6af68$f27521a0$15327e82@pyrimidine> <44C54683.70707@sendu.me.uk> <6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu> <44C5C2B3.1020304@sendu.me.uk> Message-ID: <05718B29-6432-4ACA-A10F-2B384CB68191@uiuc.edu> Look, you explaining this to me, as you see it, does not convince me that its the correct or right way to do it. Okay? Can we agree on that? I do not think that Species and Taxonomy are the same thing. A species should not hold more than one node. A species, by definition, is a rank in Taxonomy, and is a node, not a full Taxonomy, so Bio::Species should be a Node, not a Taxonomy. I don't see how I can be any clearer... The fact that it may work is beyond the point. That's like putting duct tape on a leak to me. Why not just simplify Bio::Species into a Node? Or make it into a Node and get rid of it altogether. You are going to do what you want to do, regardless of what I say. Seems to be par for the course here. I'm REALLY tired of arguing the point. Okay? Just drop it. I have other priorities in life besides goddamned bioperl right now... Chris On Jul 25, 2006, at 2:05 AM, Sendu Bala wrote: > Chris Fields wrote: >> >> There is one thing I will make perfectly clear here: there should >> never, ever be enforced lookups for SeqIO (even using caches), though >> I have no problem having optional ones. This is something I have >> stated before and what you propose below steers dangerously in that >> direction. Where, for instance, do you store the lineage from a >> GenBank file? Do you want to do a series of Tax lookups to restore >> that data? I think that the number one complaint for sequence >> parsing is speed, which would only get slower with lookups (even >> cached). > > I already gave a code example of exactly how Bio::Taxonomy is perfect > for storing the lineage data in a GenBank file with or without a > database lookup. I think perhaps at the time you first read this you > basically ignored it because you had trouble with the idea of adding > nodes to a species. If you have been glossing over my argument, it may > be instructive to go over what I've been saying with a clear eye. > Anyway, here it is again, and remember in this example, > Bio::Species isa > Bio::Taxonomy: > > > ## the fully-manual way > my $species = new Bio::Species; > my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces > cerevisiae', > -rank => 'species', -object_id > => 1, > -parent_id => 2); > my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces', > -object_id => 2, -parent_id => 3); > # (no assumption that 'Saccharomyces' is the genus, so rank() > undefined) > my $n3 = [etc] > $species->add_node($node); > $species->add_node($n2); > [etc] > > ## Using a factory without db access > # assume that Bio::Taxonomy::GenbankFactory implements > # some modified Bio::Taxonomy::FactoryI > my $factory = Bio::Taxonomy::GenbankFactory->new(); > my $species = $factory->generate(-classification => ['Saccharomyces > cerevisiae', 'Saccharomyces', > 'Saccharomycetaceae' ...]); > # the generate() method above just does the fully-manual way for you > > ## Using a factory with db access > # assume that Bio::Taxonomy::EntrezFactory implements some > # modified Bio::Taxonomy::FactoryI and uses Bio::DB::Taxonomy::entrez > # to get the nodes > my $factory = Bio::Taxonomy::EntrezFactory->new(); > my $species = $factory->fetch(-scientifc_name => 'Saccharomyces > cerevisiae'); > > > So now do you see how we're able to do the Genbank no-db way and the > db-using way with the same object model? We're able to do it the same, > sane way because a Node is just a node; you can make them yourself > manually, or retrieve them from a database. Once you stick them in a > Taxonomy you can then (potentially) ask all the questions of the data > that you can with existing Bio::Species. No cruft is required anywhere > at all. All the Taxonomy classes can be 'pure', while only > Bio::Species > has to have backward-compatibility methods. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Tue Jul 25 08:52:03 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Tue, 25 Jul 2006 13:52:03 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <05718B29-6432-4ACA-A10F-2B384CB68191@uiuc.edu> References: <000101c6af68$f27521a0$15327e82@pyrimidine> <44C54683.70707@sendu.me.uk> <6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu> <44C5C2B3.1020304@sendu.me.uk> <05718B29-6432-4ACA-A10F-2B384CB68191@uiuc.edu> Message-ID: <44C613F3.7070903@sendu.me.uk> Chris Fields wrote: > A species should not hold more than one node. A species, by > definition, is a rank in Taxonomy, and is a node, not a full > Taxonomy, so Bio::Species should be a Node, not a Taxonomy. I don't > see how I can be any clearer... Right, we have differing viewpoints because you're concerned with what Bio::Species /should/ be, based on the name of the file and perhaps its original intent, whilst I am treating it as what it actually /is/, which is an object that is used to contain information about multiple taxonomic nodes. > The fact that it may work is beyond the point. That's like putting > duct tape on a leak to me. Why not just simplify Bio::Species into a > Node? Or make it into a Node and get rid of it altogether. Bio::Species, again ignore the name, is just a thing that lets us store and retrieve a certain set of data. If we simplified it into a pure Node, it could no longer do that job. If we just get rid of it all together it can no longer do its job. By making it a Bio::Taxonomy it can continue to do its job without having to have Node objects with cruft. It would also gain the useful methods of Bio::Taxonomy at the same time. I really don't mean to upset you, and I apologise for having done so. I've been presenting what I thought was a logical argument in favour of Bio::Species as Bio::Taxonomy, and waiting to see if anyone would come up with a logical argument why that would be inappropriate, or why something else would be better. I'm not saying you're wrong and I'm certainly listening and would change my choice based on what you have to say. I don't think it's fair to say that disregarding what you have to say is 'par for the course' - I already /have/ regarded what you had to say in this thread and ended up doing scientific_name() as purely what we get from the database. From hlapp at gmx.net Tue Jul 25 09:47:47 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 25 Jul 2006 09:47:47 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C5C2B3.1020304@sendu.me.uk> References: <000101c6af68$f27521a0$15327e82@pyrimidine> <44C54683.70707@sendu.me.uk> <6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu> <44C5C2B3.1020304@sendu.me.uk> Message-ID: On Jul 25, 2006, at 3:05 AM, Sendu Bala wrote: > [...] > ## the fully-manual way > my $species = new Bio::Species; > my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces > cerevisiae', > -rank => 'species', -object_id > => 1, > -parent_id => 2); If this is meant as an example for the use cases I enumerated, then you wouldn't have the parent_id from a Genbank file. However, you didn't have that before either, so no problem. > my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces', > -object_id => 2, -parent_id => 3); > # (no assumption that 'Saccharomyces' is the genus, so rank() > undefined) I think in a confident parse you want to assign 'genus' if there's little doubt, for example 'Saccharomyces cerevisiae'. Not sure whether there are weird viri whose names look innocuous but in reality the name doesn't follow binomial convention. > my $n3 = [etc] > $species->add_node($node); > $species->add_node($n2); I know why you are doing this, but seeing this people will hit a mental snag. You should listen to Chris' refusal to see the sense in this as an indication that many people down the road won't see the sense either. So instead, make the logical model in your design more obvious, which I think ultimately will help maintainability as well. For example: my $taxonomy = Bio::Taxonomy->new(); my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces cerevisiae', -rank => 'species', -object_id => 1, -parent_id => 2); my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces', -object_id => 2, -parent_id => 3); $taxonomy->add_node($node); $taxonomy->add_node($n2); my $species = Bio::Species->new(-lineage => $taxonomy); print $species->binomial(); print $species->genus(); # this may trigger a lookup if a taxonomy db handle has been set, e.g.: # $taxonomy->db_handle(Bio::DB::Taxonomy->new(-source => 'entrez')); print $species->classification(); > [etc] > > ## Using a factory without db access > # assume that Bio::Taxonomy::GenbankFactory implements > # some modified Bio::Taxonomy::FactoryI > my $factory = Bio::Taxonomy::GenbankFactory->new(); > my $species = $factory->generate(-classification => ['Saccharomyces > cerevisiae', 'Saccharomyces', > 'Saccharomycetaceae' ...]); > # the generate() method above just does the fully-manual way for you Except the method name would be create_object(), the parameter would be a hash ref, and the return value would be a Bio::TaxonomyI compliant object: my $taxonomy = $factory->create_object({-classification => ['Saccharomyces cerevisiae', 'Saccharomyces', 'Saccharomycetaceae' ...]}); my $species = Bio::Species->new(-lineage => $taxonomy); > > ## Using a factory with db access > # assume that Bio::Taxonomy::EntrezFactory implements some > # modified Bio::Taxonomy::FactoryI and uses Bio::DB::Taxonomy::entrez > # to get the nodes > my $factory = Bio::Taxonomy::EntrezFactory->new(); The logic where to do a lookup on should not be duplicated here. It only belongs under Bio::DB::Taxonomy::*. > my $species = $factory->fetch(-scientifc_name => 'Saccharomyces > cerevisiae'); Likewise, use the methods defined in Bio::DB::Taxonomy, and again, the return type is Bio::Taxonomy, which you would pass to Bio::Species->new(). -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Tue Jul 25 09:54:14 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 25 Jul 2006 09:54:14 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <190F1365-1F8B-401F-8FBA-8DE39D851111@uiuc.edu> References: <000101c6af68$f27521a0$15327e82@pyrimidine> <44C54683.70707@sendu.me.uk> <6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu> <49AA1351-5741-4E19-AFCE-EEA5532531B6@gmx.net> <190F1365-1F8B-401F-8FBA-8DE39D851111@uiuc.edu> Message-ID: <793AFD5C-D220-493F-BE11-B9023DC9F569@gmx.net> We intend on having everyone who wants correct taxonomy parsing results for the entire kingdom of life to define his/her authoritative taxonomy database, be it local or not, be it HTTP or SQL queried. If you don't care about the correctness of the taxonomy parse, or if the taxonomy information in the flat file is trivially parseable because it conforms to standard binomial convention, then whatever is to be put in place needs to work fine regardless of whether a taxonomy database is defined or not. -hilmar On Jul 25, 2006, at 1:53 AM, Chris Fields wrote: > So do we intend on having everyone who installs bioperl have a local > copy of the taxonomy dumpfile? Or perform a remote lookup via > Entrez? Seems a bit extreme. > > I would like the option of not having the lookup run; as I mentioned > to Sendu, one of the biggest complaints about bioperl is speed. > Additional lookups won't help on that end. > > Chris > > On Jul 24, 2006, at 10:31 PM, Hilmar Lapp wrote: > >> >> On Jul 24, 2006, at 10:29 PM, Chris Fields wrote: >> >>> [...] >>> We could go back and forth on what Jason really intended. [...] The >>> reality is he's not here and you're willing to do the job. >> >> Right. And, knowing Jason, I think he'd be perfectly fine with seeing >> his original idea develop in a possibly different direction, provided >> it will all work nicely in the end. I'm willing to take the beating >> on me if that doesn't turn out to be true ... >> >>> >>> There is one thing I will make perfectly clear here: there should >>> never, ever be enforced lookups for SeqIO (even using caches), >> >> You certainly don't want taxonomy lookups during the parsing stage, >> and also not for the client requesting properties of the species that >> have been parsed with high confidence, i.e., genus and species for a >> straightforward binomial like 'Homo sapiens'. >> >> Writing sequences, IMHO, doesn't have to be as fast. It may be better >> to emit strict format a bit slower rather than sloppy format a bit >> faster. >> >> Upon parsing, one idea could be for the flat file parser to set a >> dirty bit in the parsed out species if the parsed text didn't follow >> strict binomial conventions, hence the parser may have made a mistake >> and if a client requests the information it is better to lookup the >> correct values from a taxonomy database. I.e., you could try with a >> strict regex first that would imply a high-confidence result. If that >> fails you don't give up but mark the result as untrustworthy. >> >> >>> [...] >>> This would have been MUCH easier if all three of us could have gone >>> to the local bar for a beer and discussed it. We should just take >>> the time out to videoconference next time. >> >> You're not honestly suggesting that a videoconference is better than >> having beer together? >> >> Enjoy your trip, and thanks for hanging in there in the discussion, I >> appreciate it. >> >> -hilmar >> -- >> =========================================================== >> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : >> =========================================================== >> >> >> >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Tue Jul 25 10:58:29 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 25 Jul 2006 09:58:29 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <793AFD5C-D220-493F-BE11-B9023DC9F569@gmx.net> Message-ID: <002601c6affa$ca4433f0$15327e82@pyrimidine> Agreed. I fully support the addition of an optional lookup; it gives much more flexibility SeqIO re: your previous examples of screening sequence streams for sequences that are primate, mitochondrial, etc. The key word I want to emphasize is 'optional', not 'enforced'. I appreciate what Sendu is trying to do; I really do. I think carrying over an object named 'Bio::Species' into Taxonomy is too confusing (your 'contagion' analogy, as it were). The 'species' concept (biologically speaking here, not talking about the Bioperl class) is a taxonomic rank (i.e. part of a taxonomy). I'm trying to take a biologist's point of view here. What is a 'species'? Or, if we were to stick strictly with using NCBI definitions, what is a 'species'? The NCBI definition of 'species' is simply a rank in a lineage, so it is (in Bioperl terms) a Node. If we were to follow that line of reasoning, why also have a Species object represent a Taxonomy as well? It's way too confusing. Sendu's repeatedly stating "a Species is a Taxonomy" makes some sense in a BioPerl world only, as we're speaking about a class that has been around for a long time, one that acted as a container of sorts for sequence data. And I understand what he intends to do. Conceptually speaking here, though, the way it is laid out, a Bio::Species object can hold a Node that represents a 'species' rank, as well as a 'genus' Node, and a 'family' node, and on and on. That's not a 'species', that's a taxonomy. So just call it a Taxonomy. The object itself (Bio::Species) never truly represented a 'species' anyway, biologically speaking, every time it held sequence data. It could be a subspecies, strain, plasmid, unknown, or an unclassified rank ('no rank') or environmental sample. It really held a fancier representation of a node, as based on the TaxID. My final point is, saying "a species is a taxonomy" to the rest of the biological world doesn't make sense. Maybe it makes sense to you and I and Sendu, in our little Bioperl world. But to the thousands of users out there who don't completely grok the Bioperl class structure, it's just confusing. If I were to get an object back that was labeled Bio::Species, as a biologist I would expect it to be part of a taxonomy, not the actual Taxonomy itself. So, why not cut to the chase: if we are to fundamentally change the concept of what Bio::Species is by making it a Taxonomy/TaxonomyI or whatever, why not just use a Taxonomy object altogether and not bother with Bio::Species at all? Deprecate it. BTW, I'll be in Connecticut for five days at UConn. So I hope to escape the heat for a bit. Thanks for listening to my side of things. Chris > -----Original Message----- > From: Hilmar Lapp [mailto:hlapp at gmx.net] > Sent: Tuesday, July 25, 2006 8:54 AM > To: Chris Fields > Cc: Sendu Bala; bioperl-l at bioperl.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > We intend on having everyone who wants correct taxonomy parsing > results for the entire kingdom of life to define his/her > authoritative taxonomy database, be it local or not, be it HTTP or > SQL queried. > > If you don't care about the correctness of the taxonomy parse, or if > the taxonomy information in the flat file is trivially parseable > because it conforms to standard binomial convention, then whatever is > to be put in place needs to work fine regardless of whether a > taxonomy database is defined or not. > > -hilmar > > On Jul 25, 2006, at 1:53 AM, Chris Fields wrote: > > > So do we intend on having everyone who installs bioperl have a local > > copy of the taxonomy dumpfile? Or perform a remote lookup via > > Entrez? Seems a bit extreme. > > > > I would like the option of not having the lookup run; as I mentioned > > to Sendu, one of the biggest complaints about bioperl is speed. > > Additional lookups won't help on that end. > > > > Chris > > > > On Jul 24, 2006, at 10:31 PM, Hilmar Lapp wrote: > > > >> > >> On Jul 24, 2006, at 10:29 PM, Chris Fields wrote: > >> > >>> [...] > >>> We could go back and forth on what Jason really intended. [...] The > >>> reality is he's not here and you're willing to do the job. > >> > >> Right. And, knowing Jason, I think he'd be perfectly fine with seeing > >> his original idea develop in a possibly different direction, provided > >> it will all work nicely in the end. I'm willing to take the beating > >> on me if that doesn't turn out to be true ... > >> > >>> > >>> There is one thing I will make perfectly clear here: there should > >>> never, ever be enforced lookups for SeqIO (even using caches), > >> > >> You certainly don't want taxonomy lookups during the parsing stage, > >> and also not for the client requesting properties of the species that > >> have been parsed with high confidence, i.e., genus and species for a > >> straightforward binomial like 'Homo sapiens'. > >> > >> Writing sequences, IMHO, doesn't have to be as fast. It may be better > >> to emit strict format a bit slower rather than sloppy format a bit > >> faster. > >> > >> Upon parsing, one idea could be for the flat file parser to set a > >> dirty bit in the parsed out species if the parsed text didn't follow > >> strict binomial conventions, hence the parser may have made a mistake > >> and if a client requests the information it is better to lookup the > >> correct values from a taxonomy database. I.e., you could try with a > >> strict regex first that would imply a high-confidence result. If that > >> fails you don't give up but mark the result as untrustworthy. > >> > >> > >>> [...] > >>> This would have been MUCH easier if all three of us could have gone > >>> to the local bar for a beer and discussed it. We should just take > >>> the time out to videoconference next time. > >> > >> You're not honestly suggesting that a videoconference is better than > >> having beer together? > >> > >> Enjoy your trip, and thanks for hanging in there in the discussion, I > >> appreciate it. > >> > >> -hilmar > >> -- > >> =========================================================== > >> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > >> =========================================================== > >> > >> > >> > >> > >> > >> _______________________________________________ > >> Bioperl-l mailing list > >> Bioperl-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > Christopher Fields > > Postdoctoral Researcher > > Lab of Dr. Robert Switzer > > Dept of Biochemistry > > University of Illinois Urbana-Champaign > > > > > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > From cjfields at uiuc.edu Tue Jul 25 11:36:40 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 25 Jul 2006 10:36:40 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: Message-ID: <003301c6b000$203cc560$15327e82@pyrimidine> > On Jul 25, 2006, at 3:05 AM, Sendu Bala wrote: > > > [...] > > ## the fully-manual way > > my $species = new Bio::Species; > > my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces > > cerevisiae', > > -rank => 'species', -object_id > > => 1, > > -parent_id => 2); > > If this is meant as an example for the use cases I enumerated, then > you wouldn't have the parent_id from a Genbank file. However, you > didn't have that before either, so no problem. > > > my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces', > > -object_id => 2, -parent_id => 3); > > # (no assumption that 'Saccharomyces' is the genus, so rank() > > undefined) > > I think in a confident parse you want to assign 'genus' if there's > little doubt, for example 'Saccharomyces cerevisiae'. Not sure > whether there are weird viri whose names look innocuous but in > reality the name doesn't follow binomial convention. > > > my $n3 = [etc] > > $species->add_node($node); > > $species->add_node($n2); > > I know why you are doing this, but seeing this people will hit a > mental snag. You should listen to Chris' refusal to see the sense in > this as an indication that many people down the road won't see the > sense either. Thanks for pointing that out. I think there is only a small, fundamental difference in our views here. I'm trying to view this as an outsider would, a biologist not familiar with the Bioperl class structure. I understand what Sendu's trying to accomplish but it's really confusing to someone not familiar with what Bio::Species is. Hilmar, you had pointed out several times that Bio::Species and Bio::Taxonomy shouldn't directly intermingle. My original thought for genbank.pm _read_GenBank_Species() was this, copied and pasted from my local genbank.pm. It's sort of extreme, but it passes tests just fine. sub _read_GenBank_Species { my( $self,$buffer) = @_; $_ = $$buffer; my @organelles = qw(plastid chloroplast mitochondrion); my( $source_data, $common_name, @class, $ns_name, $organelle, $source_flag, $sci_name, $abbr ); while (defined($_) || defined($_ = $self->_readline())) { # de-HTMLify (links that may be encountered here don't contain # escaped '>', so a simple-minded approach suffices) s/<[^>]+>//g; if ( /^SOURCE\s+(.*)/o ) { $source_data = $1; $source_data =~ s/\.$//; # remove trailing dot # does it have a GenBank common name in parentheses? $common_name = $source_data =~ m{\((.*)\)}xms; # organelle? If we find additional odd ones, # add to @organelle $organelle = grep { $_ =~ $source_data } @organelles; $source_flag = 1; } elsif ( /^\s{2}ORGANISM\s+(.*)/o ) { $sci_name = $1; $source_flag = 0; } elsif ($source_flag) { # no ORGANISM $common_name .= $source_data; $common_name =~ s/\n//g; $common_name =~ s/\s+/ /g; $source_flag = 0; } elsif ( /^\s+(.+)/o ) { # lineage information my $line = $1; # only split on ';' or '.' so that classification # that is 2 words will still get matched, use # map() to remove trailing/leading spaces push(@class, map { s/^\s+//; s/\s+$//; $_; } split /[;\.]+/, $line) if ( $line =~ /(;|\.)/ ); } else { # reach end of GenBank tax info last; } $_ = undef; # Empty $_ to trigger read of next line } $$buffer = $_; @class = reverse @class; my $make = Bio::Taxonomy::Node->new(); $make->common_name( $common_name ) if $common_name; $make->scientific_name($sci_name) if $sci_name; # could use SimpleValue objs here instead $make->classification( @class ) if @class; $make->organelle($organelle) if $organelle; return $make; } # back in next_seq...grab the TaxID from 'source' # seqfeature # could check organelle() here as well # add taxon_id from source if available if($species && ($feat->primary_tag eq 'source') && $feat->has_tag('db_xref') && (! $species->ncbi_taxid())) { foreach my $tagval ($feat->get_tag_values('db_xref')) { if(index($tagval,"taxon:") == 0) { $species->ncbi_taxid(substr($tagval,6)); last; } } } In other words, remove the extra parsing of genus() species() subspecies etc. All GenBank sequences have a node represented in NCBI's tax database (I checked it out). Even plasmids, unknowns, environmental samples. Chris > So instead, make the logical model in your design more obvious, which > I think ultimately will help maintainability as well. For example: > > my $taxonomy = Bio::Taxonomy->new(); > my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces cerevisiae', > -rank => 'species', -object_id > => 1, > -parent_id => 2); > my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces', > -object_id => 2, -parent_id => 3); > $taxonomy->add_node($node); > $taxonomy->add_node($n2); > > my $species = Bio::Species->new(-lineage => $taxonomy); > print $species->binomial(); > print $species->genus(); > # this may trigger a lookup if a taxonomy db handle has been set, e.g.: > # $taxonomy->db_handle(Bio::DB::Taxonomy->new(-source => 'entrez')); > print $species->classification(); > > > > [etc] > > > > ## Using a factory without db access > > # assume that Bio::Taxonomy::GenbankFactory implements > > # some modified Bio::Taxonomy::FactoryI > > my $factory = Bio::Taxonomy::GenbankFactory->new(); > > my $species = $factory->generate(-classification => ['Saccharomyces > > cerevisiae', 'Saccharomyces', > > 'Saccharomycetaceae' ...]); > > # the generate() method above just does the fully-manual way for you > > Except the method name would be create_object(), the parameter would > be a hash ref, and the return value would be a Bio::TaxonomyI > compliant object: > > my $taxonomy = $factory->create_object({-classification => > ['Saccharomyces > cerevisiae', 'Saccharomyces', > 'Saccharomycetaceae' ...]}); > my $species = Bio::Species->new(-lineage => $taxonomy); > > > > > > ## Using a factory with db access > > # assume that Bio::Taxonomy::EntrezFactory implements some > > # modified Bio::Taxonomy::FactoryI and uses Bio::DB::Taxonomy::entrez > > # to get the nodes > > my $factory = Bio::Taxonomy::EntrezFactory->new(); > > The logic where to do a lookup on should not be duplicated here. It > only belongs under Bio::DB::Taxonomy::*. > > > my $species = $factory->fetch(-scientifc_name => 'Saccharomyces > > cerevisiae'); > > Likewise, use the methods defined in Bio::DB::Taxonomy, and again, > the return type is Bio::Taxonomy, which you would pass to > Bio::Species->new(). > > -hilmar > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From bix at sendu.me.uk Tue Jul 25 13:49:04 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Tue, 25 Jul 2006 18:49:04 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <003301c6b000$203cc560$15327e82@pyrimidine> References: <003301c6b000$203cc560$15327e82@pyrimidine> Message-ID: <44C65990.4080500@sendu.me.uk> Chris Fields wrote: > If I were to get an object back that was labeled Bio::Species, as a > biologist I would expect it to be part of a taxonomy, not the actual > Taxonomy itself. I think this is the most important sentence in the discussion. Ok, so it's clear to me that a better solution is needed than my Bio::Taxonomy-related proposal. Sorry for being so slow on the uptake. I also needed to start trying to code my Taxonomy proposal to see some issues with it. [... in another email...] > I'm trying to view this as an outsider would, > a biologist not familiar with the Bioperl class structure. Ok, let's come up with a proposal that makes sense to the biologist and better matches Jason's original idea. ---- long post follows; there's a summary at the end As a biologist when I consider a species I have the following primary questions. Let's see how we would answer them using a) Bio::Species and genbank.pm as they are now, b) Bio::Species if it was a 'pure' Bio::Taxonomy::Node with no cruft (or if we just dropped Bio::Species and used Node directly), and Chris' updated genbank.pm. Let's say we got our species information from a genbank file where the scientific name and tax id are available to be parsed out. # What is the species' name? a) Not guaranteed to be correct. b) Correct thanks to recent changes to Node, just use scientific_name() # What is the lineage of this species? a) I can get a classification array with classification(). It's a bit rubbish though, I can't tell what any of the array elements are supposed to be. b) A pure Node wouldn't store the lineage on itself. There are two obvious solutions: 1) add cruft to Node by giving it a classification() method - works as well/bad as a). 2) call get_Lineage_Nodes(), which has the benefit of telling me what rank each ancestor was, if that information had been in the file (more likely, if Node was generated from database). Problem: get_Lineage_Nodes() only works if it can $self->db_handle->get_Taxonomy_Node(-taxonid => $self->parent_id); which obviously doesn't work if the nodes in our lineage didn't come from a database, but from the parsing of a genbank flat file. As we parse the genbank file we can certainly make nodes for each word in the list: inside genbank.pm... @class = reverse @class; my @nodes; my $fake_id = 1; foreach my $sci_name (@class) { push(@nodes, new Bio::Taxonomy::Node(-name => $sci_name, object_id => $fake_id++, parent_id => $fake_id); } But how do we keep these nodes and make them returnable later by get_Lineage_Nodes? Perhaps: my $taxonomy = new Bio::Taxonomy; foreach my $node (@nodes) { $taxonomy->add_node($node); } ... my $make = Bio::Taxonomy::Node->new(); ... $make->db_handle($taxonomy); Bio::Taxonomy would have to implement get_Taxonomy_Node (it has get_node which only accepts a rank). Of course this is ugly, storing a Taxonomy in our database handle. We could have a new Bio::DB::Taxonomy:: class instead, that treated a classification array like a database? It could have the added bonus of building up an entire database internally as more input arrays are given to it, able to therefore give each node a unique but consistent id. It would break if one time you gave it qw(Homo Primates) and another time qw(Homo Hominidae Primates), however. Ideas? # What if I don't want the whole lineage, just to know what a specific rank like genus is for my species? a) use genus(), but not guaranteed to be correct. b) two solutions: 1) add cruft to Node by adding a genus() method: as good/bad as a). 2) use get_Lineage_Nodes() or get_Parent_Node() until you find a node with your rank() of interest. Same problems as for lineage question, but also it would be nicer to have a get_node('rank_name') style method. But such a method belongs in something like Bio::Taxonomy, not Node. At the very least a method like genus() would be implemented using pure Node methods like get_Parent_Node(), returning undefined if no parent had a rank() of 'genus', never guessing it. # Is this species the same as another species? a) Not guaranteed to be correct. (no unique id so forced to compare names) b) Correct answer by using object_id() method, along with Chris' change to genbank.pm. # What is the most recent common ancestor of this species and another? a) Can't be answered. b) Use get_LCA_Node(), but same issues as the lineage question, since get_LCA_Node requires a working get_Lineage_Nodes(). It also requires correct (unique) ids for all nodes in all lineages to give the guaranteed correct answer. But at least you /might/ get the correct answer even using only the data in genbank files and no db lookup. ---- summary: It seems like the main problem with Node right now is that it has classification() and things like genus(). I propose pure Node method solutions to answer the questions classification() and genus() were implemented to answer, but in a better, cruft-free way. Bio::DB::Taxonomy::genbank anyone? Then if you started with a Species/Node generated by a genbank parse, and wanted certain questions answered correctly, you only have to set a different db_handle(). The Node only stores the static and hopefully correct information about itself, whilst all other questions go via db_handle, so you can dynamically swap back and forth between databases depending on if you need speed or accuracy. From cjfields at uiuc.edu Tue Jul 25 14:24:12 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 25 Jul 2006 13:24:12 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C65990.4080500@sendu.me.uk> Message-ID: <000001c6b017$873176a0$15327e82@pyrimidine> Sendu, you'll have to make the changes how you see fit. You see my point now, which is great. >From my perspective, all the object type (used to contain taxonomy file information) needs to contain is the scientific name and common names like the SOURCE line abbreviated name and the actual GenBank common name, if present. All the other cruft (i.e. genus/species/subspecies) can be excised, and the proper taxonomic information, if wanted, could be accessed via the object and it's TaxID. Organelle and lineage information needs to be retained (for the non-taxonomists) and could be stored in that object, bumped to SimpleValue objects, or just set (alternative, since the data is small) using a get/set value within the sequence object itself. This would be the bare-bones approach, which Node can fulfill. I also like Hilmar's proposal about including optional lookups, which greatly increases the flexibility when screening sequences. This will likely require a more complicated object structure (i.e. taxonomy with nodes). You suggested a Taxonomy-like object which would work; but don't force Bio::Species into the mix. Why not just use a simple Bio::Taxonomy object for that (Hilmar's point). When one asks for $species->species, they'll get a Node or Taxonomy, whichever is used (that's up to you). The Node represents a more-barebones variation, while the Taxonomy object scheme would be more fully-realized. Either way will work for me. Just don't call it 'species'. ; > Once this is all done, will we really have a need for Bio::Species? That's my other point. The only real use for it was as a container object for sequence data. That job is now done via a Taxonomy/Node object. The only real use it would have is as a container for taxonomic information for species ranks or below. I think Node/Taxonomy can handle evan that though, so now it's also redundant. If a class is not useful and is redundant, maybe it should be deprecated. Anyway, I can't get involved anymore at this point; I'm too busy with getting ready for the Kadner Institute next week. Good luck! Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Sendu Bala > Sent: Tuesday, July 25, 2006 12:49 PM > To: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > Chris Fields wrote: > > If I were to get an object back that was labeled Bio::Species, as a > > biologist I would expect it to be part of a taxonomy, not the actual > > Taxonomy itself. > > I think this is the most important sentence in the discussion. Ok, so > it's clear to me that a better solution is needed than my > Bio::Taxonomy-related proposal. Sorry for being so slow on the uptake. I > also needed to start trying to code my Taxonomy proposal to see some > issues with it. > > > [... in another email...] > > I'm trying to view this as an outsider would, > > a biologist not familiar with the Bioperl class structure. > > Ok, let's come up with a proposal that makes sense to the biologist and > better matches Jason's original idea. > > ---- long post follows; there's a summary at the end > > As a biologist when I consider a species I have the following primary > questions. Let's see how we would answer them using a) Bio::Species and > genbank.pm as they are now, b) Bio::Species if it was a 'pure' > Bio::Taxonomy::Node with no cruft (or if we just dropped Bio::Species > and used Node directly), and Chris' updated genbank.pm. Let's say we got > our species information from a genbank file where the scientific name > and tax id are available to be parsed out. > > # What is the species' name? > a) Not guaranteed to be correct. > b) Correct thanks to recent changes to Node, just use scientific_name() > > > # What is the lineage of this species? > a) I can get a classification array with classification(). It's a bit > rubbish though, I can't tell what any of the array elements are supposed > to be. > b) A pure Node wouldn't store the lineage on itself. There are two > obvious solutions: 1) add cruft to Node by giving it a classification() > method - works as well/bad as a). 2) call get_Lineage_Nodes(), which has > the benefit of telling me what rank each ancestor was, if that > information had been in the file (more likely, if Node was generated > from database). Problem: get_Lineage_Nodes() only works if it can > $self->db_handle->get_Taxonomy_Node(-taxonid => $self->parent_id); > which obviously doesn't work if the nodes in our lineage didn't come > from a database, but from the parsing of a genbank flat file. As we > parse the genbank file we can certainly make nodes for each word in the > list: > inside genbank.pm... @class = reverse @class; > my @nodes; my $fake_id = 1; > foreach my $sci_name (@class) { > push(@nodes, new Bio::Taxonomy::Node(-name => $sci_name, object_id => > $fake_id++, parent_id => $fake_id); > } > But how do we keep these nodes and make them returnable later by > get_Lineage_Nodes? Perhaps: > my $taxonomy = new Bio::Taxonomy; > foreach my $node (@nodes) { > $taxonomy->add_node($node); > } > ... > my $make = Bio::Taxonomy::Node->new(); > ... > $make->db_handle($taxonomy); > Bio::Taxonomy would have to implement get_Taxonomy_Node (it has get_node > which only accepts a rank). Of course this is ugly, storing a Taxonomy > in our database handle. We could have a new Bio::DB::Taxonomy:: class > instead, that treated a classification array like a database? It could > have the added bonus of building up an entire database internally as > more input arrays are given to it, able to therefore give each node a > unique but consistent id. It would break if one time you gave it qw(Homo > Primates) and another time qw(Homo Hominidae Primates), however. Ideas? > > > # What if I don't want the whole lineage, just to know what a specific > rank like genus is for my species? > a) use genus(), but not guaranteed to be correct. > b) two solutions: 1) add cruft to Node by adding a genus() method: as > good/bad as a). 2) use get_Lineage_Nodes() or get_Parent_Node() until > you find a node with your rank() of interest. Same problems as for > lineage question, but also it would be nicer to have a > get_node('rank_name') style method. But such a method belongs in > something like Bio::Taxonomy, not Node. At the very least a method like > genus() would be implemented using pure Node methods like > get_Parent_Node(), returning undefined if no parent had a rank() of > 'genus', never guessing it. > > > # Is this species the same as another species? > a) Not guaranteed to be correct. (no unique id so forced to compare names) > b) Correct answer by using object_id() method, along with Chris' change > to genbank.pm. > > > # What is the most recent common ancestor of this species and another? > a) Can't be answered. > b) Use get_LCA_Node(), but same issues as the lineage question, since > get_LCA_Node requires a working get_Lineage_Nodes(). It also requires > correct (unique) ids for all nodes in all lineages to give the > guaranteed correct answer. But at least you /might/ get the correct > answer even using only the data in genbank files and no db lookup. > > > ---- summary: > > It seems like the main problem with Node right now is that it has > classification() and things like genus(). I propose pure Node method > solutions to answer the questions classification() and genus() were > implemented to answer, but in a better, cruft-free way. > > Bio::DB::Taxonomy::genbank anyone? > > Then if you started with a Species/Node generated by a genbank parse, > and wanted certain questions answered correctly, you only have to set a > different db_handle(). The Node only stores the static and hopefully > correct information about itself, whilst all other questions go via > db_handle, so you can dynamically swap back and forth between databases > depending on if you need speed or accuracy. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From hlapp at gmx.net Tue Jul 25 15:18:00 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 25 Jul 2006 15:18:00 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <000001c6b017$873176a0$15327e82@pyrimidine> References: <000001c6b017$873176a0$15327e82@pyrimidine> Message-ID: <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> On Jul 25, 2006, at 2:24 PM, Chris Fields wrote: > Once this is all done, will we really have a need for Bio::Species? No, except for backwards compatibility. Phasing it out will go over a couple of releases. E.g., v1.6.x could have deprecation warning in the documentation. v1.7+ would have deprecation warnings in the code written to stderr. Just as an aside, we can't just drastically change the return type of a method. Instead, if at all possible, there should be a new method so that the old can be phased out over time but otherwise not changed. I.e., don't change $seq->species() to now all of a sudden return a node or taxonomic lineage, even if initially Bio::Species is returned with some magic under the hood. Instead, create something like # return a Bio::Taxonomy::Node: my $taxon = $seq->taxon(); # alternative approach: return a lineage (taxonomy) # this would be Bio::TaxonomyI compliant my $lineage = $seq->lineage(); The former would require the lineage (and organelle for completeness) information to be either easily (though not necessarily directly) accessible through the node, or added as annotation. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Tue Jul 25 15:30:40 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 25 Jul 2006 14:30:40 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> Message-ID: <000101c6b020$d09bc7b0$15327e82@pyrimidine> Sounds good to me. I'm fine with any way that it's worked out, either Taxonomy or Node-based, as long as there no Bio::Species-based confusion re: Taxonomy, and that this eventually leads to getting rid of Bio::Species altogether. Have fun, guys! (hey, probably the shortest response I have written)... Chris > -----Original Message----- > From: Hilmar Lapp [mailto:hlapp at gmx.net] > Sent: Tuesday, July 25, 2006 2:18 PM > To: Chris Fields > Cc: 'Sendu Bala'; bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > > On Jul 25, 2006, at 2:24 PM, Chris Fields wrote: > > > Once this is all done, will we really have a need for Bio::Species? > > No, except for backwards compatibility. Phasing it out will go over a > couple of releases. E.g., v1.6.x could have deprecation warning in > the documentation. v1.7+ would have deprecation warnings in the code > written to stderr. > > Just as an aside, we can't just drastically change the return type of > a method. Instead, if at all possible, there should be a new method > so that the old can be phased out over time but otherwise not > changed. I.e., don't change $seq->species() to now all of a sudden > return a node or taxonomic lineage, even if initially Bio::Species is > returned with some magic under the hood. Instead, create something like > > # return a Bio::Taxonomy::Node: > my $taxon = $seq->taxon(); > > # alternative approach: return a lineage (taxonomy) > # this would be Bio::TaxonomyI compliant > my $lineage = $seq->lineage(); > > The former would require the lineage (and organelle for completeness) > information to be either easily (though not necessarily directly) > accessible through the node, or added as annotation. > > -hilmar > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > From cjfields at uiuc.edu Tue Jul 25 22:16:36 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 25 Jul 2006 21:16:36 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C65990.4080500@sendu.me.uk> References: <003301c6b000$203cc560$15327e82@pyrimidine> <44C65990.4080500@sendu.me.uk> Message-ID: <1748BD60-EFBE-4872-94A4-7CA3D58BE458@uiuc.edu> One last thing before I shut off bioperl for a week and concentrate on Connecticut; On Jul 25, 2006, at 12:49 PM, Sendu Bala wrote: > Chris Fields wrote: >> If I were to get an object back that was labeled Bio::Species, as a >> biologist I would expect it to be part of a taxonomy, not the actual >> Taxonomy itself. > > I think this is the most important sentence in the discussion. Ok, so > it's clear to me that a better solution is needed than my > Bio::Taxonomy-related proposal. Sorry for being so slow on the > uptake. I > also needed to start trying to code my Taxonomy proposal to see some > issues with it. ... Again, thanks for noticing that. > ---- summary: > > It seems like the main problem with Node right now is that it has > classification() and things like genus(). I propose pure Node method > solutions to answer the questions classification() and genus() were > implemented to answer, but in a better, cruft-free way. > > Bio::DB::Taxonomy::genbank anyone? Ach... You're compromising here; that's not like you. I think you're making this too complicated by trying too many things at once. Don't think sudden dramatic changes in the API. Sneak changes in in a way that doesn't scare users away, then let them get used to the new way of grabbing Tax data. Make your point that it's more accurate to do it this way (you'll have defenders in Hilmar and I, BTW). Do this (start with genbank.pm): 1) Switch out Bio::Species with Node or Taxonomy; relocate other information temporarily (Bio::Species, get/sets in Seq object, SimpleValue). Leave Bio::Species in for the time being, but don't bother making any additional changes to it. 2) Make sure next_seq() and write_seq() work and pass tests. Add additional tests for the Tax/Node object (you could even use the tax dump data you recently added for more complicated tests). 3) Add in additional stuff bit by bit until it is where you would like it. 4) Make sure parsing is kosher with the latest release notes. Probably should make sure write_seq follows what the release note state to some degree. And, really, you won't break anything with genbank.pm organelle() parsing. If you look at the module the organelle isn't even touched in next_seq() or _read_GenBank_Species(), so it was broken to begin with! My proposal, though extreme, was to remove genus() etc (which you wanted as well with Node). You could leave this cruft for the time being in Bio::Species, which could still act as a sequence tax info holder object. It just won't be the >default< Seq tax information object, which would be Bio::Taxonomy or Node. Hence Hilmar's suggestion to use a $seq->taxon() method to return a Node/Taxonomy, and a $seq->species() would still return a Bio::Species object. It's redundant, but only for the time being, and the redundant information wouldn't have a major memory footprint anyway (not like the feature table or the full sequence might). Any information that isn't stored in whatever Tax object you use (i.e. lineage or organelle) could be stored temporarily in another fashion, such as a get/set in Seq or SimpleValue object, to make next_seq/ write_seq work (such as $seq->organelle() or $seq->classification(), instead of $seq->species->organelle and so on). Hilmar then suggests, around 1.6-ish release, note the changes made to SeqIO towards Bio::Taxonomy-based objects, and indicate that Bio::Species via species() and it's associated methods will be deprecated around 1.7 (gives everybody notice on API issues). Then add warnings to Bio::Species in 1.7 noting the deprecation, then remove from core completely in 1.8 - 2.0. One last thing, which is minor really: I remember seeing something about having Nodes with 'no rank' ignored unless a flag is used. That may be bad news for some organisms in sequence files where the TaxID is for a 'no rank' rank, such as environmental samples. May want to think about that here. I'm hoping the releases will start popping out a bit more periodically than they have been. There have been volunteers to release periodic updates for bug fixes etc. If I get a chance I'll try keeping up. Don't count on it though. The conference is 7am-9pm most days, for five days straight! Chris > > Then if you started with a Species/Node generated by a genbank parse, > and wanted certain questions answered correctly, you only have to > set a > different db_handle(). The Node only stores the static and hopefully > correct information about itself, whilst all other questions go via > db_handle, so you can dynamically swap back and forth between > databases > depending on if you need speed or accuracy. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From vrramnar at student.cs.uwaterloo.ca Tue Jul 25 22:44:17 2006 From: vrramnar at student.cs.uwaterloo.ca (vrramnar at student.cs.uwaterloo.ca) Date: Tue, 25 Jul 2006 22:44:17 -0400 Subject: [Bioperl-l] SNP reference file download In-Reply-To: <775C8983-D81E-474C-B4F5-C765ABEF4A2C@uiuc.edu> References: <000001c6b01f$bfd54e20$15327e82@pyrimidine> <1153868024.44c6a0f83fce6@www.nexusmail.uwaterloo.ca> <775C8983-D81E-474C-B4F5-C765ABEF4A2C@uiuc.edu> Message-ID: <1153881857.44c6d7016775f@www.nexusmail.uwaterloo.ca> Hey Chris, I believe I updated all those modules already as I downloaded the entire DB.tar from Bioperl live. Here is my code: #!/usr/bin/perl -w use Bio::Perl; use Bio::DB::EUtilities; my @ids = qw(rs4986950); # With the "rs" before the number the warning says: "no returned links" # Without the "rs" before the number the warning says: "No databases returned; empty linkset" my $elink = Bio::DB::EUtilities->new( -eutil => 'elink', -id => \@ids, -db => 'omim', -dbfrom => 'snp'); $elink->get_response; print "IDs: ", join q(,), $elink->get_ids; Which gives the following error: -------------------- WARNING --------------------- MSG: No databases returned; empty linkset --------------------------------------------------- ------------- EXCEPTION ------------- MSG: Must use database to access IDs STACK Bio::DB::EUtilities::ElinkData::get_LinkIds_by_db /Library/Perl/5.8.6/Bio/ DB/EUtilities/ElinkData.pm:201 STACK Bio::DB::EUtilities::get_ids /Library/Perl/5.8.6/Bio/DB/EUtilities.pm:482 STACK toplevel getOmimNum:13 -------------------------------------- All I really want is the OMIM id number under the section: NCBI Resource Links from the page: http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=1800562 Any idea why this still isn't working?? Rohan Quoting Chris Fields : > Odd, I thought XML::Simple was part of the 5.8 core. Guess I was > wrong. I plan on changing this to a more robust parser soon (likely > XML::SAX or XML::Twig, which will also require a download). > > That warning occurs when if you don't have a link to OMIM present (No > databases returned; empty linkset). The way Elink works is it stores > internal data in a separate object (ELinkData) contained in an > internal cache. The method get_ids() works for all EUtilities to > retrieve IDs, even from ELink objects. The unique problem with ELink > is, since you can search multiple databases. you can retrieve > multiple sets of IDs. > > If you haven't done it, update your EUtilities; the problem is > similar to one I fixed today (I stated something about updating in my > last post). Also, update the main Bio::DB::EUtilities and > Bio::GenericWebDBI as well (the last is the base class from which > EUtilities is based). The 'Count:1' was a debugging statement I > forgot to remove a while ago which I changed in CVS yesterday. It's > possible that commit had other changes which I forgot about. > > Sorry about that, but it is still experimental (emphasis on the > 'mental'). > > Chris > > On Jul 25, 2006, at 5:53 PM, vrramnar at student.cs.uwaterloo.ca wrote: > > > > > Hey Chris, > > > > Ignore the last email, I fixed that problem and downloaded/ > > installed the > > required XML modules. > > > > However, I am now getting this error message: > > > > -------------------- WARNING --------------------- > > MSG: No databases returned; empty linkset > > --------------------------------------------------- > > Count: 1 > > > > ------------- EXCEPTION ------------- > > MSG: Must use database to access IDs > > STACK Bio::DB::EUtilities::ElinkData::get_LinkIds_by_db /Library/ > > Perl/5.8.6/Bio/ > > DB/EUtilities/ElinkData.pm:201 > > STACK Bio::DB::EUtilities::get_ids /Library/Perl/5.8.6/Bio/DB/ > > EUtilities.pm:483 > > STACK toplevel getOmimNum:15 > > > > -------------------------------------- > > > > What does this mean?? > > > > Rohan > > > > Quoting Chris Fields : > > > >> Okay, had to fix an odd bug from ELink due to the way NCBI returns > >> data. > >> > >> You'll need to update the EUtilities modules in bioperl from CVS > >> to make > >> sure this works. > >> > >> This is how it's done: > ---------------------------------------- This mail sent through www.mywaterloo.ca From cjfields at uiuc.edu Wed Jul 26 01:01:41 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 26 Jul 2006 00:01:41 -0500 Subject: [Bioperl-l] SNP reference file download In-Reply-To: <1153881857.44c6d7016775f@www.nexusmail.uwaterloo.ca> References: <000001c6b01f$bfd54e20$15327e82@pyrimidine> <1153868024.44c6a0f83fce6@www.nexusmail.uwaterloo.ca> <775C8983-D81E-474C-B4F5-C765ABEF4A2C@uiuc.edu> <1153881857.44c6d7016775f@www.nexusmail.uwaterloo.ca> Message-ID: The below ID doesn't have any OMIM linked data, hence the warning. The problem is that NCBI, when it doesn't find a link, doesn't send something constructive to tell you that. It sends the original ID encoded in XML, but no actual DB's or ID data links. That's what the warning means. I'll make the original warning a bit more direct: No databases returned; no IDs found. The thrown error is from a logic problem; I have fixed it and committed to CVS. Here's the web page: no OMIM data there either... http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=4986950 Try changing your ID list to this: my @ids = qw(4986950 1800562); You should get back only one ID (only one has an OMIM number). By the way, the SNP data ID is only the digits (don't include the 'rs' designation). Chris On Jul 25, 2006, at 9:44 PM, vrramnar at student.cs.uwaterloo.ca wrote: > > Hey Chris, > > I believe I updated all those modules already as I downloaded the > entire DB.tar > from Bioperl live. Here is my code: > > #!/usr/bin/perl -w > > use Bio::Perl; > use Bio::DB::EUtilities; > > my @ids = qw(rs4986950); > # With the "rs" before the number the warning says: "no returned > links" > # Without the "rs" before the number the warning says: "No > databases returned; > empty linkset" > > > my $elink = Bio::DB::EUtilities->new( -eutil => 'elink', > -id => \@ids, > -db => 'omim', > -dbfrom => 'snp'); > $elink->get_response; > print "IDs: ", join q(,), $elink->get_ids; > > Which gives the following error: > > -------------------- WARNING --------------------- > MSG: No databases returned; empty linkset > --------------------------------------------------- > > ------------- EXCEPTION ------------- > MSG: Must use database to access IDs > STACK Bio::DB::EUtilities::ElinkData::get_LinkIds_by_db /Library/ > Perl/5.8.6/Bio/ > DB/EUtilities/ElinkData.pm:201 > STACK Bio::DB::EUtilities::get_ids /Library/Perl/5.8.6/Bio/DB/ > EUtilities.pm:482 > STACK toplevel getOmimNum:13 > > -------------------------------------- > > All I really want is the OMIM id number under the section: NCBI > Resource Links > from the page: > http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=1800562 > > Any idea why this still isn't working?? > > Rohan > > > Quoting Chris Fields : > >> Odd, I thought XML::Simple was part of the 5.8 core. Guess I was >> wrong. I plan on changing this to a more robust parser soon (likely >> XML::SAX or XML::Twig, which will also require a download). >> >> That warning occurs when if you don't have a link to OMIM present (No >> databases returned; empty linkset). The way Elink works is it stores >> internal data in a separate object (ELinkData) contained in an >> internal cache. The method get_ids() works for all EUtilities to >> retrieve IDs, even from ELink objects. The unique problem with ELink >> is, since you can search multiple databases. you can retrieve >> multiple sets of IDs. >> >> If you haven't done it, update your EUtilities; the problem is >> similar to one I fixed today (I stated something about updating in my >> last post). Also, update the main Bio::DB::EUtilities and >> Bio::GenericWebDBI as well (the last is the base class from which >> EUtilities is based). The 'Count:1' was a debugging statement I >> forgot to remove a while ago which I changed in CVS yesterday. It's >> possible that commit had other changes which I forgot about. >> >> Sorry about that, but it is still experimental (emphasis on the >> 'mental'). >> >> Chris >> >> On Jul 25, 2006, at 5:53 PM, vrramnar at student.cs.uwaterloo.ca wrote: >> >>> >>> Hey Chris, >>> >>> Ignore the last email, I fixed that problem and downloaded/ >>> installed the >>> required XML modules. >>> >>> However, I am now getting this error message: >>> >>> -------------------- WARNING --------------------- >>> MSG: No databases returned; empty linkset >>> --------------------------------------------------- >>> Count: 1 >>> >>> ------------- EXCEPTION ------------- >>> MSG: Must use database to access IDs >>> STACK Bio::DB::EUtilities::ElinkData::get_LinkIds_by_db /Library/ >>> Perl/5.8.6/Bio/ >>> DB/EUtilities/ElinkData.pm:201 >>> STACK Bio::DB::EUtilities::get_ids /Library/Perl/5.8.6/Bio/DB/ >>> EUtilities.pm:483 >>> STACK toplevel getOmimNum:15 >>> >>> -------------------------------------- >>> >>> What does this mean?? >>> >>> Rohan >>> >>> Quoting Chris Fields : >>> >>>> Okay, had to fix an odd bug from ELink due to the way NCBI returns >>>> data. >>>> >>>> You'll need to update the EUtilities modules in bioperl from CVS >>>> to make >>>> sure this works. >>>> >>>> This is how it's done: >> > > > > > ---------------------------------------- > This mail sent through www.mywaterloo.ca Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Wed Jul 26 05:19:29 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Wed, 26 Jul 2006 10:19:29 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <1748BD60-EFBE-4872-94A4-7CA3D58BE458@uiuc.edu> References: <003301c6b000$203cc560$15327e82@pyrimidine> <44C65990.4080500@sendu.me.uk> <1748BD60-EFBE-4872-94A4-7CA3D58BE458@uiuc.edu> Message-ID: <44C733A1.9070201@sendu.me.uk> Chris Fields wrote: > >> It seems like the main problem with Node right now is that it has >> classification() and things like genus(). I propose pure Node method >> solutions to answer the questions classification() and genus() were >> implemented to answer, but in a better, cruft-free way. >> >> Bio::DB::Taxonomy::genbank anyone? > > Ach... You're compromising here; No, I don't think so. Let me explain... (another very long email, but with the same conclusion as above) > 1) Switch out Bio::Species with Node or Taxonomy; relocate other > information temporarily (Bio::Species, get/sets in Seq object, > SimpleValue). Leave Bio::Species in for the time being, but don't > bother making any additional changes to it. [...] > Hence Hilmar's suggestion to use a $seq->taxon() method to return a > Node/Taxonomy, and a $seq->species() would still return a > Bio::Species object. It's redundant, As I see it, the problem to be solved is this: a) A node should just be a node, holding only information about itself (but this can include information on who its parent is, and methods relating to getting its parents/children as new objects - but the data of its parents/children must never be stored on itself). b) Bio::Species isn't very good at its job; you can't ask reasonable taxonomic questions of it and get correct answers. c) We need to transition Bio::Species to something better - something that lets us do the same job as Bio::Species, but do it better. An important aspect of 'better' is that we can switch from the taxonomic information in a genbank file or similar to the information in a taxonomic database if we want certain taxonomic questions answered correctly. But also, we should be able to answer all questions with a good chance of a correct answer even without database access/installation. There are a variety of possible solutions. How can we decide which is best? What would a good solution be? The 'something better' we transition Bio::Species to will become the preferred (or at least de facto standard) way of dealing with taxonomic information in bioperl. This taxonomic module (or set of modules) must be able to model taxonomic information anywhere it is found - databases or genbank files or anything else. If it can't, it would be fundamentally flawed. d) We can immediately discount any solution that involves storing some taxonomic information outside of the tax module. If we find ourselves putting lineage data in a genbank file in SimpleValue objects or similar, we can be pretty sure we've used a poor solution to the problem. That would be a compromise. e) If the thing we transition Bio::Species to can't do everything Bio::Species did (doing it in a different and better way is fine of course), it's not suitable for transitioning to (this is why Node needed all the cruft added to it before it was a suitable candidate). If it /can/ do everything Bio::Species did, there would be no harm immediately making Bio::Species inherit from the new tax module, reimplementing Bio::Species as necessary but making no API change. So any solution that would /require/ $seq->taxon() and $seq->species() wouldn't be a good one, and would be a compromise. But we do want to get rid of Bio::Species eventually, so I'm not saying we shouldn't have a $seq->taxon() or similar, only that either method would give you the same type of object with the same methods available ($seq->taxon->isa('tax module') && ($seq->species->isa('Bio::Species') && $seq->species->isa('tax module')). I see 2 possible solutions to the problem. What should 'tax module' be?: 1) Bio::Taxonomy or other similar class that is a container of multiple nodes. Naively this makes logical sense since one of the jobs Bio::Species has is to store a lineage, and a lineage is best represented as a set of Nodes. So let's have a single object with all our Nodes in it. Problems: Bio::Taxonomy itself, as currently written, is fundamentally flawed. It requires that you know the ranks and order of ranks of all your input nodes before you input them. It requires that all ranks have unique names. It doesn't handle ranks of 'no rank'. You can't have more than one lineage in an instance because you can't have two nodes with the same rank. If you don't know the ranks of your nodes (ie. genbank) there is no way to maintain the order of your lineage because there is no modelling of parent/child. I had planned to re-write it such that the rank-centric implementation was removed and we had parent/child implementation instead. But then there is nothing to stop you adding nodes that are disconnected from the others, creating a broken mess. Bio::Taxonomy::Tree might have been a little more suitable because it implements Bio::Tree::TreeI, but sadly it is also rank-centric and actually requires input of both Bio::Species and Bio::Taxonomy objects to its most useful methods. More important than issues with current implementations of node-container classes, such classes are unable to let us solve problem c) in a good way, and also leave us potentially storing in memory Node objects representing the same taxonomic node multiple times in different instances of the node-container. For problem c) if we were to switch from genbank nodes to database the solution is to delete all the nodes in the container and then get them all again from the database. What if you didn't even have a lineage-related question? You've just retrieved 10s of nodes from the database for no reason (and then store them), when all you wanted was accurate information on the node you were interested in. All in all, it's pretty horrible. Unsuitable implementations plus excess database retrieval plus massive waste of memory with duplicated nodes does not equal a good solution. 2) Bio::Taxonomy::Node. First, solve problem a) by getting rid of methods binomial(), species(), genus(), sub_species(), variant(), organelle(), classification() and show_all(). Except for organelle() which doesn't belong in taxonomy, all of these Bio::Species 'questions' can still be answered by Node - just not in a single method call. I outlined how to answer them in the previous post. For backward compatibility make Bio::Species a Node and implement the suggested way of answering the questions the proper 'Node' way under those methods. Problems: Well, those questions can't actually be answered by Node if the starting point was genbank data or manually created Nodes. The solution is clean and simple: Bio::DB::Taxonomy::genbank or perhaps better named Bio::DB::Taxonomy::list (because it makes a taxonomy database from an ordered list of names - I don't see anything inherently wrong or ugly with that). Then everything magically just works. We get all the power to ask all our questions that Node has already when working with the ncbi database, but we get it when working with genbank data. We suffer none of the problems of a node-container class. We can easily switch databases on the fly. What's not to like? From bix at sendu.me.uk Wed Jul 26 06:00:01 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Wed, 26 Jul 2006 11:00:01 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> Message-ID: <44C73D21.3010301@sendu.me.uk> Hilmar Lapp wrote: > Instead, create something like > > # return a Bio::Taxonomy::Node: > my $taxon = $seq->taxon(); Yes, but $seq->species() would also > # alternative approach: return a lineage (taxonomy) > # this would be Bio::TaxonomyI compliant > my $lineage = $seq->lineage(); I've since come to the conclusion that anything Taxonomy-ish would be inappropriate - see recent post. > The former would require the lineage (and organelle for completeness) > information to be either easily (though not necessarily directly) > accessible through the node, or added as annotation. That specifically is the main problem with Node as it is now. You shouldn't store information about the lineage (essentially information about other nodes) on the node object itself. Storing it as annotation on the Node or elsewhere is terrible: you lose all the power of Node and can no longer ask any lineage-related questions. There is no need for this split in functionality - when you don't have database access and just some genbank files, you can't answer any taxonomic questions involving lineage, vs. when you do have database access suddenly you can start doing useful things. My proposed solution is that bioperl's taxonomy model always lets you answer the same questions regardless of your source for taxonomic information - see recent post. From cjfields at uiuc.edu Wed Jul 26 08:16:29 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 26 Jul 2006 07:16:29 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C733A1.9070201@sendu.me.uk> References: <003301c6b000$203cc560$15327e82@pyrimidine> <44C65990.4080500@sendu.me.uk> <1748BD60-EFBE-4872-94A4-7CA3D58BE458@uiuc.edu> <44C733A1.9070201@sendu.me.uk> Message-ID: <0155C616-8E07-49C8-B4A1-2B77B4DE09F3@uiuc.edu> > ... > > I see 2 possible solutions to the problem. What should 'tax module' > be?: > > 1) Bio::Taxonomy or other similar class that is a container of > multiple > nodes. Naively this makes logical sense since one of the jobs > Bio::Species has is to store a lineage, and a lineage is best > represented as a set of Nodes. So let's have a single object with all > our Nodes in it. Problems: > > Bio::Taxonomy itself, as currently written, is fundamentally > flawed. It > requires that you know the ranks and order of ranks of all your input > nodes before you input them. It requires that all ranks have unique > names. It doesn't handle ranks of 'no rank'. You can't have more than > one lineage in an instance because you can't have two nodes with the > same rank. If you don't know the ranks of your nodes (ie. genbank) > there > is no way to maintain the order of your lineage because there is no > modelling of parent/child. > I had planned to re-write it such that the rank-centric implementation > was removed and we had parent/child implementation instead. But then > there is nothing to stop you adding nodes that are disconnected > from the > others, creating a broken mess. > > Bio::Taxonomy::Tree might have been a little more suitable because it > implements Bio::Tree::TreeI, but sadly it is also rank-centric and > actually requires input of both Bio::Species and Bio::Taxonomy objects > to its most useful methods. > > More important than issues with current implementations of > node-container classes, such classes are unable to let us solve > problem > c) in a good way, and also leave us potentially storing in memory Node > objects representing the same taxonomic node multiple times in > different > instances of the node-container. For problem c) if we were to switch > from genbank nodes to database the solution is to delete all the nodes > in the container and then get them all again from the database. > What if > you didn't even have a lineage-related question? You've just retrieved > 10s of nodes from the database for no reason (and then store them), > when > all you wanted was accurate information on the node you were > interested in. > > All in all, it's pretty horrible. Unsuitable implementations plus > excess > database retrieval plus massive waste of memory with duplicated nodes > does not equal a good solution. > > > 2) Bio::Taxonomy::Node. First, solve problem a) by getting rid of > methods binomial(), species(), genus(), sub_species(), > variant(), organelle(), classification() and show_all(). Except for > organelle() which doesn't belong in taxonomy, all of these > Bio::Species > 'questions' can still be answered by Node - just not in a single > method > call. I outlined how to answer them in the previous post. For backward > compatibility make Bio::Species a Node and implement the suggested way > of answering the questions the proper 'Node' way under those methods. > Problems: > > Well, those questions can't actually be answered by Node if the > starting > point was genbank data or manually created Nodes. The solution is > clean > and simple: Bio::DB::Taxonomy::genbank or perhaps better named > Bio::DB::Taxonomy::list (because it makes a taxonomy database from an > ordered list of names - I don't see anything inherently wrong or ugly > with that). Then everything magically just works. We get all the power > to ask all our questions that Node has already when working with the > ncbi database, but we get it when working with genbank data. We suffer > none of the problems of a node-container class. We can easily switch > databases on the fly. That 'broken mess' (referring to Bio::Taxonomy) is up to the user. You could make it more stringent (i.e. only allow connected nodes, starting with a single initiating node then build from there), though I don't think that's necessary as most people would probably use some sort of factory to generate a taxonomy (a warning might be appropriate). You would have to watch out for potential circular structures. Have it do what you want. I believe the original intent of Taxonomy was to allow building a full-fledged taxonomic structure, so it should stay that way. Sendu, you have to realize this is up to how you want to implement it. We're giving you the freedom to do what you want to Bio::Taxonomy. Of course, if we think you're off we'll reel you back in, but you seem to be on the right track. Realize that the only contentious issue here is that horrible lineage line in the GenBank file. We should have a way to rebuild it as it was from the original file (i.e. not rebuild it from scratch with DB lookups by default). However, you should also have the option to rebuild it from lookups (i.e. correctly), which you could do with a Taxonomy. Note this Bio::Taxonomy method: classify Title : classify Usage : @obj[][0-1] = taxonomy->classify($species); Function: return a ranked classification Returns : @obj of taxa and ranks as word pairs separated by "@" Args : Bio::Species object As Bio::Species will be deprecated, you can use that method in a dual, sneaky way: 1) directly store the lineage information, 2) return the real one (DB lookups) if needed (i,e, if some flag is set, for instance). And, if a Bio::Species argument is used, do what the docs state (catch it early on with an if block and return within it). Bio::Species, as used within genbank.pm, doesn't use Bio::Taxonomy in any way. I don't know if you even need to retain its original purpose here; you might be able to get away with changing the fundamental way this method works altogether. That's up to you. my 2c Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Wed Jul 26 08:49:05 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Wed, 26 Jul 2006 13:49:05 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <0155C616-8E07-49C8-B4A1-2B77B4DE09F3@uiuc.edu> References: <003301c6b000$203cc560$15327e82@pyrimidine> <44C65990.4080500@sendu.me.uk> <1748BD60-EFBE-4872-94A4-7CA3D58BE458@uiuc.edu> <44C733A1.9070201@sendu.me.uk> <0155C616-8E07-49C8-B4A1-2B77B4DE09F3@uiuc.edu> Message-ID: <44C764C1.9010804@sendu.me.uk> Chris Fields wrote: > We're giving you the freedom to do what you want to Bio::Taxonomy. I don't want to do anything with Bio::Taxonomy any more. I've already shown that it isn't suitable for the job. Regardless of how it is implemented, the entire idea of a class that contains Nodes isn't appropriate, for reasons already stated. > Realize that the only contentious issue here is > that horrible lineage line in the GenBank file. We should have a way to > rebuild it as it was from the original file (i.e. not rebuild it from > scratch with DB lookups by default). However, you should also have the > option to rebuild it from lookups (i.e. correctly), which you could do > with a Taxonomy. And I've already shown how rebuilding with a Taxonomy is very far from ideal, while switching db_handle on a Node would be perfect. Why are you now advocating Taxonomy when there is no reason to? > Note this Bio::Taxonomy method: > > classify > > Title : classify > Usage : @obj[][0-1] = taxonomy->classify($species); > Function: return a ranked classification > Returns : @obj of taxa and ranks as word pairs separated by "@" > Args : Bio::Species object Note that all this method does is let you combine a list of rank names with the classification array in a Bio::Species, spitting out some weird data structure. It is only of interest to Bio::Taxonomy::Tree. We're in the situation where we don't know the rank names corresponding to the classification array in a Bio::Species generated by genbank et al. So classify() is of zero value. > As Bio::Species will be deprecated, you can use that method in a dual, > sneaky way: 1) directly store the lineage information, No. Lineage information must be in the form of Nodes or you can't answer lineage-related taxonomic questions. > 2) return the real one (DB lookups) if needed Messy. Doing it with Node would be far superior. Again, Node works all the time, while Taxonomy would work badly or not at all some of the time. Rather than suggest ways of using Taxonomy, tell me what is wrong with my current Node plan. From cjfields at uiuc.edu Wed Jul 26 11:15:28 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 26 Jul 2006 10:15:28 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C764C1.9010804@sendu.me.uk> Message-ID: <002801c6b0c6$59279fa0$15327e82@pyrimidine> I advocate anything but Bio::Species that allows you the option to use lookups for correct taxonomic information and not guesswork (current Bio::Species). So, you could pretty much replace Species immediately with a DB-aware container object with simple get/sets. As of now, that would be that Node or Taxonomy. I have done this already, just haven't committed it yet. And, when I mentioned having freedom to do what you want with Bio::Taxonomy, that includes all of it (including Node, Tree, etc). We just want it to be reasonable and not 'duct tape' for the various Bio::Species mistakes of the past. I don't think the problem here is really that complicated (still, the only thing is the lineage stuff in a sequence file, right?). > > As Bio::Species will be deprecated, you can use that method in a dual, > > sneaky way: 1) directly store the lineage information, > > No. Lineage information must be in the form of Nodes or you can't answer > lineage-related taxonomic questions. You must have a way to store the 'horrible lineage information' data, as is, for those users who do not care about taxonomy and just want to convert seq streams. You shouldn't burden the everyday user with something that is pretty specialized, this being finding correct taxonomic information based on DB lookups for a particular reason (screening sequences, as Hilmar pointed out, was one possibility). I don't care how, but store lineage information as it appears in the file (scalar string) or in a simple data structure (array, maybe?) capable of retaining the information in some way. There are many many ways of doing this which I have previously pointed out; take your pick. Hilmar, in a previous post, told me to take a step back and contemplate a world w/o Bio::Species, where you would design a system capable of dealing with sequence file taxonomic data in a way that allows you to get correct tax information when needed via NCBI Taxonomy data, yet not sacrifice speed if you're just interested in converting sequences via SeqIO. Would you design a Bio::Species class, then? Would you attempt to spend time parsing out species and genus information, when the correct data is sitting on the NCBI server or in a local flatfile? No. You would retain the minimal data necessary in an object for reading and writing data, but have the >option< available to run a lookup. Therefore, Bio::Taxonomy::Node was born. A little prematurely, yes. Probably needed to bake a bit more... Anyway, we must eventually sever our reliance on Bio::Species in order to deprecate it, so the lineage information must be contained, as it appears in the file, somewhere else. And my point with the classify() Bio::Taxonomy method is not to use it as is; you could sneak in your own data if needed. It was an example of a possible way of containing the lineage data, but not meant to be an absolute way. It's up to you how you want to implement it. I think the classes that are currently in place are more than capable of handling the job. Hence my statement before that you are trying to get too many things going right out the starting gate. Start simply by replacing Bio::Species, then worry about other issues. If you think that a specialized class would work, fine, but IMHO I don't think it's absolutely necessary. I had proposed such a class before (more like a Bio::Species-like Tax object) but was shut down, and rightly so; it's unnecessarily complicated and 'contaminates' Bio::Taxonomy with extra unnecessary methods (classification(), genus(), and so on). My last proposal was to eventually strip out the unreliable taxonomic parsing in the various SeqIO modules and replace it with something simple, which seemed to be a consensus among us all. This has to do with Hilmar's post-apocalyptic vision of a Bio::Species-free world. That will eventually happen, and Bioperl will eventually switch over completely to Bio::Taxonomy::Whatever. And Bio::Species can join BPLite and other deprecated modules in the BioPerl Boot Hill. But, for now that can't happen. We all strive for the best information possible. However, you can't sacrifice the needs of other users, a majority whom probably care squat about taxonomy, with your (our) own needs. As I have repeatedly stated, simple is good. We can't just usurp the API for our own wishes w/o warning, so the change has to be gradual and Bio::Species must stick around for the time being. And we must make it optional to have DB lookups or the villagers will be storming the castle. Listen, Sendu. If you can wait a couple of weeks for further discussion then we can slog on with this. But right now I just don't have any more time for this, sorry. You can have the last word and I'll respond when I get back. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Sendu Bala > Sent: Wednesday, July 26, 2006 7:49 AM > To: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > Chris Fields wrote: > > We're giving you the freedom to do what you want to Bio::Taxonomy. > > I don't want to do anything with Bio::Taxonomy any more. I've already > shown that it isn't suitable for the job. Regardless of how it is > implemented, the entire idea of a class that contains Nodes isn't > appropriate, for reasons already stated. > > > > Realize that the only contentious issue here is > > that horrible lineage line in the GenBank file. We should have a way to > > rebuild it as it was from the original file (i.e. not rebuild it from > > scratch with DB lookups by default). However, you should also have the > > option to rebuild it from lookups (i.e. correctly), which you could do > > with a Taxonomy. > > And I've already shown how rebuilding with a Taxonomy is very far from > ideal, while switching db_handle on a Node would be perfect. Why are you > now advocating Taxonomy when there is no reason to? > > > > Note this Bio::Taxonomy method: > > > > classify > > > > Title : classify > > Usage : @obj[][0-1] = taxonomy->classify($species); > > Function: return a ranked classification > > Returns : @obj of taxa and ranks as word pairs separated by "@" > > Args : Bio::Species object > > Note that all this method does is let you combine a list of rank names > with the classification array in a Bio::Species, spitting out some weird > data structure. It is only of interest to Bio::Taxonomy::Tree. > We're in the situation where we don't know the rank names corresponding > to the classification array in a Bio::Species generated by genbank et > al. So classify() is of zero value. > > > > As Bio::Species will be deprecated, you can use that method in a dual, > > sneaky way: 1) directly store the lineage information, > > No. Lineage information must be in the form of Nodes or you can't answer > lineage-related taxonomic questions. > > > > 2) return the real one (DB lookups) if needed > > Messy. Doing it with Node would be far superior. > > > Again, Node works all the time, while Taxonomy would work badly or not > at all some of the time. Rather than suggest ways of using Taxonomy, > tell me what is wrong with my current Node plan. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From morissardj at gmail.com Wed Jul 26 10:59:54 2006 From: morissardj at gmail.com (Morissard =?utf-8?b?asOpcm9tZQ==?=) Date: Wed, 26 Jul 2006 14:59:54 +0000 (UTC) Subject: [Bioperl-l] Accessing TRANSFAC matrices References: <20060717145000.0aig7ymcmurk4wsk@webmail.embl.de> <44BEA9FB.1070009@utk.edu> Message-ID: Hi that may help you ? http://morissardjerome.free.fr/Data/files/matrices.zip From hlapp at gmx.net Wed Jul 26 11:36:32 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 26 Jul 2006 11:36:32 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C73D21.3010301@sendu.me.uk> References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> Message-ID: On Jul 26, 2006, at 6:00 AM, Sendu Bala wrote: > Hilmar Lapp wrote: >> Instead, create something like >> >> # return a Bio::Taxonomy::Node: >> my $taxon = $seq->taxon(); > > Yes, but $seq->species() would also $seq->species() would return a Bio::Species object which may not be more than a thin shell anymore around an implementation that delegates almost everything to a lineage object (Bio::Taxonomy). $seq->taxon() in contrast need not return such a backwards-compatible construct. > >> # alternative approach: return a lineage (taxonomy) >> # this would be Bio::TaxonomyI compliant >> my $lineage = $seq->lineage(); > > I've since come to the conclusion that anything Taxonomy-ish would be > inappropriate - see recent post. Not sure which one you mean, and please don't reference really long emails, you're asking a lot of other people to organize your thoughts for them. At any rate, my point is that if you only name it appropriately you can avoid misconceptions about what is being returned. The fact that it's confusing to return a taxonomy from a method called species() doesn't mean it's equally bad to return a lineage (which is a limited taxonomy) from a method called lineage(). > [...] > > My proposed solution is that bioperl's taxonomy model always lets you > answer the same questions regardless of your source for taxonomic > information - see recent post. See above ... And I'd rather see some code or API examples than extensive elaborations. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Wed Jul 26 11:38:50 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 26 Jul 2006 11:38:50 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C733A1.9070201@sendu.me.uk> References: <003301c6b000$203cc560$15327e82@pyrimidine> <44C65990.4080500@sendu.me.uk> <1748BD60-EFBE-4872-94A4-7CA3D58BE458@uiuc.edu> <44C733A1.9070201@sendu.me.uk> Message-ID: <3E7DE5A0-C094-4894-BAB9-03F62EA98A3D@gmx.net> On Jul 26, 2006, at 5:19 AM, Sendu Bala wrote: > Chris Fields wrote: >> >>> It seems like the main problem with Node right now is that it has >>> classification() and things like genus(). I propose pure Node method >>> solutions to answer the questions classification() and genus() were >>> implemented to answer, but in a better, cruft-free way. >>> >>> Bio::DB::Taxonomy::genbank anyone? >> >> Ach... You're compromising here; > > No, I don't think so. Let me explain... > (another very long email, but with the same conclusion as above) Sorry, can you summarize this in a few sentences? If you do want feedback from me you really need to be more concise. -hilmar > > >> 1) Switch out Bio::Species with Node or Taxonomy; relocate other >> information temporarily (Bio::Species, get/sets in Seq object, >> SimpleValue). Leave Bio::Species in for the time being, but don't >> bother making any additional changes to it. > [...] >> Hence Hilmar's suggestion to use a $seq->taxon() method to return a >> Node/Taxonomy, and a $seq->species() would still return a >> Bio::Species object. It's redundant, > > As I see it, the problem to be solved is this: > > a) A node should just be a node, holding only information about itself > (but this can include information on who its parent is, and methods > relating to getting its parents/children as new objects - but the data > of its parents/children must never be stored on itself). > > b) Bio::Species isn't very good at its job; you can't ask reasonable > taxonomic questions of it and get correct answers. > > c) We need to transition Bio::Species to something better - something > that lets us do the same job as Bio::Species, but do it better. An > important aspect of 'better' is that we can switch from the taxonomic > information in a genbank file or similar to the information in a > taxonomic database if we want certain taxonomic questions answered > correctly. But also, we should be able to answer all questions with a > good chance of a correct answer even without database access/ > installation. > > There are a variety of possible solutions. How can we decide which is > best? What would a good solution be? > > The 'something better' we transition Bio::Species to will become the > preferred (or at least de facto standard) way of dealing with > taxonomic > information in bioperl. This taxonomic module (or set of modules) must > be able to model taxonomic information anywhere it is found - > databases > or genbank files or anything else. If it can't, it would be > fundamentally flawed. > > d) We can immediately discount any solution that involves storing some > taxonomic information outside of the tax module. If we find ourselves > putting lineage data in a genbank file in SimpleValue objects or > similar, we can be pretty sure we've used a poor solution to the > problem. That would be a compromise. > > e) If the thing we transition Bio::Species to can't do everything > Bio::Species did (doing it in a different and better way is fine of > course), it's not suitable for transitioning to (this is why Node > needed > all the cruft added to it before it was a suitable candidate). If it > /can/ do everything Bio::Species did, there would be no harm > immediately > making Bio::Species inherit from the new tax module, reimplementing > Bio::Species as necessary but making no API change. So any solution > that > would /require/ $seq->taxon() and $seq->species() wouldn't be a good > one, and would be a compromise. But we do want to get rid of > Bio::Species eventually, so I'm not saying we shouldn't have a > $seq->taxon() or similar, only that either method would give you the > same type of object with the same methods available > ($seq->taxon->isa('tax module') && ($seq->species->isa('Bio::Species') > && $seq->species->isa('tax module')). > > > I see 2 possible solutions to the problem. What should 'tax module' > be?: > > 1) Bio::Taxonomy or other similar class that is a container of > multiple > nodes. Naively this makes logical sense since one of the jobs > Bio::Species has is to store a lineage, and a lineage is best > represented as a set of Nodes. So let's have a single object with all > our Nodes in it. Problems: > > Bio::Taxonomy itself, as currently written, is fundamentally > flawed. It > requires that you know the ranks and order of ranks of all your input > nodes before you input them. It requires that all ranks have unique > names. It doesn't handle ranks of 'no rank'. You can't have more than > one lineage in an instance because you can't have two nodes with the > same rank. If you don't know the ranks of your nodes (ie. genbank) > there > is no way to maintain the order of your lineage because there is no > modelling of parent/child. > I had planned to re-write it such that the rank-centric implementation > was removed and we had parent/child implementation instead. But then > there is nothing to stop you adding nodes that are disconnected > from the > others, creating a broken mess. > > Bio::Taxonomy::Tree might have been a little more suitable because it > implements Bio::Tree::TreeI, but sadly it is also rank-centric and > actually requires input of both Bio::Species and Bio::Taxonomy objects > to its most useful methods. > > More important than issues with current implementations of > node-container classes, such classes are unable to let us solve > problem > c) in a good way, and also leave us potentially storing in memory Node > objects representing the same taxonomic node multiple times in > different > instances of the node-container. For problem c) if we were to switch > from genbank nodes to database the solution is to delete all the nodes > in the container and then get them all again from the database. > What if > you didn't even have a lineage-related question? You've just retrieved > 10s of nodes from the database for no reason (and then store them), > when > all you wanted was accurate information on the node you were > interested in. > > All in all, it's pretty horrible. Unsuitable implementations plus > excess > database retrieval plus massive waste of memory with duplicated nodes > does not equal a good solution. > > > 2) Bio::Taxonomy::Node. First, solve problem a) by getting rid of > methods binomial(), species(), genus(), sub_species(), > variant(), organelle(), classification() and show_all(). Except for > organelle() which doesn't belong in taxonomy, all of these > Bio::Species > 'questions' can still be answered by Node - just not in a single > method > call. I outlined how to answer them in the previous post. For backward > compatibility make Bio::Species a Node and implement the suggested way > of answering the questions the proper 'Node' way under those methods. > Problems: > > Well, those questions can't actually be answered by Node if the > starting > point was genbank data or manually created Nodes. The solution is > clean > and simple: Bio::DB::Taxonomy::genbank or perhaps better named > Bio::DB::Taxonomy::list (because it makes a taxonomy database from an > ordered list of names - I don't see anything inherently wrong or ugly > with that). Then everything magically just works. We get all the power > to ask all our questions that Node has already when working with the > ncbi database, but we get it when working with genbank data. We suffer > none of the problems of a node-container class. We can easily switch > databases on the fly. > > What's not to like? > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From jay at jays.net Wed Jul 26 11:32:53 2006 From: jay at jays.net (Jay Hannah) Date: Wed, 26 Jul 2006 08:32:53 -0700 Subject: [Bioperl-l] Anyone else at OSCON right now? Message-ID: <44C78B25.80503@jays.net> Any other BioPerl'ers here in Portland for OSCON? I'd love to chat about your life w/ BioPerl. I'm here until Saturday morning. j http://oscon.kwiki.org/index.cgi?JayHannah From adamnkraut at gmail.com Wed Jul 26 10:32:42 2006 From: adamnkraut at gmail.com (Adam Kraut) Date: Wed, 26 Jul 2006 10:32:42 -0400 Subject: [Bioperl-l] Structure::IO In-Reply-To: <716af09c0607250347w24e0d8dbj674758049bc0a4e2@mail.gmail.com> References: <716af09c0607250347w24e0d8dbj674758049bc0a4e2@mail.gmail.com> Message-ID: <134ede0b0607260732u79f0dea2if8f4ea98a5e03524@mail.gmail.com> Hi bernd, Can you better explain what it is you want to do with pdb files? From your example it looks like you want to do something with each chain, but it is unclear what you want to do here: my @chains = $struc->chain($chain); With that said, I was never able to use Bio::Structure in the way that I wanted. I now use the MMTSB Perl libraries instead: http://mmtsb.scripps.edu/cgi-bin/tooldoc?perlpackages Specifically the Molecule module may be useful here. Regards, Adam On 7/25/06, Bernd Web wrote: > > Hi, > > Does someone have experience with Bio::Structure::IO? > The example III.9.1 from the bptutorial.pl covers most, but what is e.g. > the > chain() method of Bio::Structure::Entry doing? The POD states: > > Title : chain > Usage : @chains = $structure->chain($chain); > Function: Connects a (or a list of) Chain objects to a > Bio::Structure::Entry. > Returns : list of Bio::Structure::Residue objects > Args : One Residue or a reference to an array of Residue objects > > But in e.g > my $stream = Bio::Structure::IO->new(-file => $filename, > -format => 'pdb'); > while ( my $struc = $stream->next_structure() ) { > for my $chain ($struc->get_chains) { > my $chainid = $chain->id; > my @chains = $struc->chain($chain); > } > } > > I get Bio::Structure::Chain=HASH(0x9f1ab50). > > What is the function of the chain method and how to use it? > > Best regards, > bernd > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- Adam N. Kraut National Resource for Biomedical Supercomputing http://www.nrbsc.org/sb/ From bix at sendu.me.uk Wed Jul 26 12:11:25 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Wed, 26 Jul 2006 17:11:25 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <002801c6b0c6$59279fa0$15327e82@pyrimidine> References: <002801c6b0c6$59279fa0$15327e82@pyrimidine> Message-ID: <44C7942D.6050603@sendu.me.uk> Chris Fields wrote: >> No. Lineage information must be in the form of Nodes or you can't answer >> lineage-related taxonomic questions. > > You must have a way to store the 'horrible lineage information' data, as is, > for those users who do not care about taxonomy and just want to convert seq > streams. You shouldn't burden the everyday user with something that is > pretty specialized, this being finding correct taxonomic information based > on DB lookups for a particular reason (screening sequences, as Hilmar > pointed out, was one possibility). I am certainly not requiring that anyone find 'correct taxonomic information'. The whole reason I am backing my current proposal is that it works equally well with or without access to NCBI's taxonomy database. Your proposals work poorly without access to such. > I don't care how, but store lineage information as it appears in the file > (scalar string) or in a simple data structure (array, maybe?) capable of > retaining the information in some way. There are many many ways of doing > this which I have previously pointed out; take your pick. I've taken my pick. To set: my $db = new Bio::DB::Taxonomy(-source => 'list', -lineage => @lineage); $node->db_handle($db); To get: @lineage = map { $_->scientific_name } $node->get_Lineage_Nodes; That is as simple as it is going to get in a world where we have 'pure' Nodes or any other kind of pure taxonomic class. If you want to hide the taxonomic complexity from end-users who want to make and store their own lineage of their species without having to know the details of how bioperl's taxonomy modules are supposed to work, tell them to use Bio::Species: To set: $species->classification(@lineage); To get: @lineage = $species->classification; Of course in this example I propose that behind the scenes Bio::Species is a Bio:Taxonomy::Node and just implements classification() the pure Node way, given above. Let me make my requirement very clear: the solution must allow you to find the most recent common ancestor of two solution-objects without access to the NCBI taxonomy database, using exactly the same method call you would use if you /did/ have access to the NCBI taxonomy database. The method in question shouldn't need any special-case code depending on the presence or absence of NCBI taxonomy database. That's the litmus test. I'll tend to reject any solution that fails. From bix at sendu.me.uk Wed Jul 26 12:25:41 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Wed, 26 Jul 2006 17:25:41 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <3E7DE5A0-C094-4894-BAB9-03F62EA98A3D@gmx.net> References: <003301c6b000$203cc560$15327e82@pyrimidine> <44C65990.4080500@sendu.me.uk> <1748BD60-EFBE-4872-94A4-7CA3D58BE458@uiuc.edu> <44C733A1.9070201@sendu.me.uk> <3E7DE5A0-C094-4894-BAB9-03F62EA98A3D@gmx.net> Message-ID: <44C79785.6050705@sendu.me.uk> Hilmar Lapp wrote: > > On Jul 26, 2006, at 5:19 AM, Sendu Bala wrote: > >>>> It seems like the main problem with Node right now is that it has >>>> classification() and things like genus(). I propose pure Node method >>>> solutions to answer the questions classification() and genus() were >>>> implemented to answer, but in a better, cruft-free way. >>>> >>>> Bio::DB::Taxonomy::genbank anyone? > > Sorry, can you summarize this in a few sentences? If you do want > feedback from me you really need to be more concise. A bad solution-module stores any kind of taxonomic information outside of the solution-module or in an inconsistent form. By 'inconsistent' I mean, sometimes you store the name of a taxonomic rank with $node->node_name, other times you store it in an array or scalar held directly on the solution-module or elsewhere. Bio::Taxonomy specifically is not usable. Generally speaking, classes that are containers of multiple nodes are also inappropriate, because they result in excess database retrieval and excess storage of duplicated information amongst instances of such classes. Bio::Taxonomy::Node combined with Bio::DB::Taxonomy::list would probably be ideal. From cjfields at uiuc.edu Wed Jul 26 12:49:40 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 26 Jul 2006 11:49:40 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: Message-ID: <000001c6b0d3$7d936ec0$15327e82@pyrimidine> Hilmar, apologies ahead of time for not being too concise! It's my last hurrah on this thread. No, really! ... > > Yes, but $seq->species() would also > > $seq->species() would return a Bio::Species object which may not be > more than a thin shell anymore around an implementation that > delegates almost everything to a lineage object (Bio::Taxonomy). > > $seq->taxon() in contrast need not return such a backwards-compatible > construct. In genbank.pm _read_GenBank_Species (initial implementation, to switch out Bio::Species with Taxonomy/Node object): 1) Assign data to both Bio::Species (as currently implemented) and Bio::Taxonomy::Node (new way). 2) Assign organelle to Bio::Species and the Seq object get/set organelle(). 3) Assign lineage information to Bio::Species and as an array to the Seq object get/set lineage(). Replace the get/set above with your method of choice, just no Bio::Species. In genbank.pm write_seq() 1) if DB_lookup flag is defined, use $seq->taxon() to build lineage 2) If not, use $seq->lineage(). The fine details (how do you build the lineage?!?) can be worked out along the way. The wonders of CVS! The Taxonomy class used here could be returned using Hilmar's $seq->taxon() and Bio::Species can be returned via $seq->species(). Makes perfect sense! Separated! Nothing complicated about it. Nice and clean. And Bio::Species can eventually be shown the exit door. Elvis has left the building... Organelle-specific sequence TaxIDs, as they refer to the organism and not the organelle, could be placed elsewhere, preferably somewhere more accessible such as $seq->organelle(). And lineage, similarly, could be placed in $seq->lineage(), which would store it as a raw string or as an array. There are many other ways I had pointed out (SimpleValue, Node, etc); I don't care, as long as we eventually sever the Bio::Species tumor from SeqIO. ... > ...And I'd rather see some code or API examples than > extensive elaborations. > > -hilmar Hilmar's right; working code does speaks louder than words. The energy spent in writing up full expositions is better spent elsewhere, hence: I need to get back to work! Wish I could contribute more. Chris From bix at sendu.me.uk Wed Jul 26 13:13:43 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Wed, 26 Jul 2006 18:13:43 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> Message-ID: <44C7A2C7.2070100@sendu.me.uk> Hilmar Lapp wrote: > On Jul 26, 2006, at 6:00 AM, Sendu Bala wrote: > >> Hilmar Lapp wrote: >>> Instead, create something like >>> >>> # return a Bio::Taxonomy::Node: >>> my $taxon = $seq->taxon(); >> Yes, but $seq->species() would also > > $seq->species() would return a Bio::Species object which may not be > more than a thin shell anymore around an implementation that > delegates almost everything to a lineage object (Bio::Taxonomy). I actually forgot to finish that sentence. I was going to suggest Bio::Species isa Bio::Taxonomy::Node and would indeed delegate most of its implementation to Node. >>> # alternative approach: return a lineage (taxonomy) >>> # this would be Bio::TaxonomyI compliant >>> my $lineage = $seq->lineage(); >> I've since come to the conclusion that anything Taxonomy-ish would be >> inappropriate - see recent post. > > The fact that it's confusing to return a taxonomy from a method called species() > doesn't mean it's equally bad to return a lineage (which is a limited > taxonomy) from a method called lineage(). You wouldn't need to though. If you want a lineage you could ask your node for its lineage. There's no point in having a whole other class that contains a node and all its ancestor nodes, when to get the ancestors of a node all you have to do is $node->get_Lineage_Nodes(). >> My proposed solution is that bioperl's taxonomy model always lets you >> answer the same questions regardless of your source for taxonomic >> information - see recent post. > > See above ... And I'd rather see some code or API examples The fine details of the following may be slightly off, but it's just to provide an example. I'll use Test.pm syntax. my @human = qw('Homo sapiens' Homo Mammalia Eukaryota); my @mouse = qw('Mus musculus' Mus Mammalia Eukaryota); Old way with Node ----------------- my $h_node = new Bio::Taxonomy::Node(-classification => @human); my $m_node = new Bio::Taxonomy::Node(-classification => @mouse); @human = map { $_->scientific_name } $h_node->get_Lineage_Nodes; ok @human, 0; # failure to work as expected @human = $h_node->classification; ok join(", ", @human), "Homo sapiens, Homo, Mammalia, Eukaryota"; my $lca = $h_node->get_LCA_Node($m_node); ok $lca, undef; # failure to do anything useful because our lineage data # is in an array, not in nodes # try again with entrez - must make brand new objects my $db = new Bio::DB::Taxonomy(-source => 'entrez'); $h_node = $db->get_Taxonomy_Node(-name => 'Homo sapiens'); $m_node = $db->get_Taxonomy_Node(-name => 'Mus musculus'); @human = map { $_->scientific_name } $h_node->get_Lineage_Nodes; ok join(", ", @human) eq "Homo sapiens, Homo, Homo/Pan/Gorilla group, Hominidae, ..."; # now it works! $lca = $h_node->get_LCA_Node($m_node); ok $lca->scientific_name, 'Mammalia'; # and now this works! Old way with Bio::Species ------------------------- # forget about it, Species has nothing like a get_LCA_Node() Proposed way with Node ---------------------- my $db = new Bio::DB::Taxonomy(-source => 'list', -lineage => @human); my $h_node = $db->get_Taxonomy_Node(-name => 'Homo sapiens'); $db->add_lineage(@mouse); # or make a new db my $m_node = $db->get_Taxonomy_Node(-name => 'Mus musculus'); @human = map { $_->scientific_name } $h_node->get_Lineage_Nodes; ok join(", ", @human), "Homo sapiens, Homo, Mammalia, Eukaryota"; # works as expected my $lca = $h_node->get_LCA_Node($m_node); ok $lca->scientific_name, 'Mammalia'; # works first time # try again with entrez - just change the db_handle $h_node->db_handle(new Bio:DB::Taxonomy(-source => 'entrez'); @human = map { $_->scientific_name } $h_node->get_Lineage_Nodes; ok join(", ", @human) eq "Homo sapiens, Homo, Homo/Pan/Gorilla group, Hominidae, ..."; $lca = $h_node->get_LCA_Node($m_node); ok $lca->scientific_name, 'Mammalia'; Proposed way with Bio::Species ------------------------------ # (Bio::Species isa Bio::Taxonomy::Node, implements its methods like # above) my $h_species = new Bio::Species(-classification => @human); my $m_species = new Bio::Species(-classification => @mouse); @human = map { $_->scientific_name } $h_node->get_Lineage_Nodes; ok join(", ", @human), "Homo sapiens, Homo, Mammalia, Eukaryota"; @human = $h_species->classification; ok join(", ", @human), "Homo sapiens, Homo, Mammalia, Eukaryota"; my $lca = $h_species->get_LCA_Node($m_species); ok $lca->scientific_name, 'Mammalia'; # trying again with entrez behaves as per proposed Node, above From angshu96 at gmail.com Wed Jul 26 13:15:35 2006 From: angshu96 at gmail.com (Angshu Kar) Date: Wed, 26 Jul 2006 12:15:35 -0500 Subject: [Bioperl-l] WUBLASTP parsing problem Message-ID: Hi, Does WU-BLASTP has got something to do with the length of the sequence names (or the sequence names)? What is happening here is I use fasta format proteins to build the blast (I do a distributed blastp) report. But when I parse the report (using bioperl), the query column remains empty for some results as : * 328857 6.6e-135 325331 6.3e-114 325329 1.0e-113 325332 1.7e-113 325330 2.7e-113 . . *. while for some its perfect as: *267750 280003 7.5e-301 267750 348279 7.5e-301 267750 345867 2.0e-300 267750 251915 2.0e-300 267750 346539 6.7e-300 . *. . Some of my sequences are as: *IMGA|AC159872_38.1 hypothetical protein AC159872.12 35121-35051 H EGN_Mt050401 20060209 TIGR 1671.m00013 mrsciilhnmivederdtyaqrwtefeqpggngsstpqpystelrdpdvhhklqtdlvkh iwikfgmyrd* * And part of the blastp (the one where I'm facing the issue) result is as: *Smallest * * Sum High Probability Sequences producing High-scoring Segment Pairs: Score P(N) N gi|33333045|gb|AAQ11687.1| MADS box protein [Triticum aes... 1318 6.6e-135 1 gi|47681327|gb|AAT37484.1| MADS5 protein [Dendrocalamus l... 1120 6.3e-114 1 gi|47681331|gb|AAT37486.1| MADS7 protein [Dendrocalamus l... 1118 1.0e-113 1 gi|47681325|gb|AAT37483.1| MADS4 protein [Dendrocalamus l... 1116 1.7e-113 1 gi|47681329|gb|AAT37485.1| MADS6 protein [Dendrocalamus l... 1114 2.7e-113 1 gi|47681323|gb|AAT37482.1| MADS3 protein [Dendrocalamus l... 1114 2.7e-113 1 11674.m04224|LOC_Os08g41950|protein K-box region, putative 976 1.1e-98 1 gi|28630961|gb|AAO45877.1| MADS5 [Lolium perenne] 967 1.0e-97 1 gi|44888605|gb|AAS48129.1| AGAMOUS LIKE9-like protein [Ho... 964 2.1e-97 1 11674.m04223|LOC_Os08g41950|protein K-box region, putative 899 1.6e-90 1 gi|34979580|gb|AAQ83834.1| MADS box protein [Asparagus of... 875 5.8e-88 1* Could you please let me know if I'm missing something? Has the gi got to do anything with this? Thanking you, Angshu From cain.cshl at gmail.com Wed Jul 26 12:19:26 2006 From: cain.cshl at gmail.com (Scott Cain) Date: Wed, 26 Jul 2006 12:19:26 -0400 Subject: [Bioperl-l] Installing staden io_lib on windows? Message-ID: <1153930767.2632.5.camel@localhost.localdomain> Hi all, I'm wondering if anyone has tried to install Staden's io_lib on Windows, and if so, how did it go? I am not much of a Windows person, but I've tried to make it under cygwin only to get this message: make all-recursive make[1]: Entering directory `/home/scott/io_lib-1.9.2' Making all in read make[2]: Entering directory `/home/scott/io_lib-1.9.2/read' if gcc -DHAVE_CONFIG_H -I. -I. -I.. -I.. -I../include -I../read -I../alf -I../abi -I../ctf -I../ztr -I../plain -I../scf -I../sff -I../exp_file -I../utils -I/usr/local/include -g -O2 -MT Read.o -MD -MP -MF ".deps/Read.Tpo" -c -o Read.o Read.c; \ then mv -f ".deps/Read.Tpo" ".deps/Read.Po"; else rm -f ".deps/Read.Tpo"; exit 1; fi In file included from Read.h:43, from Read.c:40: ../utils/os.h:346:2: #error Must define SP_BIG_ENDIAN or SP_LITTLE_ENDIAN in Makefile make[2]: *** [Read.o] Error 1 make[2]: Leaving directory `/home/scott/io_lib-1.9.2/read' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/home/scott/io_lib-1.9.2' make: *** [all] Error 2 I'm guessing there is a flag I can pass to the configure script to get the endian-ness right, but I don't know (and I don't know if this is just the beginning of a long, fruitless road :-) I would like to use Bio::SCF (from CPAN) in conjuction with the trace glyph in BioGraphics to view traces in GBrowse. Thanks for any advice, Scott -- ------------------------------------------------------------------------ Scott Cain, Ph. D. cain.cshl at gmail.com GMOD Coordinator (http://www.gmod.org/) 216-392-3087 Cold Spring Harbor Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From morissardj at gmail.com Wed Jul 26 16:49:58 2006 From: morissardj at gmail.com (leverdeterre) Date: Wed, 26 Jul 2006 13:49:58 -0700 (PDT) Subject: [Bioperl-l] Accessing TRANSFAC matrices In-Reply-To: References: <44BEA9FB.1070009@utk.edu> Message-ID: <5510746.post@talk.nabble.com> i'm happy for helping you i'have done a page whitch can interrest you http://morissardjerome.free.fr/Data/index.html there is more information about the 397 matrix file ( in the 3 first line) and i'm done all the logo file . ++ -- View this message in context: http://www.nabble.com/Re%3A-Accessing-TRANSFAC-matrices-tf1969467.html#a5510746 Sent from the Perl - Bioperl-L forum at Nabble.com. From morissardj at gmail.com Wed Jul 26 17:15:19 2006 From: morissardj at gmail.com (leverdeterre) Date: Wed, 26 Jul 2006 14:15:19 -0700 (PDT) Subject: [Bioperl-l] Blast Output Parsing In-Reply-To: References: Message-ID: <5511136.post@talk.nabble.com> and without Bioperl i think that may help you http://morissardjerome.free.fr/perl/blastparser.html -- View this message in context: http://www.nabble.com/Blast-Output-Parsing-tf1974691.html#a5511136 Sent from the Perl - Bioperl-L forum at Nabble.com. From osborne1 at optonline.net Wed Jul 26 17:00:50 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Wed, 26 Jul 2006 17:00:50 -0400 Subject: [Bioperl-l] SeqUtils In-Reply-To: <716af09c0607250444y3e005fb1t4e20094fd8db993d@mail.gmail.com> Message-ID: Bernd, That's easily done, changed both POD and code. Brian O. On 7/25/06 7:44 AM, "Bernd Web" wrote: > Hi, > > With Bio::SeqUtils it may be nice to support 3 letter codes with > capitals only, too. > Now > > my $string = Bio::SeqUtils->seq3in($seqobj, 'METGLYTER'); > > will give in $string->seq: XXX. > > Possibly the capitals in MetGlyTer are used to find the amino acids codes? > If not maybe it's easy to implement case-insensitive, or all-capitals > for AA codes in SeqUtils? > > In addition about the POD: maybe it's better not use use $string since > Bio::SeqUtils->seq3in does not return a string but a Bio::PrimarySeq > object. > > Regards, > Bernd > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From osborne1 at optonline.net Wed Jul 26 17:24:34 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Wed, 26 Jul 2006 17:24:34 -0400 Subject: [Bioperl-l] Structure::IO In-Reply-To: <716af09c0607250347w24e0d8dbj674758049bc0a4e2@mail.gmail.com> Message-ID: Bernd, I'm not following your question. The POD in the latest Bio::Structure::Entry shows: =head2 chain() Title : chain Usage : @chains = $structure->chain($chain); Function: Connects a Chain or a list of Chain objects to a Bio::Structure::Entry. Returns : List of Bio::Structure::Chain objects Args : A Chain or a reference to an array of Chain objects =cut Which is not what you've copied and pasted. What version of Bioperl do you use? Brian O. On 7/25/06 6:47 AM, "Bernd Web" wrote: > Hi, > > Does someone have experience with Bio::Structure::IO? > The example III.9.1 from the bptutorial.pl covers most, but what is e.g. the > chain() method of Bio::Structure::Entry doing? The POD states: > > Title : chain > Usage : @chains = $structure->chain($chain); > Function: Connects a (or a list of) Chain objects to a Bio::Structure::Entry. > Returns : list of Bio::Structure::Residue objects > Args : One Residue or a reference to an array of Residue objects > > But in e.g > my $stream = Bio::Structure::IO->new(-file => $filename, > -format => 'pdb'); > while ( my $struc = $stream->next_structure() ) { > for my $chain ($struc->get_chains) { > my $chainid = $chain->id; > my @chains = $struc->chain($chain); > } > } > > I get Bio::Structure::Chain=HASH(0x9f1ab50). > > What is the function of the chain method and how to use it? > > Best regards, > bernd > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From hlapp at gmx.net Thu Jul 27 01:06:52 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 27 Jul 2006 01:06:52 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C7A2C7.2070100@sendu.me.uk> References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> <44C7A2C7.2070100@sendu.me.uk> Message-ID: <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net> I think this looks like a great solution. You could also name Bio::DB::Taxonomy::list as Bio::DB::Taxonomy::inmemory because it really isn't much else than an in-memory database (of limited content if you populate it from flat-file sequence annotation). The only reservation I have is that you'd have methods on Node that don't really operate on the node instance but rather operate on the taxonomy (database) behind the scenes. That's what I would have used Bio::Taxonomy for, not so much as a container than as a class with (conceptually) 'static' methods corresponding to those that are now in Node, like get_Lineage_Nodes(). They would optionally accept a db_handle too, or use a default one set as an attribute. However, leaving/having these methods on Node really isn't such a big deal and I'm sure would even be preferred by many people for the sake of simplicity. So overall I think you should just go ahead. -hilmar On Jul 26, 2006, at 1:13 PM, Sendu Bala wrote: > > The fine details of the following may be slightly off, but it's > just to > provide an example. I'll use Test.pm syntax. > > my @human = qw('Homo sapiens' Homo Mammalia Eukaryota); > my @mouse = qw('Mus musculus' Mus Mammalia Eukaryota); > > > [...] > Proposed way with Node > ---------------------- > > my $db = new Bio::DB::Taxonomy(-source => 'list', -lineage => @human); > my $h_node = $db->get_Taxonomy_Node(-name => 'Homo sapiens'); > $db->add_lineage(@mouse); # or make a new db > my $m_node = $db->get_Taxonomy_Node(-name => 'Mus musculus'); > > @human = map { $_->scientific_name } $h_node->get_Lineage_Nodes; > ok join(", ", @human), "Homo sapiens, Homo, Mammalia, Eukaryota"; > # works as expected > > my $lca = $h_node->get_LCA_Node($m_node); > ok $lca->scientific_name, 'Mammalia'; # works first time > > # try again with entrez - just change the db_handle > $h_node->db_handle(new Bio:DB::Taxonomy(-source => 'entrez'); > > @human = map { $_->scientific_name } $h_node->get_Lineage_Nodes; > ok join(", ", @human) eq "Homo sapiens, Homo, Homo/Pan/Gorilla group, > Hominidae, ..."; > > $lca = $h_node->get_LCA_Node($m_node); > ok $lca->scientific_name, 'Mammalia'; > > [...] -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From bix at sendu.me.uk Thu Jul 27 03:07:22 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 27 Jul 2006 08:07:22 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net> References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> <44C7A2C7.2070100@sendu.me.uk> <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net> Message-ID: <44C8662A.3080904@sendu.me.uk> Hilmar Lapp wrote: > The only reservation I have is that you'd have methods on Node that > don't really operate on the node instance but rather operate on the > taxonomy (database) behind the scenes. That's what I would have used > Bio::Taxonomy for, not so much as a container than as a class with > (conceptually) 'static' methods corresponding to those that are now > in Node, like get_Lineage_Nodes(). Yes, I had the same reservation. But it somehow seemed reasonable for me to ask a node for its lineage, though I draw the line at having a method like get_node('rank_name'). That's the only thing Bio::Taxonomy would have been good for, so it's a trade off between some nice methods and the problems inherent in a node-container class. Though, perhaps we almost have the best of both worlds, since the database is effectively a container without the problems: $node->db_handle->get_Taxonomy_Node(-rank 'rank_name', -lineage_of => $node); ? > So overall I think you should just go ahead. Great, will do. From maximilianh at gmail.com Thu Jul 27 04:56:44 2006 From: maximilianh at gmail.com (Maximilian Haeussler) Date: Thu, 27 Jul 2006 10:56:44 +0200 Subject: [Bioperl-l] TRANSFAC matrices, open acces Message-ID: <76f031ae0607270156lde64ff9m590257d98899fe64@mail.gmail.com> Actually, the fact that the transfac matrices are belonging to a company is quite inconvenient for biologists and bioinformatics analyses working in this field. There are some projects to annotate cis-sequences in regular intervals by volunteers and put the data into the public domain, one of them is the oreganno database http://www.oreganno.org/. Its first annotation jamboree will be held in Gent at the end of this year. If you're interested in cis-sequences, want to meet others that are and are willing to contribute some annotation efforts, don't hestitate to come to gent, it's conveniently placed in the middle of europe and registration costs almost nothing. http://www.dmbr.ugent.be/bioit/contents/regcreative/ One day, hopefully, journals will oblige authors to put their sequences in a common format into genbank but as long as regulation is not seen as an important part of genome annotation, a lot manual annotation will have to be done. cheers max > On 26/07/06, leverdeterre wrote: > > > > i'm happy for helping you > > i'have done a page whitch can interrest you > > http://morissardjerome.free.fr/Data/index.html > > > > there is more information about the 397 matrix file ( in the 3 first line) > > and i'm done all the logo file . > > > > ++ > > -- > > View this message in context: http://www.nabble.com/Re%3A-Accessing-TRANSFAC-matrices-tf1969467.html#a5510746 > > Sent from the Perl - Bioperl-L forum at Nabble.com. > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > -- Maximilian Haeussler, CNRS/INRA Gif-sur-Yvette, France tel: +33 6 12 82 76 16 skype: maximilianhaeussler From morissardj at gmail.com Thu Jul 27 05:10:19 2006 From: morissardj at gmail.com (leverdeterre) Date: Thu, 27 Jul 2006 02:10:19 -0700 (PDT) Subject: [Bioperl-l] Accessing TRANSFAC matrices In-Reply-To: <5510746.post@talk.nabble.com> References: <44BEA9FB.1070009@utk.edu> <5510746.post@talk.nabble.com> Message-ID: <5517747.post@talk.nabble.com> Sorry i remove all this data because they are the proprity of TRANSFAC .. http://www.gene-regulation.com/pub/databases/transfac/doc/misc.html The TRANSFAC? database is free for users from non-profit organizations only. Users from commercial enterprises have to license the TRANSFAC? database and accompanying programs. -- View this message in context: http://www.nabble.com/Re%3A-Accessing-TRANSFAC-matrices-tf1969467.html#a5517747 Sent from the Perl - Bioperl-L forum at Nabble.com. From maximilianh at gmail.com Thu Jul 27 04:44:47 2006 From: maximilianh at gmail.com (Maximilian Haeussler) Date: Thu, 27 Jul 2006 10:44:47 +0200 Subject: [Bioperl-l] Accessing TRANSFAC matrices In-Reply-To: <5510746.post@talk.nabble.com> References: <44BEA9FB.1070009@utk.edu> <5510746.post@talk.nabble.com> Message-ID: <76f031ae0607270144of6ff9cbtbd9f3045bbc4e6e1@mail.gmail.com> I'm pretty sure that you are not allowed to distribute these matrices: http://www.gene-regulation.com/pub/databases/transfac/doc/misc.html [well...but if you don't care and biobase doesn't complain... actually anyone can scrape the matrices from the website with wget.] max On 26/07/06, leverdeterre wrote: > > i'm happy for helping you > i'have done a page whitch can interrest you > http://morissardjerome.free.fr/Data/index.html > > there is more information about the 397 matrix file ( in the 3 first line) > and i'm done all the logo file . > > ++ > -- > View this message in context: http://www.nabble.com/Re%3A-Accessing-TRANSFAC-matrices-tf1969467.html#a5510746 > Sent from the Perl - Bioperl-L forum at Nabble.com. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From bix at sendu.me.uk Thu Jul 27 05:55:01 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 27 Jul 2006 10:55:01 +0100 Subject: [Bioperl-l] TRANSFAC matrices, open acces In-Reply-To: <76f031ae0607270156lde64ff9m590257d98899fe64@mail.gmail.com> References: <76f031ae0607270156lde64ff9m590257d98899fe64@mail.gmail.com> Message-ID: <44C88D75.7040102@sendu.me.uk> Maximilian Haeussler wrote: > Actually, the fact that the transfac matrices are belonging to a > company is quite inconvenient for biologists and bioinformatics > analyses working in this field. The public version is adequate though. It would certainly be useful to have Bioperl access to transfac and other regulation databases. I'll probably write some suitable modules if no one beats me to it. From sdavis2 at mail.nih.gov Thu Jul 27 07:43:09 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 27 Jul 2006 07:43:09 -0400 Subject: [Bioperl-l] TRANSFAC matrices, open acces In-Reply-To: <44C88D75.7040102@sendu.me.uk> Message-ID: On 7/27/06 5:55 AM, "Sendu Bala" wrote: > Maximilian Haeussler wrote: >> Actually, the fact that the transfac matrices are belonging to a >> company is quite inconvenient for biologists and bioinformatics >> analyses working in this field. > > The public version is adequate though. It would certainly be useful to > have Bioperl access to transfac and other regulation databases. I'll > probably write some suitable modules if no one beats me to it. I haven't used it in a while, but the TFBS family of modules are, if I recall correctly, bioperl-compatible, though not part of bioperl. In any case, for those who aren't aware, it might be worth a quick look: http://forkhead.cgb.ki.se/TFBS/ Sean From bix at sendu.me.uk Thu Jul 27 08:01:03 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 27 Jul 2006 13:01:03 +0100 Subject: [Bioperl-l] TRANSFAC matrices, open acces In-Reply-To: References: Message-ID: <44C8AAFF.6060100@sendu.me.uk> Sean Davis wrote: > > On 7/27/06 5:55 AM, "Sendu Bala" wrote: > >> Maximilian Haeussler wrote: >>> Actually, the fact that the transfac matrices are belonging to a >>> company is quite inconvenient for biologists and bioinformatics >>> analyses working in this field. > >> The public version is adequate though. It would certainly be useful to >> have Bioperl access to transfac and other regulation databases. I'll >> probably write some suitable modules if no one beats me to it. > > I haven't used it in a while, but the TFBS family of modules are, if I > recall correctly, bioperl-compatible, though not part of bioperl. In any > case, for those who aren't aware, it might be worth a quick look: Yes. It only has online access to Transfac though, and the inheritance and returned objects are TFBS specific so you miss out on whatever goodness there may be in the rest of bioperl. Still, recommended to use if you want programmatic access to Transfac matrices right now. From bernd.web at gmail.com Thu Jul 27 06:14:13 2006 From: bernd.web at gmail.com (Bernd Web) Date: Thu, 27 Jul 2006 12:14:13 +0200 Subject: [Bioperl-l] Structure::IO In-Reply-To: References: <716af09c0607250347w24e0d8dbj674758049bc0a4e2@mail.gmail.com> Message-ID: <716af09c0607270314u4e2b1eb8y6c1b87f5b3abd8e1@mail.gmail.com> Hi Thanks for your notes. The text I pasted comes from http://doc.bioperl.org/releases/bioperl-1.5.1/ but indeed Entry.pm (v1.25 2006/07/04) shows a different POD. I am trying to get annotation out of PDB. ID is not a problem, but I would like to have the HEADER and possibly comment fields to a (FastA) description line, but how? Bio::Structure::Entry v.1.25 does not list the annotation method in the POD anymore (due to a missing empty line before =head). $struc->annotation still exists; I can get the keys but not the values with $struc->annotation($struc->seqres) (Can't locate object method "get_Annotations" via package "Bio::PrimarySeq"). (Example script attached). The POD states: annotation: $obj->annotation($seq_obj). So I thought of a PrimarySeq object to pass to annotation. The PrimarySeq object ($struc->seqres) does not contain a description: $struc->seqres->desc is uninitialized. Is it possible to get annotation from header/comments fields with Bio::Structure? Best regards, Bernd On 7/26/06, Brian Osborne wrote: > Bernd, > > I'm not following your question. The POD in the latest Bio::Structure::Entry > shows: > > =head2 chain() > > Title : chain > Usage : @chains = $structure->chain($chain); > Function: Connects a Chain or a list of Chain objects to a > Bio::Structure::Entry. > Returns : List of Bio::Structure::Chain objects > Args : A Chain or a reference to an array of Chain objects > > =cut > > Which is not what you've copied and pasted. What version of Bioperl do you > use? > > Brian O. > > > > On 7/25/06 6:47 AM, "Bernd Web" wrote: > > > Hi, > > > > Does someone have experience with Bio::Structure::IO? > > The example III.9.1 from the bptutorial.pl covers most, but what is e.g. the > > chain() method of Bio::Structure::Entry doing? The POD states: > > > > Title : chain > > Usage : @chains = $structure->chain($chain); > > Function: Connects a (or a list of) Chain objects to a Bio::Structure::Entry. > > Returns : list of Bio::Structure::Residue objects > > Args : One Residue or a reference to an array of Residue objects > > > > But in e.g > > my $stream = Bio::Structure::IO->new(-file => $filename, > > -format => 'pdb'); > > while ( my $struc = $stream->next_structure() ) { > > for my $chain ($struc->get_chains) { > > my $chainid = $chain->id; > > my @chains = $struc->chain($chain); > > } > > } > > > > I get Bio::Structure::Chain=HASH(0x9f1ab50). > > > > What is the function of the chain method and how to use it? > > > > Best regards, > > bernd > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > -------------- next part -------------- #!/usr/bin/perl -w use warnings; use strict; use Bio::Structure::IO; my $filename = $ARGV[0]; my $stream = Bio::Structure::IO->new( -file => $filename, -format => 'pdb'); while ( my $struc = $stream->next_structure() ) { print "SEQRES DESC: ", $struc->seqres->desc, "\n"; print join(" ", keys %{$struc->annotation($struc->seqres)}), "\n"; print join(" ", keys %{$struc->annotation()}), "\n"; print join(" ", values %{$struc->annotation()}), "\n"; #(partly) works print join(" ", values %{$struc->annotation($struc->seqres)}), "\n"; #does not work } From bix at sendu.me.uk Thu Jul 27 09:31:54 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 27 Jul 2006 14:31:54 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net> References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> <44C7A2C7.2070100@sendu.me.uk> <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net> Message-ID: <44C8C04A.8070504@sendu.me.uk> Hilmar Lapp wrote: > > So overall I think you should just go ahead. One last suggestion for discussion: It may be appropriate is to rename Bio::Taxonomy::Node to clarify that Node has no particular reliance on or association with Bio::Taxonomy or the other modules in Bio/Taxonomy/. How about calling it Bio::Taxon? It is more obvious what to expect from something called 'Bio::Taxon' when you know that it is the new 'Bio::Species': like Bio::Species but for any taxon. It also makes the class 'top-level' which I think most people are happier using; seems like things in sub-directories are more for advanced users. From hlapp at gmx.net Thu Jul 27 09:44:25 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 27 Jul 2006 09:44:25 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C8C04A.8070504@sendu.me.uk> References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> <44C7A2C7.2070100@sendu.me.uk> <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net> <44C8C04A.8070504@sendu.me.uk> Message-ID: <077A86AB-174D-426E-B65F-0954A72A6019@gmx.net> I don't think the top-level or sub-directory matters at all and I don't want anybody to get used to the notion that that may imply anything (except possibly better thought-out structure for the sub- directory level). For instance RichSeq is what all rich annotation sequence format parsers return, yet it is in a sub-directory. I don't any real objection to Bio::Taxon though if that's what you'd like to name it - although, what will happen to the Bio::Taxonomy hierarchy then? Phased out? -hilmar On Jul 27, 2006, at 9:31 AM, Sendu Bala wrote: > Hilmar Lapp wrote: >> >> So overall I think you should just go ahead. > > One last suggestion for discussion: > > It may be appropriate is to rename Bio::Taxonomy::Node to clarify that > Node has no particular reliance on or association with > Bio::Taxonomy or > the other modules in Bio/Taxonomy/. > > How about calling it Bio::Taxon? > > It is more obvious what to expect from something called 'Bio::Taxon' > when you know that it is the new 'Bio::Species': like Bio::Species but > for any taxon. It also makes the class 'top-level' which I think most > people are happier using; seems like things in sub-directories are > more > for advanced users. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Thu Jul 27 09:48:32 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 27 Jul 2006 08:48:32 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C8662A.3080904@sendu.me.uk> Message-ID: <002a01c6b183$59779880$15327e82@pyrimidine> Sounds good to me; agree with Hilmar's suggestion of 'in_memory' as well, but it's your choice. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Sendu Bala > Sent: Thursday, July 27, 2006 2:07 AM > To: bioperl-l at bioperl.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > Hilmar Lapp wrote: > > The only reservation I have is that you'd have methods on Node that > > don't really operate on the node instance but rather operate on the > > taxonomy (database) behind the scenes. That's what I would have used > > Bio::Taxonomy for, not so much as a container than as a class with > > (conceptually) 'static' methods corresponding to those that are now > > in Node, like get_Lineage_Nodes(). > > Yes, I had the same reservation. But it somehow seemed reasonable for me > to ask a node for its lineage, though I draw the line at having a method > like get_node('rank_name'). That's the only thing Bio::Taxonomy would > have been good for, so it's a trade off between some nice methods and > the problems inherent in a node-container class. > > Though, perhaps we almost have the best of both worlds, since the > database is effectively a container without the problems: > $node->db_handle->get_Taxonomy_Node(-rank 'rank_name', > -lineage_of => $node); ? > > > > So overall I think you should just go ahead. > > Great, will do. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From osborne1 at optonline.net Thu Jul 27 09:44:33 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Thu, 27 Jul 2006 09:44:33 -0400 Subject: [Bioperl-l] Structure::IO In-Reply-To: <716af09c0607270314u4e2b1eb8y6c1b87f5b3abd8e1@mail.gmail.com> Message-ID: Bernd, I'll need to take a look a closer look at the POD but from your description it seems it's wrong, or certainly incomplete. To get the HEADER line you'll do something like: my $stream = Bio::Structure::IO->new(-file => $filename, -format => 'pdb'); my $struc = $stream->next_structure(); my $anncoll = $struc->annotation; my @headers = $anncoll->get_Annotations('header'); This implies that all these top-level annotations are associated with the entry, not with the chains. I don't use Bio::Structure so don't assume this is true for all annotations. There are 2 ways to explore this further. One is to look at t/StructIO.t or other tests, useful examples are frequently found in the tests. The other is to run your script in the debugger: >perl -d pdb.pl 1CAM.pdb By examining the variables your script creates using the "x" command you get to see exactly where strings are stored and what the names of the keys are, this is how I found the HEADER line. Type "h" for the debugger's Help. Brian O. On 7/27/06 6:14 AM, "Bernd Web" wrote: > Hi > > Thanks for your notes. The text I pasted comes from > http://doc.bioperl.org/releases/bioperl-1.5.1/ but indeed Entry.pm > (v1.25 2006/07/04) shows a different POD. > > I am trying to get annotation out of PDB. ID is not a problem, but I > would like to have the HEADER and possibly comment fields to a (FastA) > description line, but how? > > Bio::Structure::Entry v.1.25 does not list the annotation method in > the POD anymore (due to a missing empty line before =head). > $struc->annotation still exists; I can get the keys but not the values > with $struc->annotation($struc->seqres) (Can't locate object method > "get_Annotations" via package "Bio::PrimarySeq"). > (Example script attached). > > The POD states: annotation: $obj->annotation($seq_obj). So I thought > of a PrimarySeq object to pass to annotation. > > The PrimarySeq object ($struc->seqres) does not contain a description: > $struc->seqres->desc is uninitialized. > > Is it possible to get annotation from header/comments fields with > Bio::Structure? > > Best regards, > Bernd > > > On 7/26/06, Brian Osborne wrote: >> Bernd, >> >> I'm not following your question. The POD in the latest Bio::Structure::Entry >> shows: >> >> =head2 chain() >> >> Title : chain >> Usage : @chains = $structure->chain($chain); >> Function: Connects a Chain or a list of Chain objects to a >> Bio::Structure::Entry. >> Returns : List of Bio::Structure::Chain objects >> Args : A Chain or a reference to an array of Chain objects >> >> =cut >> >> Which is not what you've copied and pasted. What version of Bioperl do you >> use? >> >> Brian O. >> >> >> >> On 7/25/06 6:47 AM, "Bernd Web" wrote: >> >>> Hi, >>> >>> Does someone have experience with Bio::Structure::IO? >>> The example III.9.1 from the bptutorial.pl covers most, but what is e.g. the >>> chain() method of Bio::Structure::Entry doing? The POD states: >>> >>> Title : chain >>> Usage : @chains = $structure->chain($chain); >>> Function: Connects a (or a list of) Chain objects to a >>> Bio::Structure::Entry. >>> Returns : list of Bio::Structure::Residue objects >>> Args : One Residue or a reference to an array of Residue objects >>> >>> But in e.g >>> my $stream = Bio::Structure::IO->new(-file => $filename, >>> -format => 'pdb'); >>> while ( my $struc = $stream->next_structure() ) { >>> for my $chain ($struc->get_chains) { >>> my $chainid = $chain->id; >>> my @chains = $struc->chain($chain); >>> } >>> } >>> >>> I get Bio::Structure::Chain=HASH(0x9f1ab50). >>> >>> What is the function of the chain method and how to use it? >>> >>> Best regards, >>> bernd >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> >> From aaron.j.mackey at gsk.com Thu Jul 27 08:54:05 2006 From: aaron.j.mackey at gsk.com (aaron.j.mackey at gsk.com) Date: Thu, 27 Jul 2006 08:54:05 -0400 Subject: [Bioperl-l] Installing staden io_lib on windows? In-Reply-To: <1153930767.2632.5.camel@localhost.localdomain> Message-ID: Hi Scott, > In file included from Read.h:43, > from Read.c:40: > ../utils/os.h:346:2: #error Must define SP_BIG_ENDIAN or > SP_LITTLE_ENDIAN in Makefile os.h has a bunch of #ifdef statements that check for platforms, and there isn't one for cygwin (but there is for MinGW) Try running configure with "--CFLAGS=-DSP_LITTLE_ENDIAN" or somesuch Also take a look at the MinGW section of os.h to see if there are others you will likely need (e.g. NOPIPE, NOLOCKF, etc) Alternatively, you may want to just edit the original os.h to duplicate the MinGW section with the appropriate compiler constant for CYGWIN (__CYGWIN__ I'm guessing, but don't really know for sure). Good luck, -Aaron From bix at sendu.me.uk Thu Jul 27 10:06:23 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 27 Jul 2006 15:06:23 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <077A86AB-174D-426E-B65F-0954A72A6019@gmx.net> References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> <44C7A2C7.2070100@sendu.me.uk> <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net> <44C8C04A.8070504@sendu.me.uk> <077A86AB-174D-426E-B65F-0954A72A6019@gmx.net> Message-ID: <44C8C85F.2010104@sendu.me.uk> Hilmar Lapp wrote: > I don't think the top-level or sub-directory matters at all and I don't > want anybody to get used to the notion that that may imply anything > (except possibly better thought-out structure for the sub-directory > level). For instance RichSeq is what all rich annotation sequence format > parsers return, yet it is in a sub-directory. Well, I'm not aware that I've ever used a RichSeq ;). But your point is taken. > I don't any real objection to Bio::Taxon though if that's what you'd > like to name it - although, what will happen to the Bio::Taxonomy > hierarchy then? Phased out? At the moment it seems to me that the Bio::Taxonomy modules (excluding Node) aren't really usable. Jason wrote a comment in t/TaxonTree.t which tests Taxon and Tree: ## I am pretty sure this module is going the way of the dodo bird so ## I am not sure how much work to put into fixing the tests/module FactoryI is strange (it isn't intended to work like any other Bioperl factory) and there are no implementers of it, while Taxonomy.pm itself would be redundant after my Node changes and has implementation issues, though it may make more sense to some people. My vote is phase out. What is the actual process involved in renaming a module in Bioperl? From hlapp at gmx.net Thu Jul 27 10:29:09 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 27 Jul 2006 10:29:09 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C8C85F.2010104@sendu.me.uk> References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> <44C7A2C7.2070100@sendu.me.uk> <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net> <44C8C04A.8070504@sendu.me.uk> <077A86AB-174D-426E-B65F-0954A72A6019@gmx.net> <44C8C85F.2010104@sendu.me.uk> Message-ID: How do you mean 'process'? You create a new module, and then you deprecate the ones you're phasing out. If possible you rewrite the implementation to use the new module. Not sure this answers your question? -hilmar On Jul 27, 2006, at 10:06 AM, Sendu Bala wrote: > Hilmar Lapp wrote: >> I don't think the top-level or sub-directory matters at all and I >> don't >> want anybody to get used to the notion that that may imply anything >> (except possibly better thought-out structure for the sub-directory >> level). For instance RichSeq is what all rich annotation sequence >> format >> parsers return, yet it is in a sub-directory. > > Well, I'm not aware that I've ever used a RichSeq ;). But your > point is > taken. > > >> I don't any real objection to Bio::Taxon though if that's what you'd >> like to name it - although, what will happen to the Bio::Taxonomy >> hierarchy then? Phased out? > > At the moment it seems to me that the Bio::Taxonomy modules (excluding > Node) aren't really usable. Jason wrote a comment in t/TaxonTree.t > which > tests Taxon and Tree: > > ## I am pretty sure this module is going the way of the dodo bird so > ## I am not sure how much work to put into fixing the tests/module > > FactoryI is strange (it isn't intended to work like any other Bioperl > factory) and there are no implementers of it, while Taxonomy.pm itself > would be redundant after my Node changes and has implementation > issues, > though it may make more sense to some people. > > My vote is phase out. > > > What is the actual process involved in renaming a module in Bioperl? > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Thu Jul 27 10:29:39 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 27 Jul 2006 09:29:39 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <077A86AB-174D-426E-B65F-0954A72A6019@gmx.net> Message-ID: <003101c6b189$17f5d2e0$15327e82@pyrimidine> I'll respond to both here: > Sendu Bala wrote: > > One last suggestion for discussion: > > It may be appropriate is to rename Bio::Taxonomy::Node to clarify that > Node has no particular reliance on or association with Bio::Taxonomy or > the other modules in Bio/Taxonomy/. > > How about calling it Bio::Taxon? > > It is more obvious what to expect from something called 'Bio::Taxon' > when you know that it is the new 'Bio::Species': like Bio::Species but > for any taxon. It also makes the class 'top-level' which I think most > people are happier using; seems like things in sub-directories are more > for advanced users. Hilmar explains the namespace issue with Bioperl more concisely below. You should still be able to use a Node in a Taxonomy, but then again you should also be able to use a Taxon in a Taxonomy as well (by definition, a Taxon is part of a Taxonomy as it is a taxonomic unit). The whole "looking at this from a biologist's perspective" thing again... http://en.wikipedia.org/wiki/Taxon BTW, what exactly is Bio::Taxonomy::Taxon used for? Looks like it is used more for building taxonomic trees that anything, so shouldn't it be moved to Bio::Tree:Taxon (that name isn't used)? Then you could use Bio::Taxonomy::Taxon for your purposes. See, the only concern I have with using the name Bio::Taxon is people confusing it with Bio::Taxonomy itself or with Bio::Taxonomy::Taxon. Though I agree that the name makes sense for what you want. > Hilmar Lapp wrote: > > I don't think the top-level or sub-directory matters at all and I > don't want anybody to get used to the notion that that may imply > anything (except possibly better thought-out structure for the sub- > directory level). For instance RichSeq is what all rich annotation > sequence format parsers return, yet it is in a sub-directory. > > I don't any real objection to Bio::Taxon though if that's what you'd > like to name it - although, what will happen to the Bio::Taxonomy > hierarchy then? Phased out? > > -hilmar I'm not sure how many people out there use Bio::Taxonomy. I think they use the tree-building modules in Bio::Tree more than anything. And there haven't been any panicked users protesting at the gates yet about the many posts for Bio::Taxonomy changes (well, except me, and 'I got better'). Chris From cjfields at uiuc.edu Thu Jul 27 10:54:06 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 27 Jul 2006 09:54:06 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C8C85F.2010104@sendu.me.uk> Message-ID: <003201c6b18c$829330e0$15327e82@pyrimidine> > > I don't any real objection to Bio::Taxon though if that's what you'd > > like to name it - although, what will happen to the Bio::Taxonomy > > hierarchy then? Phased out? > > At the moment it seems to me that the Bio::Taxonomy modules (excluding > Node) aren't really usable. Jason wrote a comment in t/TaxonTree.t which > tests Taxon and Tree: > > ## I am pretty sure this module is going the way of the dodo bird so > ## I am not sure how much work to put into fixing the tests/module > > FactoryI is strange (it isn't intended to work like any other Bioperl > factory) and there are no implementers of it, while Taxonomy.pm itself > would be redundant after my Node changes and has implementation issues, > though it may make more sense to some people. > > My vote is phase out. > > > What is the actual process involved in renaming a module in Bioperl? This is how many times the phrase "Bio::Taxonomy" is used in Bioperl in directory Bio (which should catch any namespace usage like Node, etc.): Instances: 2 BP Module : Bio::DB::Taxonomy Instances: 4 BP Module : Bio::DB::Taxonomy::entrez Instances: 7 BP Module : Bio::DB::Taxonomy::flatfile Instances: 1 BP Module : Bio::Expression::Platform Instances: 1 BP Module : Bio::SeqIO::genbank Instances: 22 BP Module : Bio::Taxonomy Instances: 8 BP Module : Bio::Taxonomy::FactoryI Instances: 17 BP Module : Bio::Taxonomy::Node Instances: 20 BP Module : Bio::Taxonomy::Taxon Instances: 39 BP Module : Bio::Taxonomy::Tree Hmm, not much. Almost all hits are within Bio::DB::taxonomy or Bio::Taxonomy. The SeqIO::genbank was my change BTW; just haven't tossed the code yet. Therefore, the only one left that would be affected (outside of Bio::Taxonomy and Bio::DB::Taxonomy) is Allen Day's Bio::Expression::Platform class, which uses Bio::DB::Taxonomy::entrez to grab Nodes; that would just be changed over to whatever class you plan on using. And that class hasn't been documented at all outside the methods. Furthermore, judging by the mail list archives the Bio::Taxonomy modules had very little usage outside of Node. Jason mentioned on an old post that he could never get Bio::Taxonomy::Taxon/Tree to work and that Dan Kortschak had moved (Dan's last post was in 2003). Hence the test file comments. And you make a good point with Bio::Taxonomy::FactoryI. I agree, if the modules haven't served a useful purpose they should be phased out. Chris From cjfields at uiuc.edu Thu Jul 27 11:15:25 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 27 Jul 2006 10:15:25 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: Message-ID: <003301c6b18f$7d114000$15327e82@pyrimidine> Wow, we're doing a little bioperl spring cleaning here! I agree with Hilmar: create a new module (Bio::Taxon), which claims the namespace, and deprecate the old ones. How 'broken', exactly, is Bio::Taxonomy? The idea behind it seems just (container for Nodes) but maybe it should just be reconfigured, and all the classes in directory Bio/Taxonomy deprecated. Or should we start from scratch completely? Don't know if it has been attempted but it would be nice to have a way for building taxonomic trees from Node/Taxon information using a Taxonomy-like container object. I like the way NCBI does something along these lines with BLAST output now. BTW, thanks guys for a rousing discussion! Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Hilmar Lapp > Sent: Thursday, July 27, 2006 9:29 AM > To: Sendu Bala > Cc: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > How do you mean 'process'? You create a new module, and then you > deprecate the ones you're phasing out. If possible you rewrite the > implementation to use the new module. > > Not sure this answers your question? > > -hilmar > > On Jul 27, 2006, at 10:06 AM, Sendu Bala wrote: > > > Hilmar Lapp wrote: > >> I don't think the top-level or sub-directory matters at all and I > >> don't > >> want anybody to get used to the notion that that may imply anything > >> (except possibly better thought-out structure for the sub-directory > >> level). For instance RichSeq is what all rich annotation sequence > >> format > >> parsers return, yet it is in a sub-directory. > > > > Well, I'm not aware that I've ever used a RichSeq ;). But your > > point is > > taken. > > > > > >> I don't any real objection to Bio::Taxon though if that's what you'd > >> like to name it - although, what will happen to the Bio::Taxonomy > >> hierarchy then? Phased out? > > > > At the moment it seems to me that the Bio::Taxonomy modules (excluding > > Node) aren't really usable. Jason wrote a comment in t/TaxonTree.t > > which > > tests Taxon and Tree: > > > > ## I am pretty sure this module is going the way of the dodo bird so > > ## I am not sure how much work to put into fixing the tests/module > > > > FactoryI is strange (it isn't intended to work like any other Bioperl > > factory) and there are no implementers of it, while Taxonomy.pm itself > > would be redundant after my Node changes and has implementation > > issues, > > though it may make more sense to some people. > > > > My vote is phase out. > > > > > > What is the actual process involved in renaming a module in Bioperl? > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From hlapp at gmx.net Thu Jul 27 11:29:04 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 27 Jul 2006 11:29:04 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <003101c6b189$17f5d2e0$15327e82@pyrimidine> References: <003101c6b189$17f5d2e0$15327e82@pyrimidine> Message-ID: On Jul 27, 2006, at 10:29 AM, Chris Fields wrote: > See, the only concern I have with using the name Bio::Taxon is people > confusing it with Bio::Taxonomy itself or with > Bio::Taxonomy::Taxon. Though > I agree that the name makes sense for what you want. I don't think Bio::Taxonomy is used a lot in earnest if at all, so it you even test the waters by deprecating them right away by putting warning statements there and see whether anybody complains about the cluttered terminal screens. If this goes into snapshot releases and release candidates leading up to 1.6 then they may be phased out right away. Unless anybody on the list has strong objections? Anybody using Bio::Taxonomy? -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From skirov at utk.edu Thu Jul 27 09:57:19 2006 From: skirov at utk.edu (skirov) Date: Thu, 27 Jul 2006 09:57:19 -0400 Subject: [Bioperl-l] TRANSFAC matrices, open acces Message-ID: <44E2E794@webmail.utk.edu> Just a quick note- Bio::Matrix::PSM::IO can parse matrix.dat if you can get it- and as far as I can tell this is not easy- you have to contact the company to get access and it is not clear what their conditions are. This is the reason I have decided not to maintain the transfac parser. Stefan >===== Original Message From Sendu Bala ===== >Sean Davis wrote: >> >> On 7/27/06 5:55 AM, "Sendu Bala" wrote: >> >>> Maximilian Haeussler wrote: >>> Actually, the fact that the transfac matrices are belonging to a >>> company is quite inconvenient for biologists and bioinformatics >>> analyses working in this field. > > >>> The public version is adequate though. It would certainly be useful to >>> have Bioperl access to transfac and other regulation databases. I'll >>> probably write some suitable modules if no one beats me to it. >> >> I haven't used it in a while, but the TFBS family of modules are, if I >> recall correctly, bioperl-compatible, though not part of bioperl. In any >> case, for those who aren't aware, it might be worth a quick look: > >Yes. It only has online access to Transfac though, and the inheritance >and returned objects are TFBS specific so you miss out on whatever >goodness there may be in the rest of bioperl. > >Still, recommended to use if you want programmatic access to Transfac >matrices right now. >_______________________________________________ >Bioperl-l mailing list >Bioperl-l at lists.open-bio.org >http://lists.open-bio.org/mailman/listinfo/bioperl-l From bix at sendu.me.uk Thu Jul 27 12:30:38 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 27 Jul 2006 17:30:38 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> <44C7A2C7.2070100@sendu.me.uk> <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net> <44C8C04A.8070504@sendu.me.uk> <077A86AB-174D-426E-B65F-0954A72A6019@gmx.net> <44C8C85F.2010104@sendu.me.uk> Message-ID: <44C8EA2E.8030000@sendu.me.uk> Hilmar Lapp wrote: > How do you mean 'process'? You create a new module, and then you > deprecate the ones you're phasing out. If possible you rewrite the > implementation to use the new module. > > Not sure this answers your question? I guess. I was thinking of just making Bio::Taxonomy::Node isa Bio::Taxon and then simply removing all the code from Node, leaving just some perldoc that said it had been renamed? Or should there be some methods that issue a warning and then call SUPER? From hlapp at gmx.net Thu Jul 27 12:38:30 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 27 Jul 2006 12:38:30 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C8EA2E.8030000@sendu.me.uk> References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> <44C7A2C7.2070100@sendu.me.uk> <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net> <44C8C04A.8070504@sendu.me.uk> <077A86AB-174D-426E-B65F-0954A72A6019@gmx.net> <44C8C85F.2010104@sendu.me.uk> <44C8EA2E.8030000@sendu.me.uk> Message-ID: <881BB312-DF1B-43F6-A38D-8B543738244F@gmx.net> That's what I said could be possible here on much shorter notice that we'd do usually due to the low usage. Eventually deprecated modules should also be physically removed, so you want to prepare for that. (removing a module breaks scripts that used it; issuing a warning alerts to this being forthcoming.) -hilmar On Jul 27, 2006, at 12:30 PM, Sendu Bala wrote: > Hilmar Lapp wrote: >> How do you mean 'process'? You create a new module, and then you >> deprecate the ones you're phasing out. If possible you rewrite the >> implementation to use the new module. >> >> Not sure this answers your question? > > I guess. I was thinking of just making Bio::Taxonomy::Node isa > Bio::Taxon and then simply removing all the code from Node, leaving > just > some perldoc that said it had been renamed? > > Or should there be some methods that issue a warning and then call > SUPER? > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From sanges at biogem.it Thu Jul 27 12:37:05 2006 From: sanges at biogem.it (Remo Sanges) Date: Thu, 27 Jul 2006 18:37:05 +0200 Subject: [Bioperl-l] TRANSFAC matrices, open acces In-Reply-To: <44E2E794@webmail.utk.edu> References: <44E2E794@webmail.utk.edu> Message-ID: <44C8EBB1.5070709@biogem.it> Here is also my 2c on TFBS: skirov wrote: >Just a quick note- Bio::Matrix::PSM::IO can parse matrix.dat if you can get >it- and as far as I can tell this is not easy- you have to contact the company >to get access and it is not clear what their conditions are. This is the >reason I have decided not to maintain the transfac parser. >Stefan > > >>===== Original Message From Sendu Bala ===== >>Sean Davis wrote: >> >> >>>On 7/27/06 5:55 AM, "Sendu Bala" wrote: >>> >>> >>> >>>>Maximilian Haeussler wrote: >>>>Actually, the fact that the transfac matrices are belonging to a >>>>company is quite inconvenient for biologists and bioinformatics >>>>analyses working in this field. >>>> >>>> >>>>The public version is adequate though. It would certainly be useful to >>>>have Bioperl access to transfac and other regulation databases. I'll >>>>probably write some suitable modules if no one beats me to it. >>>> >>>> >>>I haven't used it in a while, but the TFBS family of modules are, if I >>>recall correctly, bioperl-compatible, though not part of bioperl. In any >>>case, for those who aren't aware, it might be worth a quick look: >>> >>> >>Yes. It only has online access to Transfac though >> TFBS::DB::LocalTRANSFAC - can parse local transfac matrices (matrix.dat) >>, and the inheritance >>and returned objects are TFBS specific so you miss out on whatever >>goodness there may be in the rest of bioperl. >> >> >> In TFBS there are modules which inherithed from Bio::SeqFeature::Generic and Bio::Root::Root. See for example TFBS::Site. So probably it is not so bad.... Here is the link cutted from the Sean's e-mail: http://forkhead.cgb.ki.se/TFBS/ HTH Remo From osborne1 at optonline.net Thu Jul 27 12:49:26 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Thu, 27 Jul 2006 12:49:26 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <881BB312-DF1B-43F6-A38D-8B543738244F@gmx.net> Message-ID: Sendu, And add the module or modules names to the DEPRECATED file. Brian O. On 7/27/06 12:38 PM, "Hilmar Lapp" wrote: > Eventually deprecated modules should also be physically removed From MEC at stowers-institute.org Thu Jul 27 13:28:03 2006 From: MEC at stowers-institute.org (Cook, Malcolm) Date: Thu, 27 Jul 2006 12:28:03 -0500 Subject: [Bioperl-l] TRANSFAC matrices, open acces Message-ID: re: >Yes. It only has online access to Transfac though, not quite true. It does support access to local transfac data files if you have them. --Malcolm From cjfields at uiuc.edu Thu Jul 27 13:45:28 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 27 Jul 2006 12:45:28 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <881BB312-DF1B-43F6-A38D-8B543738244F@gmx.net> Message-ID: <000301c6b1a4$73ef3fd0$15327e82@pyrimidine> Makes sense to me. From my previous post the only bioperl class that used it was Bio::Expression::Platform, and that only for grabbing Node objects from Bio::DB::Taxonomy::entrez (so, change it to use whatever object Bio::DB::Taxonomy returns). I couldn't find anything else in the core outside of the Bio::DB::Taxonomy and Bio::Taxonomy classes and tests that use them. There aren't even any scripts or examples. If you implement Bio::Root::RootI (and pretty much everything does), you could use warn() or deprecated() for these easily: ... Title : warn Usage : $object->warn("Warning message"); Function: Places a warning. What happens now is down to the verbosity of the object (value of $obj->verbose) verbosity 0 or not set => small warning verbosity -1 => no warning verbosity 1 => warning with stack trace verbosity 2 => converts warnings into throw ... Title : deprecated Usage : $obj->deprecated("Method X is deprecated"); Function: Prints a message about deprecation unless verbose is < 0 (which means be quiet) Returns : none Args : Message string to print to STDERR ... Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Hilmar Lapp > Sent: Thursday, July 27, 2006 11:39 AM > To: Sendu Bala > Cc: bioperl-l at bioperl.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > That's what I said could be possible here on much shorter notice that > we'd do usually due to the low usage. > > Eventually deprecated modules should also be physically removed, so > you want to prepare for that. (removing a module breaks scripts that > used it; issuing a warning alerts to this being forthcoming.) > > -hilmar > > On Jul 27, 2006, at 12:30 PM, Sendu Bala wrote: > > > Hilmar Lapp wrote: > >> How do you mean 'process'? You create a new module, and then you > >> deprecate the ones you're phasing out. If possible you rewrite the > >> implementation to use the new module. > >> > >> Not sure this answers your question? > > > > I guess. I was thinking of just making Bio::Taxonomy::Node isa > > Bio::Taxon and then simply removing all the code from Node, leaving > > just > > some perldoc that said it had been renamed? > > > > Or should there be some methods that issue a warning and then call > > SUPER? > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From bix at sendu.me.uk Thu Jul 27 15:30:47 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 27 Jul 2006 20:30:47 +0100 Subject: [Bioperl-l] TRANSFAC matrices, open acces In-Reply-To: References: Message-ID: <44C91467.5050001@sendu.me.uk> Cook, Malcolm wrote: > re: > >> Yes. It only has online access to Transfac though, > > not quite true. It does support access to local transfac data files if > you have them. And to local Jaspar files. I wasn't clear, but I meant for the 'only' to modify 'online'. Ie. it doesn't give you access to any other online databases. From bix at sendu.me.uk Thu Jul 27 15:55:32 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 27 Jul 2006 20:55:32 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <003101c6b189$17f5d2e0$15327e82@pyrimidine> References: <003101c6b189$17f5d2e0$15327e82@pyrimidine> Message-ID: <44C91A34.1040406@sendu.me.uk> Chris Fields wrote: > BTW, what exactly is Bio::Taxonomy::Taxon used for? Looks like it is used > more for building taxonomic trees that anything, so shouldn't it be moved to > Bio::Tree:Taxon (that name isn't used)? Then you could use > Bio::Taxonomy::Taxon for your purposes. It actually seemed more like a possible replacement for Bio::Taxonomy::Node. Thanks to its Tree::NodeI implementation it has the big advantage over Bio::Taxonomy::Node that you access the lineage without a database. From the programmer's point of view it seemed more natural, being able to create nodes and add descendants. I decided against it because I felt the added complexity wasn't really worth it, and Bio::Taxonomy::Node had some of its own advantages. If this turns out to be the wrong choice, my Bio::Taxon can always be reimplemented to also implement Tree::NodeI in the future. > See, the only concern I have with using the name Bio::Taxon is people > confusing it with Bio::Taxonomy itself or with Bio::Taxonomy::Taxon. Though > I agree that the name makes sense for what you want. I don't think you'd confuse it directly with Bio::Taxonomy, but you could certainly waste some time thinking it was appropriate to stick Bio::Taxon objects in Bio::Taxonomy objects - theoretically it might work but ultimately you'd just be wasting your time. I'll make sure the docs in the Taxonomy modules steer people in the right direction. From bix at sendu.me.uk Thu Jul 27 16:18:06 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 27 Jul 2006 21:18:06 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <003301c6b18f$7d114000$15327e82@pyrimidine> References: <003301c6b18f$7d114000$15327e82@pyrimidine> Message-ID: <44C91F7E.2040000@sendu.me.uk> Chris Fields wrote: > How 'broken', exactly, is Bio::Taxonomy? Its certainly usable as-is, but there are some gotchas. # It has an acknowledged weakness of not coping with multiple ranks of the same name (notably 'no rank'). # You can't have 2 nodes with the same rank (so can only build a single lineage, not a whole menagerie). # You must supply a list of all your rank names correctly ordered before you can add any nodes (or trust that the default list is satisfactory - it won't be if you have just a single 'no rank'). # You simply don't need it if you have Bio::Taxonomy::Nodes with db_handle set, or Bio::Taxonomy::Taxons. In my opinion, the burden is just too great for this ever to have been a 'fun' module to use. It was only required so that people could manually create their own Bio::Taxonomy::Nodes and form a lineage without a database. > Don't know if it has been attempted but it would be nice to have a way for > building taxonomic trees from Node/Taxon information using a Taxonomy-like > container object. I like the way NCBI does something along these lines with > BLAST output now. Not really sure what you mean. I don't think you'd require a container object to do any particular task. Can you clarify? From clarsen at vecna.com Thu Jul 27 15:59:50 2006 From: clarsen at vecna.com (Chris Larsen) Date: Thu, 27 Jul 2006 15:59:50 -0400 (EDT) Subject: [Bioperl-l] Working code Message-ID: <7263.70.106.6.26.1154030390.squirrel@mail.vecna.com> Hey gang, You said you wanted to see working code: ------------------------------------------- > ...And I'd rather see some code or API examples than > extensive elaborations. > > -hilmar Hilmar's right; working code does speaks louder than words. -Chris -------------------------------------------- So here's some: http://www.biohealthbase.org/GSearch/ We've just released the v2 of Bioinformatic Resource Center's website "Biohealthbase". Earlier I pointed out BHB v1 to the list; then we had implemented GBrowse on top of GUS 3. There was some data processing using BioPerl packages to generate well-formatted data for the Oracle instance. But new micro-organisms are added now, so we have Francisella, Mycobacterium, Microsporidia, Giardia, and Influenza. They are under GUS 3.5. We also now have some web-capable BLASTing under there (Please no spam!) And multiple sequence alignments and dendrograms are to come, using MUSCLE instead of ClustalW. Currently, a Bioperl I/O module accepts the output from BLAST and writes up some HTML, then our web app on another server displays the URL content. But we will improve on this model in v3 for MSA et al. since the requirements are different for multiple vs single alignments. Thanks again for the open source! Chris ---------------------------- Christopher Larsen, Ph.D. Senior Scientist Vecna Technologies, Inc. 5004 Lehigh Rd College Park, MD 20740-3821 e: clarsen at vecna.com ph: (240) 737-1625 f: (301) 699-3180 From skirov at utk.edu Thu Jul 27 09:56:45 2006 From: skirov at utk.edu (skirov) Date: Thu, 27 Jul 2006 09:56:45 -0400 Subject: [Bioperl-l] TRANSFAC matrices, open acces Message-ID: <44E2E5B9@webmail.utk.edu> Just a quick note- Bio::Matrix::PSM::IO can parse matrix.dat if you can get it- and as far as I can tell this is not easy- you have to contact the company to get access and it is not clear what their conditions are. Stefan >===== Original Message From Sendu Bala ===== >Sean Davis wrote: >> >> On 7/27/06 5:55 AM, "Sendu Bala" wrote: >> >>> Maximilian Haeussler wrote: >>> Actually, the fact that the transfac matrices are belonging to a >>> company is quite inconvenient for biologists and bioinformatics >>> analyses working in this field. > > >>> The public version is adequate though. It would certainly be useful to >>> have Bioperl access to transfac and other regulation databases. I'll >>> probably write some suitable modules if no one beats me to it. >> >> I haven't used it in a while, but the TFBS family of modules are, if I >> recall correctly, bioperl-compatible, though not part of bioperl. In any >> case, for those who aren't aware, it might be worth a quick look: > >Yes. It only has online access to Transfac though, and the inheritance >and returned objects are TFBS specific so you miss out on whatever >goodness there may be in the rest of bioperl. > >Still, recommended to use if you want programmatic access to Transfac >matrices right now. >_______________________________________________ >Bioperl-l mailing list >Bioperl-l at lists.open-bio.org >http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Thu Jul 27 21:19:51 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 27 Jul 2006 20:19:51 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C91F7E.2040000@sendu.me.uk> References: <003301c6b18f$7d114000$15327e82@pyrimidine> <44C91F7E.2040000@sendu.me.uk> Message-ID: <3DAB9065-3633-4D50-B97E-41F2BB58C6EB@uiuc.edu> ... >> Don't know if it has been attempted but it would be nice to have a >> way for >> building taxonomic trees from Node/Taxon information using a >> Taxonomy-like >> container object. I like the way NCBI does something along these >> lines with >> BLAST output now. > > Not really sure what you mean. I don't think you'd require a container > object to do any particular task. Can you clarify? Let's say you start with a list of sequence IDs from a BLAST report and wanted to find the taxonomic relationship between the BLAST hits. NCBI does something similar to this in their last few BLAST output revisions from the CGI interface; they have a link which contains the organisms ranked taxonomically in various ways. There is probably a Bioperl-specific way of doing this but I haven't spent the effort yet working out how. No big deal, really. I have PLENTY else to work on. Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From R.Birnie at leeds.ac.uk Fri Jul 28 05:39:34 2006 From: R.Birnie at leeds.ac.uk (Richard Birnie) Date: Fri, 28 Jul 2006 10:39:34 +0100 Subject: [Bioperl-l] whole genome annotation Message-ID: Hello all, I'm just trying to familiarise myself with BioPerl and I'm a little overwhelmed by the sheer volume of information available on the wiki. I'm hoping someone can point in the right direction through the labyrinth. This may become a little longwinded but I'll try and get all the annoying newbie questions out of the way in one go. Let me try and explain what I'm aiming for. I have some CGH data downloaded from the Progenetix database (http://www.progenetix.de/~pgscripts/progenetix/Aboutprogenetix.html), this data is simplified to record simply gain/loss/amplification of whole chromosome bands at 862 band resolution to facilitate the combination of data from multiple different studies. What I'd like to be able to do is download a copy of the human genome sequence with annotation describing the locations of chromosome bands and preferably of known genes. I then want to be able to manipulate the genome data based on the CGH data to mimic deletions. The ultimate goal of this is to be able to feed the manipulated genome data into a program (metashark) that predicts the structure of metabolic networks based on genome annotation compared to a reference genome, in this case a complete 'normal' human genome and see what effect that has on the metabolic pathways. I appreciate that is a bit vague but thats sort of my problem, I'm not a bioinformatician really so I'm not sue of the details of what I want. I just happen to have an question to answer and bioperl seems the way to go (for this project and more generally). I've started looking at the HOWTOs and read the main bioperl tutorial. I also looked at the CGL comparative genomics library but I haven't penetrated far into that yet. I'm ok with basic perl although not much object oriented stuff. I don't really have much experience with handling sequence data on a whole genome scale either, a few genbank files for my favourite genes is fine but I need some guidance to work on this scale. What I'm looking for is someone to give me a start. I'd greatly appreciate it if someone could spell out the general steps for downloading a complete copy of the human genome and its annotations (if this is even a feasible approach) and how to put it all together. Not actual code just the general concept for each step and which tools from the bioperl set would be most appropriate for each step so that I can focus what I need to read about, even a little pseudo-code if I'm lucky. If I can get the genome data downloaded and setup properly I'll work out how to apply the CGH data to it myself. If example code for what I'm trying to describe is included somewhere, great could someone point to where. Thanks for your patience. best regards, Richard Dr Richard Birnie Scientific Officer Section of Pathology and Tumour Biology Welcome Brenner Building, LIMM St James University Hospital Beckett St, Leeds, LS9 7TF Tel:0113 3438624 e-mail: r.birnie at leeds.ac.uk From sdavis2 at mail.nih.gov Fri Jul 28 07:59:17 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Fri, 28 Jul 2006 07:59:17 -0400 Subject: [Bioperl-l] whole genome annotation In-Reply-To: References: Message-ID: <44C9FC15.3040503@mail.nih.gov> Richard Birnie wrote: > Hello all, > > I'm just trying to familiarise myself with BioPerl and I'm a little overwhelmed by the sheer volume of information available on the wiki. I'm hoping someone can point in the right direction through the labyrinth. This may become a little longwinded but I'll try and get all the annoying newbie questions out of the way in one go. > > Let me try and explain what I'm aiming for. I have some CGH data downloaded from the Progenetix database (http://www.progenetix.de/~pgscripts/progenetix/Aboutprogenetix.html), this data is simplified to record simply gain/loss/amplification of whole chromosome bands at 862 band resolution to facilitate the combination of data from multiple different studies. > > What I'd like to be able to do is download a copy of the human genome sequence with annotation describing the locations of chromosome bands and preferably of known genes. I then want to be able to manipulate the genome data based on the CGH data to mimic deletions. The ultimate goal of this is to be able to feed the manipulated genome data into a program (metashark) that predicts the structure of metabolic networks based on genome annotation compared to a reference genome, in this case a complete 'normal' human genome and see what effect that has on the metabolic pathways. > > I appreciate that is a bit vague but thats sort of my problem, I'm not a bioinformatician really so I'm not sue of the details of what I want. I just happen to have an question to answer and bioperl seems the way to go (for this project and more generally). I've started looking at the HOWTOs and read the main bioperl tutorial. I also looked at the CGL comparative genomics library but I haven't penetrated far into that yet. I'm ok with basic perl although not much object oriented stuff. I don't really have much experience with handling sequence data on a whole genome scale either, a few genbank files for my favourite genes is fine but I need some guidance to work on this scale. > > What I'm looking for is someone to give me a start. I'd greatly appreciate it if someone could spell out the general steps for downloading a complete copy of the human genome and its annotations (if this is even a feasible approach) and how to put it all together. Not actual code just the general concept for each step and which tools from the bioperl set would be most appropriate for each step so that I can focus what I need to read about, even a little pseudo-code if I'm lucky. If I can get the genome data downloaded and setup properly I'll work out how to apply the CGH data to it myself. > > If example code for what I'm trying to describe is included somewhere, great could someone point to where. Hi, Richard. Bioperl is good for many things, but for simply grabbing all the locations of human genes in the genome and chromosome band locations, I wouldn't use bioperl. It sounds to me like you are interested in getting the genes associated with each chromosomal band? If so, just download the cytoband.txt and refFlat.txt files from the UCSC genome browser site. cytoband.txt contains the base pair locations for each of the cytobands. refFlat.txt contains the base pair locations of "refseq" genes. It is then simply a matter of finding overlapping regions (genes with cytobands) to determine which genes are in which cytobands. Since the files are tab-delimited text, they are very easy to work with (in perl, excel, python, ...). Don't get me wrong--I really appreciate the power of bioperl, but in this case, your task lends itself to a simpler (and MUCH) faster approach. Sean From R.Birnie at leeds.ac.uk Fri Jul 28 08:21:46 2006 From: R.Birnie at leeds.ac.uk (Richard Birnie) Date: Fri, 28 Jul 2006 13:21:46 +0100 Subject: [Bioperl-l] whole genome annotation References: <44C9FC15.3040503@mail.nih.gov> Message-ID: -----Original Message----- From: Sean Davis [mailto:sdavis2 at mail.nih.gov] Sent: Fri 7/28/2006 12:59 To: Richard Birnie Cc: bioperl-l at lists.open-bio.org Subject: Re: [Bioperl-l] whole genome annotation Richard Birnie wrote: > Hello all, > > I'm just trying to familiarise myself with BioPerl and I'm a little overwhelmed by the sheer volume of information available on the wiki. I'm hoping someone can point in the right direction through the labyrinth. This may become a little longwinded but I'll try and get all the annoying newbie questions out of the way in one go. > > Let me try and explain what I'm aiming for. I have some CGH data downloaded from the Progenetix database (http://www.progenetix.de/~pgscripts/progenetix/Aboutprogenetix.html), this data is simplified to record simply gain/loss/amplification of whole chromosome bands at 862 band resolution to facilitate the combination of data from multiple different studies. > > What I'd like to be able to do is download a copy of the human genome sequence with annotation describing the locations of chromosome bands and preferably of known genes. I then want to be able to manipulate the genome data based on the CGH data to mimic deletions. The ultimate goal of this is to be able to feed the manipulated genome data into a program (metashark) that predicts the structure of metabolic networks based on genome annotation compared to a reference genome, in this case a complete 'normal' human genome and see what effect that has on the metabolic pathways. > > I appreciate that is a bit vague but thats sort of my problem, I'm not a bioinformatician really so I'm not sue of the details of what I want. I just happen to have an question to answer and bioperl seems the way to go (for this project and more generally). I've started looking at the HOWTOs and read the main bioperl tutorial. I also looked at the CGL comparative genomics library but I haven't penetrated far into that yet. I'm ok with basic perl although not much object oriented stuff. I don't really have much experience with handling sequence data on a whole genome scale either, a few genbank files for my favourite genes is fine but I need some guidance to work on this scale. > > What I'm looking for is someone to give me a start. I'd greatly appreciate it if someone could spell out the general steps for downloading a complete copy of the human genome and its annotations (if this is even a feasible approach) and how to put it all together. Not actual code just the general concept for each step and which tools from the bioperl set would be most appropriate for each step so that I can focus what I need to read about, even a little pseudo-code if I'm lucky. If I can get the genome data downloaded and setup properly I'll work out how to apply the CGH data to it myself. > > If example code for what I'm trying to describe is included somewhere, great could someone point to where. Hi, Richard. Bioperl is good for many things, but for simply grabbing all the locations of human genes in the genome and chromosome band locations, I wouldn't use bioperl. It sounds to me like you are interested in getting the genes associated with each chromosomal band? If so, just download the cytoband.txt and refFlat.txt files from the UCSC genome browser site. cytoband.txt contains the base pair locations for each of the cytobands. refFlat.txt contains the base pair locations of "refseq" genes. It is then simply a matter of finding overlapping regions (genes with cytobands) to determine which genes are in which cytobands. Since the files are tab-delimited text, they are very easy to work with (in perl, excel, python, ...). Don't get me wrong--I really appreciate the power of bioperl, but in this case, your task lends itself to a simpler (and MUCH) faster approach. Sean Thanks for the response Sean, getting the genes associated with each band is certainly part of what I need and your suggestion will help with that. I did look at the UCSC site but as you know there is such a volume of info on there I didn't really know which files I needed. However my main goal requires slightly more. What I want to be able to do is take the chromosomal band annotation info and compare that against the CGH data I have. From this I'd like to then be able say "OK band 8q13.1 (or whatever) is deleted, so make a copy of the genome with the actual sequence associated with that band removed." Then I could feed both sequences into metashark which predicts the structure of metabolic pathways based on genome annotation and see what effect deleting that region of DNA has on the structure of the metabolic network. Knowing which genes are involved is necessary for identifying what are the important components within the region. Are there tools in Bioperl for making this comparison? It can probably be reduced to a straight comparison of data structures so I may just use regular perl for this part unless there is anything designed for purpose. The thing I was struggling with was how to store and manipulate genomic sequence data in such quantities. Since this morning I've had a better look at the CGL library and associated datastore module, I think I can do it using these tools but I'm having a few dependency issues getting it installed right now. So I'll go back to wrestling with that. regards, Richard From valiente at lsi.upc.edu Fri Jul 28 08:10:19 2006 From: valiente at lsi.upc.edu (Gabriel Valiente) Date: Fri, 28 Jul 2006 15:10:19 +0300 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: References: Message-ID: >>> At the moment it seems to me that the Bio::Taxonomy modules >>> (excluding >>> Node) aren't really usable. I've been using Bio::Taxonomy and Bio::DB::Taxonomy a lot, they are very useful modules. Whatever Bio::Taxon or Bio::Taxonomy::Taxon turns out to be, please do keep the Bio::DB::Taxonomy functionality. BTW, does anybody know how to include branch lengths in Bio::DB::Taxonomy? Thanks a lot, Gabriel From y.itan at ucl.ac.uk Fri Jul 28 08:07:32 2006 From: y.itan at ucl.ac.uk (Yuval Itan) Date: Fri, 28 Jul 2006 13:07:32 +0100 Subject: [Bioperl-l] Getting sequences by base pair locations Message-ID: <05a8f85bdc36dc7e370d0935b6394f9d@ucl.ac.uk> Hello all, I was BLATing a few hundred human genes against the chimp genome, and kept the best chimp hits for every human gene. I have the base pair start and end location for every chimp hit, and I need to get the sequence for each of these chimp hits. Here is an example for a few chimp hits bp locations: Start End 142854 144504 154479 155198 153066 167370 163146 163559 I have one chimp genome file (about 3GB) including all chromosomes, but I could also get one file per chromosome if that would make things easier. Does anyone have a script or a link for an interface that can do the job? Thank you very much. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/enriched Size: 835 bytes Desc: not available URL: From hlapp at gmx.net Fri Jul 28 08:59:43 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Fri, 28 Jul 2006 08:59:43 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: References: Message-ID: <233D3060-5CF7-4DF7-8EF6-6762CF45B94D@gmx.net> If I understand Sendu's proposal correctly then the existing methods in Bio::DB::Taxonomy will remain largely unchanged (methods may be added though). Can you describe briefly what you use Bio::Taxonomy for, e.g., which methods you use primarily and the context? -hilmar On Jul 28, 2006, at 8:10 AM, Gabriel Valiente wrote: >>>> At the moment it seems to me that the Bio::Taxonomy modules >>>> (excluding >>>> Node) aren't really usable. > > I've been using Bio::Taxonomy and Bio::DB::Taxonomy a lot, they are > very useful modules. Whatever Bio::Taxon or Bio::Taxonomy::Taxon > turns out to be, please do keep the Bio::DB::Taxonomy functionality. > > BTW, does anybody know how to include branch lengths in > Bio::DB::Taxonomy? > > Thanks a lot, > > Gabriel > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From bix at sendu.me.uk Fri Jul 28 09:01:44 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Fri, 28 Jul 2006 14:01:44 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: References: Message-ID: <44CA0AB8.7040205@sendu.me.uk> Gabriel Valiente wrote: >>>> At the moment it seems to me that the Bio::Taxonomy modules >>>> (excluding >>>> Node) aren't really usable. > > I've been using Bio::Taxonomy Can I ask how you've been using it? > and Bio::DB::Taxonomy a lot, they are > very useful modules. Whatever Bio::Taxon or Bio::Taxonomy::Taxon > turns out to be, please do keep the Bio::DB::Taxonomy functionality. Bio::DB::Taxonomy is staying virtually unaltered. > BTW, does anybody know how to include branch lengths in > Bio::DB::Taxonomy? At the moment, you don't 'include' anything at all in the DB modules yourself, since they are read-only. They give you Nodes which you can alter afterwards. I plan to add something like a 'distance to parent' in Node (Bio::Taxon) so you can work out branch lengths; you can't do that yet. From bix at sendu.me.uk Fri Jul 28 09:13:44 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Fri, 28 Jul 2006 14:13:44 +0100 Subject: [Bioperl-l] Getting sequences by base pair locations In-Reply-To: <05a8f85bdc36dc7e370d0935b6394f9d@ucl.ac.uk> References: <05a8f85bdc36dc7e370d0935b6394f9d@ucl.ac.uk> Message-ID: <44CA0D88.3000404@sendu.me.uk> Yuval Itan wrote: > Hello all, > > I was BLATing a few hundred human genes against the chimp genome, and > kept the best chimp hits for every human gene. > I have the base pair start and end location for every chimp hit, and I > need to get the sequence for each of these chimp hits. Here is an > example for a few chimp hits bp locations: > > Start End* > *142854 144504 > 154479 155198 > 153066 167370 > 163146 163559 > > I have one chimp genome file (about 3GB) including all chromosomes, but > I could also get one file per chromosome if that would make things > easier. Does anyone have a script or a link for an interface that can do > the job? If your genome file is in some standard format, use SeqIO. http://www.bioperl.org/wiki/HOWTO:SeqIO And then get the sequence corresponding to the correct chromosome and get the desired chunk with subseq(); http://www.bioperl.org/wiki/HOWTO:Beginners#The_Sequence_Object You'd also have to make sure that the data used during the blat is exactly the same data you have in your big file. From sdavis2 at mail.nih.gov Fri Jul 28 09:28:02 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Fri, 28 Jul 2006 09:28:02 -0400 Subject: [Bioperl-l] whole genome annotation In-Reply-To: References: <44C9FC15.3040503@mail.nih.gov> Message-ID: <44CA10E2.8010205@mail.nih.gov> Richard Birnie wrote: > > -----Original Message----- > From: Sean Davis [mailto:sdavis2 at mail.nih.gov] > Sent: Fri 7/28/2006 12:59 > To: Richard Birnie > Cc: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] whole genome annotation > > Richard Birnie wrote: > >>Hello all, >> >>I'm just trying to familiarise myself with BioPerl and I'm a little overwhelmed by the sheer volume of information available on the wiki. I'm hoping someone can point in the right direction through the labyrinth. This may become a little longwinded but I'll try and get all the annoying newbie questions out of the way in one go. >> >>Let me try and explain what I'm aiming for. I have some CGH data downloaded from the Progenetix database (http://www.progenetix.de/~pgscripts/progenetix/Aboutprogenetix.html), this data is simplified to record simply gain/loss/amplification of whole chromosome bands at 862 band resolution to facilitate the combination of data from multiple different studies. >> >>What I'd like to be able to do is download a copy of the human genome sequence with annotation describing the locations of chromosome bands and preferably of known genes. I then want to be able to manipulate the genome data based on the CGH data to mimic deletions. The ultimate goal of this is to be able to feed the manipulated genome data into a program (metashark) that predicts the structure of metabolic networks based on genome annotation compared to a reference genome, in this case a complete 'normal' human genome and see what effect that has on the metabolic pathways. >> >>I appreciate that is a bit vague but thats sort of my problem, I'm not a bioinformatician really so I'm not sue of the details of what I want. I just happen to have an question to answer and bioperl seems the way to go (for this project and more generally). I've started looking at the HOWTOs and read the main bioperl tutorial. I also looked at the CGL comparative genomics library but I haven't penetrated far into that yet. I'm ok with basic perl although not much object oriented stuff. I don't really have much experience with handling sequence data on a whole genome scale either, a few genbank files for my favourite genes is fine but I need some guidance to work on this scale. >> >>What I'm looking for is someone to give me a start. I'd greatly appreciate it if someone could spell out the general steps for downloading a complete copy of the human genome and its annotations (if this is even a feasible approach) and how to put it all together. Not actual code just the general concept for each step and which tools from the bioperl set would be most appropriate for each step so that I can focus what I need to read about, even a little pseudo-code if I'm lucky. If I can get the genome data downloaded and setup properly I'll work out how to apply the CGH data to it myself. >> >>If example code for what I'm trying to describe is included somewhere, great could someone point to where. > > > Hi, Richard. > > Bioperl is good for many things, but for simply grabbing all the > locations of human genes in the genome and chromosome band locations, I > wouldn't use bioperl. It sounds to me like you are interested in > getting the genes associated with each chromosomal band? If so, just > download the cytoband.txt and refFlat.txt files from the UCSC genome > browser site. cytoband.txt contains the base pair locations for each of > the cytobands. refFlat.txt contains the base pair locations of "refseq" > genes. It is then simply a matter of finding overlapping regions (genes > with cytobands) to determine which genes are in which cytobands. Since > the files are tab-delimited text, they are very easy to work with (in > perl, excel, python, ...). Don't get me wrong--I really appreciate the > power of bioperl, but in this case, your task lends itself to a simpler > (and MUCH) faster approach. > > Sean > > Thanks for the response Sean, > > getting the genes associated with each band is certainly part of what I need and your suggestion will help with that. I did look at the UCSC site but as you know there is such a volume of info on there I didn't really know which files I needed. > > However my main goal requires slightly more. What I want to be able to do is take the chromosomal band annotation info and compare that against the CGH data I have. From this I'd like to then be able say "OK band 8q13.1 (or whatever) is deleted, so make a copy of the genome with the actual sequence associated with that band removed." Then I could feed both sequences into metashark which predicts the structure of metabolic pathways based on genome annotation and see what effect deleting that region of DNA has on the structure of the metabolic network. Knowing which genes are involved is necessary for identifying what are the important components within the region. Are there tools in Bioperl for making this comparison? It can probably be reduced to a straight comparison of data structures so I may just use regular perl for this part unless there is anything designed for purpose. > > The thing I was struggling with was how to store and manipulate genomic sequence data in such quantities. Since this morning I've had a better look at the CGL library and associated datastore module, I think I can do it using these tools but I'm having a few dependency issues getting it installed right now. So I'll go back to wrestling with that. Ahh. I see. Metashark actually searches the remaining sequence in the human genome? If that is the case, then you need the start and end positions of the chromosomal bands, which you can download from the ucsc genome browser. Follow the links to download and then to the genome of your choice and finally get the chromband.txt file. The other piece of the puzzle is the bio::DB::Fasta module. It allows extremely fast access to a set of fasta files, which it first indexes. Here is the documentation for it: http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/DB/Fasta.html You could imagine making a hash indexed by chromosome band of a hash of starts and ends for each band. For each CGH experiment, find those regions that are deleted. Exclude those when looping through all the chromosome bands, pulling the sequence using Bio::DB::Fasta for each band and writing that to a file for metashark. Sean From sdavis2 at mail.nih.gov Fri Jul 28 09:30:52 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Fri, 28 Jul 2006 09:30:52 -0400 Subject: [Bioperl-l] Getting sequences by base pair locations In-Reply-To: <05a8f85bdc36dc7e370d0935b6394f9d@ucl.ac.uk> References: <05a8f85bdc36dc7e370d0935b6394f9d@ucl.ac.uk> Message-ID: <44CA118C.7010401@mail.nih.gov> Yuval Itan wrote: > Hello all, > > I was BLATing a few hundred human genes against the chimp genome, and > kept the best chimp hits for every human gene. > I have the base pair start and end location for every chimp hit, and I > need to get the sequence for each of these chimp hits. Here is an > example for a few chimp hits bp locations: > > Start End* > *142854 144504 > 154479 155198 > 153066 167370 > 163146 163559 > > I have one chimp genome file (about 3GB) including all chromosomes, but > I could also get one file per chromosome if that would make things > easier. Does anyone have a script or a link for an interface that can do > the job? See this module: http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/DB/Fasta.html Sean From osborne1 at optonline.net Fri Jul 28 09:35:02 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Fri, 28 Jul 2006 09:35:02 -0400 Subject: [Bioperl-l] whole genome annotation In-Reply-To: Message-ID: Richard, A good starting point is a FAQ page we created that describes various ways of extracting genomic sequence: http://www.bioperl.org/wiki/Getting_Genomic_Sequences Check that out, and Sean's suggestion, and write back to bioperl-l if you have questions. One thing that this page doesn't really address is the special challenge that comes with working with very large sequences, this is something you might have to consider as well. You also asked about downloading the human genome and its annotations. There's also more than one way to do this as well. You'd have access to this data if you used the ENSEMBL API but you can get the Genbank files at ftp://ftp.ncbi.nih.gov/genomes/. Having said that I should add that one of the advantages of the ENSEMBL API approach is that you don't have to download the entire genome. Don't know what machine you're working on but, again, trying to manipulate very large sequences may tax your computer as well as your patience. Brian O. On 7/28/06 5:39 AM, "Richard Birnie" wrote: > Hello all, > > I'm just trying to familiarise myself with BioPerl and I'm a little > overwhelmed by the sheer volume of information available on the wiki. I'm > hoping someone can point in the right direction through the labyrinth. This > may become a little longwinded but I'll try and get all the annoying newbie > questions out of the way in one go. > > Let me try and explain what I'm aiming for. I have some CGH data downloaded > from the Progenetix database > (http://www.progenetix.de/~pgscripts/progenetix/Aboutprogenetix.html), this > data is simplified to record simply gain/loss/amplification of whole > chromosome bands at 862 band resolution to facilitate the combination of data > from multiple different studies. > > What I'd like to be able to do is download a copy of the human genome sequence > with annotation describing the locations of chromosome bands and preferably of > known genes. I then want to be able to manipulate the genome data based on the > CGH data to mimic deletions. The ultimate goal of this is to be able to feed > the manipulated genome data into a program (metashark) that predicts the > structure of metabolic networks based on genome annotation compared to a > reference genome, in this case a complete 'normal' human genome and see what > effect that has on the metabolic pathways. > > I appreciate that is a bit vague but thats sort of my problem, I'm not a > bioinformatician really so I'm not sue of the details of what I want. I just > happen to have an question to answer and bioperl seems the way to go (for this > project and more generally). I've started looking at the HOWTOs and read the > main bioperl tutorial. I also looked at the CGL comparative genomics library > but I haven't penetrated far into that yet. I'm ok with basic perl although > not much object oriented stuff. I don't really have much experience with > handling sequence data on a whole genome scale either, a few genbank files for > my favourite genes is fine but I need some guidance to work on this scale. > > What I'm looking for is someone to give me a start. I'd greatly appreciate it > if someone could spell out the general steps for downloading a complete copy > of the human genome and its annotations (if this is even a feasible approach) > and how to put it all together. Not actual code just the general concept for > each step and which tools from the bioperl set would be most appropriate for > each step so that I can focus what I need to read about, even a little > pseudo-code if I'm lucky. If I can get the genome data downloaded and setup > properly I'll work out how to apply the CGH data to it myself. > > If example code for what I'm trying to describe is included somewhere, great > could someone point to where. > > Thanks for your patience. > best regards, > Richard > > > > Dr Richard Birnie > Scientific Officer > Section of Pathology and Tumour Biology > Welcome Brenner Building, LIMM > St James University Hospital > Beckett St, Leeds, LS9 7TF > Tel:0113 3438624 > e-mail: r.birnie at leeds.ac.uk > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From sdavis2 at mail.nih.gov Fri Jul 28 09:41:45 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Fri, 28 Jul 2006 09:41:45 -0400 Subject: [Bioperl-l] Getting sequences by base pair locations In-Reply-To: <44CA0D88.3000404@sendu.me.uk> References: <05a8f85bdc36dc7e370d0935b6394f9d@ucl.ac.uk> <44CA0D88.3000404@sendu.me.uk> Message-ID: <44CA1419.3030100@mail.nih.gov> Sendu Bala wrote: > Yuval Itan wrote: > >>Hello all, >> >>I was BLATing a few hundred human genes against the chimp genome, and >>kept the best chimp hits for every human gene. >>I have the base pair start and end location for every chimp hit, and I >>need to get the sequence for each of these chimp hits. Here is an >>example for a few chimp hits bp locations: >> >>Start End* >>*142854 144504 >>154479 155198 >>153066 167370 >>163146 163559 >> >>I have one chimp genome file (about 3GB) including all chromosomes, but >>I could also get one file per chromosome if that would make things >>easier. Does anyone have a script or a link for an interface that can do >>the job? > > > If your genome file is in some standard format, use SeqIO. > http://www.bioperl.org/wiki/HOWTO:SeqIO > > And then get the sequence corresponding to the correct chromosome and > get the desired chunk with subseq(); > http://www.bioperl.org/wiki/HOWTO:Beginners#The_Sequence_Object My guess is that Yuval will need random access to the sequences. With seqIO, this is possible with a relatively large amount of memory, but Bio::DB::Fasta might be the better bet. Alternatively, make a custom track (see the documentation for doing so at the UCSC genome browser site), upload it, and then getting the DNA is trivial with just a couple of mouseclicks. This method also has the advantage of being able to do things like viewing the data in genome coordinates and allows the possibility of doing interections with known chimp genes so you could find hits that don't overlap known chimp genes, for example. Sean From valiente at lsi.upc.edu Fri Jul 28 09:53:10 2006 From: valiente at lsi.upc.edu (Gabriel Valiente) Date: Fri, 28 Jul 2006 16:53:10 +0300 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: <001301c6b24b$da38ba80$15327e82@pyrimidine> References: <001301c6b24b$da38ba80$15327e82@pyrimidine> Message-ID: <1D2EF357-E8DE-4F15-824A-9C9359389685@lsi.upc.edu> > Would be nice to know how you use Bio::Taxonomy. You are the first > here who > seems to have a use for it. I'm using it to obtain a reference taxonomy for a set of organisms, against which to assess a phylogeny obtained by the usual protocol (fetch rRNA sequences, align them, obtain a distance matrix, cluster). Roughly: use Bio::DB::Taxonomy; my $nodesfile = "nodes.dmp"; my $namesfile = "names.dmp"; my $db = new Bio::DB::Taxonomy(-source => 'flatfile', -directory => "./db/", -nodesfile => $nodesfile, -namesfile => $namesfile); my @species = (...); for my $ncbi_name (@species) { my $ncbi_id = $db->get_taxonid($ncbi_name); my $node = $db->get_Taxonomy_Node(-taxonid => $ncbi_id); my @lineage = get_lineage_nodes($node); # ... } Here, get_lineage_nodes could be added as a method to Bio::Taxonomy::Node or equivalent: sub get_lineage_nodes{ my $node = shift; my @lineage; while ($node->node_name ne "root") { $node = $node->get_Parent_Node; unshift @lineage, $node; } return @lineage; } I've also written a method to merge the full lineages of a set of Bio::Taxonomy::Node object into a Bio::Tree::Tree object. I'd be glad to contribute it as well, but I'm not sure where it would fit. > As for branch lengths, I think you're confusing > 'taxonomy' (classification > of organisms based on just about anything) with > 'phylogeny' (evolutionary > relatedness). Note in the Wikipedia article below the use of the term > 'phylogenetic taxonomy', which is the classification of organisms > based on > evolutionary relationships. > > http://en.wikipedia.org/wiki/Taxonomy > > http://en.wikipedia.org/wiki/Phylogeny > > NCBI has a disclaimer about the Taxonomy database that is related > to this: > > http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi? > chapter=how > cite > > There are HOWTOs on tree manipulation, population genetics, and > PAML on the > wiki which might be a good start for Bioperl phylogenetic methods: > > http://www.bioperl.org/wiki/HOWTO:Trees > > http://www.bioperl.org/wiki/HOWTO:PAML > > http://www.bioperl.org/wiki/HOWTO:PopGen Thanks a lot. Let me check it and get back to the discussion later on. Gabriel > Chris > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> bounces at lists.open-bio.org] On Behalf Of Gabriel Valiente >> Sent: Friday, July 28, 2006 7:10 AM >> To: bioperl-l at lists.open-bio.org >> Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) >> >>>>> At the moment it seems to me that the Bio::Taxonomy modules >>>>> (excluding >>>>> Node) aren't really usable. >> >> I've been using Bio::Taxonomy and Bio::DB::Taxonomy a lot, they are >> very useful modules. Whatever Bio::Taxon or Bio::Taxonomy::Taxon >> turns out to be, please do keep the Bio::DB::Taxonomy functionality. >> >> BTW, does anybody know how to include branch lengths in >> Bio::DB::Taxonomy? >> >> Thanks a lot, >> >> Gabriel >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l From R.Birnie at leeds.ac.uk Fri Jul 28 09:56:15 2006 From: R.Birnie at leeds.ac.uk (Richard Birnie) Date: Fri, 28 Jul 2006 14:56:15 +0100 Subject: [Bioperl-l] whole genome annotation References: Message-ID: Thanks folks, That should be enough to get me going. At least I can see the wood for the trees now. Richard Dr Richard Birnie Scientific Officer Section of Pathology and Tumour Biology Welcome Brenner Building, LIMM St James University Hospital Beckett St, Leeds, LS9 7TF Tel:0113 3438624 e-mail: r.birnie at leeds.ac.uk -----Original Message----- From: Brian Osborne [mailto:osborne1 at optonline.net] Sent: Fri 7/28/2006 14:35 To: Richard Birnie; bioperl-l at lists.open-bio.org Subject: Re: [Bioperl-l] whole genome annotation Richard, A good starting point is a FAQ page we created that describes various ways of extracting genomic sequence: http://www.bioperl.org/wiki/Getting_Genomic_Sequences Check that out, and Sean's suggestion, and write back to bioperl-l if you have questions. One thing that this page doesn't really address is the special challenge that comes with working with very large sequences, this is something you might have to consider as well. You also asked about downloading the human genome and its annotations. There's also more than one way to do this as well. You'd have access to this data if you used the ENSEMBL API but you can get the Genbank files at ftp://ftp.ncbi.nih.gov/genomes/. Having said that I should add that one of the advantages of the ENSEMBL API approach is that you don't have to download the entire genome. Don't know what machine you're working on but, again, trying to manipulate very large sequences may tax your computer as well as your patience. Brian O. On 7/28/06 5:39 AM, "Richard Birnie" wrote: > Hello all, > > I'm just trying to familiarise myself with BioPerl and I'm a little > overwhelmed by the sheer volume of information available on the wiki. I'm > hoping someone can point in the right direction through the labyrinth. This > may become a little longwinded but I'll try and get all the annoying newbie > questions out of the way in one go. > > Let me try and explain what I'm aiming for. I have some CGH data downloaded > from the Progenetix database > (http://www.progenetix.de/~pgscripts/progenetix/Aboutprogenetix.html), this > data is simplified to record simply gain/loss/amplification of whole > chromosome bands at 862 band resolution to facilitate the combination of data > from multiple different studies. > > What I'd like to be able to do is download a copy of the human genome sequence > with annotation describing the locations of chromosome bands and preferably of > known genes. I then want to be able to manipulate the genome data based on the > CGH data to mimic deletions. The ultimate goal of this is to be able to feed > the manipulated genome data into a program (metashark) that predicts the > structure of metabolic networks based on genome annotation compared to a > reference genome, in this case a complete 'normal' human genome and see what > effect that has on the metabolic pathways. > > I appreciate that is a bit vague but thats sort of my problem, I'm not a > bioinformatician really so I'm not sue of the details of what I want. I just > happen to have an question to answer and bioperl seems the way to go (for this > project and more generally). I've started looking at the HOWTOs and read the > main bioperl tutorial. I also looked at the CGL comparative genomics library > but I haven't penetrated far into that yet. I'm ok with basic perl although > not much object oriented stuff. I don't really have much experience with > handling sequence data on a whole genome scale either, a few genbank files for > my favourite genes is fine but I need some guidance to work on this scale. > > What I'm looking for is someone to give me a start. I'd greatly appreciate it > if someone could spell out the general steps for downloading a complete copy > of the human genome and its annotations (if this is even a feasible approach) > and how to put it all together. Not actual code just the general concept for > each step and which tools from the bioperl set would be most appropriate for > each step so that I can focus what I need to read about, even a little > pseudo-code if I'm lucky. If I can get the genome data downloaded and setup > properly I'll work out how to apply the CGH data to it myself. > > If example code for what I'm trying to describe is included somewhere, great > could someone point to where. > > Thanks for your patience. > best regards, > Richard > > > > Dr Richard Birnie > Scientific Officer > Section of Pathology and Tumour Biology > Welcome Brenner Building, LIMM > St James University Hospital > Beckett St, Leeds, LS9 7TF > Tel:0113 3438624 > e-mail: r.birnie at leeds.ac.uk > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Fri Jul 28 09:43:47 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 28 Jul 2006 08:43:47 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: Message-ID: <001301c6b24b$da38ba80$15327e82@pyrimidine> Now I get personal email? Yikes! Sendu has indicated that Bio::DB::Taxonomy will stay essentially unchanged. If anything changes, it >may< be the class used to hold the Node information. Would be nice to know how you use Bio::Taxonomy. You are the first here who seems to have a use for it. As for branch lengths, I think you're confusing 'taxonomy' (classification of organisms based on just about anything) with 'phylogeny' (evolutionary relatedness). Note in the Wikipedia article below the use of the term 'phylogenetic taxonomy', which is the classification of organisms based on evolutionary relationships. http://en.wikipedia.org/wiki/Taxonomy http://en.wikipedia.org/wiki/Phylogeny NCBI has a disclaimer about the Taxonomy database that is related to this: http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=how cite There are HOWTOs on tree manipulation, population genetics, and PAML on the wiki which might be a good start for Bioperl phylogenetic methods: http://www.bioperl.org/wiki/HOWTO:Trees http://www.bioperl.org/wiki/HOWTO:PAML http://www.bioperl.org/wiki/HOWTO:PopGen Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Gabriel Valiente > Sent: Friday, July 28, 2006 7:10 AM > To: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) > > >>> At the moment it seems to me that the Bio::Taxonomy modules > >>> (excluding > >>> Node) aren't really usable. > > I've been using Bio::Taxonomy and Bio::DB::Taxonomy a lot, they are > very useful modules. Whatever Bio::Taxon or Bio::Taxonomy::Taxon > turns out to be, please do keep the Bio::DB::Taxonomy functionality. > > BTW, does anybody know how to include branch lengths in > Bio::DB::Taxonomy? > > Thanks a lot, > > Gabriel > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Fri Jul 28 10:15:38 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 28 Jul 2006 09:15:38 -0500 Subject: [Bioperl-l] Getting sequences by base pair locations In-Reply-To: <44CA118C.7010401@mail.nih.gov> Message-ID: <001401c6b250$4e3c2490$15327e82@pyrimidine> Yutal, You can also do this remotely if the file you want is in GenBank (and you don't want to store the data locally). The nice thing about using this is any seqfeatures in the GenBank file within the region requested is also returned. Note that if data is stored in a RefSeq file you'll need to add the parameter '-no_redirect => 1,' to the Bio::DB::GenBank object. I would NOT recommend this for huge numbers of sequences (>2000) as you would be spamming NCBI with thousands of repeated requests; if you did have a relatively large number you could run this overnight, which is what I do. Bio::DB::Fasta would be better if you have tons of hits. Use this in a loop to grab the sequences one at a time based on the start, stop positions, (and strand, if you need it): # Below is from Bio::DB::GenBank POD, with some modifications my $factory = Bio::DB::GenBank->new( -seq_start => $start, -seq_stop => $end, -strand => $strand # 1=plus, 2=minus ); my $seq_obj; eval { $seq_obj = $factory->get_Seq_by_acc($sf->seq_id); }; if( $@ ) { print STDERR "Unable to retrieve from $start to $end.\n"; print STDERR "Error!\n$@"; print STDERR "Attempting to move on...\n"; next; } print STDERR "Got sequence: ",$seq_obj->description,"\n"; print STDERR "\tLength: ",$seq_obj->length,"\n"; my $sf_len = $sf->length; The eval{} block is needed to make sure retrieval worked via network connections and to not end based on a network error (the object throws an error which eval catches, logs it to STDERR, thus allowing you to continue on). Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Sean Davis > Sent: Friday, July 28, 2006 8:31 AM > To: Yuval Itan > Cc: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Getting sequences by base pair locations > > Yuval Itan wrote: > > Hello all, > > > > I was BLATing a few hundred human genes against the chimp genome, and > > kept the best chimp hits for every human gene. > > I have the base pair start and end location for every chimp hit, and I > > need to get the sequence for each of these chimp hits. Here is an > > example for a few chimp hits bp locations: > > > > Start End* > > *142854 144504 > > 154479 155198 > > 153066 167370 > > 163146 163559 > > > > I have one chimp genome file (about 3GB) including all chromosomes, but > > I could also get one file per chromosome if that would make things > > easier. Does anyone have a script or a link for an interface that can do > > the job? > > See this module: > > http://doc.bioperl.org/releases/bioperl-current/bioperl- > live/Bio/DB/Fasta.html > > Sean > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Fri Jul 28 10:35:21 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 28 Jul 2006 09:35:21 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: <1D2EF357-E8DE-4F15-824A-9C9359389685@lsi.upc.edu> Message-ID: <001501c6b253$0fed08a0$15327e82@pyrimidine> > use Bio::DB::Taxonomy; > I've also written a method to merge the full lineages of a set of > Bio::Taxonomy::Node object into a Bio::Tree::Tree object. I'd be glad > to contribute it as well, but I'm not sure where it would fit. Ah, that would be great (I had mentioned something along these lines to do with BLAST reports). But does this actually use Bio::Taxonomy directly? Taxonomy::Node does not inherit methods from Bio::Taxonomy AFAIK. So, anything that Sendu does may not dramatically impact your code. Sendu? You might need to address some of this to Sendu. Big changes are afoot for Bio::Taxonomy and Bio::Taxonomy::Node. He's heading that up. Chris > ... > Thanks a lot. Let me check it and get back to the discussion later on. > > Gabriel > > > Chris > > ... From cjfields at uiuc.edu Fri Jul 28 10:37:09 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 28 Jul 2006 09:37:09 -0500 Subject: [Bioperl-l] Getting sequences by base pair locations In-Reply-To: <44CA1419.3030100@mail.nih.gov> Message-ID: <001601c6b253$4ec57170$15327e82@pyrimidine> ... > > If your genome file is in some standard format, use SeqIO. > > http://www.bioperl.org/wiki/HOWTO:SeqIO > > > > And then get the sequence corresponding to the correct chromosome and > > get the desired chunk with subseq(); > > http://www.bioperl.org/wiki/HOWTO:Beginners#The_Sequence_Object > > My guess is that Yuval will need random access to the sequences. With > seqIO, this is possible with a relatively large amount of memory, but > Bio::DB::Fasta might be the better bet. Agreed. This is one of the bioperl 'speed' issue areas: http://www.bioperl.org/wiki/Project_priority_list Bio::DB::Fasta returns a specialized PrimarySeq object which gets around the current speed issues with SeqIO. > Alternatively, make a custom track (see the documentation for doing so > at the UCSC genome browser site), upload it, and then getting the DNA is > trivial with just a couple of mouseclicks. This method also has the > advantage of being able to do things like viewing the data in genome > coordinates and allows the possibility of doing interections with known > chimp genes so you could find hits that don't overlap known chimp genes, > for example. > > Sean Would be nice to have a more automated and direct way of doing something along these lines within bioperl (with the obvious caveat of not spamming the server). You can currently retrieve chunks of sequence based on start, stop, strand from GenBank. Ah, one can dream... Chris From bix at sendu.me.uk Fri Jul 28 10:38:20 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Fri, 28 Jul 2006 15:38:20 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: <1D2EF357-E8DE-4F15-824A-9C9359389685@lsi.upc.edu> References: <001301c6b24b$da38ba80$15327e82@pyrimidine> <1D2EF357-E8DE-4F15-824A-9C9359389685@lsi.upc.edu> Message-ID: <44CA215C.2070607@sendu.me.uk> Gabriel Valiente wrote: >> Would be nice to know how you use Bio::Taxonomy. You are the first >> here who >> seems to have a use for it. > > I'm using it to obtain a reference taxonomy for a set of organisms, > against which to assess a phylogeny obtained by the usual protocol > (fetch rRNA sequences, align them, obtain a distance matrix, > cluster). Roughly: > > use Bio::DB::Taxonomy; Ah, we were specifically wondering if you had used Bio/Taxonomy.pm, not Taxonomy modules in general. Again, DB::Taxonomy usage will be unaffected. > Here, get_lineage_nodes could be added as a method to > Bio::Taxonomy::Node or equivalent: > > sub get_lineage_nodes{ > my $node = shift; > my @lineage; > while ($node->node_name ne "root") { > $node = $node->get_Parent_Node; > unshift @lineage, $node; > } > return @lineage; > } I think you must have an older version of bioperl. Bio::Taxonomy::Node has a method get_Lineage_Nodes() which more or less does exactly that. > I've also written a method to merge the full lineages of a set of > Bio::Taxonomy::Node object into a Bio::Tree::Tree object. I'd be glad > to contribute it as well, but I'm not sure where it would fit. Post it and I'll see if it will fit anywhere :) From cuiw at ncbi.nlm.nih.gov Fri Jul 28 09:46:50 2006 From: cuiw at ncbi.nlm.nih.gov (Cui, Wenwu (NIH/NLM/NCBI) [C]) Date: Fri, 28 Jul 2006 09:46:50 -0400 Subject: [Bioperl-l] Getting sequences by base pair locations In-Reply-To: <05a8f85bdc36dc7e370d0935b6394f9d@ucl.ac.uk> Message-ID: <18C407FD4FFB424292D769FBD68C1987C7C254@NIHCESMLBX8.nih.gov> Maybe the easiest way is to use LWP to get the webpage. Here is an example for CHIMP1A:10:12345678:12348888: http://www.ensembl.org/Pan_troglodytes/exportview?format=fasta&l=10%3A12 345678-12348888&action=export&_format=Text&output=txt&submit=Continue+%3 E%3E Wenwu Cui ________________________________ From: Yuval Itan [mailto:y.itan at ucl.ac.uk] Sent: Friday, July 28, 2006 8:08 AM To: bioperl-l at lists.open-bio.org Subject: [Bioperl-l] Getting sequences by base pair locations Hello all, I was BLATing a few hundred human genes against the chimp genome, and kept the best chimp hits for every human gene. I have the base pair start and end location for every chimp hit, and I need to get the sequence for each of these chimp hits. Here is an example for a few chimp hits bp locations: Start End 142854 144504 154479 155198 153066 167370 163146 163559 I have one chimp genome file (about 3GB) including all chromosomes, but I could also get one file per chromosome if that would make things easier. Does anyone have a script or a link for an interface that can do the job? Thank you very much. From valiente at lsi.upc.edu Fri Jul 28 10:49:28 2006 From: valiente at lsi.upc.edu (Gabriel Valiente) Date: Fri, 28 Jul 2006 17:49:28 +0300 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: <001501c6b253$0fed08a0$15327e82@pyrimidine> References: <001501c6b253$0fed08a0$15327e82@pyrimidine> Message-ID: <5563CD94-DC99-46A3-A56A-485D4A4D3031@lsi.upc.edu> >> use Bio::DB::Taxonomy; > > > >> I've also written a method to merge the full lineages of a set of >> Bio::Taxonomy::Node object into a Bio::Tree::Tree object. I'd be glad >> to contribute it as well, but I'm not sure where it would fit. > > Ah, that would be great (I had mentioned something along these > lines to do > with BLAST reports). But does this actually use Bio::Taxonomy > directly? > Taxonomy::Node does not inherit methods from Bio::Taxonomy AFAIK. So, > anything that Sendu does may not dramatically impact your code. > Sendu? It is a general algorithm I devised that takes a set of paths and builds up a tree. The input paths are full lineages coming from Bio::DB::Taxonomy, the output tree is a Bio::Tree::Tree. This is why I said I don't know exactly where it would belong, it looks to me more like a standalone script than a Bio::Taxonomy or Bio::Tree method. Gabriel > You might need to address some of this to Sendu. Big changes are > afoot for > Bio::Taxonomy and Bio::Taxonomy::Node. He's heading that up. > > Chris > >> ... >> Thanks a lot. Let me check it and get back to the discussion later >> on. >> >> Gabriel >> >>> Chris >>> > ... From sdavis2 at mail.nih.gov Fri Jul 28 11:21:09 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Fri, 28 Jul 2006 11:21:09 -0400 Subject: [Bioperl-l] Getting sequences by base pair locations In-Reply-To: <001601c6b253$4ec57170$15327e82@pyrimidine> References: <001601c6b253$4ec57170$15327e82@pyrimidine> Message-ID: <44CA2B65.8070906@mail.nih.gov> Chris Fields wrote: > Would be nice to have a more automated and direct way of doing something > along these lines within bioperl (with the obvious caveat of not spamming > the server). You can currently retrieve chunks of sequence based on start, > stop, strand from GenBank. The ENSembl API has some features that can be useful for these types of things. I, personally, have a mirror of the UCSC mysql database (very easy to do with just rsync and mysql) and try to turn questions like these into SQL queries. That, combined with Bio::DB::Fasta, can make a useful automated pipeline for getting arbitrary sequences associated with genomic locations meeting specific criteria. It is much faster than anything one can do over the web and doesn't have access limitations. Sean From cjfields at uiuc.edu Fri Jul 28 11:27:17 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 28 Jul 2006 10:27:17 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: <5563CD94-DC99-46A3-A56A-485D4A4D3031@lsi.upc.edu> Message-ID: <000001c6b25a$4f9392b0$15327e82@pyrimidine> > It is a general algorithm I devised that takes a set of paths and > builds up a tree. The input paths are full lineages coming from > Bio::DB::Taxonomy, the output tree is a Bio::Tree::Tree. This is why > I said I don't know exactly where it would belong, it looks to me > more like a standalone script than a Bio::Taxonomy or Bio::Tree method. > > Gabriel Agreed. You could submit the script as an example here if it is short, or via Bugzilla as an enhancement request: http://bugzilla.open-bio.org/ It could be added to the scripts\ or examples\ directory in bioperl-core. Chris From valiente at lsi.upc.edu Fri Jul 28 12:35:20 2006 From: valiente at lsi.upc.edu (Gabriel Valiente) Date: Fri, 28 Jul 2006 19:35:20 +0300 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: <000001c6b25a$4f9392b0$15327e82@pyrimidine> References: <000001c6b25a$4f9392b0$15327e82@pyrimidine> Message-ID: <3DB992C6-DF16-42B9-8C36-F3B5C8CCBDE7@lsi.upc.edu> >> It is a general algorithm I devised that takes a set of paths and >> builds up a tree. The input paths are full lineages coming from >> Bio::DB::Taxonomy, the output tree is a Bio::Tree::Tree. This is why >> I said I don't know exactly where it would belong, it looks to me >> more like a standalone script than a Bio::Taxonomy or Bio::Tree >> method. >> >> Gabriel > > Agreed. You could submit the script as an example here if it is > short, or > via Bugzilla as an enhancement request: > > http://bugzilla.open-bio.org/ > > It could be added to the scripts\ or examples\ directory in bioperl- > core. Here it is. Please check it and include for instance as taxonomy2tree.PLS in the scripts/tree or scripts/taxonomy directory. Disclaimer: I'm also publishing part of this code in a conference paper. The script is already fully functional but anyway, I have a couple of improvements in mind. The minor one is provision for cmdline input. How would you like to input an array of names? The other one is to remove internal node labels and contract elementary paths, for instance reducing the tree: (((((((((((((((((((((((((((("Pongo pygmaeus")Pongo,(("Gorilla gorilla")Gorilla,("Pan troglodytes")Pan,("Homo sapiens")Homo)"Homo/ Pan/Gorilla group")Hominidae)Hominoidea)Catarrhini)Simiiformes) Primates)Euarchontoglires)Eutheria)Theria)Mammalia)Amniota)Tetrapoda) Sarcopterygii)Euteleostomi)Teleostomi)"Gnathostomata ") Vertebrata)"Craniata ")Chordata)Deuterostomia)Coelomata) Bilateria)Eumetazoa)Metazoa)"Fungi/Metazoa group")Eukaryota)"cellular organisms")root; to the tree: ("Pongo pygmaeus",("Gorilla gorilla","Pan troglodytes","Homo sapiens")); It is certainly easy to remove all internal node labels. On the other hand, I've been working on contraction of elementary paths for quite a while, but always got stuck with internals of the Bio::Tree methods to remove nodes. Thus, so far the only working code I have consists of removing elementary branches while making a deep copy of the tree, which certainly is not quite elegant... Thanks a lot, Gabriel #!/usr/bin/perl -w # Author: Gabriel Valiente # Purpose: Bio::DB::Taxonomy -> Bio::Tree::Tree use strict; use Bio::DB::Taxonomy; use Bio::TreeIO; my $nodesfile = "nodes.dmp"; my $namesfile = "names.dmp"; my $db = new Bio::DB::Taxonomy(-source => 'flatfile', -directory => "./db/", -nodesfile => $nodesfile, -namesfile => $namesfile); # the input to the script is an array of species names my @species = ('Orangutan', 'Gorilla', 'Chimpanzee', 'Human'); my $root = new Bio::Tree::Node(-id => "root"); my $tree = new Bio::Tree::Tree(-root => $root); # the full lineages of the species are merged into a tree for my $name (@species) { my $ncbi_id = $db->get_taxonid($name); if ($ncbi_id) { my $node = $db->get_Taxonomy_Node(-taxonid => $ncbi_id); my @lineage = get_lineage_nodes($node); shift @lineage; # discard root push @lineage, $node; merge_path($root, \@lineage); } else { warn "no NCBI Taxonomy node for species ",$name,"\n"; } } # the tree is output in Newick format my $output = new Bio::TreeIO(-format => 'newick'); $output->write_tree($tree); # the actual merging of full lineages is performed by a recursive method sub merge_path { my $root = shift; my $path = shift; my @path = @{$path}; if (@path) { my $top = shift @path; my @children = grep { $_->id eq $top->node_name } $root- >each_Descendent; if (@children) { # $root has a $child with id eq $top name my $child = shift @children; merge_path($child,\@path); } else { # add $top and @path below $root my $node = $root; unshift @path, $top; while (@path) { my $top = shift @path; my $name = $top->node_name; my $child = new Bio::Tree::Node(-id => "$name"); $node->add_Descendent($child); $node = $child; } } } } # the full lineage of a species is recovered by traversing the taxonomy sub get_lineage_nodes{ my $node = shift; my @lineage; while ($node->node_name ne "root") { $node = $node->get_Parent_Node; unshift @lineage, $node; } return @lineage; } =head1 NAME taxonomy2tree - builds a taxonomic tree based on the full lineages of a set of species names =head1 DESCRIPTION This script requires that the bioperl-run pkg be also installed. Providing the nodes.dmp and names.dmp files from the NCBI Taxonomy dump (see Bio::DB::Taxonomy::flatfile for more info) is only necessary on the first time running. This will create the local indexes and may take quite a long time. However once created, these indexes will allow fast access for species to taxon id OR taxon id to species name lookups. =cut From MEC at stowers-institute.org Fri Jul 28 12:44:43 2006 From: MEC at stowers-institute.org (Cook, Malcolm) Date: Fri, 28 Jul 2006 11:44:43 -0500 Subject: [Bioperl-l] Getting sequences by base pair locations Message-ID: There are many options. But, it looks like you only have start end coordinates! Where do you know which chromosome/contig the hit was on? Assuming you have this, if you did the blat with a local copy of the blat program and a the genome, then in addition to the blat command, you have the twoBitToFa command which can extract the hits from the blat index (see http://genome.ucsc.edu/goldenPath/help/blatSpec.html ) Or did you do the blat at ucsc? Malcolm Cook Database Applications Manager, Bioinformatics Stowers Institute for Medical Research oh - I replied similarly in the Bio BB forum, but it is held for moderation so am replying here as well ________________________________ From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Yuval Itan Sent: Friday, July 28, 2006 7:08 AM To: bioperl-l at lists.open-bio.org Subject: [Bioperl-l] Getting sequences by base pair locations Hello all, I was BLATing a few hundred human genes against the chimp genome, and kept the best chimp hits for every human gene. I have the base pair start and end location for every chimp hit, and I need to get the sequence for each of these chimp hits. Here is an example for a few chimp hits bp locations: Start End 142854 144504 154479 155198 153066 167370 163146 163559 I have one chimp genome file (about 3GB) including all chromosomes, but I could also get one file per chromosome if that would make things easier. Does anyone have a script or a link for an interface that can do the job? Thank you very much. From osborne1 at optonline.net Fri Jul 28 13:25:12 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Fri, 28 Jul 2006 13:25:12 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: <3DB992C6-DF16-42B9-8C36-F3B5C8CCBDE7@lsi.upc.edu> Message-ID: Gabriel, It looks like most of the Bioperl scripts use Getopt::Long. It's documentation says, in part: Options can take multiple values at once, for example --coordinates 52.2 16.4 --rgbcolor 255 255 149 This can be accomplished by adding a repeat specifier to the option specification. Repeat specifiers are very similar to the {...} repeat specifiers that can be used with regular expression patterns. For example, the above command line would be handled as follows: GetOptions('coordinates=f{2}' => \@coor, 'rgbcolor=i{3}' => \@color); So the arguments are space-delimited on the command line. Is the problem that the names can be binomial? Brian O. On 7/28/06 12:35 PM, "Gabriel Valiente" wrote: > The minor one is provision for cmdline input. > How would you like to input an array of names? From golharam at umdnj.edu Fri Jul 28 14:03:39 2006 From: golharam at umdnj.edu (Ryan Golhar) Date: Fri, 28 Jul 2006 14:03:39 -0400 Subject: [Bioperl-l] Bio::Align::DNAStatistics module has errors? Message-ID: <01a701c6b270$28232130$2f01a8c0@GOLHARMOBILE1> This is from the description: This object contains routines for calculating various statistics and distances for DNA alignments. The routines are not well tested and do contain errors at this point. Work is underway to correct them, but do not expect this code to give you the right answer currently! Use dnadist/distmat in the PHLYIP or EMBOSS packages to calculate the dis- tances. Any idea what the errors are and what is/is not usable? From lzhtom at hotmail.com Fri Jul 28 22:00:23 2006 From: lzhtom at hotmail.com (zhihua li) Date: Sat, 29 Jul 2006 02:00:23 +0000 Subject: [Bioperl-l] how to get annotations (especially ensembl IDs) for a list of genes? Message-ID: Hi all, I have a list of like 300 genes (actually their refseq IDs). Now I wanna get more information (annotations) for each of the genes. Speficially, I want a mapping of the refseq IDs to Ensembl gene IDs. I know how to do it through a web page. But I'm wondering if I can also do it via bioperl, by using some modules or packages. Can anyone help me out here? Thanks a lot! From jason.stajich at duke.edu Sat Jul 29 01:18:50 2006 From: jason.stajich at duke.edu (Jason Stajich) Date: Fri, 28 Jul 2006 22:18:50 -0700 Subject: [Bioperl-l] Bio::Align::DNAStatistics module has errors? Message-ID: I think that msg was CYA by me at some point - I am pretty sure I made tests based on numbers from PHYLIP and EMBOSS but was hoping for someone else to help. At this point I have no reliable time to really work on, but I hope someone who is interested in it will give it a whirl. There may be some boundary cases that don't work where seqs are too short or have a zero number of a particular nt but in general the nums should jive. I am not sure about all the NG Ks and Ka as I didn't write those but I believe Richard vetted them pretty well first. There are a couple of methods not implemented too - am always hopeful other people will see this as a great starting point and roll up their sleeves to join in... -jason -- Jason Stajich Duke University http://www.duke.edu/~jes12 From bix at sendu.me.uk Sat Jul 29 03:25:38 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Sat, 29 Jul 2006 08:25:38 +0100 Subject: [Bioperl-l] how to get annotations (especially ensembl IDs) for a list of genes? In-Reply-To: References: Message-ID: <44CB0D72.20104@sendu.me.uk> zhihua li wrote: > Hi all, > > I have a list of like 300 genes (actually their refseq IDs). Now I > wanna get more information (annotations) for each of the genes. > Speficially, I want a mapping of the refseq IDs to Ensembl gene IDs. > > I know how to do it through a web page. But I'm wondering if I can also > do it via bioperl One possible way is to use the Ensembl perl API: http://www.ensembl.org/info/software/core/core_tutorial.html You'd get a gene or transcript adapator and use fetch_all_by_external_name() iirc. I'm aware that not every entrez id can be mapped that way, but perhaps most if not all refseqs will work. From bix at sendu.me.uk Sat Jul 29 03:54:52 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Sat, 29 Jul 2006 08:54:52 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: <001301c6b24b$da38ba80$15327e82@pyrimidine> References: <001301c6b24b$da38ba80$15327e82@pyrimidine> Message-ID: <44CB144C.6050507@sendu.me.uk> Chris Fields wrote: > > As for branch lengths, I think you're confusing 'taxonomy' (classification > of organisms based on just about anything) with 'phylogeny' (evolutionary > relatedness). Note in the Wikipedia article below the use of the term > 'phylogenetic taxonomy', which is the classification of organisms based on > evolutionary relationships. > > http://en.wikipedia.org/wiki/Taxonomy > > http://en.wikipedia.org/wiki/Phylogeny Indeed. The two can be considered closely intertwined - if you were making a phylogeny you might hang it on a taxonomy. At any rate, you need to know a bunch of evolutionarily related species names before you start work, and Bio::Taxonomy::Node has been as good a place as any to get that. > There are HOWTOs on tree manipulation, population genetics, and PAML on the > wiki which might be a good start for Bioperl phylogenetic methods: > > http://www.bioperl.org/wiki/HOWTO:Trees Which is why the Trees HOWTO talks about taxa, and a number of the Taxonomy modules have phylogenetic methods like get_lca. (And why there is Bio::Taxonomy::Taxon and Tree.) I suppose this is another reason to make Bio::Taxonomy::Node (ne Bio::Taxon) implement Bio::Tree::NodeI. (for these reasons I don't think Gabriel's method isn't best appropriate as a script - it's something you might do all the time, as a matter of course. If Bio::Taxon wasa Bio::Tree::NodeI you would just do my $tree = new Bio::Tree::Tree(-root => $bio_taxon); and blamo, instant phylogenetic taxonomy) From cjfields at uiuc.edu Sat Jul 29 07:49:29 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sat, 29 Jul 2006 06:49:29 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: <44CB144C.6050507@sendu.me.uk> References: <001301c6b24b$da38ba80$15327e82@pyrimidine> <44CB144C.6050507@sendu.me.uk> Message-ID: <5C765CAF-4976-45D1-BDA4-1E432081FF4D@uiuc.edu> On Jul 29, 2006, at 2:54 AM, Sendu Bala wrote: > Chris Fields wrote: >> >> As for branch lengths, I think you're confusing >> 'taxonomy' (classification >> of organisms based on just about anything) with >> 'phylogeny' (evolutionary >> relatedness). Note in the Wikipedia article below the use of the >> term >> 'phylogenetic taxonomy', which is the classification of organisms >> based on >> evolutionary relationships. >> >> http://en.wikipedia.org/wiki/Taxonomy >> >> http://en.wikipedia.org/wiki/Phylogeny > > Indeed. The two can be considered closely intertwined - if you were > making a phylogeny you might hang it on a taxonomy. At any rate, you > need to know a bunch of evolutionarily related species names before > you > start work, and Bio::Taxonomy::Node has been as good a place as any to > get that. Intertwined, yes, but not exactly the same. Hence the NCBI disclaimer I mentioned: How to reference the NCBI taxonomy database The NCBI taxonomy database is not a primary source for taxonomic or phylogenetic information. Furthermore, the database does not follow a single taxonomic treatise but rather attempts to incorporate phylogenetic and taxonomic knowledge from a variety of sources, including the published literature, web-based databases, and the advice of sequence submitters and outside taxonomy experts. Consequently, the NCBI taxonomy database is not a phylogenetic or taxonomic authority and should not be cited as such. >> There are HOWTOs on tree manipulation, population genetics, and >> PAML on the >> wiki which might be a good start for Bioperl phylogenetic methods: >> >> http://www.bioperl.org/wiki/HOWTO:Trees > > Which is why the Trees HOWTO talks about taxa, and a number of the > Taxonomy modules have phylogenetic methods like get_lca. (And why > there > is Bio::Taxonomy::Taxon and Tree.) Are we still thinking about deprecating those? I have seen very little mention of those modules from the mail list archives, and Jason mentioned that Bio::Taxonomy::Taxon hasn't been modified in a long time. > I suppose this is another reason to make Bio::Taxonomy::Node (ne > Bio::Taxon) implement Bio::Tree::NodeI. > > (for these reasons I don't think Gabriel's method isn't best > appropriate > as a script - it's something you might do all the time, as a matter of > course. If Bio::Taxon wasa Bio::Tree::NodeI you would just do my > $tree = > new Bio::Tree::Tree(-root => $bio_taxon); and blamo, instant > phylogenetic taxonomy) Brian already deposited the script (see bioperl-guts). You could use it for the methods, of course noting Gabriel's contribution. Sounds like a good plan to me ; > Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From nabil at broad.mit.edu Sun Jul 30 00:28:00 2006 From: nabil at broad.mit.edu (Nabil Hafez) Date: Sun, 30 Jul 2006 00:28:00 -0400 Subject: [Bioperl-l] Strangeness in parsing blast file Message-ID: <44CC3550.5070105@broad.mit.edu> Hi, I am having a somewhat similar problem to what was posted in http://bioperl.org/pipermail/bioperl-l/2006-May/021416.html however, I have read through all of that thread and I don't believe what I am experiencing is the exact same problem. I also realize that the Bioperl version 1.5.1 fixes a problem with blast parsing. My problem: My blastresults file parses fine and everything works swimmingly if I pass the blast output file by name such as $blast_result = 'test.blastout'; however when I do $blast_result = &do_blast($sample_fasta); even though in both cases $blast_result evaluate to "test.blastout", the parsing doesn't work, more specifically it gets an undefined variable for $result in while( my $result = $report_obj->next_result ) { Sorr y for the long email - any help would be appreciated, Thanks - Nabil The code...non releavant parts trimmed for size constraints....debugging from working and non-working versions after the code. use strict; use Bio::SearchIO; use Getopt::Std; use List::Util qw(shuffle); use Benchmark; my ($inputfile, $samplefile, $blastfile, $blast_result, $blast_report, $blast_verbose); #files generated #------------------# # Subroutine Calls # #------------------# my $test = &create_sample_file($inputfile); #inputfile being a fasta file containing nucleotide sequence $blast_result = &do_blast($test); ##$blast_result = 'test.blastout'; #when this is uncommented and replace the previous two lines with test.blastout being normal blast output - the script works fine. &parse_blast($blast_result); ####################### # create_sample_file # # Input: Original Fasta File # # Output: Fasta file containing randomly sampled reads # # sub create_sample_file { my $in = shift; my $linecount = 0; my @lines; $samplefile = $in . "_sample"; #Determine total # of reads in input fasta $totalreads = `$grep -c '>' $inputfile`; $totalreads =~ s/\s+//; chomp $totalreads; if ($totalreads > 1000) { #sample if more than 1000 reads $sample_reads = sprintf("%.0f", $totalreads * ($per_to_sample/100)); #number of reads to sample } else { #otherwise use all reads $sample_reads = $totalreads; } $/ = '>'; #define fasta record input seperator open (IN, "<$in") or die "Cannot open $in $!\n"; open (OUT, ">$samplefile") or die "Cannot open $samplefile $!\n"; while () { #read lines into an array chomp; push (@lines, $_); } @lines = shuffle(@lines); #shuffle array foreach (@lines) { print OUT ">$_" if $linecount <= $sample_reads; #output to file sampled number of reads $linecount++; } close IN; close OUT; return $samplefile; }#end create_sample_file ####################### # do_blast # # Input: Fasta File containing SCREENED sampled reads # # Output: Blast File # # sub do_blast { my $bf = shift; my $blastoutput = $bf . ".blastout"; print "Blasting against $db ...\n"; `blast/blast-2.2.10/bin/megablast -d /prodinfo/proddata_ntblastdb/nt -e 1e-50 -p 99 -D 2 -i test -o test.blastout`; return $blastoutput; }#end do_blast ####################### # parse_blast # # Input: Blast file # # Output: Creates hash containing best hit for each read # # sub parse_blast { my $blastoutfile = shift; if (! -e $blastoutfile) { die "$blastoutfile does not exist $!\n"; } print "Parsing blast hits ...\n"; my $report_obj = new Bio::SearchIO(-verbose => 1, -format => 'blast', -file => $blastoutfile); die "no valid $report_obj" unless defined $report_obj; while( my $result = $report_obj->next_result ) { die "no valid $result" unless defined $result; while( my $hit = $result->next_hit ) { while( my $hsp = $hit->next_hsp ) { my $name = $result->query_name; my $hitDesc = $hit->description; my $length = $hsp->length('total'); my $per_id = sprintf("%.2f", $hsp->percent_identity); my $eval = $hsp->evalue; next if (defined $blast_results{$name} && $blast_results{$name}->[0] > $length); #only keep best hit for any read $blast_results{$name} = [$length, $per_id, $eval, $hitDesc]; #store in hash of arrays } } } } #end parse_blast Debug of NON-working blast-parse: main::(454/scripts/fasta_blasta_mb.pl:151): 151: my $sample_fasta = &create_sample_file($inputfile); DB<2> n main::(454/scripts/fasta_blasta_mb.pl:152): 152: $blast_result = &do_blast($sample_fasta); DB<2> x $sample_fasta 0 'G782.2005-08-16-16-48.fasta_sample' DB<3> n Blasting against NT ... main::(454/scripts/fasta_blasta_mb.pl:154): 154: &parse_blast($blast_result); DB<3> s main::parse_blast(454/scripts/fasta_blasta_mb.pl:293): 293: my $blastoutfile = shift; DB<3> s main::parse_blast(454/scripts/fasta_blasta_mb.pl:295): 295: if (! -e $blastoutfile) { DB<3> x $blastoutfile 0 'G782.2005-08-16-16-48.fasta_sample.blastout' DB<4> s main::parse_blast(454/scripts/fasta_blasta_mb.pl:299): 299: print "Parsing blast hits ...\n"; DB<4> s Parsing blast hits ... main::parse_blast(454/scripts/fasta_blasta_mb.pl:302): 302: my $report_obj = new Bio::SearchIO(-verbose => 1, 303: -format => 'blast', 304: -file => $blastoutfile);#or die "Could not open blast report $!"; DB<4> s Bio::SearchIO::new(/util/lib/perl5/site_perl/5.8.0/Bio/SearchIO.pm:129): 129: my($caller, at args) = @_; DB<4> r scalar context return from Bio::SearchIO::new: '_file' => 'G782.2005-08-16-16-48.fasta_sample.blastout' '_filehandle' => GLOB(0x8cef40c) -> *Symbol::GEN1 FileHandle({*Symbol::GEN1}) => fileno(3) '_flush_on_write' => 1 '_handler' => Bio::SearchIO::IteratedSearchResultEventBuilder=HASH(0x9505724) '_factories' => HASH(0x95054c0) 'hit' => Bio::Factory::ObjectFactory=HASH(0x95017b8) '_loaded_types' => HASH(0x9506c0c) 'Bio::Search::Hit::BlastHit' => 1 '_root_verbose' => 0 'interface' => 'Bio::Search::Hit::HitI' 'type' => 'Bio::Search::Hit::BlastHit' 'hsp' => Bio::Factory::ObjectFactory=HASH(0x9500e10) '_loaded_types' => HASH(0x9506c18) 'Bio::Search::HSP::GenericHSP' => 1 '_root_verbose' => 0 'interface' => 'Bio::Search::HSP::HSPI' 'type' => 'Bio::Search::HSP::GenericHSP' 'iteration' => Bio::Factory::ObjectFactory=HASH(0x9506c60) '_loaded_types' => HASH(0x9506af8) 'Bio::Search::Iteration::GenericIteration' => 1 '_root_verbose' => 0 'interface' => 'Bio::Search::Iteration::IterationI' 'type' => 'Bio::Search::Iteration::GenericIteration' 'result' => Bio::Factory::ObjectFactory=HASH(0x9504c80) '_loaded_types' => HASH(0x9501f74) 'Bio::Search::Result::BlastResult' => 1 '_root_verbose' => 0 'interface' => 'Bio::Search::Result::ResultI' 'type' => 'Bio::Search::Result::BlastResult' '_inclusion_threshold' => 0.001 '_root_verbose' => 1 '_handler_cache' => Bio::SearchIO::IteratedSearchResultEventBuilder=HASH(0x9505724) -> REUSED_ADDRESS '_notfirsttime' => 0 '_reporttype' => '' '_root_cleanup_methods' => ARRAY(0x8cde434) 0 CODE(0x82a9aec) -> &Bio::Root::IO::_io_cleanup in /util/lib/perl5/site_perl/5.8.0/Bio/Root/IO.pm:544-570 1 CODE(0x82a9aec) -> REUSED_ADDRESS '_root_verbose' => 1 main::parse_blast(454/scripts/fasta_blasta_mb.pl:307): 307: die "no valid $report_obj" unless defined $report_obj; DB<4> s main::parse_blast(454/scripts/fasta_blasta_mb.pl:310): 310: while( my $result = $report_obj->next_result ) { DB<4> s Bio::SearchIO::blast::next_result(/util/lib/perl5/site_perl/5.8.0/Bio/SearchIO/blast.pm:389): 389: my ($self) = @_; DB<4> r scalar context return from Bio::SearchIO::blast::next_result: undef Bio::SearchIO::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/SearchIO.pm:438): 438: my $self = shift; DB<4> r scalar context return from Bio::SearchIO::DESTROY: '' Bio::Root::Root::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/Root/Root.pm:404): 404: my $self = shift; DB<4> r scalar context return from Bio::Root::Root::DESTROY: undef Bio::Root::Root::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/Root/Root.pm:404): 404: my $self = shift; DB<4> r scalar context return from Bio::Root::Root::DESTROY: undef Bio::Root::Root::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/Root/Root.pm:404): 404: my $self = shift; DB<4> r scalar context return from Bio::Root::Root::DESTROY: undef Bio::Root::Root::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/Root/Root.pm:404): 404: my $self = shift; DB<4> r scalar context return from Bio::Root::Root::DESTROY: undef Bio::Root::Root::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/Root/Root.pm:404): 404: my $self = shift; DB<4> r scalar context return from Bio::Root::Root::DESTROY: undef main::(454/scripts/fasta_blasta_mb.pl:155): 155: &output_results(); DB<4> x $result 0 undef Debug of WORKING blast-parse: Parsing blast hits ... main::parse_blast(454/scripts/fasta_blasta_mb.pl:302): 302: my $report_obj = new Bio::SearchIO(-verbose => 1, 303: -format => 'blast', 304: -file => $blastoutfile);#or die "Could not open blast report $!"; DB<3> s Bio::SearchIO::new(/util/lib/perl5/site_perl/5.8.0/Bio/SearchIO.pm:129): 129: my($caller, at args) = @_; DB<3> r scalar context return from Bio::SearchIO::new: '_file' => 'G782.2005-08-16-16-48.fasta_sample.blastout' '_filehandle' => GLOB(0x8763100) -> *Symbol::GEN1 FileHandle({*Symbol::GEN1}) => fileno(3) '_flush_on_write' => 1 '_handler' => Bio::SearchIO::IteratedSearchResultEventBuilder=HASH(0x8ab3be4) '_factories' => HASH(0x8ab1594) 'hit' => Bio::Factory::ObjectFactory=HASH(0x8a7b7c0) '_loaded_types' => HASH(0x8abee10) 'Bio::Search::Hit::BlastHit' => 1 '_root_verbose' => 0 'interface' => 'Bio::Search::Hit::HitI' 'type' => 'Bio::Search::Hit::BlastHit' 'hsp' => Bio::Factory::ObjectFactory=HASH(0x8a87200) '_loaded_types' => HASH(0x8abee1c) 'Bio::Search::HSP::GenericHSP' => 1 '_root_verbose' => 0 'interface' => 'Bio::Search::HSP::HSPI' 'type' => 'Bio::Search::HSP::GenericHSP' 'iteration' => Bio::Factory::ObjectFactory=HASH(0x8abee64) '_loaded_types' => HASH(0x8abecfc) 'Bio::Search::Iteration::GenericIteration' => 1 '_root_verbose' => 0 'interface' => 'Bio::Search::Iteration::IterationI' 'type' => 'Bio::Search::Iteration::GenericIteration' 'result' => Bio::Factory::ObjectFactory=HASH(0x8a81a84) '_loaded_types' => HASH(0x8a96ce8) 'Bio::Search::Result::BlastResult' => 1 '_root_verbose' => 0 'interface' => 'Bio::Search::Result::ResultI' 'type' => 'Bio::Search::Result::BlastResult' '_inclusion_threshold' => 0.001 '_root_verbose' => 1 '_handler_cache' => Bio::SearchIO::IteratedSearchResultEventBuilder=HASH(0x8ab3be4) -> REUSED_ADDRESS '_notfirsttime' => 0 '_reporttype' => '' '_root_cleanup_methods' => ARRAY(0x8762efc) 0 CODE(0x82a9aec) -> &Bio::Root::IO::_io_cleanup in /util/lib/perl5/site_perl/5.8.0/Bio/Root/IO.pm:544-570 1 CODE(0x82a9aec) -> REUSED_ADDRESS '_root_verbose' => 1 main::parse_blast(454/scripts/fasta_blasta_mb.pl:307): 307: die "no valid $report_obj" unless defined $report_obj; DB<3> s main::parse_blast(454/scripts/fasta_blasta_mb.pl:310): 310: while( my $result = $report_obj->next_result ) { DB<3> s Bio::SearchIO::blast::next_result(/util/lib/perl5/site_perl/5.8.0/Bio/SearchIO/blast.pm:389): 389: my ($self) = @_; DB<3> r blast.pm: unrecognized line Reference: Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb Miller (2000), blast.pm: unrecognized line "A greedy algorithm for aligning DNA sequences", blast.pm: unrecognized line J Comput Biol 2000; 7(1-2):203-14. blast.pm: unrecognized line Score E Got NCBI HSP score=354, evalue 0.0 scalar context return from Bio::SearchIO::blast::next_result: '_algorithm' => 'MEGABLAST' '_algorithm_version' => '2.2.10 [Oct-19-2004]' '_dbentries' => 4249067 '_dbletters' => 17735149364 '_dbname' => 'All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS,GSS,environmental samples or phase 0, 1 or 2 HTGS sequences) ' '_hitindex' => 0 '_hits' => ARRAY(0x8b2acd0) empty array '_inclusion_threshold' => 0.001 '_iteration_count' => 1 '_iteration_index' => 0 '_iterations' => ARRAY(0x8b2ac4c) 0 Bio::Search::Iteration::GenericIteration=HASH(0x8b1cacc) '_newhits_below_threshold' => ARRAY(0x8b1ca84) 0 Bio::Search::Hit::BlastHit=HASH(0x8b1cf64) '_accession' => 'AE004091' '_algorithm' => 'MEGABLAST' '_description' => 'Pseudomonas aeruginosa PAO1, complete genome' '_hsps' => ARRAY(0x8b1ceb0) 0 Bio::Search::HSP::GenericHSP=HASH(0x8b2098c) '_algorithm' => 'MEGABLAST' '_frac_conserved' => HASH(0x8b266a0) 'hit' => 0.991803278688525 'query' => 0.991803278688525 'total' => 0.991803278688525 '_frac_identical' => HASH(0x8b2658c) 'hit' => 0.991803278688525 'query' => 0.991803278688525 'total' => 0.991803278688525 '_gaps' => HASH(0x8b24d94) 'hit' => 0 'query' => 0 'total' => 0 '_gsf_tag_hash' => HASH(0x8b20998) empty hash '_hit_string' => 'cctgacctccgctcaactgcgcaaatacgccagcgccggtcggccgttccccgaagggcgcctgctggccgcctcctgccacgacgcggaggaactggccctggctgcctcgatgggagtggagttcgtcaccctttcgccggtacagccgaccgagagccatcccggcgagccggcgctgggttgggacaaggccgccgaactgatcgccggcttcaaccagccggtctacctgctgggtggcctcggtccgcagcaagccgagcaggcttgggagcatggagcccagggcgtggcgggtatccgtgcgttctggccgggcggcctttgacggtggaatgaagaaaaaaggaggcttcggcctcc' '_homology_string' => '|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||' etc...... From torsten.seemann at infotech.monash.edu.au Sun Jul 30 01:41:30 2006 From: torsten.seemann at infotech.monash.edu.au (Torsten Seemann) Date: Sun, 30 Jul 2006 15:41:30 +1000 Subject: [Bioperl-l] Strangeness in parsing blast file In-Reply-To: <44CC3550.5070105@broad.mit.edu> References: <44CC3550.5070105@broad.mit.edu> Message-ID: <44CC468A.40700@infotech.monash.edu.au> > sub do_blast { > my $bf = shift; > my $blastoutput = $bf . ".blastout"; > print "Blasting against $db ...\n"; > `blast/blast-2.2.10/bin/megablast -d /prodinfo/proddata_ntblastdb/nt > -e 1e-50 -p 99 -D 2 -i test -o test.blastout`; > return $blastoutput; > }#end do_blast Should "-o test.blastoutput" be "-o $blastoutput" ? Otherwise you are returning the name of a non-existent file, which naturally Bio::SearchIO will not be able to find a blast result in. Alternatively use Bio::Tools::Run::StandaloneBlast to invoke megablast rather than back-ticks - that way you avoid any intermediate file and get a Bio::SearchIO object back directly. --Torsten From nabil at broad.mit.edu Sun Jul 30 10:11:03 2006 From: nabil at broad.mit.edu (Nabil Hafez) Date: Sun, 30 Jul 2006 10:11:03 -0400 Subject: [Bioperl-l] Strangeness in parsing blast file In-Reply-To: <44CC468A.40700@infotech.monash.edu.au> References: <44CC3550.5070105@broad.mit.edu> <44CC468A.40700@infotech.monash.edu.au> Message-ID: <44CCBDF7.2010601@broad.mit.edu> I had modified the variables a bit to try and make them more readable than what is in my code, in my code -o $blastoutput is what it is, like I said, the blast portion works absolutely fine - i.e. the do_blast sub routine is fully functional. here's a cut and paste from my actual code my $MBLAST = "/prodinfo/prod3pty/blast/blast-2.2.10/bin/megablast"; my $blastdb = "/prodinfo/proddata_ntblastdb/nt"; my $e_val = "1e-50"; #Default e-value Getopt_long my $percent_id = "99"; #Default percentage identity my $per_to_sample ="10"; #Default for percentage of reads to sample sub do_blast { my $bf = shift; my $blastoutput = $bf . ".blastout"; print "Blasting against $db ...\n"; `$MBLAST -d $blastdb -e $e_val -p $percent_id -D 2 -i $bf -o $blastoutput`; return $blastoutput; } I will try your suggestion to use the Bio::Tools::Run::StandaloneBlast, is megablast supported by this module? Thanks Nabil Torsten Seemann wrote: > >> sub do_blast { >> my $bf = shift; >> my $blastoutput = $bf . ".blastout"; >> print "Blasting against $db ...\n"; >> `blast/blast-2.2.10/bin/megablast -d >> /prodinfo/proddata_ntblastdb/nt -e 1e-50 -p 99 -D 2 -i test -o >> test.blastout`; > > > return $blastoutput; > > }#end do_blast > > Should "-o test.blastoutput" be "-o $blastoutput" ? > > Otherwise you are returning the name of a non-existent file, which > naturally Bio::SearchIO will not be able to find a blast result in. > > Alternatively use Bio::Tools::Run::StandaloneBlast to invoke megablast > rather than back-ticks - that way you avoid any intermediate file and > get a Bio::SearchIO object back directly. > > --Torsten > From bix at sendu.me.uk Sun Jul 30 12:20:54 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Sun, 30 Jul 2006 17:20:54 +0100 Subject: [Bioperl-l] Strangeness in parsing blast file In-Reply-To: <44CCBDF7.2010601@broad.mit.edu> References: <44CC3550.5070105@broad.mit.edu> <44CC468A.40700@infotech.monash.edu.au> <44CCBDF7.2010601@broad.mit.edu> Message-ID: <44CCDC66.2030604@sendu.me.uk> Nabil Hafez wrote: > I had modified the variables a bit to try and make them more readable > than what is in my code, in my code -o $blastoutput is > what it is, like I said, the blast portion works absolutely fine - i.e. > the do_blast sub routine is fully functional. How do you know? > `$MBLAST -d $blastdb -e $e_val -p $percent_id -D 2 -i $bf -o > $blastoutput`; Does this command definitely produce exactly the same file as the one you use to show that parse_blast() does sometimes work (when you avoid using do_blast())? Btw, http://perldoc.perl.org/perlfaq8.html#What's-wrong-with-using-backticks-in-a-void-context%3f > I will try your suggestion to use the Bio::Tools::Run::StandaloneBlast, > is megablast supported by this module? No, it doesn't. You could cheat and call _runblast() directly (give it an executable string and a string of args to megablast), and provide -outfile to new(). From nabil at broad.mit.edu Sun Jul 30 20:13:16 2006 From: nabil at broad.mit.edu (Nabil Hafez) Date: Sun, 30 Jul 2006 20:13:16 -0400 Subject: [Bioperl-l] Strangeness in parsing blast file In-Reply-To: <44CCDC66.2030604@sendu.me.uk> References: <44CC3550.5070105@broad.mit.edu> <44CC468A.40700@infotech.monash.edu.au> <44CCBDF7.2010601@broad.mit.edu> <44CCDC66.2030604@sendu.me.uk> Message-ID: <44CD4B1C.5070907@broad.mit.edu> Sendu Bala wrote: >Nabil Hafez wrote: > > >>I had modified the variables a bit to try and make them more readable >>than what is in my code, in my code -o $blastoutput is >>what it is, like I said, the blast portion works absolutely fine - i.e. >>the do_blast sub routine is fully functional. >> >> > >How do you know? > > > Because it creates a file containing all of the blastoutput, this works every time - a file is created with the blastoutput. >> `$MBLAST -d $blastdb -e $e_val -p $percent_id -D 2 -i $bf -o >>$blastoutput`; >> >> > >Does this command definitely produce exactly the same file as the one >you use to show that parse_blast() does sometimes work (when you avoid >using do_blast())? > > > Yes - the exact same file because I produce the file with do_blast() and then when it fails to parse it ends but there is a blastoutput file created in my directory. If i re-run the script again just feeding in the name of the file that was created, it parses it just fine. So basically the parsing works whenever I feed it a blastoupt file but it can't seem to parse the same file that was created and then passed to the parse_blast() subroutine >Btw, >http://perldoc.perl.org/perlfaq8.html#What's-wrong-with-using-backticks-in-a-void-context%3f > >Good to know. Thanks. > > >>I will try your suggestion to use the Bio::Tools::Run::StandaloneBlast, >>is megablast supported by this module? >> >> > >No, it doesn't. You could cheat and call _runblast() directly (give it >an executable string and a string of args to megablast), and provide >-outfile to new(). > > > I still don't think the blast is a problem since I get perfect blastoutput everytime. >_______________________________________________ >Bioperl-l mailing list >Bioperl-l at lists.open-bio.org >http://lists.open-bio.org/mailman/listinfo/bioperl-l > > From cjfields at uiuc.edu Sun Jul 30 22:52:16 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sun, 30 Jul 2006 21:52:16 -0500 Subject: [Bioperl-l] Strangeness in parsing blast file In-Reply-To: <44CD4B1C.5070907@broad.mit.edu> References: <44CC3550.5070105@broad.mit.edu> <44CC468A.40700@infotech.monash.edu.au> <44CCBDF7.2010601@broad.mit.edu> <44CCDC66.2030604@sendu.me.uk> <44CD4B1C.5070907@broad.mit.edu> Message-ID: <81C49D1F-0468-4B63-8D7A-09E1C48573F0@uiuc.edu> As an aside, BLAST 2.2.13 or later cannot be parsed using Bioperl 1.5.1. You have to update to the latest bioperl-live (from CVS). Chris On Jul 30, 2006, at 7:13 PM, Nabil Hafez wrote: > > > Sendu Bala wrote: > >> Nabil Hafez wrote: >> >> >>> I had modified the variables a bit to try and make them more >>> readable >>> than what is in my code, in my code -o $blastoutput is >>> what it is, like I said, the blast portion works absolutely fine >>> - i.e. >>> the do_blast sub routine is fully functional. >>> >>> >> >> How do you know? >> >> >> > Because it creates a file containing all of the blastoutput, this > works > every time - a file is created with the > blastoutput. > >>> `$MBLAST -d $blastdb -e $e_val -p $percent_id -D 2 -i $bf -o >>> $blastoutput`; >>> >>> >> >> Does this command definitely produce exactly the same file as the one >> you use to show that parse_blast() does sometimes work (when you >> avoid >> using do_blast())? >> >> >> > Yes - the exact same file because I produce the file with do_blast() > and then when it fails to parse it ends but > there is a blastoutput file created in my directory. If i re-run the > script again just feeding in the name of the file that was > created, it parses it just fine. So basically the parsing works > whenever I feed it a blastoupt file but it can't seem to parse > the same file that was created and then passed to the parse_blast() > subroutine > >> Btw, >> http://perldoc.perl.org/perlfaq8.html#What's-wrong-with-using- >> backticks-in-a-void-context%3f >> >> Good to know. Thanks. >> >> >>> I will try your suggestion to use the >>> Bio::Tools::Run::StandaloneBlast, >>> is megablast supported by this module? >>> >>> >> >> No, it doesn't. You could cheat and call _runblast() directly >> (give it >> an executable string and a string of args to megablast), and provide >> -outfile to new(). >> >> >> > I still don't think the blast is a problem since I get perfect > blastoutput everytime. > >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Mon Jul 31 04:29:28 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 31 Jul 2006 09:29:28 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: <5C765CAF-4976-45D1-BDA4-1E432081FF4D@uiuc.edu> References: <001301c6b24b$da38ba80$15327e82@pyrimidine> <44CB144C.6050507@sendu.me.uk> <5C765CAF-4976-45D1-BDA4-1E432081FF4D@uiuc.edu> Message-ID: <44CDBF68.2040803@sendu.me.uk> Chris Fields wrote: > On Jul 29, 2006, at 2:54 AM, Sendu Bala wrote: > >>> http://www.bioperl.org/wiki/HOWTO:Trees >> Which is why the Trees HOWTO talks about taxa, and a number of the >> Taxonomy modules have phylogenetic methods like get_lca. (And why >> there >> is Bio::Taxonomy::Taxon and Tree.) > > Are we still thinking about deprecating those? I have seen very > little mention of those modules from the mail list archives, and > Jason mentioned that Bio::Taxonomy::Taxon hasn't been modified in a > long time. Yes, they would both be redundant and nonsensical with the planned changes to Bio::Species. From Xianjun.Dong at bccs.uib.no Mon Jul 31 07:55:59 2006 From: Xianjun.Dong at bccs.uib.no (Xianjun Dong) Date: Mon, 31 Jul 2006 13:55:59 +0200 Subject: [Bioperl-l] PAML + Codeml problem.. In-Reply-To: 4A98ACB8EC146149872BAC9A132A582C277AC4@icex5.ic.ac.uk Message-ID: <1154346960.6517.19.camel@lauvtre.ii.uib.no> Hi, I have a problem during running the Codeml Wiki-HOWTO code: Here is the error message: //////////////////////////////////////////////////////////////// [xianjund at lauvtre kaks]$ perl paml.pl test.fa -------------------- WARNING --------------------- MSG: There was an error - see error_string for the program output STACK Bio::Tools::Run::Phylo::PAML::Codeml::run /Home/extern/xianjund/src/bioperl/bioperl-run/Bio/Tools/Run/Phylo/PAML/Codeml.pm:581 STACK toplevel paml.pl:61 ------------- EXCEPTION: Bio::Root::NotImplemented ------------- MSG: Unknown format of PAML output STACK: Error::throw STACK: Bio::Root::Root::throw /usr/lib/perl5/site_perl/5.8.5/Bio/Root/Root.pm:328 STACK: Bio::Tools::Phylo::PAML::_parse_summary /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:359 STACK: Bio::Tools::Phylo::PAML::next_result /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:224 STACK: paml.pl:62 ---------------------------------------------------------------- //////////////////////////////////////////////////////////////// My test sequence is: >ENST00000361390 ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCTAATCGCAATGGCATTCCTAATGCTTACCGAACGAAAAATTCTAGGCTATATACAACTACGCAAAGGCCCCAACGTTGTAGGCCCCTACGGGCTACTACAACCCTTCGCTGACGCCATAAAACTCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATCACCGCCCCGACCTTAGCTCTCACCATCGCTCTTCTACTATGAACCCCCCTCCCCATACCCAACCCCCTGGTCAACCTCAACCTAGGCCTCCTATTTATTCTAGCCACCTCTAGCCTAGCCGTTTACTCAATCCTCTGATCAGGGTGAGCATCAAACTCAAACTACGCCCTGATCGGCGCACTGCGAGCAGTAGCCCAAACAATCTCATATGAAGTCACCCTAGCCATCATTCTACTATCAACATTACTAATAAGTGGCTCCTTTAACCTCTCCACCCTTATCACAACACAAGAACACCTCTGATTACTCCTGCCATCATGACCCTTGGCCATAATATGATTTATCTCCACACTAGCAGAGACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACTAGTCTCAGGCTTCAACATCGAATACGCCGCAGGCCCCTTCGCCCTATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTACAATCTTCCTAGGAACAACATATGACGCACTCTCCCCTGAACTCTACACAACATATTTTGTCACCAAGACCCTACTTCTAACCTCCCTGTTCTTATGAATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACACCTCCTATGAAAAAACTTCCTACCACTCACCCTAGCATTACTTATATGATATGTCTCCATACCCATTACAATCTCCAGCATTCCCCCTCAAACCTAA >ENSMUST00000082392 GTGTTCTTTATTAATATCCTAACACTCCTCGTCCCCATTCTAATCGCCATAGCCTTCCTAACATTAGTAGAACGCAAAATCTTAGGGTACATACAACTACGAAAAGGCCCTAACATTGTTGGTCCATACGGCATTTTACAACCATTTGCAGACGCCATAAAATTATTTATAAAAGAACCAATACGCCCTTTAACAACCTCTATATCCTTATTTATTATTGCACCTACCCTATCACTCACACTAGCATTAAGTCTATGAGTTCCCCTACCAATACCACACCCATTAATTAATTTAAACCTAGGGATTTTATTTATTTTAGCAACATCTAGCCTATCAGTTTACTCCATTCTATGATCAGGATGAGCCTCAAACTCCAAATACTCACTATTCGGAGCTTTACGAGCCGTAGCCCAAACAATTTCATATGAAGTAACCATAGCTATTATCCTTTTATCAGTTCTATTAATAAATGGATCCTACTCTCTACAAACACTTATTACAACCCAAGAACACATATGATTACTTCTGCCAGCCTGACCCATAGCCATAATATGATTTATCTCAACCCTAGCAGAAACAAACCGGGCCCCCTTCGACCTGACAGAAGGAGAATCAGAATTAGTATCAGGGTTTAACGTAGAATACGCAGCCGGCCCATTCGCGTTATTCTTTATAGCAGAGTACACTAACATTATTCTAATAAACGCCCTAACAACTATTATCTTCCTAGGACCCCTATACTATATCAATTTACCAGAACTCTACTCAACTAACTTCATAATAGAAGCTCTACTACTATCATCAACATTCCTATGGATCCGAGCATCTTATCCACGCTTCCGTTACGATCAACTTATACATCTTCTATGAAAAAACTTTCTACCCCTAACACTAGCATTATGTATGTGACATATTTCTTTACCAATTTTTACAGCGGGAGTACCACCATACATATAG Sure, I checked it. There is some stop codon in it. If I replace it with non-stop codon, it works. For example, >ENST00000361390 ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCcaaTCGCAATGGCATTCCcaaTGCTTACCGAACGAAAAATTCcaaGCTATATACAACTACGCAAAGGCCCCAACGTTGcaaGCCCCTACGGGCTACTACAACCCTTCGCcaaCGCCAcaaAACTCTTCACCAAAGAGCCCCcaaAACCCGCCACATCTACCATCACCCTCTACATCACCGCCCCGACCTcaaCTCTCACCATCGCTCTTCTACTAcaaACCCCCCTCCCCATACCCAACCCCCTGGTCAACCTCAACCcaaGCCTCCTATTTATTCcaaCCACCTCcaaCCcaaCCGTTTACTCAATCCTCcaaTCAGGGcaaGCATCAAACTCAAACTACGCCCcaaTCGGCGCACTGCGAGCAGcaaCCCAAACAATCTCATAcaaAGTCACCCcaaCCATCATTCTACTATCAACATTACcaacaaGTGGCTCCTTcaaCCTCTCCACCCTTATCACAACACAAGAACACCTCcaaTTACTCCTGCCATCAcaaCCCTTGGCCAcaaTAcaaTTTATCTCCACACcaaCAGAGACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACcaaTCTCAGGCTTCAACATCGAATACGCCGCAGGCCCCTTCGCCCTATTCTTCAcaaCCGAATACACAAACATTATTAcaacaaACACCCTCACCACTACAATCTTCCcaaGAACAACATAcaaCGCACTCTCCCCcaaACTCTACACAACATATTTTGTCACCAAGACCCTACTTCcaaCCTCCCTGTTCTTAcaaATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACACCTCCTAcaaAAAAACTTCCTACCACTCACCCcaaCATTACTTATAcaaTATGTCTCCATACCCATTACAATCTCCAGCATTCCCCCTCAAACCcaa >ENSMUST00000082392 GTGTTCTTTATcaaTATCCcaaCACTCCTCGTCCCCATTCcaaTCGCCAcaaCCTTCCcaaCATcaacaaAACGCAAAATCTcaaGGTACATACAACTACGAAAAGGCCCcaaCATTGTTGGTCCATACGGCATTTTACAACCATTTGCAGACGCCAcaaAATTATTTAcaaAAGAACCAATACGCCCTTcaaCAACCTCTATATCCTTATTTATTATTGCACCTACCCTATCACTCACACcaaCATcaaGTCTAcaaGTTCCCCTACCAATACCACACCCATcaaTcaaTTcaaACCcaaGGATTTTATTTATTTcaaCAACATCcaaCCTATCAGTTTACTCCATTCTAcaaTCAGGAcaaGCCTCAAACTCCAAATACTCACTATTCGGAGCTTTACGAGCCGcaaCCCAAACAATTTCATAcaaAGcaaCCAcaaCTATTATCCTTTTATCAGTTCTATcaacaaATGGATCCTACTCTCTACAAACACTTATTACAACCCAAGAACACATAcaaTTACTTCTGCCAGCCcaaCCCAcaaCCAcaaTAcaaTTTATCTCAACCCcaaCAGAAACAAACCGGGCCCCCTTCGACCcaaCAGAAGGAGAATCAGAATcaaTATCAGGGTTcaaCGcaaAATACGCAGCCGGCCCATTCGCGTTATTCTTTAcaaCAGAGTACACcaaCATTATTCcaacaaACGCCCcaaCAACTATTATCTTCCcaaGACCCCTATACTATATCAATTTACCAGAACTCTACTCAACcaaCTTCAcaacaaAAGCTCTACTACTATCATCAACATTCCTATGGATCCGAGCATCTTATCCACGCTTCCGTTACGATCAACTTATACATCTTCTAcaaAAAAACTTTCTACCCCcaaCACcaaCATTATGTATGcaaCATATTTCTTTACCAATTTTTACAGCGGGAGTACCACCATACATAcaa But my question is: it does not occur in the codon position (say, the third codon's position is not a times of 3). Why it effect the result? And also there is code to filter out the stop codon in the sample code (as the following shown) /////////////////////////////// if( $pseq =~ /\*/ && $pseq !~ /\*$/ ) { warn("provided a CDS sequence with a stop codon, PAML will choke!"); exit(0); } # Tcoffee can't handle '*' even if it is trailing $pseq =~ s/\*//g; ///////////////////////////// So, when translate back from aa_aln to dna_aln, there should be no stop codon included. SO, why it can not pass? Thanks for answer! P.S: attach my code here: ///////////////////////////////////////////////////////// #!/usr/bin/perl -w use strict; use Bio::Tools::Run::Phylo::PAML::Codeml; use Bio::Tools::Run::Alignment::Clustalw; # for projecting alignments from protein to R/DNA space use Bio::Align::Utilities qw(aa_to_dna_aln); # for input of the sequence data use Bio::SeqIO; use Bio::AlignIO; my $aln_factory = Bio::Tools::Run::Alignment::Clustalw->new('quiet'=>1); my $seqdata = shift || 'test.fa'; my $seqio = new Bio::SeqIO(-file => $seqdata, -format => 'fasta'); my %seqs; my @prots; # process each sequence while ( my $seq = $seqio->next_seq ) { $seqs{$seq->display_id} = $seq; # translate them into protein my $protein = $seq->translate(); my $pseq = $protein->seq(); if( $pseq =~ /\*/ && $pseq !~ /\*$/ ) { warn("provided a CDS sequence with a stop codon, PAML will choke!"); exit(0); } # Tcoffee can't handle '*' even if it is trailing $pseq =~ s/\*//g; $protein->seq($pseq); push @prots, $protein; } if( @prots < 2 ) { warn("Need at least 2 CDS sequences to proceed"); exit(0); } # open(OUT, ">align_output.txt") || die("cannot open output align_output for writing"); # Align the sequences with clustalw my $aa_aln = $aln_factory->align(\@prots); # project the protein alignment back to CDS coordinates my $dna_aln = aa_to_dna_aln($aa_aln, \%seqs); my @each = $dna_aln->each_seq(); my $kaks_factory = Bio::Tools::Run::Phylo::PAML::Codeml->new ( -params => { 'runmode' => -2, 'seqtype' => 1, }, -save_tempfiles => 1, -verbose => 1); # set the alignment object $kaks_factory->alignment($dna_aln); # run the KaKs analysis my ($rc,$parser) = $kaks_factory->run(); my $result = $parser->next_result; my $MLmatrix = $result->get_MLmatrix(); my @otus = $result->get_seqs(); # this gives us a mapping from the PAML order of sequences back to # the input order (since names get truncated) my @pos = map { my $c= 1; foreach my $s ( @each ) { last if( $s->display_id eq $_->display_id ); $c++; } $c; } @otus; print join("\t", qw(SEQ1 SEQ2 Ka Ks Ka/Ks PROT_PERCENTID CDNA_PERCENTID)),"\n"; for( my $i = 0; $i < (scalar @otus -1) ; $i++) { for( my $j = $i+1; $j < (scalar @otus); $j++ ) { my $sub_aa_aln = $aa_aln->select_noncont($pos[$i],$pos[$j]); my $sub_dna_aln = $dna_aln->select_noncont($pos[$i],$pos[$j]); print join("\t", $otus[$i]->display_id, $otus[$j]->display_id,$MLmatrix->[$i]->[$j]- >{'dN'}, $MLmatrix->[$i]->[$j]->{'dS'}, $MLmatrix->[$i]->[$j]->{'omega'}, sprintf("%.2f",$sub_aa_aln- >percentage_identity), sprintf("%.2f",$sub_dna_aln- >percentage_identity), ), "\n"; } } -- Xianjun Dong PhD Student Computational Biology Unit Bergen Center for Computational Science University of Bergen H?yteknologisenteret, Thorm?hlensgate 55 N-5008 Bergen,Norway. Webpage: http://www.ii.uib.no/~xianjund/ MSN: sterding at hotmail.com Phone No: +47 - 55584354 (office) +47 - 47361688 (mobile) Fax No: +47 - 55584295 From golharam at umdnj.edu Mon Jul 31 11:20:33 2006 From: golharam at umdnj.edu (Ryan Golhar) Date: Mon, 31 Jul 2006 11:20:33 -0400 Subject: [Bioperl-l] PAML + Codeml problem.. In-Reply-To: <1154346960.6517.19.camel@lauvtre.ii.uib.no> Message-ID: <027201c6b4b4$ddc201f0$2f01a8c0@GOLHARMOBILE1> Hi Xianjun, I just did some work on this module including the example. >> it does not occur in the codon position >>(say, the third codon's position is not a times of 3). >>Why it effect the result? If I'm interpreting your question correctly, the stop codons in your sequence occur in-frame. This is why it is choking. >>So, when translate back from aa_aln to dna_aln, there should be no stop codon included. SO, why it can not pass? The Ka and Ks statistics are not calculated based on the protein sequence, they are calculated based on the DNA sequence. The protein sequence is used to provide a alignment for the codons of the DNA sequence. Checking the protein sequence for * is easier to identify in-frame stop codons than scanning the DNA sequence. The two checks for stop codons you mentioned are to check for stop codons within the sequence without worry for the last amino acid. The second part remove the * at the end of the sequence (not the middle). If you want to remove the in-frame stop codons, you can, but do so before translating it to protein sequences. Ryan -----Original Message----- From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Xianjun Dong Sent: Monday, July 31, 2006 7:56 AM To: bioperl-l at lists.open-bio.org Subject: [Bioperl-l] PAML + Codeml problem.. Hi, I have a problem during running the Codeml Wiki-HOWTO code: Here is the error message: //////////////////////////////////////////////////////////////// [xianjund at lauvtre kaks]$ perl paml.pl test.fa -------------------- WARNING --------------------- MSG: There was an error - see error_string for the program output STACK Bio::Tools::Run::Phylo::PAML::Codeml::run /Home/extern/xianjund/src/bioperl/bioperl-run/Bio/Tools/Run/Phylo/PAML/C odeml.pm:581 STACK toplevel paml.pl:61 ------------- EXCEPTION: Bio::Root::NotImplemented ------------- MSG: Unknown format of PAML output STACK: Error::throw STACK: Bio::Root::Root::throw /usr/lib/perl5/site_perl/5.8.5/Bio/Root/Root.pm:328 STACK: Bio::Tools::Phylo::PAML::_parse_summary /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:359 STACK: Bio::Tools::Phylo::PAML::next_result /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:224 STACK: paml.pl:62 ---------------------------------------------------------------- //////////////////////////////////////////////////////////////// My test sequence is: >ENST00000361390 ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCTAATCGCAATGGCATTCCTAATGCTTACCGAA CGAAAAATTCTAGGCTATATACAACTACGCAAAGGCCCCAACGTTGTAGGCCCCTACGGGCTACTACAACCC TTCGCTGACGCCATAAAACTCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATC ACCGCCCCGACCTTAGCTCTCACCATCGCTCTTCTACTATGAACCCCCCTCCCCATACCCAACCCCCTGGTC AACCTCAACCTAGGCCTCCTATTTATTCTAGCCACCTCTAGCCTAGCCGTTTACTCAATCCTCTGATCAGGG TGAGCATCAAACTCAAACTACGCCCTGATCGGCGCACTGCGAGCAGTAGCCCAAACAATCTCATATGAAGTC ACCCTAGCCATCATTCTACTATCAACATTACTAATAAGTGGCTCCTTTAACCTCTCCACCCTTATCACAACA CAAGAACACCTCTGATTACTCCTGCCATCATGACCCTTGGCCATAATATGATTTATCTCCACACTAGCAGAG ACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACTAGTCTCAGGCTTCAACATCGAATACGCC GCAGGCCCCTTCGCCCTATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTACA ATCTTCCTAGGAACAACATATGACGCACTCTCCCCTGAACTCTACACAACATATTTTGTCACCAAGACCCTA CTTCTAACCTCCCTGTTCTTATGAATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACACCTC CTATGAAAAAACTTCCTACCACTCACCCTAGCATTACTTATATGATATGTCTCCATACCCATTACAATCTCC AGCATTCCCCCTCAAACCTAA >ENSMUST00000082392 GTGTTCTTTATTAATATCCTAACACTCCTCGTCCCCATTCTAATCGCCATAGCCTTCCTAACATTAGTAGAA CGCAAAATCTTAGGGTACATACAACTACGAAAAGGCCCTAACATTGTTGGTCCATACGGCATTTTACAACCA TTTGCAGACGCCATAAAATTATTTATAAAAGAACCAATACGCCCTTTAACAACCTCTATATCCTTATTTATT ATTGCACCTACCCTATCACTCACACTAGCATTAAGTCTATGAGTTCCCCTACCAATACCACACCCATTAATT AATTTAAACCTAGGGATTTTATTTATTTTAGCAACATCTAGCCTATCAGTTTACTCCATTCTATGATCAGGA TGAGCCTCAAACTCCAAATACTCACTATTCGGAGCTTTACGAGCCGTAGCCCAAACAATTTCATATGAAGTA ACCATAGCTATTATCCTTTTATCAGTTCTATTAATAAATGGATCCTACTCTCTACAAACACTTATTACAACC CAAGAACACATATGATTACTTCTGCCAGCCTGACCCATAGCCATAATATGATTTATCTCAACCCTAGCAGAA ACAAACCGGGCCCCCTTCGACCTGACAGAAGGAGAATCAGAATTAGTATCAGGGTTTAACGTAGAATACGCA GCCGGCCCATTCGCGTTATTCTTTATAGCAGAGTACACTAACATTATTCTAATAAACGCCCTAACAACTATT ATCTTCCTAGGACCCCTATACTATATCAATTTACCAGAACTCTACTCAACTAACTTCATAATAGAAGCTCTA CTACTATCATCAACATTCCTATGGATCCGAGCATCTTATCCACGCTTCCGTTACGATCAACTTATACATCTT CTATGAAAAAACTTTCTACCCCTAACACTAGCATTATGTATGTGACATATTTCTTTACCAATTTTTACAGCG GGAGTACCACCATACATATAG Sure, I checked it. There is some stop codon in it. If I replace it with non-stop codon, it works. For example, >ENST00000361390 ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCcaaTCGCAATGGCATTCCcaaTGCTTACCGAA CGAAAAATTCcaaGCTATATACAACTACGCAAAGGCCCCAACGTTGcaaGCCCCTACGGGCTACTACAACCC TTCGCcaaCGCCAcaaAACTCTTCACCAAAGAGCCCCcaaAACCCGCCACATCTACCATCACCCTCTACATC ACCGCCCCGACCTcaaCTCTCACCATCGCTCTTCTACTAcaaACCCCCCTCCCCATACCCAACCCCCTGGTC AACCTCAACCcaaGCCTCCTATTTATTCcaaCCACCTCcaaCCcaaCCGTTTACTCAATCCTCcaaTCAGGG caaGCATCAAACTCAAACTACGCCCcaaTCGGCGCACTGCGAGCAGcaaCCCAAACAATCTCATAcaaAGTC ACCCcaaCCATCATTCTACTATCAACATTACcaacaaGTGGCTCCTTcaaCCTCTCCACCCTTATCACAACA CAAGAACACCTCcaaTTACTCCTGCCATCAcaaCCCTTGGCCAcaaTAcaaTTTATCTCCACACcaaCAGAG ACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACcaaTCTCAGGCTTCAACATCGAATACGCC GCAGGCCCCTTCGCCCTATTCTTCAcaaCCGAATACACAAACATTATTAcaacaaACACCCTCACCACTACA ATCTTCCcaaGAACAACATAcaaCGCACTCTCCCCcaaACTCTACACAACATATTTTGTCACCAAGACCCTA CTTCcaaCCTCCCTGTTCTTAcaaATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACACCTC CTAcaaAAAAACTTCCTACCACTCACCCcaaCATTACTTATAcaaTATGTCTCCATACCCATTACAATCTCC AGCATTCCCCCTCAAACCcaa >ENSMUST00000082392 GTGTTCTTTATcaaTATCCcaaCACTCCTCGTCCCCATTCcaaTCGCCAcaaCCTTCCcaaCATcaacaaAA CGCAAAATCTcaaGGTACATACAACTACGAAAAGGCCCcaaCATTGTTGGTCCATACGGCATTTTACAACCA TTTGCAGACGCCAcaaAATTATTTAcaaAAGAACCAATACGCCCTTcaaCAACCTCTATATCCTTATTTATT ATTGCACCTACCCTATCACTCACACcaaCATcaaGTCTAcaaGTTCCCCTACCAATACCACACCCATcaaTc aaTTcaaACCcaaGGATTTTATTTATTTcaaCAACATCcaaCCTATCAGTTTACTCCATTCTAcaaTCAGGA caaGCCTCAAACTCCAAATACTCACTATTCGGAGCTTTACGAGCCGcaaCCCAAACAATTTCATAcaaAGca aCCAcaaCTATTATCCTTTTATCAGTTCTATcaacaaATGGATCCTACTCTCTACAAACACTTATTACAACC CAAGAACACATAcaaTTACTTCTGCCAGCCcaaCCCAcaaCCAcaaTAcaaTTTATCTCAACCCcaaCAGAA ACAAACCGGGCCCCCTTCGACCcaaCAGAAGGAGAATCAGAATcaaTATCAGGGTTcaaCGcaaAATACGCA GCCGGCCCATTCGCGTTATTCTTTAcaaCAGAGTACACcaaCATTATTCcaacaaACGCCCcaaCAACTATT ATCTTCCcaaGACCCCTATACTATATCAATTTACCAGAACTCTACTCAACcaaCTTCAcaacaaAAGCTCTA CTACTATCATCAACATTCCTATGGATCCGAGCATCTTATCCACGCTTCCGTTACGATCAACTTATACATCTT CTAcaaAAAAACTTTCTACCCCcaaCACcaaCATTATGTATGcaaCATATTTCTTTACCAATTTTTACAGCG GGAGTACCACCATACATAcaa But my question is: it does not occur in the codon position (say, the third codon's position is not a times of 3). Why it effect the result? And also there is code to filter out the stop codon in the sample code (as the following shown) /////////////////////////////// if( $pseq =~ /\*/ && $pseq !~ /\*$/ ) { warn("provided a CDS sequence with a stop codon, PAML will choke!"); exit(0); } # Tcoffee can't handle '*' even if it is trailing $pseq =~ s/\*//g; ///////////////////////////// So, when translate back from aa_aln to dna_aln, there should be no stop codon included. SO, why it can not pass? Thanks for answer! P.S: attach my code here: ///////////////////////////////////////////////////////// #!/usr/bin/perl -w use strict; use Bio::Tools::Run::Phylo::PAML::Codeml; use Bio::Tools::Run::Alignment::Clustalw; # for projecting alignments from protein to R/DNA space use Bio::Align::Utilities qw(aa_to_dna_aln); # for input of the sequence data use Bio::SeqIO; use Bio::AlignIO; my $aln_factory = Bio::Tools::Run::Alignment::Clustalw->new('quiet'=>1); my $seqdata = shift || 'test.fa'; my $seqio = new Bio::SeqIO(-file => $seqdata, -format => 'fasta'); my %seqs; my @prots; # process each sequence while ( my $seq = $seqio->next_seq ) { $seqs{$seq->display_id} = $seq; # translate them into protein my $protein = $seq->translate(); my $pseq = $protein->seq(); if( $pseq =~ /\*/ && $pseq !~ /\*$/ ) { warn("provided a CDS sequence with a stop codon, PAML will choke!"); exit(0); } # Tcoffee can't handle '*' even if it is trailing $pseq =~ s/\*//g; $protein->seq($pseq); push @prots, $protein; } if( @prots < 2 ) { warn("Need at least 2 CDS sequences to proceed"); exit(0); } # open(OUT, ">align_output.txt") || die("cannot open output align_output for writing"); # Align the sequences with clustalw my $aa_aln = $aln_factory->align(\@prots); # project the protein alignment back to CDS coordinates my $dna_aln = aa_to_dna_aln($aa_aln, \%seqs); my @each = $dna_aln->each_seq(); my $kaks_factory = Bio::Tools::Run::Phylo::PAML::Codeml->new ( -params => { 'runmode' => -2, 'seqtype' => 1, }, -save_tempfiles => 1, -verbose => 1); # set the alignment object $kaks_factory->alignment($dna_aln); # run the KaKs analysis my ($rc,$parser) = $kaks_factory->run(); my $result = $parser->next_result; my $MLmatrix = $result->get_MLmatrix(); my @otus = $result->get_seqs(); # this gives us a mapping from the PAML order of sequences back to # the input order (since names get truncated) my @pos = map { my $c= 1; foreach my $s ( @each ) { last if( $s->display_id eq $_->display_id ); $c++; } $c; } @otus; print join("\t", qw(SEQ1 SEQ2 Ka Ks Ka/Ks PROT_PERCENTID CDNA_PERCENTID)),"\n"; for( my $i = 0; $i < (scalar @otus -1) ; $i++) { for( my $j = $i+1; $j < (scalar @otus); $j++ ) { my $sub_aa_aln = $aa_aln->select_noncont($pos[$i],$pos[$j]); my $sub_dna_aln = $dna_aln->select_noncont($pos[$i],$pos[$j]); print join("\t", $otus[$i]->display_id, $otus[$j]->display_id,$MLmatrix->[$i]->[$j]- >{'dN'}, $MLmatrix->[$i]->[$j]->{'dS'}, $MLmatrix->[$i]->[$j]->{'omega'}, sprintf("%.2f",$sub_aa_aln- >percentage_identity), sprintf("%.2f",$sub_dna_aln- >percentage_identity), ), "\n"; } } -- Xianjun Dong PhD Student Computational Biology Unit Bergen Center for Computational Science University of Bergen H?yteknologisenteret, Thorm?hlensgate 55 N-5008 Bergen,Norway. Webpage: http://www.ii.uib.no/~xianjund/ MSN: sterding at hotmail.com Phone No: +47 - 55584354 (office) +47 - 47361688 (mobile) Fax No: +47 - 55584295 _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l From nabil at broad.mit.edu Mon Jul 31 14:57:48 2006 From: nabil at broad.mit.edu (Nabil Hafez) Date: Mon, 31 Jul 2006 14:57:48 -0400 Subject: [Bioperl-l] Strangeness in parsing blast file In-Reply-To: <44CC3550.5070105@broad.mit.edu> References: <44CC3550.5070105@broad.mit.edu> Message-ID: <44CE52AC.4080108@broad.mit.edu> I have figured out the problem - not a problem with Bioperl. In my create_sample_file() subroutine I defined $/ = '>'; #define fasta record input seperator when it should have been this local $/ = "\n>"; the use of local made a big difference. Thanks to all for your help. Nabil Hafez Nabil Hafez wrote: > Hi, > I am having a somewhat similar problem to what was posted in > http://bioperl.org/pipermail/bioperl-l/2006-May/021416.html > however, I have read through all of that thread and I don't believe what > I am > experiencing is the exact same problem. I also realize that the Bioperl > version 1.5.1 > fixes a problem with blast parsing. > > My problem: > My blastresults file parses fine and everything works swimmingly if > I pass > the blast output file by name such as > $blast_result = 'test.blastout'; > > however when I do > $blast_result = &do_blast($sample_fasta); > > even though in both cases $blast_result evaluate to "test.blastout", the > parsing doesn't work, more specifically > it gets an undefined variable for $result in while( my $result = > $report_obj->next_result ) { > > Sorr y for the long email - any help would be appreciated, > Thanks - Nabil > > > The code...non releavant parts trimmed for size constraints....debugging > from working and non-working > versions after the code. > > use strict; > use Bio::SearchIO; > use Getopt::Std; > use List::Util qw(shuffle); > use Benchmark; > > my ($inputfile, $samplefile, $blastfile, $blast_result, $blast_report, > $blast_verbose); #files generated > > > #------------------# > # Subroutine Calls # > #------------------# > > my $test = &create_sample_file($inputfile); #inputfile being a fasta > file containing nucleotide sequence > $blast_result = &do_blast($test); > ##$blast_result = 'test.blastout'; #when this is uncommented and > replace the previous two lines with test.blastout being normal blast > output - the script works fine. > &parse_blast($blast_result); > > > ####################### > # create_sample_file > # > # Input: Original Fasta File > # > # Output: Fasta file containing randomly sampled reads > # > # > sub create_sample_file { > my $in = shift; > my $linecount = 0; > my @lines; > > $samplefile = $in . "_sample"; > > #Determine total # of reads in input fasta > $totalreads = `$grep -c '>' $inputfile`; > $totalreads =~ s/\s+//; > chomp $totalreads; > > if ($totalreads > 1000) { #sample if more than 1000 reads > $sample_reads = sprintf("%.0f", $totalreads * > ($per_to_sample/100)); #number of reads to sample > } > else { #otherwise use all reads > $sample_reads = $totalreads; > } > > $/ = '>'; #define fasta record input seperator > > open (IN, "<$in") or die "Cannot open $in $!\n"; > open (OUT, ">$samplefile") or die "Cannot open $samplefile $!\n"; > > > while () { #read lines into an array > chomp; > push (@lines, $_); > } > > @lines = shuffle(@lines); #shuffle array > foreach (@lines) { > print OUT ">$_" if $linecount <= $sample_reads; #output to > file sampled number of reads > $linecount++; > } > > close IN; > close OUT; > > return $samplefile; > > }#end create_sample_file > > > ####################### > # do_blast > # > # Input: Fasta File containing SCREENED sampled reads > # > # Output: Blast File > # > # > > sub do_blast { > my $bf = shift; > my $blastoutput = $bf . ".blastout"; > > print "Blasting against $db ...\n"; > > `blast/blast-2.2.10/bin/megablast -d /prodinfo/proddata_ntblastdb/nt > -e 1e-50 -p 99 -D 2 -i test -o test.blastout`; > > return $blastoutput; > > }#end do_blast > > > > ####################### > # parse_blast > # > # Input: Blast file > # > # Output: Creates hash containing best hit for each read > # > # > > sub parse_blast { > my $blastoutfile = shift; > > if (! -e $blastoutfile) { > die "$blastoutfile does not exist $!\n"; > } > > print "Parsing blast hits ...\n"; > > > my $report_obj = new Bio::SearchIO(-verbose => 1, > -format => 'blast', > -file => $blastoutfile); > > > die "no valid $report_obj" unless defined $report_obj; > > > while( my $result = $report_obj->next_result ) { > die "no valid $result" unless defined $result; > while( my $hit = $result->next_hit ) { > while( my $hsp = $hit->next_hsp ) { > my $name = $result->query_name; > my $hitDesc = $hit->description; > my $length = $hsp->length('total'); > my $per_id = sprintf("%.2f", $hsp->percent_identity); > my $eval = $hsp->evalue; > next if (defined $blast_results{$name} && > $blast_results{$name}->[0] > $length); #only keep best hit for any read > $blast_results{$name} = [$length, $per_id, $eval, $hitDesc]; > #store in hash of arrays > } > } > } > > } #end parse_blast > > > > > > Debug of NON-working blast-parse: > > main::(454/scripts/fasta_blasta_mb.pl:151): > 151: my $sample_fasta = &create_sample_file($inputfile); > DB<2> n > main::(454/scripts/fasta_blasta_mb.pl:152): > 152: $blast_result = &do_blast($sample_fasta); > DB<2> x $sample_fasta > 0 'G782.2005-08-16-16-48.fasta_sample' > DB<3> n > Blasting against NT ... > main::(454/scripts/fasta_blasta_mb.pl:154): > 154: &parse_blast($blast_result); > DB<3> s > main::parse_blast(454/scripts/fasta_blasta_mb.pl:293): > 293: my $blastoutfile = shift; > DB<3> s > main::parse_blast(454/scripts/fasta_blasta_mb.pl:295): > 295: if (! -e $blastoutfile) { > DB<3> x $blastoutfile > 0 'G782.2005-08-16-16-48.fasta_sample.blastout' > DB<4> s > main::parse_blast(454/scripts/fasta_blasta_mb.pl:299): > 299: print "Parsing blast hits ...\n"; > DB<4> s > Parsing blast hits ... > main::parse_blast(454/scripts/fasta_blasta_mb.pl:302): > 302: my $report_obj = new Bio::SearchIO(-verbose => 1, > 303: -format => 'blast', > 304: -file => > $blastoutfile);#or die "Could not open blast report $!"; > DB<4> s > Bio::SearchIO::new(/util/lib/perl5/site_perl/5.8.0/Bio/SearchIO.pm:129): > 129: my($caller, at args) = @_; > DB<4> r > scalar context return from Bio::SearchIO::new: '_file' => > 'G782.2005-08-16-16-48.fasta_sample.blastout' > '_filehandle' => GLOB(0x8cef40c) > -> *Symbol::GEN1 > FileHandle({*Symbol::GEN1}) => fileno(3) > '_flush_on_write' => 1 > '_handler' => > Bio::SearchIO::IteratedSearchResultEventBuilder=HASH(0x9505724) > '_factories' => HASH(0x95054c0) > 'hit' => Bio::Factory::ObjectFactory=HASH(0x95017b8) > '_loaded_types' => HASH(0x9506c0c) > 'Bio::Search::Hit::BlastHit' => 1 > '_root_verbose' => 0 > 'interface' => 'Bio::Search::Hit::HitI' > 'type' => 'Bio::Search::Hit::BlastHit' > 'hsp' => Bio::Factory::ObjectFactory=HASH(0x9500e10) > '_loaded_types' => HASH(0x9506c18) > 'Bio::Search::HSP::GenericHSP' => 1 > '_root_verbose' => 0 > 'interface' => 'Bio::Search::HSP::HSPI' > 'type' => 'Bio::Search::HSP::GenericHSP' > 'iteration' => Bio::Factory::ObjectFactory=HASH(0x9506c60) > '_loaded_types' => HASH(0x9506af8) > 'Bio::Search::Iteration::GenericIteration' => 1 > '_root_verbose' => 0 > 'interface' => 'Bio::Search::Iteration::IterationI' > 'type' => 'Bio::Search::Iteration::GenericIteration' > 'result' => Bio::Factory::ObjectFactory=HASH(0x9504c80) > '_loaded_types' => HASH(0x9501f74) > 'Bio::Search::Result::BlastResult' => 1 > '_root_verbose' => 0 > 'interface' => 'Bio::Search::Result::ResultI' > 'type' => 'Bio::Search::Result::BlastResult' > '_inclusion_threshold' => 0.001 > '_root_verbose' => 1 > '_handler_cache' => > Bio::SearchIO::IteratedSearchResultEventBuilder=HASH(0x9505724) > -> REUSED_ADDRESS > '_notfirsttime' => 0 > '_reporttype' => '' > '_root_cleanup_methods' => ARRAY(0x8cde434) > 0 CODE(0x82a9aec) > -> &Bio::Root::IO::_io_cleanup in > /util/lib/perl5/site_perl/5.8.0/Bio/Root/IO.pm:544-570 > 1 CODE(0x82a9aec) > -> REUSED_ADDRESS > '_root_verbose' => 1 > main::parse_blast(454/scripts/fasta_blasta_mb.pl:307): > 307: die "no valid $report_obj" unless defined $report_obj; > DB<4> s > main::parse_blast(454/scripts/fasta_blasta_mb.pl:310): > 310: while( my $result = $report_obj->next_result ) { > DB<4> s > Bio::SearchIO::blast::next_result(/util/lib/perl5/site_perl/5.8.0/Bio/SearchIO/blast.pm:389): > 389: my ($self) = @_; > DB<4> r > scalar context return from Bio::SearchIO::blast::next_result: undef > Bio::SearchIO::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/SearchIO.pm:438): > 438: my $self = shift; > DB<4> r > scalar context return from Bio::SearchIO::DESTROY: '' > Bio::Root::Root::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/Root/Root.pm:404): > 404: my $self = shift; > DB<4> r > scalar context return from Bio::Root::Root::DESTROY: undef > Bio::Root::Root::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/Root/Root.pm:404): > 404: my $self = shift; > DB<4> r > scalar context return from Bio::Root::Root::DESTROY: undef > Bio::Root::Root::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/Root/Root.pm:404): > 404: my $self = shift; > DB<4> r > scalar context return from Bio::Root::Root::DESTROY: undef > Bio::Root::Root::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/Root/Root.pm:404): > 404: my $self = shift; > DB<4> r > scalar context return from Bio::Root::Root::DESTROY: undef > Bio::Root::Root::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/Root/Root.pm:404): > 404: my $self = shift; > DB<4> r > scalar context return from Bio::Root::Root::DESTROY: undef > main::(454/scripts/fasta_blasta_mb.pl:155): > 155: &output_results(); > DB<4> x $result > 0 undef > > > > Debug of WORKING blast-parse: > Parsing blast hits ... > main::parse_blast(454/scripts/fasta_blasta_mb.pl:302): > 302: my $report_obj = new Bio::SearchIO(-verbose => 1, > 303: -format => 'blast', > 304: -file => > $blastoutfile);#or die "Could not open blast report $!"; > DB<3> s > Bio::SearchIO::new(/util/lib/perl5/site_perl/5.8.0/Bio/SearchIO.pm:129): > 129: my($caller, at args) = @_; > DB<3> r > scalar context return from Bio::SearchIO::new: '_file' => > 'G782.2005-08-16-16-48.fasta_sample.blastout' > '_filehandle' => GLOB(0x8763100) > -> *Symbol::GEN1 > FileHandle({*Symbol::GEN1}) => fileno(3) > '_flush_on_write' => 1 > '_handler' => > Bio::SearchIO::IteratedSearchResultEventBuilder=HASH(0x8ab3be4) > '_factories' => HASH(0x8ab1594) > 'hit' => Bio::Factory::ObjectFactory=HASH(0x8a7b7c0) > '_loaded_types' => HASH(0x8abee10) > 'Bio::Search::Hit::BlastHit' => 1 > '_root_verbose' => 0 > 'interface' => 'Bio::Search::Hit::HitI' > 'type' => 'Bio::Search::Hit::BlastHit' > 'hsp' => Bio::Factory::ObjectFactory=HASH(0x8a87200) > '_loaded_types' => HASH(0x8abee1c) > 'Bio::Search::HSP::GenericHSP' => 1 > '_root_verbose' => 0 > 'interface' => 'Bio::Search::HSP::HSPI' > 'type' => 'Bio::Search::HSP::GenericHSP' > 'iteration' => Bio::Factory::ObjectFactory=HASH(0x8abee64) > '_loaded_types' => HASH(0x8abecfc) > 'Bio::Search::Iteration::GenericIteration' => 1 > '_root_verbose' => 0 > 'interface' => 'Bio::Search::Iteration::IterationI' > 'type' => 'Bio::Search::Iteration::GenericIteration' > 'result' => Bio::Factory::ObjectFactory=HASH(0x8a81a84) > '_loaded_types' => HASH(0x8a96ce8) > 'Bio::Search::Result::BlastResult' => 1 > '_root_verbose' => 0 > 'interface' => 'Bio::Search::Result::ResultI' > 'type' => 'Bio::Search::Result::BlastResult' > '_inclusion_threshold' => 0.001 > '_root_verbose' => 1 > '_handler_cache' => > Bio::SearchIO::IteratedSearchResultEventBuilder=HASH(0x8ab3be4) > -> REUSED_ADDRESS > '_notfirsttime' => 0 > '_reporttype' => '' > '_root_cleanup_methods' => ARRAY(0x8762efc) > 0 CODE(0x82a9aec) > -> &Bio::Root::IO::_io_cleanup in > /util/lib/perl5/site_perl/5.8.0/Bio/Root/IO.pm:544-570 > 1 CODE(0x82a9aec) > -> REUSED_ADDRESS > '_root_verbose' => 1 > main::parse_blast(454/scripts/fasta_blasta_mb.pl:307): > 307: die "no valid $report_obj" unless defined $report_obj; > DB<3> s > main::parse_blast(454/scripts/fasta_blasta_mb.pl:310): > 310: while( my $result = $report_obj->next_result ) { > DB<3> s > Bio::SearchIO::blast::next_result(/util/lib/perl5/site_perl/5.8.0/Bio/SearchIO/blast.pm:389): > 389: my ($self) = @_; > DB<3> r > blast.pm: unrecognized line Reference: Zheng Zhang, Scott Schwartz, > Lukas Wagner, and Webb Miller (2000), > blast.pm: unrecognized line "A greedy algorithm for aligning DNA > sequences", > blast.pm: unrecognized line J Comput Biol 2000; 7(1-2):203-14. > blast.pm: unrecognized > line > Score E > Got NCBI HSP score=354, evalue 0.0 > scalar context return from Bio::SearchIO::blast::next_result: > '_algorithm' => 'MEGABLAST' > '_algorithm_version' => '2.2.10 [Oct-19-2004]' > '_dbentries' => 4249067 > '_dbletters' => 17735149364 > '_dbname' => 'All GenBank+EMBL+DDBJ+PDB sequences (but no EST, > STS,GSS,environmental samples or phase 0, 1 or 2 HTGS sequences) ' > '_hitindex' => 0 > '_hits' => ARRAY(0x8b2acd0) > empty array > '_inclusion_threshold' => 0.001 > '_iteration_count' => 1 > '_iteration_index' => 0 > '_iterations' => ARRAY(0x8b2ac4c) > 0 Bio::Search::Iteration::GenericIteration=HASH(0x8b1cacc) > '_newhits_below_threshold' => ARRAY(0x8b1ca84) > 0 Bio::Search::Hit::BlastHit=HASH(0x8b1cf64) > '_accession' => 'AE004091' > '_algorithm' => 'MEGABLAST' > '_description' => 'Pseudomonas aeruginosa PAO1, complete genome' > '_hsps' => ARRAY(0x8b1ceb0) > 0 Bio::Search::HSP::GenericHSP=HASH(0x8b2098c) > '_algorithm' => 'MEGABLAST' > '_frac_conserved' => HASH(0x8b266a0) > 'hit' => 0.991803278688525 > 'query' => 0.991803278688525 > 'total' => 0.991803278688525 > '_frac_identical' => HASH(0x8b2658c) > 'hit' => 0.991803278688525 > 'query' => 0.991803278688525 > 'total' => 0.991803278688525 > '_gaps' => HASH(0x8b24d94) > 'hit' => 0 > 'query' => 0 > 'total' => 0 > '_gsf_tag_hash' => HASH(0x8b20998) > empty hash > '_hit_string' => > 'cctgacctccgctcaactgcgcaaatacgccagcgccggtcggccgttccccgaagggcgcctgctggccgcctcctgccacgacgcggaggaactggccctggctgcctcgatgggagtggagttcgtcaccctttcgccggtacagccgaccgagagccatcccggcgagccggcgctgggttgggacaaggccgccgaactgatcgccggcttcaaccagccggtctacctgctgggtggcctcggtccgcagcaagccgagcaggcttgggagcatggagcccagggcgtggcgggtatccgtgcgttctggccgggcggcctttgacggtggaatgaagaaaaaaggaggcttcggcctcc' > '_homology_string' => > '|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| > ||||||||||||||||||||||||||||||||||||||||| > ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| > |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||' > etc...... > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From andreo_beck at yahoo.com Mon Jul 31 22:59:30 2006 From: andreo_beck at yahoo.com (Andreo Beck) Date: Mon, 31 Jul 2006 19:59:30 -0700 (PDT) Subject: [Bioperl-l] $hit_object->frac_aligned_hit/$hit_object->frac_aligned_query Message-ID: <20060801025930.96806.qmail@web55705.mail.re3.yahoo.com> Hi, Can $hit_object->frac_aligned_hit or $hit_object->frac_aligned_query give outputs > 1 ? I get some > 1 values. Does using the parentheses (e.g. $hit_object->frac_aligned_hit()) make any difference? Thanks, Andy --------------------------------- Talk is cheap. Use Yahoo! Messenger to make PC-to-Phone calls. Great rates starting at 1?/min. From andreo_beck at yahoo.com Mon Jul 31 22:56:45 2006 From: andreo_beck at yahoo.com (Andreo Beck) Date: Mon, 31 Jul 2006 19:56:45 -0700 (PDT) Subject: [Bioperl-l] $hit_object->frac_aligned_hit/$hit_object->frac_aligned_query Message-ID: <20060801025645.12106.qmail@web55703.mail.re3.yahoo.com> Hi, Can $hit_object->frac_aligned_hit or $hit_object->frac_aligned_query give outputs > 1 ? I get them. Does using the parentheses (e.g. $hit_object->frac_aligned_hit()) make any difference? Thanks, Andy __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From darin.london at duke.edu Mon Jul 3 08:41:33 2006 From: darin.london at duke.edu (Darin London) Date: Mon, 03 Jul 2006 08:41:33 -0400 Subject: [Bioperl-l] Call For Birds of a Feather Suggestions Message-ID: <44A9107D.2050304@duke.edu> The BOSC organizing comittee is currently seeking suggestions for Birds of a Feather meeting ideas. Birds of a Feather meetings are one of the more popular activities at BOSC, occurring at the end of each days session. These are free-form meetings organized by the attendees themselves to discuss one or a few topics of interest in greater detail. BOF?s have been formed to allow developers and users of individual OBF software to meet each other face-to-face to discuss the project, or to discuss completely new ideas, and even start new software development projects. These meetings offer a unique opportunity for individuals to explore more about the activities of the various Open Source Projects, and, in some cases, even take an active role influencing the future of Open Source Software development. If you would like to create a BOF, just sign up for a wiki account, login, and edit the BOSC 2006 Birds of a Feather page. From bix at sendu.me.uk Wed Jul 5 08:37:34 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Wed, 05 Jul 2006 13:37:34 +0100 Subject: [Bioperl-l] checkout_all fails on biodata Message-ID: <44ABB28E.2000203@sendu.me.uk> I'm doing: cvs -d:ext:sendu at dev.open-bio.org:/home/repository/bioperl co bioperl_all to check out all the bioperl packages at once. However it only checks out core, db, pedigree, pipeline and run before failing on biodata: cvs checkout: Updating biodata cvs checkout: failed to create lock directory for `/home/repository/bioperl/biodata' (/home/repository/bioperl/biodata/#cvs.lock): Permission denied cvs checkout: failed to obtain dir lock in repository `/home/repository/bioperl/biodata' cvs [checkout aborted]: read lock failed - giving up This failure is consistent for me (had it multiple times, different days, never worked). Biodata isn't even mentioned as a possible package at http://bioperl.org/wiki/Using_CVS. What is it? Could it be moved to the end of the alias list so it is checked out last, letting all the other packages be checked out before failure? PS. neither biodata nor pipeline are mentioned as a package on that wiki page or at http://bioperl.org/wiki/Category:BioPerl_Packages. Are there yet more packages? Cheers, Sendu. From hlapp at gmx.net Wed Jul 5 08:55:42 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 5 Jul 2006 08:55:42 -0400 Subject: [Bioperl-l] checkout_all fails on biodata In-Reply-To: <44ABB28E.2000203@sendu.me.uk> References: <44ABB28E.2000203@sendu.me.uk> Message-ID: <9C3FE120-6B3D-4C81-99E1-921E81926CCA@gmx.net> Should have been fixed - I can cvs update. did you try again? On Jul 5, 2006, at 8:37 AM, Sendu Bala wrote: > I'm doing: > > cvs -d:ext:sendu at dev.open-bio.org:/home/repository/bioperl co > bioperl_all > > to check out all the bioperl packages at once. However it only checks > out core, db, pedigree, pipeline and run before failing on biodata: > > cvs checkout: Updating biodata > cvs checkout: failed to create lock directory for > `/home/repository/bioperl/biodata' > (/home/repository/bioperl/biodata/#cvs.lock): Permission denied > cvs checkout: failed to obtain dir lock in repository > `/home/repository/bioperl/biodata' > cvs [checkout aborted]: read lock failed - giving up > > This failure is consistent for me (had it multiple times, different > days, never worked). > > Biodata isn't even mentioned as a possible package at > http://bioperl.org/wiki/Using_CVS. What is it? Could it be moved to > the > end of the alias list so it is checked out last, letting all the other > packages be checked out before failure? > > PS. neither biodata nor pipeline are mentioned as a package on that > wiki > page or at http://bioperl.org/wiki/Category:BioPerl_Packages. Are > there > yet more packages? > > Cheers, > Sendu. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From bix at sendu.me.uk Wed Jul 5 09:03:50 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Wed, 05 Jul 2006 14:03:50 +0100 Subject: [Bioperl-l] checkout_all fails on biodata In-Reply-To: <9C3FE120-6B3D-4C81-99E1-921E81926CCA@gmx.net> References: <44ABB28E.2000203@sendu.me.uk> <9C3FE120-6B3D-4C81-99E1-921E81926CCA@gmx.net> Message-ID: <44ABB8B6.5040707@sendu.me.uk> Hilmar Lapp wrote: > Should have been fixed - I can cvs update. did you try again? Still doesn't work, no change. I can manually check out the other packages, I just can't do it with bioperl_all alias. co bioperl-biodata fails because: cvs server: cannot find module `bioperl-biodata' - ignored cvs [checkout aborted]: cannot expand modules (not that I want it - if its no longer a bioperl package can it be removed from the alias?) From hlapp at gmx.net Wed Jul 5 09:41:27 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 5 Jul 2006 09:41:27 -0400 Subject: [Bioperl-l] checkout_all fails on biodata In-Reply-To: <44ABB8B6.5040707@sendu.me.uk> References: <44ABB28E.2000203@sendu.me.uk> <9C3FE120-6B3D-4C81-99E1-921E81926CCA@gmx.net> <44ABB8B6.5040707@sendu.me.uk> Message-ID: The idea was once that Bioperl, Biojava, etc had all those unit tests that use specific sample data which take up quite a bit of space. Unifying the largely redundant test data into a single shared repository would save quite a bit of space and therefore download/ update time for people who work on/use more than one Bio* project. However, this was never fully implemented AFAIK. I.e., you don't need biodata. I guess it could be removed from the alias since it's not integrated anyway. Any other opinions? I also forwarded your report to root-l as I couldn't find the offending (stale) lock file. -hilmar On Jul 5, 2006, at 9:03 AM, Sendu Bala wrote: > Hilmar Lapp wrote: >> Should have been fixed - I can cvs update. did you try again? > > Still doesn't work, no change. I can manually check out the other > packages, I just can't do it with bioperl_all alias. > > co bioperl-biodata fails because: > cvs server: cannot find module `bioperl-biodata' - ignored > cvs [checkout aborted]: cannot expand modules > > (not that I want it - if its no longer a bioperl package can it be > removed from the alias?) > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Wed Jul 5 09:48:03 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 5 Jul 2006 08:48:03 -0500 Subject: [Bioperl-l] checkout_all fails on biodata In-Reply-To: <44ABB8B6.5040707@sendu.me.uk> Message-ID: <000f01c6a039$a7a24f10$15327e82@pyrimidine> Bioperl-data was a directory started up a few years ago to hold various data files for testing and as examples (BLAST file examples, GenBank files, etc), somewhat like the t/data directory but cleaned up a bit more. It hasn't been updated in a while. Regardless, you should be able to check it out. As for the problem, looks like Hilmar's checking up on a possible lock file issue. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Sendu Bala > Sent: Wednesday, July 05, 2006 8:04 AM > To: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] checkout_all fails on biodata > > Hilmar Lapp wrote: > > Should have been fixed - I can cvs update. did you try again? > > Still doesn't work, no change. I can manually check out the other > packages, I just can't do it with bioperl_all alias. > > co bioperl-biodata fails because: > cvs server: cannot find module `bioperl-biodata' - ignored > cvs [checkout aborted]: cannot expand modules > > (not that I want it - if its no longer a bioperl package can it be > removed from the alias?) > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Wed Jul 5 11:06:30 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 5 Jul 2006 10:06:30 -0500 Subject: [Bioperl-l] checkout_all fails on biodata In-Reply-To: Message-ID: <001901c6a044$999a14b0$15327e82@pyrimidine> I use TortoiseCVS via WinXP and I'm getting the same issue that Sendu has: --------------------------- In C:\Perl\src: "C:\Program Files\TortoiseCVS\cvs.exe" "-q" "--lf" "checkout" "-P" "bioperl_all" CVSROOT=:ext:cjfields at dev.open-bio.org:/home/repository/bioperl ... cvs checkout: failed to create lock directory for `/home/repository/bioperl/biodata' (/home/repository/bioperl/biodata/#cvs.lock): Permission denied cvs checkout: failed to obtain dir lock in repository `/home/repository/bioperl/biodata' cvs [checkout aborted]: read lock failed - giving up cvs.exe checkout: in directory bioperl: cvs.exe checkout: cannot open CVS/Entries for reading: No such file or directory --------------------------- I had the same problem with schema (BioSQL) a while back. I tried again, and... --------------------------- cvs checkout: failed to create lock directory for `/home/repository/bioperl/biosql-schema' (/home/repository/bioperl/biosql-schema/#cvs.lock): Permission denied cvs checkout: failed to obtain dir lock in repository `/home/repository/bioperl/biosql-schema' cvs [checkout aborted]: read lock failed - giving up cvs.exe checkout: in directory .: cvs.exe checkout: cannot open CVS/Entries for reading: No such file or directory --------------------------- I believe it had something to do with CVS commit privileges (i.e. I had none for schema, which was fine). So maybe this is a permissions issue via the lock file? Looking at the alias: bioperl_all -d bioperl &core &db &run &pipeline &pedigree &biodata &schema &network µarray This may mean if anyone w/o commit privs for any of the above (specifically schema and biodata) tries checkout/update using bioperl-all, they may run into this problem. Since it's not integrated I don't see the problem with removing it from the alias, but if we follow the same line of logic (and privileges are the issue) then schema must be removed as well. To me it doesn't make much sense to not include schema though since we can checkout/update bioperl-db. BTW, I like the idea of biodata as you've outlined it. Would be nice to gear the test suite towards a more general set of data for all the Bio* projects versus having each one come with their own, and the data could be updated a bit more frequently that t/data is. Seems like it would definitely save a large chunk of real estate for the distributions. If one wanted to run the full test suite then they would have to download biodata separately, though, but not a bad compromise. Though, if this is/was its intent, why would it need a lock file? Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Hilmar Lapp > Sent: Wednesday, July 05, 2006 8:41 AM > To: Sendu Bala > Cc: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] checkout_all fails on biodata > > The idea was once that Bioperl, Biojava, etc had all those unit tests > that use specific sample data which take up quite a bit of space. > Unifying the largely redundant test data into a single shared > repository would save quite a bit of space and therefore download/ > update time for people who work on/use more than one Bio* project. > > However, this was never fully implemented AFAIK. I.e., you don't need > biodata. I guess it could be removed from the alias since it's not > integrated anyway. > > Any other opinions? > > I also forwarded your report to root-l as I couldn't find the > offending (stale) lock file. > > -hilmar > > On Jul 5, 2006, at 9:03 AM, Sendu Bala wrote: > > > Hilmar Lapp wrote: > >> Should have been fixed - I can cvs update. did you try again? > > > > Still doesn't work, no change. I can manually check out the other > > packages, I just can't do it with bioperl_all alias. > > > > co bioperl-biodata fails because: > > cvs server: cannot find module `bioperl-biodata' - ignored > > cvs [checkout aborted]: cannot expand modules > > > > (not that I want it - if its no longer a bioperl package can it be > > removed from the alias?) > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Wed Jul 5 11:36:33 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 5 Jul 2006 10:36:33 -0500 Subject: [Bioperl-l] Bio::SearchIO::hmmer hsp behaviour In-Reply-To: Message-ID: <001a01c6a048$cb802420$15327e82@pyrimidine> Okay, I managed to figure out what the problem was. I committed a fix in CVS for the initial bug (Selvi's missing hits). Still has one HSP per hit for now; I think it will take a bit more effort to get a BLAST-like multi HSP/hit up and running. Selvi, update from CVS to see if that works. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Chris Fields > Sent: Friday, June 30, 2006 12:44 PM > To: Sendu Bala; Jason Stajich > Cc: bioperl-l at lists.open-bio.org list > Subject: Re: [Bioperl-l] Bio::SearchIO::hmmer hsp behaviour > > I'll try looking at it this weekend. A suggested workaround is to > either try setting -A for no alignments or setting it to a high > number to retrieve all of them. It's pretty serious as the error > silently dumps those domains, so for those using automated annotation > pipelines would miss it unless they are also checking the raw output. > > You could design a SearchIO::hmmpfam parser then expand it to take in > hmmsearch output at a later point, or keep them separate. I like the > idea of having modules that are more specific about what they parse; > seems at some point you reach serious code bloat and maintenance > becomes an issue. Look at SearchIO::blast; it parses various text > BLAST output very well but with some serious obfuscation. Just don't > know how productive it would be to separate out the PSI-BLAST and > bl2seq stuff since they are pretty close to a standard BLAST > report... oh well. > > To Jason : good luck on your move. Drop us a line here to let us > know everything went well. > > Chris > > On Jun 30, 2006, at 11:14 AM, Sendu Bala wrote: > > > Chris Fields wrote: > >> It may have been just simpler to have it be one HSP (domain) per Hit > >> (model) as that's how the reports are generated. My reasoning was > >> that > >> using the one domain per model made sense based on what you are > >> actually > >> trying to do, which is annotate the sequence based on the order the > >> domain appears. Most others may not view it that way, which is fine. > >> One can always gather the relevant HSP's, convert to seqfeatures, > >> then > >> sort them if order is important, I suppose. > >> > >> I would say, if the overall consensus is to modify it to have > >> multiple > >> domain hits per model (similar to BLAST) then Sendu should go > >> ahead and > >> make those changes then announce it on the list so no one can gripe > >> about it later. My main concern was not changing things so > >> dramatically > >> that it'll break for someone > > > > Going on your earlier suggestion, I was thinking about making > > SearchIO::hmmpfam instead, which would get used if you set the > > format to > > 'hmmpfam' instead of the generic 'hmmer' when making a SearchIO. I > > suppose I would make a SearchIO::hmmsearch as well, if necessary. > > > > > > [...] > >> that the reported bug about missing hits (Bug 2036) is fixed as well. > > > > However, having never made a SearchIO plugin before, it will be some > > time before I get my head around it. I'll want to make one the current > > HOWTO:SearchIO way before I can think about doing it a better way > > (hashes) as well. So I can say I'll make a move on this at some > > point in > > the future, but if someone wants to fix Bug 2036 in the mean time, > > they > > are welcome to. Again as suggested, my priority is Bio::Map right now. > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From arareko at campus.iztacala.unam.mx Wed Jul 5 11:38:14 2006 From: arareko at campus.iztacala.unam.mx (Mauricio Herrera Cuadra) Date: Wed, 05 Jul 2006 10:38:14 -0500 Subject: [Bioperl-l] checkout_all fails on biodata In-Reply-To: <001901c6a044$999a14b0$15327e82@pyrimidine> References: <001901c6a044$999a14b0$15327e82@pyrimidine> Message-ID: <44ABDCE6.7090906@campus.iztacala.unam.mx> Same problem here. I've never used the bioperl_all alias before (I always check-out dirs individually), but to me it seems like a privileges issue as Chris suggests. Also browsed through all the repository in dev.open-bio.org and didn't found such lock file. I guess Chris D. or Jason will know better what's happening here. Mauricio. Chris Fields wrote: > I use TortoiseCVS via WinXP and I'm getting the same issue that Sendu has: > --------------------------- > In C:\Perl\src: "C:\Program Files\TortoiseCVS\cvs.exe" "-q" "--lf" > "checkout" "-P" "bioperl_all" > CVSROOT=:ext:cjfields at dev.open-bio.org:/home/repository/bioperl > > ... > > cvs checkout: failed to create lock directory for > `/home/repository/bioperl/biodata' > (/home/repository/bioperl/biodata/#cvs.lock): Permission denied > cvs checkout: failed to obtain dir lock in repository > `/home/repository/bioperl/biodata' > cvs [checkout aborted]: read lock failed - giving up > cvs.exe checkout: in directory bioperl: > cvs.exe checkout: cannot open CVS/Entries for reading: No such file or > directory > --------------------------- > > I had the same problem with schema (BioSQL) a while back. I tried again, > and... > > --------------------------- > cvs checkout: failed to create lock directory for > `/home/repository/bioperl/biosql-schema' > (/home/repository/bioperl/biosql-schema/#cvs.lock): Permission denied > cvs checkout: failed to obtain dir lock in repository > `/home/repository/bioperl/biosql-schema' > cvs [checkout aborted]: read lock failed - giving up > cvs.exe checkout: in directory .: > cvs.exe checkout: cannot open CVS/Entries for reading: No such file or > directory > --------------------------- > > I believe it had something to do with CVS commit privileges (i.e. I had none > for schema, which was fine). So maybe this is a permissions issue via the > lock file? Looking at the alias: > > bioperl_all -d bioperl &core &db &run &pipeline &pedigree &biodata &schema > &network µarray > > This may mean if anyone w/o commit privs for any of the above (specifically > schema and biodata) tries checkout/update using bioperl-all, they may run > into this problem. > > Since it's not integrated I don't see the problem with removing it from the > alias, but if we follow the same line of logic (and privileges are the > issue) then schema must be removed as well. To me it doesn't make much > sense to not include schema though since we can checkout/update bioperl-db. > > > BTW, I like the idea of biodata as you've outlined it. Would be nice to > gear the test suite towards a more general set of data for all the Bio* > projects versus having each one come with their own, and the data could be > updated a bit more frequently that t/data is. Seems like it would > definitely save a large chunk of real estate for the distributions. If one > wanted to run the full test suite then they would have to download biodata > separately, though, but not a bad compromise. Though, if this is/was its > intent, why would it need a lock file? > > Chris > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> bounces at lists.open-bio.org] On Behalf Of Hilmar Lapp >> Sent: Wednesday, July 05, 2006 8:41 AM >> To: Sendu Bala >> Cc: bioperl-l at lists.open-bio.org >> Subject: Re: [Bioperl-l] checkout_all fails on biodata >> >> The idea was once that Bioperl, Biojava, etc had all those unit tests >> that use specific sample data which take up quite a bit of space. >> Unifying the largely redundant test data into a single shared >> repository would save quite a bit of space and therefore download/ >> update time for people who work on/use more than one Bio* project. >> >> However, this was never fully implemented AFAIK. I.e., you don't need >> biodata. I guess it could be removed from the alias since it's not >> integrated anyway. >> >> Any other opinions? >> >> I also forwarded your report to root-l as I couldn't find the >> offending (stale) lock file. >> >> -hilmar >> >> On Jul 5, 2006, at 9:03 AM, Sendu Bala wrote: >> >>> Hilmar Lapp wrote: >>>> Should have been fixed - I can cvs update. did you try again? >>> Still doesn't work, no change. I can manually check out the other >>> packages, I just can't do it with bioperl_all alias. >>> >>> co bioperl-biodata fails because: >>> cvs server: cannot find module `bioperl-biodata' - ignored >>> cvs [checkout aborted]: cannot expand modules >>> >>> (not that I want it - if its no longer a bioperl package can it be >>> removed from the alias?) >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >> -- >> =========================================================== >> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : >> =========================================================== >> >> >> >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- MAURICIO HERRERA CUADRA arareko at campus.iztacala.unam.mx Laboratorio de Gen?tica Unidad de Morfofisiolog?a y Funci?n Facultad de Estudios Superiores Iztacala, UNAM From bix at sendu.me.uk Thu Jul 6 04:41:57 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 06 Jul 2006 09:41:57 +0100 Subject: [Bioperl-l] Bio::Map changes In-Reply-To: <449A9AF9.2000305@sendu.me.uk> References: <44985915.8010607@sendu.me.uk> <449A9AF9.2000305@sendu.me.uk> Message-ID: <44ACCCD5.3030309@sendu.me.uk> Sendu Bala wrote: > The next step is to tidy up all of Bio::Map*, which involves a major > reimplementation of the whole system [...] > The reimplementation will make Position central to the model, allowing > for lots of other things to work properly without anything becoming > inconsistent (as is currently the case). This is now done. It uses a new PositionHandler class behind the scenes. The next step is to introduce relative positioning across the board, possibly in a way that makes OrderedPosition redundant or an implementer of the system. Has anyone here ever used Bio::Map* modules for anything? I would appreciate you sending me your code, especially if you've used MapIO, Physical (encompassing Clone, Contig, FPCMarker, OrderedPositionWithDistance) or LinkageMap (encompassing LinkagePosition, OrderedPosition, Microsatellite) since these have insufficient tests at the moment. From nidage at yahoo.com Thu Jul 6 14:13:12 2006 From: nidage at yahoo.com (sss lll) Date: Thu, 6 Jul 2006 11:13:12 -0700 (PDT) Subject: [Bioperl-l] PrimarySeqI object Exception Message-ID: <20060706181312.63581.qmail@web35607.mail.mud.yahoo.com> Hi there, I encountered a problem while calling module PrimarySeqI, with the following code: my $db=Bio::DB::Fasta->new($fafile); my $obj=$db->get_Seq_by_id($array_gene_name[$p]); $seqio->write_seq($obj); The error message was: MSG: Did not provide a valid Bio::PrimarySeqI object STACK Bio::SeqIO::fasta::write_seq /usr/lib/perl5/site_perl/5.8.0/Bio/SeqIO/fasta.pm:178 We think it had to do with the lengh of the gene name. For example the following gene name was a problem: gi|59711891|ref|YP_204667.1| NAD-specific glutamate dehydrogenase [Vibrio fischeri ES114]*VIB*COG2902*E Any ideas on what happened? Thanks __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From rmb32 at cornell.edu Thu Jul 6 19:11:00 2006 From: rmb32 at cornell.edu (Robert Buels) Date: Thu, 06 Jul 2006 16:11:00 -0700 Subject: [Bioperl-l] parser for GeneSeqer In-Reply-To: <2FB066C7-12E6-46D8-8F4A-BD096BE2A0CA@uiuc.edu> References: <44A558F2.2050304@cornell.edu> <2FB066C7-12E6-46D8-8F4A-BD096BE2A0CA@uiuc.edu> Message-ID: <44AD9884.6040507@cornell.edu> The Annotation/Annotatable stuff was going to be talked about at the GMOD meeting that just happened, wasn't it? What's the scoop on that? Rob Chris Fields wrote: > If you plan on generating seqfeatures from this output you could check > out the Bio::Tools core modules for examples. There are a few there > that take program output and convert them to Bio::SeqFeature::Generic > objects, including Bio::Tools:RNAMotif and Bio::Tools::tRNAscanSE. If > alignments are involved you might want something like > Bio::SeqFeature::FeaturePair. Not sure about using the > SeqFeature::Annotation or others; I thought that the some of the > Annotation/Annotatable stuff might be changing soon but I may be wrong. > > Chris > > On Jun 30, 2006, at 12:01 PM, Robert Buels wrote: > >> Hi all, >> >> I find myself needing a parser for GeneSeqer output, so I'm writing one >> (which I will submit for your consideration when it's working). In a >> nutshell, GeneSeqer is a (kind of old) program for aligning a bunch of >> ESTs to genomic sequence, then using those alignments to predict where >> in the genomic sequence the genes are. So really what you get from this >> is a bunch of hierarchical features. >> >> I don't really know where I should put it in the bioperl hierarchy >> though. Probably FeatureIO? >> >> And what's the current fashion for objects it should emit? >> Bio::SeqFeature::Generic? Bio::SeqFeature::Annotated? >> >> Rob >> >> --Robert Buels >> SGN Bioinformatics Analyst >> 252A Emerson Hall, Cornell University >> Ithaca, NY 14853 >> Tel: 503-889-8539 >> rmb32 at cornell.edu >> http://www.sgn.cornell.edu >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign > > > -- Robert Buels SGN Bioinformatics Analyst 252A Emerson Hall, Cornell University Ithaca, NY 14853 Tel: 503-889-8539 rmb32 at cornell.edu http://www.sgn.cornell.edu From hlapp at gmx.net Thu Jul 6 19:27:31 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 6 Jul 2006 19:27:31 -0400 Subject: [Bioperl-l] parser for GeneSeqer In-Reply-To: <44AD9884.6040507@cornell.edu> References: <44A558F2.2050304@cornell.edu> <2FB066C7-12E6-46D8-8F4A-BD096BE2A0CA@uiuc.edu> <44AD9884.6040507@cornell.edu> Message-ID: <6B530ED6-5825-47C4-A677-2C75E0F97E26@gmx.net> No scoop b/c no time. I am busy w/ a grant and Lincoln had to leave early as well on Friday. Sorry. On Jul 6, 2006, at 7:11 PM, Robert Buels wrote: > The Annotation/Annotatable stuff was going to be talked about at the > GMOD meeting that just happened, wasn't it? What's the scoop on that? > > Rob > > > Chris Fields wrote: >> If you plan on generating seqfeatures from this output you could >> check >> out the Bio::Tools core modules for examples. There are a few there >> that take program output and convert them to Bio::SeqFeature::Generic >> objects, including Bio::Tools:RNAMotif and >> Bio::Tools::tRNAscanSE. If >> alignments are involved you might want something like >> Bio::SeqFeature::FeaturePair. Not sure about using the >> SeqFeature::Annotation or others; I thought that the some of the >> Annotation/Annotatable stuff might be changing soon but I may be >> wrong. >> >> Chris >> >> On Jun 30, 2006, at 12:01 PM, Robert Buels wrote: >> >>> Hi all, >>> >>> I find myself needing a parser for GeneSeqer output, so I'm >>> writing one >>> (which I will submit for your consideration when it's working). >>> In a >>> nutshell, GeneSeqer is a (kind of old) program for aligning a >>> bunch of >>> ESTs to genomic sequence, then using those alignments to predict >>> where >>> in the genomic sequence the genes are. So really what you get >>> from this >>> is a bunch of hierarchical features. >>> >>> I don't really know where I should put it in the bioperl hierarchy >>> though. Probably FeatureIO? >>> >>> And what's the current fashion for objects it should emit? >>> Bio::SeqFeature::Generic? Bio::SeqFeature::Annotated? >>> >>> Rob >>> >>> --Robert Buels >>> SGN Bioinformatics Analyst >>> 252A Emerson Hall, Cornell University >>> Ithaca, NY 14853 >>> Tel: 503-889-8539 >>> rmb32 at cornell.edu >>> http://www.sgn.cornell.edu >>> >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> Christopher Fields >> Postdoctoral Researcher >> Lab of Dr. Robert Switzer >> Dept of Biochemistry >> University of Illinois Urbana-Champaign >> >> >> > > -- > Robert Buels > SGN Bioinformatics Analyst > 252A Emerson Hall, Cornell University > Ithaca, NY 14853 > Tel: 503-889-8539 > rmb32 at cornell.edu > http://www.sgn.cornell.edu > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Thu Jul 6 19:28:09 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 6 Jul 2006 18:28:09 -0500 Subject: [Bioperl-l] parser for GeneSeqer In-Reply-To: <44AD9884.6040507@cornell.edu> Message-ID: <000001c6a153$d78b83c0$15327e82@pyrimidine> Not any word yet. Been pretty quiet, likely b/c everybody was there planning a roadmap for v1.6. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Robert Buels > Sent: Thursday, July 06, 2006 6:11 PM > To: bioperl-l at bioperl.org > Subject: Re: [Bioperl-l] parser for GeneSeqer > > The Annotation/Annotatable stuff was going to be talked about at the > GMOD meeting that just happened, wasn't it? What's the scoop on that? > > Rob > > > Chris Fields wrote: > > If you plan on generating seqfeatures from this output you could check > > out the Bio::Tools core modules for examples. There are a few there > > that take program output and convert them to Bio::SeqFeature::Generic > > objects, including Bio::Tools:RNAMotif and Bio::Tools::tRNAscanSE. If > > alignments are involved you might want something like > > Bio::SeqFeature::FeaturePair. Not sure about using the > > SeqFeature::Annotation or others; I thought that the some of the > > Annotation/Annotatable stuff might be changing soon but I may be wrong. > > > > Chris > > > > On Jun 30, 2006, at 12:01 PM, Robert Buels wrote: > > > >> Hi all, > >> > >> I find myself needing a parser for GeneSeqer output, so I'm writing one > >> (which I will submit for your consideration when it's working). In a > >> nutshell, GeneSeqer is a (kind of old) program for aligning a bunch of > >> ESTs to genomic sequence, then using those alignments to predict where > >> in the genomic sequence the genes are. So really what you get from > this > >> is a bunch of hierarchical features. > >> > >> I don't really know where I should put it in the bioperl hierarchy > >> though. Probably FeatureIO? > >> > >> And what's the current fashion for objects it should emit? > >> Bio::SeqFeature::Generic? Bio::SeqFeature::Annotated? > >> > >> Rob > >> > >> --Robert Buels > >> SGN Bioinformatics Analyst > >> 252A Emerson Hall, Cornell University > >> Ithaca, NY 14853 > >> Tel: 503-889-8539 > >> rmb32 at cornell.edu > >> http://www.sgn.cornell.edu > >> > >> > >> _______________________________________________ > >> Bioperl-l mailing list > >> Bioperl-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > Christopher Fields > > Postdoctoral Researcher > > Lab of Dr. Robert Switzer > > Dept of Biochemistry > > University of Illinois Urbana-Champaign > > > > > > > > -- > Robert Buels > SGN Bioinformatics Analyst > 252A Emerson Hall, Cornell University > Ithaca, NY 14853 > Tel: 503-889-8539 > rmb32 at cornell.edu > http://www.sgn.cornell.edu > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From hlapp at gmx.net Thu Jul 6 19:41:44 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 6 Jul 2006 19:41:44 -0400 Subject: [Bioperl-l] parser for GeneSeqer In-Reply-To: <000001c6a153$d78b83c0$15327e82@pyrimidine> References: <000001c6a153$d78b83c0$15327e82@pyrimidine> Message-ID: Uhm - roadmap - I guess yes, but more that of the Golden State, or other states on the way, for Jason. On Jul 6, 2006, at 7:28 PM, Chris Fields wrote: > Not any word yet. Been pretty quiet, likely b/c everybody was there > planning a roadmap for v1.6. > > Chris > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> bounces at lists.open-bio.org] On Behalf Of Robert Buels >> Sent: Thursday, July 06, 2006 6:11 PM >> To: bioperl-l at bioperl.org >> Subject: Re: [Bioperl-l] parser for GeneSeqer >> >> The Annotation/Annotatable stuff was going to be talked about at the >> GMOD meeting that just happened, wasn't it? What's the scoop on >> that? >> >> Rob >> >> >> Chris Fields wrote: >>> If you plan on generating seqfeatures from this output you could >>> check >>> out the Bio::Tools core modules for examples. There are a few there >>> that take program output and convert them to >>> Bio::SeqFeature::Generic >>> objects, including Bio::Tools:RNAMotif and >>> Bio::Tools::tRNAscanSE. If >>> alignments are involved you might want something like >>> Bio::SeqFeature::FeaturePair. Not sure about using the >>> SeqFeature::Annotation or others; I thought that the some of the >>> Annotation/Annotatable stuff might be changing soon but I may be >>> wrong. >>> >>> Chris >>> >>> On Jun 30, 2006, at 12:01 PM, Robert Buels wrote: >>> >>>> Hi all, >>>> >>>> I find myself needing a parser for GeneSeqer output, so I'm >>>> writing one >>>> (which I will submit for your consideration when it's working). >>>> In a >>>> nutshell, GeneSeqer is a (kind of old) program for aligning a >>>> bunch of >>>> ESTs to genomic sequence, then using those alignments to predict >>>> where >>>> in the genomic sequence the genes are. So really what you get from >> this >>>> is a bunch of hierarchical features. >>>> >>>> I don't really know where I should put it in the bioperl hierarchy >>>> though. Probably FeatureIO? >>>> >>>> And what's the current fashion for objects it should emit? >>>> Bio::SeqFeature::Generic? Bio::SeqFeature::Annotated? >>>> >>>> Rob >>>> >>>> --Robert Buels >>>> SGN Bioinformatics Analyst >>>> 252A Emerson Hall, Cornell University >>>> Ithaca, NY 14853 >>>> Tel: 503-889-8539 >>>> rmb32 at cornell.edu >>>> http://www.sgn.cornell.edu >>>> >>>> >>>> _______________________________________________ >>>> Bioperl-l mailing list >>>> Bioperl-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >>> Christopher Fields >>> Postdoctoral Researcher >>> Lab of Dr. Robert Switzer >>> Dept of Biochemistry >>> University of Illinois Urbana-Champaign >>> >>> >>> >> >> -- >> Robert Buels >> SGN Bioinformatics Analyst >> 252A Emerson Hall, Cornell University >> Ithaca, NY 14853 >> Tel: 503-889-8539 >> rmb32 at cornell.edu >> http://www.sgn.cornell.edu >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Thu Jul 6 19:49:23 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 6 Jul 2006 18:49:23 -0500 Subject: [Bioperl-l] parser for GeneSeqer In-Reply-To: Message-ID: <000101c6a156$cee60bc0$15327e82@pyrimidine> Oh well. There's always BOSC... Chris > -----Original Message----- > From: Hilmar Lapp [mailto:hlapp at gmx.net] > Sent: Thursday, July 06, 2006 6:42 PM > To: Chris Fields > Cc: 'Robert Buels'; bioperl-l at bioperl.org > Subject: Re: [Bioperl-l] parser for GeneSeqer > > Uhm - roadmap - I guess yes, but more that of the Golden State, or > other states on the way, for Jason. > > On Jul 6, 2006, at 7:28 PM, Chris Fields wrote: > > > Not any word yet. Been pretty quiet, likely b/c everybody was there > > planning a roadmap for v1.6. > > > > Chris > > > >> -----Original Message----- > >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > >> bounces at lists.open-bio.org] On Behalf Of Robert Buels > >> Sent: Thursday, July 06, 2006 6:11 PM > >> To: bioperl-l at bioperl.org > >> Subject: Re: [Bioperl-l] parser for GeneSeqer > >> > >> The Annotation/Annotatable stuff was going to be talked about at the > >> GMOD meeting that just happened, wasn't it? What's the scoop on > >> that? > >> > >> Rob > >> > >> > >> Chris Fields wrote: > >>> If you plan on generating seqfeatures from this output you could > >>> check > >>> out the Bio::Tools core modules for examples. There are a few there > >>> that take program output and convert them to > >>> Bio::SeqFeature::Generic > >>> objects, including Bio::Tools:RNAMotif and > >>> Bio::Tools::tRNAscanSE. If > >>> alignments are involved you might want something like > >>> Bio::SeqFeature::FeaturePair. Not sure about using the > >>> SeqFeature::Annotation or others; I thought that the some of the > >>> Annotation/Annotatable stuff might be changing soon but I may be > >>> wrong. > >>> > >>> Chris > >>> > >>> On Jun 30, 2006, at 12:01 PM, Robert Buels wrote: > >>> > >>>> Hi all, > >>>> > >>>> I find myself needing a parser for GeneSeqer output, so I'm > >>>> writing one > >>>> (which I will submit for your consideration when it's working). > >>>> In a > >>>> nutshell, GeneSeqer is a (kind of old) program for aligning a > >>>> bunch of > >>>> ESTs to genomic sequence, then using those alignments to predict > >>>> where > >>>> in the genomic sequence the genes are. So really what you get from > >> this > >>>> is a bunch of hierarchical features. > >>>> > >>>> I don't really know where I should put it in the bioperl hierarchy > >>>> though. Probably FeatureIO? > >>>> > >>>> And what's the current fashion for objects it should emit? > >>>> Bio::SeqFeature::Generic? Bio::SeqFeature::Annotated? > >>>> > >>>> Rob > >>>> > >>>> --Robert Buels > >>>> SGN Bioinformatics Analyst > >>>> 252A Emerson Hall, Cornell University > >>>> Ithaca, NY 14853 > >>>> Tel: 503-889-8539 > >>>> rmb32 at cornell.edu > >>>> http://www.sgn.cornell.edu > >>>> > >>>> > >>>> _______________________________________________ > >>>> Bioperl-l mailing list > >>>> Bioperl-l at lists.open-bio.org > >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l > >>> > >>> Christopher Fields > >>> Postdoctoral Researcher > >>> Lab of Dr. Robert Switzer > >>> Dept of Biochemistry > >>> University of Illinois Urbana-Champaign > >>> > >>> > >>> > >> > >> -- > >> Robert Buels > >> SGN Bioinformatics Analyst > >> 252A Emerson Hall, Cornell University > >> Ithaca, NY 14853 > >> Tel: 503-889-8539 > >> rmb32 at cornell.edu > >> http://www.sgn.cornell.edu > >> > >> > >> _______________________________________________ > >> Bioperl-l mailing list > >> Bioperl-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > From osborne1 at optonline.net Thu Jul 6 21:06:32 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Thu, 06 Jul 2006 21:06:32 -0400 Subject: [Bioperl-l] PrimarySeqI object Exception In-Reply-To: <20060706181312.63581.qmail@web35607.mail.mud.yahoo.com> Message-ID: sss lll, What this error means is that $obj is not a valid Sequence object, this is what's passed to the write_seq method. What identifier is $array_gene_name[$p]? Brian O. On 7/6/06 2:13 PM, "sss lll" wrote: > Hi there, > > I encountered a problem while calling module > PrimarySeqI, with the following code: > > my $db=Bio::DB::Fasta->new($fafile); > my $obj=$db->get_Seq_by_id($array_gene_name[$p]); > $seqio->write_seq($obj); > > The error message was: > MSG: Did not provide a valid Bio::PrimarySeqI object > STACK Bio::SeqIO::fasta::write_seq > /usr/lib/perl5/site_perl/5.8.0/Bio/SeqIO/fasta.pm:178 > > We think it had to do with the lengh of the gene name. > For example the following gene name was a problem: > > gi|59711891|ref|YP_204667.1| NAD-specific glutamate > dehydrogenase [Vibrio fischeri ES114]*VIB*COG2902*E > > Any ideas on what happened? > > Thanks > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam protection around > http://mail.yahoo.com > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From rmb32 at cornell.edu Thu Jul 6 21:24:40 2006 From: rmb32 at cornell.edu (Robert Buels) Date: Thu, 06 Jul 2006 18:24:40 -0700 Subject: [Bioperl-l] FeatureIO tests failing on linux debian sarge Message-ID: <44ADB7D8.7080102@cornell.edu> I am stumped. On a fresh checkout from cvs (as of like 10 seconds ago): rob at rubisco:/usr/local/lib/site_perl/bioperl-live$ perl -v This is perl, v5.8.4 built for i386-linux-thread-multi Copyright 1987-2004, Larry Wall Perl may be copied only under the terms of either the Artistic License or the GNU General Public License, which may be found in the Perl 5 source kit. Complete documentation for Perl, including FAQ lists, should be found on this system using `man perl' or `perldoc perl'. If you have access to the Internet, point your browser at http://www.perl.com/, the Perl Home Page. rob at rubisco:/usr/local/lib/site_perl/Bio$ perl t/FeatureIO.t 1..22 ok 1 ok 2 ok 3 ok 4 ok 5 ok 6 Can't locate object method "get_Annotations" via package "Bio::SeqFeature::Annotated" at /usr/local/lib/site_perl/Bio/SeqFeature/Annotated.pm line 292, line 2. ok 7 # Cannot complete FeatureIO tests ok 8 # Cannot complete FeatureIO tests ok 9 # Cannot complete FeatureIO tests ok 10 # Cannot complete FeatureIO tests ok 11 # Cannot complete FeatureIO tests ok 12 # Cannot complete FeatureIO tests ok 13 # Cannot complete FeatureIO tests ok 14 # Cannot complete FeatureIO tests ok 15 # Cannot complete FeatureIO tests ok 16 # Cannot complete FeatureIO tests ok 17 # Cannot complete FeatureIO tests ok 18 # Cannot complete FeatureIO tests ok 19 # Cannot complete FeatureIO tests ok 20 # Cannot complete FeatureIO tests ok 21 # Cannot complete FeatureIO tests ok 22 # Cannot complete FeatureIO tests However, same code runs fine on my debian unstable machine (perl 5.8.8). Perhaps this is a bug in debian stable's perl? I did some poking around through the code, changing @ISA = qw/.../ to use base, switching the order of inclusion in the ISA at the top of Bio::SeqFeature::Annotated, no dice. Anybody able to reproduce this? Anyone have any ideas? Rob -- Robert Buels SGN Bioinformatics Analyst 252A Emerson Hall, Cornell University Ithaca, NY 14853 Tel: 503-889-8539 rmb32 at cornell.edu http://www.sgn.cornell.edu From cjfields at uiuc.edu Thu Jul 6 22:30:25 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 6 Jul 2006 21:30:25 -0500 Subject: [Bioperl-l] FeatureIO tests failing on linux debian sarge In-Reply-To: <44ADB7D8.7080102@cornell.edu> Message-ID: <000001c6a16d$4dd7e6e0$15327e82@pyrimidine> I don't get any issues (all tests pass), except a few warning messages which is normal; some ontology handlind not implemented. Usually when running tests I use 'perl -I. t/test.t' to force it to use the core directory first. You might try that to see if it 'fixes' the problem. If it does, there may be another bioperl installation in @INC being used instead of your current directory. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Robert Buels > Sent: Thursday, July 06, 2006 8:25 PM > To: bioperl-l at bioperl.org > Subject: [Bioperl-l] FeatureIO tests failing on linux debian sarge > > I am stumped. On a fresh checkout from cvs (as of like 10 seconds ago): > > > rob at rubisco:/usr/local/lib/site_perl/bioperl-live$ perl -v > > This is perl, v5.8.4 built for i386-linux-thread-multi > > Copyright 1987-2004, Larry Wall > > Perl may be copied only under the terms of either the Artistic License > or the > GNU General Public License, which may be found in the Perl 5 source kit. > > Complete documentation for Perl, including FAQ lists, should be found on > this system using `man perl' or `perldoc perl'. If you have access to the > Internet, point your browser at http://www.perl.com/, the Perl Home Page. > > rob at rubisco:/usr/local/lib/site_perl/Bio$ perl t/FeatureIO.t > 1..22 > ok 1 > ok 2 > ok 3 > ok 4 > ok 5 > ok 6 > Can't locate object method "get_Annotations" via package > "Bio::SeqFeature::Annotated" at > /usr/local/lib/site_perl/Bio/SeqFeature/Annotated.pm line 292, > line 2. > ok 7 # Cannot complete FeatureIO tests > ok 8 # Cannot complete FeatureIO tests > ok 9 # Cannot complete FeatureIO tests > ok 10 # Cannot complete FeatureIO tests > ok 11 # Cannot complete FeatureIO tests > ok 12 # Cannot complete FeatureIO tests > ok 13 # Cannot complete FeatureIO tests > ok 14 # Cannot complete FeatureIO tests > ok 15 # Cannot complete FeatureIO tests > ok 16 # Cannot complete FeatureIO tests > ok 17 # Cannot complete FeatureIO tests > ok 18 # Cannot complete FeatureIO tests > ok 19 # Cannot complete FeatureIO tests > ok 20 # Cannot complete FeatureIO tests > ok 21 # Cannot complete FeatureIO tests > ok 22 # Cannot complete FeatureIO tests > > However, same code runs fine on my debian unstable machine (perl > 5.8.8). Perhaps this is a bug in debian stable's perl? > > I did some poking around through the code, changing @ISA = qw/.../ to > use base, switching the order of inclusion in the ISA at the top of > Bio::SeqFeature::Annotated, no dice. > > Anybody able to reproduce this? Anyone have any ideas? > > Rob > > -- > Robert Buels > SGN Bioinformatics Analyst > 252A Emerson Hall, Cornell University > Ithaca, NY 14853 > Tel: 503-889-8539 > rmb32 at cornell.edu > http://www.sgn.cornell.edu > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From chandan.kr.singh at gmail.com Fri Jul 7 01:23:40 2006 From: chandan.kr.singh at gmail.com (CHANDAN SINGH) Date: Fri, 7 Jul 2006 10:53:40 +0530 Subject: [Bioperl-l] PrimarySeqI object Exception In-Reply-To: References: <20060706181312.63581.qmail@web35607.mail.mud.yahoo.com> Message-ID: <2d4f320607062223y520a1375lb30cf40c1c883702@mail.gmail.com> Hi By default , id is the first word encountered i.e, the first string after ">" seperated from the rest by a space. The sample id u mentioned in ur first mail contains spaces and as i mentioned in my previous mail, i am sure the ids made by indexing and the ones u r using in the array are different. U can see the ids used in indexing by using @ids = $db->ids() ; print join("\n", at ids) ; Cheers Chandan On 7/7/06, Brian Osborne wrote: > > sss lll, > > What this error means is that $obj is not a valid Sequence object, this is > what's passed to the write_seq method. What identifier is > $array_gene_name[$p]? > > Brian O. > > > On 7/6/06 2:13 PM, "sss lll" wrote: > > > Hi there, > > > > I encountered a problem while calling module > > PrimarySeqI, with the following code: > > > > my $db=Bio::DB::Fasta->new($fafile); > > my $obj=$db->get_Seq_by_id($array_gene_name[$p]); > > $seqio->write_seq($obj); > > > > The error message was: > > MSG: Did not provide a valid Bio::PrimarySeqI object > > STACK Bio::SeqIO::fasta::write_seq > > /usr/lib/perl5/site_perl/5.8.0/Bio/SeqIO/fasta.pm:178 > > > > We think it had to do with the lengh of the gene name. > > For example the following gene name was a problem: > > > > gi|59711891|ref|YP_204667.1| NAD-specific glutamate > > dehydrogenase [Vibrio fischeri ES114]*VIB*COG2902*E > > > > Any ideas on what happened? > > > > Thanks > > > > __________________________________________________ > > Do You Yahoo!? > > Tired of spam? Yahoo! Mail has the best spam protection around > > http://mail.yahoo.com > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From selvik at ufl.edu Fri Jul 7 12:07:03 2006 From: selvik at ufl.edu (Selvi Kadirvel) Date: Fri, 7 Jul 2006 12:07:03 -0400 Subject: [Bioperl-l] Bio::SearchIO::hmmer hsp behaviour In-Reply-To: <001a01c6a048$cb802420$15327e82@pyrimidine> References: <001a01c6a048$cb802420$15327e82@pyrimidine> Message-ID: <1A5235F4-87E6-42D7-9796-7FEB8F7C04E5@ufl.edu> Chris: I just tried it out, and it looks like this solution works fine for me. Thank you for the fix! -Selvi On Jul 5, 2006, at 11:36 AM, Chris Fields wrote: > Okay, I managed to figure out what the problem was. I committed a > fix in > CVS for the initial bug (Selvi's missing hits). Still has one HSP > per hit > for now; I think it will take a bit more effort to get a BLAST-like > multi > HSP/hit up and running. > > Selvi, update from CVS to see if that works. > > Chris > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> bounces at lists.open-bio.org] On Behalf Of Chris Fields >> Sent: Friday, June 30, 2006 12:44 PM >> To: Sendu Bala; Jason Stajich >> Cc: bioperl-l at lists.open-bio.org list >> Subject: Re: [Bioperl-l] Bio::SearchIO::hmmer hsp behaviour >> >> I'll try looking at it this weekend. A suggested workaround is to >> either try setting -A for no alignments or setting it to a high >> number to retrieve all of them. It's pretty serious as the error >> silently dumps those domains, so for those using automated annotation >> pipelines would miss it unless they are also checking the raw output. >> >> You could design a SearchIO::hmmpfam parser then expand it to take in >> hmmsearch output at a later point, or keep them separate. I like the >> idea of having modules that are more specific about what they parse; >> seems at some point you reach serious code bloat and maintenance >> becomes an issue. Look at SearchIO::blast; it parses various text >> BLAST output very well but with some serious obfuscation. Just don't >> know how productive it would be to separate out the PSI-BLAST and >> bl2seq stuff since they are pretty close to a standard BLAST >> report... oh well. >> >> To Jason : good luck on your move. Drop us a line here to let us >> know everything went well. >> >> Chris >> >> On Jun 30, 2006, at 11:14 AM, Sendu Bala wrote: >> >>> Chris Fields wrote: >>>> It may have been just simpler to have it be one HSP (domain) per >>>> Hit >>>> (model) as that's how the reports are generated. My reasoning was >>>> that >>>> using the one domain per model made sense based on what you are >>>> actually >>>> trying to do, which is annotate the sequence based on the order the >>>> domain appears. Most others may not view it that way, which is >>>> fine. >>>> One can always gather the relevant HSP's, convert to seqfeatures, >>>> then >>>> sort them if order is important, I suppose. >>>> >>>> I would say, if the overall consensus is to modify it to have >>>> multiple >>>> domain hits per model (similar to BLAST) then Sendu should go >>>> ahead and >>>> make those changes then announce it on the list so no one can gripe >>>> about it later. My main concern was not changing things so >>>> dramatically >>>> that it'll break for someone >>> >>> Going on your earlier suggestion, I was thinking about making >>> SearchIO::hmmpfam instead, which would get used if you set the >>> format to >>> 'hmmpfam' instead of the generic 'hmmer' when making a SearchIO. I >>> suppose I would make a SearchIO::hmmsearch as well, if necessary. >>> >>> >>> [...] >>>> that the reported bug about missing hits (Bug 2036) is fixed as >>>> well. >>> >>> However, having never made a SearchIO plugin before, it will be some >>> time before I get my head around it. I'll want to make one the >>> current >>> HOWTO:SearchIO way before I can think about doing it a better way >>> (hashes) as well. So I can say I'll make a move on this at some >>> point in >>> the future, but if someone wants to fix Bug 2036 in the mean time, >>> they >>> are welcome to. Again as suggested, my priority is Bio::Map right >>> now. >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> Christopher Fields >> Postdoctoral Researcher >> Lab of Dr. Robert Switzer >> Dept of Biochemistry >> University of Illinois Urbana-Champaign >> >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > From cjfields at uiuc.edu Fri Jul 7 12:16:30 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 7 Jul 2006 11:16:30 -0500 Subject: [Bioperl-l] Bio::SeqFeatureI spliced_seq Message-ID: <002a01c6a1e0$b4e2b360$15327e82@pyrimidine> There is a reported bug (Bug 2039) which I found an easy fix for; the issue is that spliced_seq, as currently implemented, has two optional arguments: my ($self, $db, $nosort) = @_; $db is-a Bio::DB::RandomAccessI; $nosort is a flag so that locations aren't sorted before splicing, which is crux of the bug. So, to set $nosort you must also set $db to either undef or a Bio::DB::RandomAccessI (a point not made in the docs and not immediately clear to the user). Would it make more sense to have something like this (using $self->_rearrange to get the options)? my $seq = $sf->spliced_seq(-nosort => 1); my $seq = $sf->spliced_seq(-db => $db); my $seq = $sf->spliced_seq(-nosort => 1 -db => $db); Christopher Fields Postdoctoral Researcher - Switzer Lab Dept. of Biochemistry University of Illinois Urbana-Champaign From vebaev at gmail.com Sat Jul 8 16:59:40 2006 From: vebaev at gmail.com (Vesselin Baev) Date: Sat, 08 Jul 2006 23:59:40 +0300 Subject: [Bioperl-l] BLAST running options Message-ID: <44B01CBC.9070404@gmail.com> Hi, I'm parsing Blast results, but I need an Blast option to limit limit and decrease the Blast number of results. I'm blasting an oligo about 40nt and I need only results which are with mismatches (not more than 10) or exactly matching but in the length as the query - 40. I do not want all the big amount of results that blast gave me about shorter matching. Do anyone knows what king of BLAST option to use? Thanks -- ------------------------------------------------ University of Plovdiv Faculty of Biology Dept. Molecular Biology and Plant Physiology Tzar Asen 24 Plovdiv 4000, BULGARIA vebaev at gmail.com tel.00359889034044 From cjfields at uiuc.edu Sat Jul 8 19:15:29 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sat, 8 Jul 2006 18:15:29 -0500 Subject: [Bioperl-l] BLAST running options In-Reply-To: <44B01CBC.9070404@gmail.com> References: <44B01CBC.9070404@gmail.com> Message-ID: <95D47990-9B63-444D-B386-04219D21DC39@uiuc.edu> There were some posts about this a few months back. http://bioperl.org/pipermail/bioperl-l/2006-April/021341.html Essentially, most responders suggested not using BLAST, but I believe there were a few who gave pointers. Chris On Jul 8, 2006, at 3:59 PM, Vesselin Baev wrote: > Hi, > I'm parsing Blast results, but I need an Blast option to limit > limit and > decrease the Blast number of results. > I'm blasting an oligo about 40nt and I need only results which are > with > mismatches (not more than 10) or exactly matching but in the length as > the query - 40. > I do not want all the big amount of results that blast gave me about > shorter matching. > > Do anyone knows what king of BLAST option to use? > Thanks > > -- > ------------------------------------------------ > > University of Plovdiv > Faculty of Biology > Dept. Molecular Biology and Plant Physiology > Tzar Asen 24 > Plovdiv 4000, BULGARIA > vebaev at gmail.com > tel.00359889034044 > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Mon Jul 10 17:09:12 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 10 Jul 2006 16:09:12 -0500 Subject: [Bioperl-l] How to use gi2taxonid Message-ID: <000301c6a465$182025d0$15327e82@pyrimidine> Hubert, In case you didn't get this going, there may be another option now. I have started work on a new set of modules called Bio::DB::EUtilities in bioperl-live, intended as a back-end for NCBI database searches. It can be used directly if needed though. You can use EPost/Elink to directly retrieve the taxonIDs using the following script (pass a file containing the protein/nucleotide primary ID on command line). The below retrieves taxonid's using protein GI's: use Bio::DB::EUtilities; my @ids; while (my $id = <>) { chomp $id; push @ids, $id; } my $epost = Bio::DB::EUtilities->new( -eutil => 'epost', -db => 'protein', -id => \@ids, ); $epost->get_response; my $elink = Bio::DB::EUtilities->new( -eutil => 'elink', -cookie => $epost->next_cookie, -db => 'taxonomy', ); $elink->get_response; my @tax_ids = $elink->get_db_ids; Chris > hi, > I have downloaded the gi2taxonid file to get the taxonid for a GI > number > taken from a report as recommended here, but I don't know how to > use the > gi2taxonid file. > Jason wrote in a previous post that you have to make a DB_File out of > it, but I don't know how....and finally tie it to a hash.... > Can anybody give me a hint how to use it..... my final goal is to get > the taxonomy. > > thanks > Hubert Christopher Fields Postdoctoral Researcher - Switzer Lab Dept. of Biochemistry University of Illinois Urbana-Champaign From hubert.prielinger at gmx.at Mon Jul 10 19:53:26 2006 From: hubert.prielinger at gmx.at (Hubert Prielinger) Date: Mon, 10 Jul 2006 17:53:26 -0600 Subject: [Bioperl-l] How to use gi2taxonid In-Reply-To: <000301c6a465$182025d0$15327e82@pyrimidine> References: <000301c6a465$182025d0$15327e82@pyrimidine> Message-ID: <44B2E876.2020200@gmx.at> Hi Chris, thanks for your response, actually I have done it with the EUtils, because I have only accession ids and there is no possibility to retrieve the taxonomy directly for an accession id. Because the xml files you retrieve are very small, I first assign accession id to esearch, parse the Uid from the xml file, assign Uid to esummary, parse tax id from xml and finally, assign tax id to esummary again and retrieve taxonomy and parse it..... I know a little bit intricatley, but it works fine.....thanks regards Hubert Chris Fields wrote: > Hubert, > > In case you didn't get this going, there may be another option now. I have > started work on a new set of modules called Bio::DB::EUtilities in > bioperl-live, intended as a back-end for NCBI database searches. It can be > used directly if needed though. You can use EPost/Elink to directly > retrieve the taxonIDs using the following script (pass a file containing the > protein/nucleotide primary ID on command line). The below retrieves > taxonid's using protein GI's: > > > use Bio::DB::EUtilities; > my @ids; > > while (my $id = <>) { > chomp $id; > push @ids, $id; > } > > my $epost = Bio::DB::EUtilities->new( > -eutil => 'epost', > -db => 'protein', > -id => \@ids, > ); > > $epost->get_response; > > my $elink = Bio::DB::EUtilities->new( > -eutil => 'elink', > -cookie => $epost->next_cookie, > -db => 'taxonomy', > ); > > $elink->get_response; > > my @tax_ids = $elink->get_db_ids; > > > > Chris > > >> hi, >> I have downloaded the gi2taxonid file to get the taxonid for a GI >> number >> taken from a report as recommended here, but I don't know how to >> use the >> gi2taxonid file. >> Jason wrote in a previous post that you have to make a DB_File out of >> it, but I don't know how....and finally tie it to a hash.... >> Can anybody give me a hint how to use it..... my final goal is to get >> the taxonomy. >> >> thanks >> Hubert >> > > Christopher Fields > Postdoctoral Researcher - Switzer Lab > Dept. of Biochemistry > University of Illinois Urbana-Champaign > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > From MEC at stowers-institute.org Mon Jul 10 20:25:11 2006 From: MEC at stowers-institute.org (Cook, Malcolm) Date: Mon, 10 Jul 2006 19:25:11 -0500 Subject: [Bioperl-l] Bio::Tools::Fgenesh bug? and fix? Message-ID: I am finding the Bio::Tools::Fgenesh parser to incorrectly handle the feature coordinates on - strand predictions. In particular, start & end are deliberately reversed if the strand is '-'. I guess this was a holdover from Genscan.pm and wasn't really tested !?!?! Or, perhaps fgenesh v 2.4 which I am running has different output in this respect compared to the version 2.0, against which this module was written. Or, perhaps my understanding is blotto (known to happen). Does anyone know for sure? If I comment out selected lines... # if($predobj->strand() == 1) { $predobj->start($start); $predobj->end($end); # } else { # $predobj->end($start); # $predobj->start($end); # } ... then GFF produced by my naive fgenesh2gff script below is correct (at least w.r.t. strand and coordinates - GFF compatibility purists might wince). Should I commit this change to head? Malcolm Cook Database Applications Manager, Bioinformatics Stowers Institute for Medical Research #!/usr/bin/env perl # fgenesh2gff # PURPOSE: parse fgenesh output into gff # USAGE: fgenesh fish somefish.dna | fgenesh2gff > somefish.dna.fgenesh.gff use strict; use warnings; use Bio::Tools::Fgenesh; use Bio::FeatureIO; # Remaining options should name files to process, but if none, process # standard input: @ARGV = ('-') unless @ARGV; my $fgenesh = Bio::Tools::Fgenesh->new(-fh => \*ARGV); my $featureout = new Bio::Tools::GFF( -gff_version => 2, #whatever ;) ); my $IDNUM = 0; while (my $gene = $fgenesh->next_prediction()) { my $ID = "fgenesh" . ++ $IDNUM; $gene->add_tag_value('ID', $ID); $featureout->write_feature($gene); foreach ($gene->exons()) { $_->add_tag_value('Parent', $ID); $_->seq_id($gene->seq_id); $featureout->write_feature($_); } } $fgenesh->close(); exit 0; From chris at dwan.org Mon Jul 10 22:06:41 2006 From: chris at dwan.org (Christopher Dwan) Date: Mon, 10 Jul 2006 22:06:41 -0400 Subject: [Bioperl-l] Bio::Tools::Fgenesh bug? and fix? In-Reply-To: References: Message-ID: I'm not surprised that there are parts that don't work right, I coped genscan.pm and made the absolute minimal changes required to get what I needed working. Haven't touched it since. Please feel free to do what needs to be done, and sorry about the mess. -Chris Dwan On Jul 10, 2006, at 8:25 PM, Cook, Malcolm wrote: > I am finding the Bio::Tools::Fgenesh parser to incorrectly handle the > feature coordinates on - strand predictions. > > In particular, start & end are deliberately reversed if the strand is > '-'. > > I guess this was a holdover from Genscan.pm and wasn't really tested > !?!?! > > Or, perhaps fgenesh v 2.4 which I am running has different output in > this respect compared to the version 2.0, against which this module > was > written. > > Or, perhaps my understanding is blotto (known to happen). > > Does anyone know for sure? > > If I comment out selected lines... > > # if($predobj->strand() == 1) { > $predobj->start($start); > $predobj->end($end); > # } else { > # $predobj->end($start); > # $predobj->start($end); > # } > > ... then GFF produced by my naive fgenesh2gff script below is correct > (at least w.r.t. strand and coordinates - GFF compatibility purists > might wince). > > Should I commit this change to head? > > > Malcolm Cook > Database Applications Manager, Bioinformatics > Stowers Institute for Medical Research > > > > #!/usr/bin/env perl > > # fgenesh2gff > # PURPOSE: parse fgenesh output into gff > # USAGE: fgenesh fish somefish.dna | fgenesh2gff > > somefish.dna.fgenesh.gff > > use strict; > use warnings; > use Bio::Tools::Fgenesh; > use Bio::FeatureIO; > > # Remaining options should name files to process, but if none, process > # standard input: > @ARGV = ('-') unless @ARGV; > my $fgenesh = Bio::Tools::Fgenesh->new(-fh => \*ARGV); > > my $featureout = new Bio::Tools::GFF( > -gff_version => 2, #whatever ;) > ); > my $IDNUM = 0; > while (my $gene = $fgenesh->next_prediction()) { > my $ID = "fgenesh" . ++ $IDNUM; > $gene->add_tag_value('ID', $ID); > $featureout->write_feature($gene); > foreach ($gene->exons()) { > $_->add_tag_value('Parent', $ID); > $_->seq_id($gene->seq_id); > $featureout->write_feature($_); > } > } > $fgenesh->close(); > > exit 0; > From rvosa at sfu.ca Tue Jul 11 04:58:46 2006 From: rvosa at sfu.ca (Rutger Vos) Date: Tue, 11 Jul 2006 01:58:46 -0700 Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? Message-ID: <44B36846.8070103@sfu.ca> Dear all, would it be possible to overload Bio::Root::RootI's 'throw' method to accept an additional, optional (positional) argument to define the exception class, e.g. using Exception::Class: # ...somewhere ... sub makefh { my ( $self, $filename ) = @_; open my $fh, '<' $filename or $self->throw("Can't open file: $!", 'Bio::Exceptions::FileIO'); # NOTE second argument return $fh; } #.... somewhere else my $fh; eval { $fh = $obj->makefh( 'data.txt'); } if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) { # something's wrong with the file? } -- ++++++++++++++++++++++++++++++++++++++++++++++++++++ Rutger Vos, PhD. candidate Department of Biological Sciences Simon Fraser University 8888 University Drive Burnaby, BC, V5A1S6 Phone: 604-291-5625 Fax: 604-291-3496 Personal site: http://www.sfu.ca/~rvosa FAB* lab: http://www.sfu.ca/~fabstar Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ ++++++++++++++++++++++++++++++++++++++++++++++++++++ From khoiwal_tara at yahoo.co.in Tue Jul 11 08:19:17 2006 From: khoiwal_tara at yahoo.co.in (Khoiwal Tara) Date: Tue, 11 Jul 2006 05:19:17 -0700 (PDT) Subject: [Bioperl-l] Need help in needle parser Message-ID: <20060711121918.64429.qmail@web8513.mail.in.yahoo.com> Hi, I want to parse the output of needle.I tried but didn't able to get expected output. My code is as follows: #!/usr/local/bin/perl use strict; use warnings; use Bio::AlignIO; my $needleReport = $ARGV[0]; my $in = new Bio::AlignIO(-format => 'emboss',-file =>$needleReport); while(my $align = $in->next_aln()){ print "Alignment Length:".$align->length()."\n"; print "Percentage Identity:".$align->percentage_identity()."\n"; print "Consensus string:".$align->consensus_string()."\n"; print "Number of sequences:".$align->no_sequence()."\n"; print "Number of residues:".$align->no_residues()."\n"; } But it doesn't go inside the while loop. Pls help me. How to find the alignment position for the query sequence on the target sequence from the needle output? Where can i find the good documentation on needle parser and its usage? Good document on bioperl for beginners. Regards, Tara Khoiwal. --------------------------------- Sneak preview the all-new Yahoo.com. It's not radically different. Just radically better. From cjfields at uiuc.edu Tue Jul 11 08:59:07 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 11 Jul 2006 07:59:07 -0500 Subject: [Bioperl-l] Need help in needle parser In-Reply-To: <20060711121918.64429.qmail@web8513.mail.in.yahoo.com> References: <20060711121918.64429.qmail@web8513.mail.in.yahoo.com> Message-ID: <250EEE60-48D0-4844-B0C0-13E17E60965C@uiuc.edu> perldoc Bio::AlignIO perldoc Bio::AlignIO::needle http://www.bioperl.org/wiki/FAQ http://www.bioperl.org/wiki/HOWTO:Beginners http://www.bioperl.org/wiki/Bptutorial.pl http://www.catb.org/~esr/faqs/smart-questions.html Google is your friend! If it isn't entering the while loop, there are two possibilities: 1) Something is wrong with the file 2) The parser isn't reading the file correctly In order to know which, we will need to see the alignment itself. Chris On Jul 11, 2006, at 7:19 AM, Khoiwal Tara wrote: > Hi, > I want to parse the output of needle.I tried but didn't able to > get expected output. > > My code is as follows: > > #!/usr/local/bin/perl > > use strict; > use warnings; > use Bio::AlignIO; > my $needleReport = $ARGV[0]; > > my $in = new Bio::AlignIO(-format => 'emboss',-file =>$needleReport); > > while(my $align = $in->next_aln()){ > print "Alignment Length:".$align->length()."\n"; > print "Percentage Identity:".$align->percentage_identity()."\n"; > print "Consensus string:".$align->consensus_string()."\n"; > print "Number of sequences:".$align->no_sequence()."\n"; > print "Number of residues:".$align->no_residues()."\n"; > } > > But it doesn't go inside the while loop. > Pls help me. > How to find the alignment position for the query sequence on the > target sequence from the needle output? > Where can i find the good documentation on needle parser and its > usage? > Good document on bioperl for beginners. > > Regards, > Tara Khoiwal. > > > --------------------------------- > Sneak preview the all-new Yahoo.com. It's not radically different. > Just radically better. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Tue Jul 11 09:13:23 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 11 Jul 2006 08:13:23 -0500 Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? In-Reply-To: <44B36846.8070103@sfu.ca> References: <44B36846.8070103@sfu.ca> Message-ID: <954D76EB-F8D3-4652-957B-6ED4BBAEE209@uiuc.edu> I suppose you could; Bio::Root::Root does that using Error.pm (if it is installed). It almost sounds like what Bio::Root::Root does is what you want, but you want a little more information when exceptions are thrown maybe? from perldoc Bio::Root::Root: ... # Alternatively, using the new typed exception syntax in the throw() call: $obj->throw( -class => 'Bio::Root::BadParameter', -text => "Can not open file $file", -value => $file); ... Typed Exception Syntax The typed exception syntax of throw() has the advantage of plainly indicating the nature of the trouble, since the name of the class is included in the title of the exception output. To take advantage of this capability, you must specify arguments as named parameters in the throw() call. Here are the parameters: -class name of the class of the exception. This should be one of the classes defined in Bio::Root::Exception, or a custom error of yours that extends one of the exceptions defined in Bio::Root::Exception. -text a sensible message for the exception -value the value causing the exception or $!, if appropriate. Note that Bio::Root::Exception does not need to be imported into your module (or script) namespace in order to throw exceptions via Bio::Root::Root::throw(), since Bio::Root::Root imports it. Chris On Jul 11, 2006, at 3:58 AM, Rutger Vos wrote: > Dear all, > > would it be possible to overload Bio::Root::RootI's 'throw' method to > accept an additional, optional (positional) argument to define the > exception class, e.g. using Exception::Class: > > # ...somewhere ... > > sub makefh { > my ( $self, $filename ) = @_; > open my $fh, '<' $filename or $self->throw("Can't open file: $!", > 'Bio::Exceptions::FileIO'); # NOTE second argument > return $fh; > } > > #.... somewhere else > my $fh; > eval { > $fh = $obj->makefh( 'data.txt'); > } > if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) { > # something's wrong with the file? > } > > -- > ++++++++++++++++++++++++++++++++++++++++++++++++++++ > Rutger Vos, PhD. candidate > Department of Biological Sciences > Simon Fraser University > 8888 University Drive > Burnaby, BC, V5A1S6 > Phone: 604-291-5625 > Fax: 604-291-3496 > Personal site: http://www.sfu.ca/~rvosa > FAB* lab: http://www.sfu.ca/~fabstar > Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Tue Jul 11 11:25:32 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 11 Jul 2006 10:25:32 -0500 Subject: [Bioperl-l] Need help in needle parser In-Reply-To: <20060711132601.46368.qmail@web8510.mail.in.yahoo.com> Message-ID: <001601c6a4fe$3ff7ca10$15327e82@pyrimidine> There are a few odd things about the data you sent; the FASTA files aren't FASTA format (they are raw) and the needle output doesn't have sequence names. You could try running these through needle with descriptors to see if that helps, but. it is very likely my option #2 (i.e. the parser doesn't recognize the format). There is a thread on the mail list about this issue: http://thread.gmane.org/gmane.comp.lang.perl.bio.general/8926/focus=8935 Basically, it looks like the needle output has changed dramatically in EMBOSS v3. Jason's suggested options from the above thread (as well as mine): . I think the "emboss" format changed in 3.0.0 solutions: a) fix the AlignIO::emboss parser to handle both flavors (old and new) b) have it output MSF format and use AlignIO::msf. . So, as a workaround, use MSF output. I won't have time to look at this anytime soon as I'm busy at $job and getting ready for a summer institute; I'll submit this as a bug to see if someone else can tackle it before I get back in early August. Chris _____ From: Khoiwal Tara [mailto:khoiwal_tara at yahoo.co.in] Sent: Tuesday, July 11, 2006 8:26 AM To: Chris Fields Subject: Re: [Bioperl-l] Need help in needle parser I am sending my testing data to you. I have two fasta files "GenomicSeq.fasta" and "TranscriptSeq.fasta". I ran needle on these files as follows: $ needle GenomicSeq.fasta TranscriptSeq.fasta outfile.needle So the out put of the needle will get stored in outfile.needle. I am attaching the output file also. Please check it and tell me if it has any problem. Is my output file is correct? Thanks and Regards, Tara. Chris Fields wrote: perldoc Bio::AlignIO perldoc Bio::AlignIO::needle http://www.bioperl.org/wiki/FAQ http://www.bioperl.org/wiki/HOWTO:Beginners http://www.bioperl.org/wiki/Bptutorial.pl http://www.catb.org/~esr/faqs/smart-questions.html Google is your friend! If it isn't entering the while loop, there are two possibilities: 1) Something is wrong with the file 2) The parser isn't reading the file correctly In order to know which, we will need to see the alignment itself. Chris On Jul 11, 2006, at 7:19 AM, Khoiwal Tara wrote: > Hi, > I want to parse the output of needle.I tried but didn't able to > get expected output. > > My code is as follows: > > #!/usr/local/bin/perl > > use strict; > use warnings; > use Bio::AlignIO; > my $needleReport = $ARGV[0]; > > my $in = new Bio::AlignIO(-format => 'emboss',-file =>$needleReport); > > while(my $align = $in->next_aln()){ > print "Alignment Length:".$align->length()."\n"; > print "Percentage Identity:".$align->percentage_identity()."\n"; > print "Consensus string:".$align->consensus_string()."\n"; > print "Number of sequences:".$align->no_sequence()."\n"; > print "Number of residues:".$align->no_residues()."\n"; > } > > But it doesn't go inside the while loop. > Pls help me. > How to find the alignment position for the query sequence on the > target sequence from the needle output? > Where can i find the good documentation on needle parser and its > usage? > Good document on bioperl for beginners. > > Regards, > Tara Khoiwal. > > > --------------------------------- > Sneak preview the all-new Yahoo.com. It's not radically different. > Just radically better. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign _____ Do you Yahoo!? Next-gen email? Have it all with the all-new Yahoo! Mail Beta. From MEC at stowers-institute.org Tue Jul 11 11:56:40 2006 From: MEC at stowers-institute.org (Cook, Malcolm) Date: Tue, 11 Jul 2006 10:56:40 -0500 Subject: [Bioperl-l] Bio::Tools::Fgenesh bug? and fix? Message-ID: Got it. Commits made. Thanks for the history lesson. Cheers, Malcolm Cook >-----Original Message----- >From: Christopher Dwan [mailto:chris at dwan.org] >Sent: Monday, July 10, 2006 9:07 PM >To: Cook, Malcolm >Cc: bioperl-l >Subject: Re: Bio::Tools::Fgenesh bug? and fix? > > >I'm not surprised that there are parts that don't work right, I coped >genscan.pm and made the absolute minimal changes required to get what >I needed working. Haven't touched it since. > >Please feel free to do what needs to be done, and sorry about the mess. > >-Chris Dwan > >On Jul 10, 2006, at 8:25 PM, Cook, Malcolm wrote: > >> I am finding the Bio::Tools::Fgenesh parser to incorrectly handle the >> feature coordinates on - strand predictions. >> >> In particular, start & end are deliberately reversed if the strand is >> '-'. >> >> I guess this was a holdover from Genscan.pm and wasn't really tested >> !?!?! >> >> Or, perhaps fgenesh v 2.4 which I am running has different output in >> this respect compared to the version 2.0, against which this module >> was >> written. >> >> Or, perhaps my understanding is blotto (known to happen). >> >> Does anyone know for sure? >> >> If I comment out selected lines... >> >> # if($predobj->strand() == 1) { >> $predobj->start($start); >> $predobj->end($end); >> # } else { >> # $predobj->end($start); >> # $predobj->start($end); >> # } >> >> ... then GFF produced by my naive fgenesh2gff script below is correct >> (at least w.r.t. strand and coordinates - GFF compatibility purists >> might wince). >> >> Should I commit this change to head? >> >> >> Malcolm Cook >> Database Applications Manager, Bioinformatics >> Stowers Institute for Medical Research >> >> >> >> #!/usr/bin/env perl >> >> # fgenesh2gff >> # PURPOSE: parse fgenesh output into gff >> # USAGE: fgenesh fish somefish.dna | fgenesh2gff > >> somefish.dna.fgenesh.gff >> >> use strict; >> use warnings; >> use Bio::Tools::Fgenesh; >> use Bio::FeatureIO; >> >> # Remaining options should name files to process, but if >none, process >> # standard input: >> @ARGV = ('-') unless @ARGV; >> my $fgenesh = Bio::Tools::Fgenesh->new(-fh => \*ARGV); >> >> my $featureout = new Bio::Tools::GFF( >> -gff_version => 2, #whatever ;) >> ); >> my $IDNUM = 0; >> while (my $gene = $fgenesh->next_prediction()) { >> my $ID = "fgenesh" . ++ $IDNUM; >> $gene->add_tag_value('ID', $ID); >> $featureout->write_feature($gene); >> foreach ($gene->exons()) { >> $_->add_tag_value('Parent', $ID); >> $_->seq_id($gene->seq_id); >> $featureout->write_feature($_); >> } >> } >> $fgenesh->close(); >> >> exit 0; >> > > From cjfields at uiuc.edu Tue Jul 11 12:04:49 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 11 Jul 2006 11:04:49 -0500 Subject: [Bioperl-l] Need help in needle parser In-Reply-To: <20060711132601.46368.qmail@web8510.mail.in.yahoo.com> Message-ID: <000101c6a503$bd982eb0$15327e82@pyrimidine> Okay, I take that back. Bio::AlignIO::emboss does parse EMBOSS v3 needle output! The fact that it doesn't parse your alignment is b/c there are no sequence descriptors in the file for the sequences (your FASTA files didn't have them either). Modifying the file to contain descriptions for both the alignment and the 'Aligned_sequences:' section gets your test alignment to work. I consider this a feature and not a bug; how would others be able to distinguish between numerous sequences in an alignment w/o identifiers of some sort? It shouldn't just toss this out without a warning however; I'll try to add a little exception handling. BTW, one line is incorrect in your script; it should be print "Number of sequences:".$align->no_sequences()."\n"; you have print "Number of sequences:".$align->no_sequence()."\n"; Chris _____ From: Khoiwal Tara [mailto:khoiwal_tara at yahoo.co.in] Sent: Tuesday, July 11, 2006 8:26 AM To: Chris Fields Subject: Re: [Bioperl-l] Need help in needle parser I am sending my testing data to you. I have two fasta files "GenomicSeq.fasta" and "TranscriptSeq.fasta". I ran needle on these files as follows: $ needle GenomicSeq.fasta TranscriptSeq.fasta outfile.needle So the out put of the needle will get stored in outfile.needle. I am attaching the output file also. Please check it and tell me if it has any problem. Is my output file is correct? Thanks and Regards, Tara. Chris Fields wrote: perldoc Bio::AlignIO perldoc Bio::AlignIO::needle http://www.bioperl.org/wiki/FAQ http://www.bioperl.org/wiki/HOWTO:Beginners http://www.bioperl.org/wiki/Bptutorial.pl http://www.catb.org/~esr/faqs/smart-questions.html Google is your friend! If it isn't entering the while loop, there are two possibilities: 1) Something is wrong with the file 2) The parser isn't reading the file correctly In order to know which, we will need to see the alignment itself. Chris On Jul 11, 2006, at 7:19 AM, Khoiwal Tara wrote: > Hi, > I want to parse the output of needle.I tried but didn't able to > get expected output. > > My code is as follows: > > #!/usr/local/bin/perl > > use strict; > use warnings; > use Bio::AlignIO; > my $needleReport = $ARGV[0]; > > my $in = new Bio::AlignIO(-format => 'emboss',-file =>$needleReport); > > while(my $align = $in->next_aln()){ > print "Alignment Length:".$align->length()."\n"; > print "Percentage Identity:".$align->percentage_identity()."\n"; > print "Consensus string:".$align->consensus_string()."\n"; > print "Number of sequences:".$align->no_sequence()."\n"; > print "Number of residues:".$align->no_residues()."\n"; > } > > But it doesn't go inside the while loop. > Pls help me. > How to find the alignment position for the query sequence on the > target sequence from the needle output? > Where can i find the good documentation on needle parser and its > usage? > Good document on bioperl for beginners. > > Regards, > Tara Khoiwal. > > > --------------------------------- > Sneak preview the all-new Yahoo.com. It's not radically different. > Just radically better. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign _____ Do you Yahoo!? Next-gen email? Have it all with the all-new Yahoo! Mail Beta. From wrp at virginia.edu Tue Jul 11 14:05:29 2006 From: wrp at virginia.edu (William R. Pearson) Date: Tue, 11 Jul 2006 14:05:29 -0400 Subject: [Bioperl-l] Course announcement: CSHL Computational Genomics Course In-Reply-To: References: Message-ID: <45D80228-35DE-44B0-9E11-48EC76CE0DE7@virginia.edu> Course announcement - Application deadline, July 15, 2006 ================================================================ Cold Spring Harbor COMPUTATIONAL & COMPARATIVE GENOMICS November 8 - 14, 2006 Application Deadline: July 15, 2006 INSTRUCTORS: Pearson, William, Ph.D., University of Virginia, Charlottesville, VA Smith, Randall, Ph.D., SmithKline Beecham Pharmaceuticals, King of Prussia, PA Beyond BLAST and FASTA - Alignment: from proteins to genomes - This course presents a comprehensive overview of the theory and practice of computational methods for extracting the maximum amount of information from protein and DNA sequence similarity through sequence database searches, statistical analysis, and multiple sequence alignment, and genome scale alignment. Additional topics include gene finding, dentifying signals in unaligned sequences, integration of genetic and sequence information in biological databases. The course combines lectures with hands-on exercises; students are encouraged to pose challenging sequence analysis problems using their own data. The course makes extensive use of local WWW pages to present problem sets and the computing tools to solve them. Students use Windows and Mac workstations attached to a UNIX server; participants should be comfortable using the Unix operating system and a Unix text editor. The course is designed for biologists seeking advanced training in biological sequence analysis, computational biology core resource directors and staff, and for scientists in other disciplines, such as computer science, who wish to survey current research problems in biological sequence analysis and comparative genomics. The primary focus of the Computational and Comparative Genomics Course is the theory and practice of algorithms used in computational biology, with the goal of using current methods more effectively and developing new algorithms. Cold Spring Harbor also offers a "Programming for Biology" course, which focuses more on software development. Over the past few years, the course has been expanded to cover more algorithms and exercises on comparative genomics and genome databases. For additional information and the lecture schedule and problem sets for the 2005 course, see: http://fasta.bioch.virginia.edu/cshl05 ================================================================ To apply to the course, fill out the form at: http://meetings.cshl.edu/courses/courseapplication.asp ================================================================ From rvosa at sfu.ca Tue Jul 11 14:58:25 2006 From: rvosa at sfu.ca (Rutger Vos) Date: Tue, 11 Jul 2006 11:58:25 -0700 Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? In-Reply-To: <954D76EB-F8D3-4652-957B-6ED4BBAEE209@uiuc.edu> References: <44B36846.8070103@sfu.ca> <954D76EB-F8D3-4652-957B-6ED4BBAEE209@uiuc.edu> Message-ID: <44B3F4D1.7090804@sfu.ca> I must have overlooked this. I think it does what I want. So could I do something like: $obj->thow_not_implemented( -class => 'Bio::Root::NotImplemented' ); ...in interfaces? Chris Fields wrote: > I suppose you could; Bio::Root::Root does that using Error.pm (if it > is installed). It almost sounds like what Bio::Root::Root does is > what you want, but you want a little more information when exceptions > are thrown maybe? > > from perldoc Bio::Root::Root: > > ... > # Alternatively, using the new typed exception syntax in > the throw() call: > > $obj->throw( -class => 'Bio::Root::BadParameter', > -text => "Can not open file $file", > -value => $file); > ... > > Typed Exception Syntax > > The typed exception syntax of throw() has the advantage of > plainly > indicating the nature of the trouble, since the name of the > class is > included in the title of the exception output. > > To take advantage of this capability, you must specify > arguments as > named parameters in the throw() call. Here are the parameters: > > -class > name of the class of the exception. This should be one > of the > classes defined in Bio::Root::Exception, or a custom > error of yours > that extends one of the exceptions defined in > Bio::Root::Exception. > > -text > a sensible message for the exception > > -value > the value causing the exception or $!, if appropriate. > > Note that Bio::Root::Exception does not need to be imported > into your > module (or script) namespace in order to throw exceptions via > Bio::Root::Root::throw(), since Bio::Root::Root imports it. > > > Chris > > On Jul 11, 2006, at 3:58 AM, Rutger Vos wrote: > > >> Dear all, >> >> would it be possible to overload Bio::Root::RootI's 'throw' method to >> accept an additional, optional (positional) argument to define the >> exception class, e.g. using Exception::Class: >> >> # ...somewhere ... >> >> sub makefh { >> my ( $self, $filename ) = @_; >> open my $fh, '<' $filename or $self->throw("Can't open file: $!", >> 'Bio::Exceptions::FileIO'); # NOTE second argument >> return $fh; >> } >> >> #.... somewhere else >> my $fh; >> eval { >> $fh = $obj->makefh( 'data.txt'); >> } >> if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) { >> # something's wrong with the file? >> } >> >> -- >> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Rutger Vos, PhD. candidate >> Department of Biological Sciences >> Simon Fraser University >> 8888 University Drive >> Burnaby, BC, V5A1S6 >> Phone: 604-291-5625 >> Fax: 604-291-3496 >> Personal site: http://www.sfu.ca/~rvosa >> FAB* lab: http://www.sfu.ca/~fabstar >> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- ++++++++++++++++++++++++++++++++++++++++++++++++++++ Rutger Vos, PhD. candidate Department of Biological Sciences Simon Fraser University 8888 University Drive Burnaby, BC, V5A1S6 Phone: 604-291-5625 Fax: 604-291-3496 Personal site: http://www.sfu.ca/~rvosa FAB* lab: http://www.sfu.ca/~fabstar Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ ++++++++++++++++++++++++++++++++++++++++++++++++++++ From hlapp at gmx.net Tue Jul 11 15:05:03 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 11 Jul 2006 15:05:03 -0400 Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? In-Reply-To: <44B36846.8070103@sfu.ca> References: <44B36846.8070103@sfu.ca> Message-ID: <18C839F9-B099-4A4A-9957-2BF4EB7CFB85@gmx.net> I think it does this already, except that I believe you need to create the exception object and initialize with the message upfront. Steve, can you comment? Is this at least somewhat right? -hilmar On Jul 11, 2006, at 4:58 AM, Rutger Vos wrote: > Dear all, > > would it be possible to overload Bio::Root::RootI's 'throw' method to > accept an additional, optional (positional) argument to define the > exception class, e.g. using Exception::Class: > > # ...somewhere ... > > sub makefh { > my ( $self, $filename ) = @_; > open my $fh, '<' $filename or $self->throw("Can't open file: $!", > 'Bio::Exceptions::FileIO'); # NOTE second argument > return $fh; > } > > #.... somewhere else > my $fh; > eval { > $fh = $obj->makefh( 'data.txt'); > } > if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) { > # something's wrong with the file? > } > > -- > ++++++++++++++++++++++++++++++++++++++++++++++++++++ > Rutger Vos, PhD. candidate > Department of Biological Sciences > Simon Fraser University > 8888 University Drive > Burnaby, BC, V5A1S6 > Phone: 604-291-5625 > Fax: 604-291-3496 > Personal site: http://www.sfu.ca/~rvosa > FAB* lab: http://www.sfu.ca/~fabstar > Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Tue Jul 11 15:05:54 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 11 Jul 2006 15:05:54 -0400 Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? In-Reply-To: <954D76EB-F8D3-4652-957B-6ED4BBAEE209@uiuc.edu> References: <44B36846.8070103@sfu.ca> <954D76EB-F8D3-4652-957B-6ED4BBAEE209@uiuc.edu> Message-ID: <297D4770-A963-4039-8D90-987CC570BA94@gmx.net> Alright - well spotted Chris. This is what I was looking for. On Jul 11, 2006, at 9:13 AM, Chris Fields wrote: > I suppose you could; Bio::Root::Root does that using Error.pm (if it > is installed). It almost sounds like what Bio::Root::Root does is > what you want, but you want a little more information when exceptions > are thrown maybe? > > from perldoc Bio::Root::Root: > > ... > # Alternatively, using the new typed exception syntax in > the throw() call: > > $obj->throw( -class => 'Bio::Root::BadParameter', > -text => "Can not open file $file", > -value => $file); > ... > > Typed Exception Syntax > > The typed exception syntax of throw() has the advantage of > plainly > indicating the nature of the trouble, since the name of the > class is > included in the title of the exception output. > > To take advantage of this capability, you must specify > arguments as > named parameters in the throw() call. Here are the parameters: > > -class > name of the class of the exception. This should be one > of the > classes defined in Bio::Root::Exception, or a custom > error of yours > that extends one of the exceptions defined in > Bio::Root::Exception. > > -text > a sensible message for the exception > > -value > the value causing the exception or $!, if appropriate. > > Note that Bio::Root::Exception does not need to be imported > into your > module (or script) namespace in order to throw exceptions via > Bio::Root::Root::throw(), since Bio::Root::Root imports it. > > > Chris > > On Jul 11, 2006, at 3:58 AM, Rutger Vos wrote: > >> Dear all, >> >> would it be possible to overload Bio::Root::RootI's 'throw' method to >> accept an additional, optional (positional) argument to define the >> exception class, e.g. using Exception::Class: >> >> # ...somewhere ... >> >> sub makefh { >> my ( $self, $filename ) = @_; >> open my $fh, '<' $filename or $self->throw("Can't open file: $!", >> 'Bio::Exceptions::FileIO'); # NOTE second argument >> return $fh; >> } >> >> #.... somewhere else >> my $fh; >> eval { >> $fh = $obj->makefh( 'data.txt'); >> } >> if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) { >> # something's wrong with the file? >> } >> >> -- >> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Rutger Vos, PhD. candidate >> Department of Biological Sciences >> Simon Fraser University >> 8888 University Drive >> Burnaby, BC, V5A1S6 >> Phone: 604-291-5625 >> Fax: 604-291-3496 >> Personal site: http://www.sfu.ca/~rvosa >> FAB* lab: http://www.sfu.ca/~fabstar >> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Tue Jul 11 16:42:35 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 11 Jul 2006 15:42:35 -0500 Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? In-Reply-To: <44B3F4D1.7090804@sfu.ca> Message-ID: <000b01c6a52a$8c77f8c0$15327e82@pyrimidine> Bio::Root::Root doesn't overload throw_not_implemented from Bio::Root::RootI; from the comments looks like Steve C and Ewan B couldn't work out some of the Error.pm issues. Judging by the POD for Bio::Root::RootI, throw_not_implemented doesn't accept arguments; it throws a Bio::Root::NotImplemented exception automatically. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Rutger Vos > Sent: Tuesday, July 11, 2006 1:58 PM > To: Chris Fields > Cc: 'Bioperl List' > Subject: Re: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? > > I must have overlooked this. I think it does what I want. So could I do > something like: > > $obj->thow_not_implemented( -class => 'Bio::Root::NotImplemented' ); > > ...in interfaces? > > Chris Fields wrote: > > I suppose you could; Bio::Root::Root does that using Error.pm (if it > > is installed). It almost sounds like what Bio::Root::Root does is > > what you want, but you want a little more information when exceptions > > are thrown maybe? > > > > from perldoc Bio::Root::Root: > > > > ... > > # Alternatively, using the new typed exception syntax in > > the throw() call: > > > > $obj->throw( -class => 'Bio::Root::BadParameter', > > -text => "Can not open file $file", > > -value => $file); > > ... > > > > Typed Exception Syntax > > > > The typed exception syntax of throw() has the advantage of > > plainly > > indicating the nature of the trouble, since the name of the > > class is > > included in the title of the exception output. > > > > To take advantage of this capability, you must specify > > arguments as > > named parameters in the throw() call. Here are the parameters: > > > > -class > > name of the class of the exception. This should be one > > of the > > classes defined in Bio::Root::Exception, or a custom > > error of yours > > that extends one of the exceptions defined in > > Bio::Root::Exception. > > > > -text > > a sensible message for the exception > > > > -value > > the value causing the exception or $!, if appropriate. > > > > Note that Bio::Root::Exception does not need to be imported > > into your > > module (or script) namespace in order to throw exceptions via > > Bio::Root::Root::throw(), since Bio::Root::Root imports it. > > > > > > Chris > > > > On Jul 11, 2006, at 3:58 AM, Rutger Vos wrote: > > > > > >> Dear all, > >> > >> would it be possible to overload Bio::Root::RootI's 'throw' method to > >> accept an additional, optional (positional) argument to define the > >> exception class, e.g. using Exception::Class: > >> > >> # ...somewhere ... > >> > >> sub makefh { > >> my ( $self, $filename ) = @_; > >> open my $fh, '<' $filename or $self->throw("Can't open file: $!", > >> 'Bio::Exceptions::FileIO'); # NOTE second argument > >> return $fh; > >> } > >> > >> #.... somewhere else > >> my $fh; > >> eval { > >> $fh = $obj->makefh( 'data.txt'); > >> } > >> if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) { > >> # something's wrong with the file? > >> } > >> > >> -- > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> Rutger Vos, PhD. candidate > >> Department of Biological Sciences > >> Simon Fraser University > >> 8888 University Drive > >> Burnaby, BC, V5A1S6 > >> Phone: 604-291-5625 > >> Fax: 604-291-3496 > >> Personal site: http://www.sfu.ca/~rvosa > >> FAB* lab: http://www.sfu.ca/~fabstar > >> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> > >> > >> _______________________________________________ > >> Bioperl-l mailing list > >> Bioperl-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > >> > > > > Christopher Fields > > Postdoctoral Researcher > > Lab of Dr. Robert Switzer > > Dept of Biochemistry > > University of Illinois Urbana-Champaign > > > > > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > > > > > > > -- > ++++++++++++++++++++++++++++++++++++++++++++++++++++ > Rutger Vos, PhD. candidate > Department of Biological Sciences > Simon Fraser University > 8888 University Drive > Burnaby, BC, V5A1S6 > Phone: 604-291-5625 > Fax: 604-291-3496 > Personal site: http://www.sfu.ca/~rvosa > FAB* lab: http://www.sfu.ca/~fabstar > Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From frederick.partridge at st-johns.oxford.ac.uk Tue Jul 11 17:23:28 2006 From: frederick.partridge at st-johns.oxford.ac.uk (Frederick Partridge) Date: Tue, 11 Jul 2006 22:23:28 +0100 (BST) Subject: [Bioperl-l] Get nucleotide sequence when expecting protein from genpept Message-ID: I am trying to retrieve various protein sequences from genpept using get_Seq_by_acc. All of them work ok, except one T16005: If I try and retrieve it with a reduced program: #!usr/bin/perl -w use strict; use Bio::Perl; use Bio::SeqIO; my $genpept = new Bio::DB::GenPept; my $seq = $genpept->get_Seq_by_acc('T16005'); print ($seq->seq(),'\n'); I get back a nucleotide sequence, which is another sequence at NCBI with the same accession number. (I thought these were meant to be unique? but evidently not.) I am using bioperl 1.5.1, perl 5.8.1, Mac OS 10.3 Could anyone help me to get this protein sequence with my program? Many thanks, Freddie Partridge University of Oxford From qfdong at iastate.edu Tue Jul 11 17:32:56 2006 From: qfdong at iastate.edu (Qunfeng) Date: Tue, 11 Jul 2006 16:32:56 -0500 Subject: [Bioperl-l] Get nucleotide sequence when expecting protein from genpept In-Reply-To: References: Message-ID: <6.1.2.0.2.20060711163128.08086570@qfdong.mail.iastate.edu> This particular protein record (acc#T16005) was imported from PIR. In other words, this is not an original GenBank protein record. When GenBank imports protein records from other DB, it keeps their original acc#. However, gi# should be unique. Q At 04:23 PM 7/11/2006, Frederick Partridge wrote: >I am trying to retrieve various protein sequences from genpept using >get_Seq_by_acc. All of them work ok, except one T16005: > > >If I try and retrieve it with a reduced program: > > >#!usr/bin/perl -w > >use strict; > >use Bio::Perl; >use Bio::SeqIO; > >my $genpept = new Bio::DB::GenPept; > >my $seq = $genpept->get_Seq_by_acc('T16005'); > >print ($seq->seq(),'\n'); > > > >I get back a nucleotide sequence, which is another sequence at NCBI with >the same accession number. (I thought these were meant to be unique? but >evidently not.) > > >I am using bioperl 1.5.1, perl 5.8.1, Mac OS 10.3 > > >Could anyone help me to get this protein sequence with my program? > > >Many thanks, > > > >Freddie Partridge > >University of Oxford > > >_______________________________________________ >Bioperl-l mailing list >Bioperl-l at lists.open-bio.org >http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Tue Jul 11 18:05:09 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 11 Jul 2006 17:05:09 -0500 Subject: [Bioperl-l] Get nucleotide sequence when expecting protein fromgenpept In-Reply-To: Message-ID: <000001c6a536$141befb0$15327e82@pyrimidine> It's an imprted PIR record, so there probably is no accession recorded in the database. I think NCBI uses a fallback to nucleotide if it can't find a particular accession via protein. Using the primary ID (the GI#, 7498730) works. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Frederick Partridge > Sent: Tuesday, July 11, 2006 4:23 PM > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] Get nucleotide sequence when expecting protein > fromgenpept > > > > I am trying to retrieve various protein sequences from genpept using > get_Seq_by_acc. All of them work ok, except one T16005: > > > If I try and retrieve it with a reduced program: > > > #!usr/bin/perl -w > > use strict; > > use Bio::Perl; > use Bio::SeqIO; > > my $genpept = new Bio::DB::GenPept; > > my $seq = $genpept->get_Seq_by_acc('T16005'); > > print ($seq->seq(),'\n'); > > > > I get back a nucleotide sequence, which is another sequence at NCBI with > the same accession number. (I thought these were meant to be unique? but > evidently not.) > > > I am using bioperl 1.5.1, perl 5.8.1, Mac OS 10.3 > > > Could anyone help me to get this protein sequence with my program? > > > Many thanks, > > > > Freddie Partridge > > University of Oxford > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Tue Jul 11 18:47:38 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 11 Jul 2006 17:47:38 -0500 Subject: [Bioperl-l] Get nucleotide sequence when expecting proteinfromgenpept In-Reply-To: <000001c6a536$141befb0$15327e82@pyrimidine> Message-ID: <000201c6a53c$03970ed0$15327e82@pyrimidine> Okay, now try this: use Bio::DB::GenPept; use Bio::SeqIO; my $factory = Bio::DB::GenPept->new(-format => 'fasta'); my $seqin = $factory->get_Stream_by_acc('T16005'); my $seqout = Bio::SeqIO->new(-fh => \*STDOUT, -format => 'fasta'); while (my $seq = $seqin->next_seq) { $seqout->write_seq($seq); } This returns both the nucleotide sequence and the correct protein sequence; the protein was returned second for some reason, so get_Seq_by_acc misses it while get_Stream_by_acc doesn't. I have notified NCBI about this issue, but they will likely just tell me to use the GI number for searches as they are unique. Probably a good warning for anyone using accessions for all their work (I use the GI myself). Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Chris Fields > Sent: Tuesday, July 11, 2006 5:05 PM > To: 'Frederick Partridge'; bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Get nucleotide sequence when expecting > proteinfromgenpept > > It's an imprted PIR record, so there probably is no accession recorded in > the database. I think NCBI uses a fallback to nucleotide if it can't find > a > particular accession via protein. Using the primary ID (the GI#, 7498730) > works. > > Chris > > > -----Original Message----- > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > > bounces at lists.open-bio.org] On Behalf Of Frederick Partridge > > Sent: Tuesday, July 11, 2006 4:23 PM > > To: bioperl-l at lists.open-bio.org > > Subject: [Bioperl-l] Get nucleotide sequence when expecting protein > > fromgenpept > > > > > > > > I am trying to retrieve various protein sequences from genpept using > > get_Seq_by_acc. All of them work ok, except one T16005: > > > > > > If I try and retrieve it with a reduced program: > > > > > > #!usr/bin/perl -w > > > > use strict; > > > > use Bio::Perl; > > use Bio::SeqIO; > > > > my $genpept = new Bio::DB::GenPept; > > > > my $seq = $genpept->get_Seq_by_acc('T16005'); > > > > print ($seq->seq(),'\n'); > > > > > > > > I get back a nucleotide sequence, which is another sequence at NCBI with > > the same accession number. (I thought these were meant to be unique? but > > evidently not.) > > > > > > I am using bioperl 1.5.1, perl 5.8.1, Mac OS 10.3 > > > > > > Could anyone help me to get this protein sequence with my program? > > > > > > Many thanks, > > > > > > > > Freddie Partridge > > > > University of Oxford > > > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From Steve_Chervitz at affymetrix.com Tue Jul 11 20:21:16 2006 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Tue, 11 Jul 2006 17:21:16 -0700 Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? In-Reply-To: <18C839F9-B099-4A4A-9957-2BF4EB7CFB85@gmx.net> Message-ID: The Bio::Root::Root object is rigged to use the Error.pm module if available, so you can throw and catch of exception objects derived from Error. The motivation here was to provide a recommended path for folks that want to use more structured exception handling logic in their bioperl code. There are a number of pre-defined subclasses of exceptions that cover common problems (such as FileOpenException), but you can also define your own. See a list of the predfined exceptions as well as some how to docs in the POD for Bio::Root::Exception: http://search.cpan.org/~birney/bioperl-1.4/Bio/Root/Exception.pm There's a bunch more info about Bioperl exception fun available from the bioperl distribution under the examples/root directory. See the README in that directory to get oriented. There are a number of demo scripts there, too. Bio::Root::Root doesn't know anything about Exception::Class, but I see you can use it with Error.pm as described here: http://search.cpan.org/~drolsky/Exception-Class-1.23/lib/Exception/Class.pm# OTHER_EXCEPTION_MODULES_(try%2Fcatch_syntax) Cheers, Steve > From: Hilmar Lapp > Date: Tue, 11 Jul 2006 15:05:03 -0400 > To: Rutger Vos > Cc: Bioperl , Steve Chervitz > > Subject: Re: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? > > I think it does this already, except that I believe you need to > create the exception object and initialize with the message upfront. > > Steve, can you comment? Is this at least somewhat right? > > -hilmar > > On Jul 11, 2006, at 4:58 AM, Rutger Vos wrote: > >> Dear all, >> >> would it be possible to overload Bio::Root::RootI's 'throw' method to >> accept an additional, optional (positional) argument to define the >> exception class, e.g. using Exception::Class: >> >> # ...somewhere ... >> >> sub makefh { >> my ( $self, $filename ) = @_; >> open my $fh, '<' $filename or $self->throw("Can't open file: $!", >> 'Bio::Exceptions::FileIO'); # NOTE second argument >> return $fh; >> } >> >> #.... somewhere else >> my $fh; >> eval { >> $fh = $obj->makefh( 'data.txt'); >> } >> if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) { >> # something's wrong with the file? >> } >> >> -- >> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Rutger Vos, PhD. candidate >> Department of Biological Sciences >> Simon Fraser University >> 8888 University Drive >> Burnaby, BC, V5A1S6 >> Phone: 604-291-5625 >> Fax: 604-291-3496 >> Personal site: http://www.sfu.ca/~rvosa >> FAB* lab: http://www.sfu.ca/~fabstar >> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > From Steve_Chervitz at affymetrix.com Tue Jul 11 21:07:06 2006 From: Steve_Chervitz at affymetrix.com (Steve_Chervitz) Date: Tue, 11 Jul 2006 18:07:06 -0700 Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? In-Reply-To: <000b01c6a52a$8c77f8c0$15327e82@pyrimidine> References: <000b01c6a52a$8c77f8c0$15327e82@pyrimidine> Message-ID: <337D69E3-4075-426B-A53E-31D37ED6CA2E@affymetrix.com> On Jul 11, 2006, at 1:42 PM, Chris Fields wrote: > Bio::Root::Root doesn't overload throw_not_implemented from > Bio::Root::RootI; from the comments looks like Steve C and Ewan B > couldn't > work out some of the Error.pm issues. The issue (I believe) was that Bio::Root::RootI::throw_not_implemented was doing some checking for the presence of Error.pm and calling Error::throw. I changed it so that this fanciness only happens in Root.pm. > Judging by the POD for Bio::Root::RootI, throw_not_implemented doesn't > accept arguments; it throws a Bio::Root::NotImplemented exception > automatically. Looking at the code now, throw_not_implemented() does not throw a Bio::Root::NotImplemented exception. It just throws a simple, unclassed message. We could allow it to throw an exception of class Bio::Root:NotImplemented by changing this code: if( $self->can('throw') ) { $self->throw($message); }... to this if( $self->can('throw') ) { $self->throw(-text=>$message, -class=>'Bio::Root::NotImplemented'); }... This does not create any dependency on Error.pm, but permits it to be used if available. If Error.pm is not loaded, the only change is that the class string is included in the error message, which is kind of handy. Trouble would occur if the implementing class: * does not derive from Bio::Root::Root, * does not import Bio::Root::Exception, * fails to implement a method which gets called, and * Error.pm is available. I don't know if such implementations exist in bioperl now, but I suspect they would be rare (and discouraged). Steve > Chris > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> bounces at lists.open-bio.org] On Behalf Of Rutger Vos >> Sent: Tuesday, July 11, 2006 1:58 PM >> To: Chris Fields >> Cc: 'Bioperl List' >> Subject: Re: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) >> overloading? >> >> I must have overlooked this. I think it does what I want. So could >> I do >> something like: >> >> $obj->thow_not_implemented( -class => 'Bio::Root::NotImplemented' ); >> >> ...in interfaces? >> >> Chris Fields wrote: >>> I suppose you could; Bio::Root::Root does that using Error.pm (if it >>> is installed). It almost sounds like what Bio::Root::Root does is >>> what you want, but you want a little more information when >>> exceptions >>> are thrown maybe? >>> >>> from perldoc Bio::Root::Root: >>> >>> ... >>> # Alternatively, using the new typed exception syntax in >>> the throw() call: >>> >>> $obj->throw( -class => 'Bio::Root::BadParameter', >>> -text => "Can not open file $file", >>> -value => $file); >>> ... >>> >>> Typed Exception Syntax >>> >>> The typed exception syntax of throw() has the advantage of >>> plainly >>> indicating the nature of the trouble, since the name of the >>> class is >>> included in the title of the exception output. >>> >>> To take advantage of this capability, you must specify >>> arguments as >>> named parameters in the throw() call. Here are the >>> parameters: >>> >>> -class >>> name of the class of the exception. This should be one >>> of the >>> classes defined in Bio::Root::Exception, or a custom >>> error of yours >>> that extends one of the exceptions defined in >>> Bio::Root::Exception. >>> >>> -text >>> a sensible message for the exception >>> >>> -value >>> the value causing the exception or $!, if appropriate. >>> >>> Note that Bio::Root::Exception does not need to be imported >>> into your >>> module (or script) namespace in order to throw exceptions >>> via >>> Bio::Root::Root::throw(), since Bio::Root::Root imports it. >>> >>> >>> Chris >>> >>> On Jul 11, 2006, at 3:58 AM, Rutger Vos wrote: >>> >>> >>>> Dear all, >>>> >>>> would it be possible to overload Bio::Root::RootI's 'throw' >>>> method to >>>> accept an additional, optional (positional) argument to define the >>>> exception class, e.g. using Exception::Class: >>>> >>>> # ...somewhere ... >>>> >>>> sub makefh { >>>> my ( $self, $filename ) = @_; >>>> open my $fh, '<' $filename or $self->throw("Can't open file: >>>> $!", >>>> 'Bio::Exceptions::FileIO'); # NOTE second argument >>>> return $fh; >>>> } >>>> >>>> #.... somewhere else >>>> my $fh; >>>> eval { >>>> $fh = $obj->makefh( 'data.txt'); >>>> } >>>> if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) { >>>> # something's wrong with the file? >>>> } >>>> >>>> -- >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Rutger Vos, PhD. candidate >>>> Department of Biological Sciences >>>> Simon Fraser University >>>> 8888 University Drive >>>> Burnaby, BC, V5A1S6 >>>> Phone: 604-291-5625 >>>> Fax: 604-291-3496 >>>> Personal site: http://www.sfu.ca/~rvosa >>>> FAB* lab: http://www.sfu.ca/~fabstar >>>> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >>>> >>>> _______________________________________________ >>>> Bioperl-l mailing list >>>> Bioperl-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>> >>> >>> Christopher Fields >>> Postdoctoral Researcher >>> Lab of Dr. Robert Switzer >>> Dept of Biochemistry >>> University of Illinois Urbana-Champaign >>> >>> >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >>> >>> >>> >> >> -- >> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Rutger Vos, PhD. candidate >> Department of Biological Sciences >> Simon Fraser University >> 8888 University Drive >> Burnaby, BC, V5A1S6 >> Phone: 604-291-5625 >> Fax: 604-291-3496 >> Personal site: http://www.sfu.ca/~rvosa >> FAB* lab: http://www.sfu.ca/~fabstar >> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Tue Jul 11 23:27:37 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 11 Jul 2006 22:27:37 -0500 Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? In-Reply-To: <337D69E3-4075-426B-A53E-31D37ED6CA2E@affymetrix.com> References: <000b01c6a52a$8c77f8c0$15327e82@pyrimidine> <337D69E3-4075-426B-A53E-31D37ED6CA2E@affymetrix.com> Message-ID: <18B66484-FE46-4F77-BDD1-B97085D04C95@uiuc.edu> Makes sense to keep most of the magic in Root instead of RootI.pm. The POD for RootI does state that the class exception thrown is Bio::Root::NotImplemented, so we should probably either change the POD to reflect what really happens or change throw_not_implemented like you suggest (my vote is the latter). I don't think many (if any) implementing classes fall into your 'trouble' category, though I can't be sure how many actually import Bio::Root::Exception. Chris On Jul 11, 2006, at 8:07 PM, Steve_Chervitz wrote: > On Jul 11, 2006, at 1:42 PM, Chris Fields wrote: > >> Bio::Root::Root doesn't overload throw_not_implemented from >> Bio::Root::RootI; from the comments looks like Steve C and Ewan B >> couldn't >> work out some of the Error.pm issues. > > The issue (I believe) was that > Bio::Root::RootI::throw_not_implemented was doing some checking for > the presence of Error.pm and calling Error::throw. I changed it so > that this fanciness only happens in Root.pm. > >> Judging by the POD for Bio::Root::RootI, throw_not_implemented >> doesn't >> accept arguments; it throws a Bio::Root::NotImplemented exception >> automatically. > > Looking at the code now, throw_not_implemented() does not throw a > Bio::Root::NotImplemented exception. It just throws a simple, > unclassed message. We could allow it to throw an exception of class > Bio::Root:NotImplemented by changing this code: > > if( $self->can('throw') ) { > $self->throw($message); > }... > > to this > > if( $self->can('throw') ) { > $self->throw(-text=>$message, - > class=>'Bio::Root::NotImplemented'); > }... > > This does not create any dependency on Error.pm, but permits it to > be used if available. If Error.pm is not loaded, the only change is > that the class string is included in the error message, which is > kind of handy. > > Trouble would occur if the implementing class: > > * does not derive from Bio::Root::Root, > * does not import Bio::Root::Exception, > * fails to implement a method which gets called, and > * Error.pm is available. > > I don't know if such implementations exist in bioperl now, but I > suspect they would be rare (and discouraged). > > Steve > > >> Chris >> >>> -----Original Message----- >>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >>> bounces at lists.open-bio.org] On Behalf Of Rutger Vos >>> Sent: Tuesday, July 11, 2006 1:58 PM >>> To: Chris Fields >>> Cc: 'Bioperl List' >>> Subject: Re: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) >>> overloading? >>> >>> I must have overlooked this. I think it does what I want. So >>> could I do >>> something like: >>> >>> $obj->thow_not_implemented( -class => 'Bio::Root::NotImplemented' ); >>> >>> ...in interfaces? >>> >>> Chris Fields wrote: >>>> I suppose you could; Bio::Root::Root does that using Error.pm >>>> (if it >>>> is installed). It almost sounds like what Bio::Root::Root does is >>>> what you want, but you want a little more information when >>>> exceptions >>>> are thrown maybe? >>>> >>>> from perldoc Bio::Root::Root: >>>> >>>> ... >>>> # Alternatively, using the new typed exception syntax in >>>> the throw() call: >>>> >>>> $obj->throw( -class => 'Bio::Root::BadParameter', >>>> -text => "Can not open file $file", >>>> -value => $file); >>>> ... >>>> >>>> Typed Exception Syntax >>>> >>>> The typed exception syntax of throw() has the advantage of >>>> plainly >>>> indicating the nature of the trouble, since the name of the >>>> class is >>>> included in the title of the exception output. >>>> >>>> To take advantage of this capability, you must specify >>>> arguments as >>>> named parameters in the throw() call. Here are the >>>> parameters: >>>> >>>> -class >>>> name of the class of the exception. This should be one >>>> of the >>>> classes defined in Bio::Root::Exception, or a custom >>>> error of yours >>>> that extends one of the exceptions defined in >>>> Bio::Root::Exception. >>>> >>>> -text >>>> a sensible message for the exception >>>> >>>> -value >>>> the value causing the exception or $!, if appropriate. >>>> >>>> Note that Bio::Root::Exception does not need to be imported >>>> into your >>>> module (or script) namespace in order to throw >>>> exceptions via >>>> Bio::Root::Root::throw(), since Bio::Root::Root imports it. >>>> >>>> >>>> Chris >>>> >>>> On Jul 11, 2006, at 3:58 AM, Rutger Vos wrote: >>>> >>>> >>>>> Dear all, >>>>> >>>>> would it be possible to overload Bio::Root::RootI's 'throw' >>>>> method to >>>>> accept an additional, optional (positional) argument to define the >>>>> exception class, e.g. using Exception::Class: >>>>> >>>>> # ...somewhere ... >>>>> >>>>> sub makefh { >>>>> my ( $self, $filename ) = @_; >>>>> open my $fh, '<' $filename or $self->throw("Can't open >>>>> file: $!", >>>>> 'Bio::Exceptions::FileIO'); # NOTE second argument >>>>> return $fh; >>>>> } >>>>> >>>>> #.... somewhere else >>>>> my $fh; >>>>> eval { >>>>> $fh = $obj->makefh( 'data.txt'); >>>>> } >>>>> if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) { >>>>> # something's wrong with the file? >>>>> } >>>>> >>>>> -- >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> Rutger Vos, PhD. candidate >>>>> Department of Biological Sciences >>>>> Simon Fraser University >>>>> 8888 University Drive >>>>> Burnaby, BC, V5A1S6 >>>>> Phone: 604-291-5625 >>>>> Fax: 604-291-3496 >>>>> Personal site: http://www.sfu.ca/~rvosa >>>>> FAB* lab: http://www.sfu.ca/~fabstar >>>>> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> >>>>> >>>>> _______________________________________________ >>>>> Bioperl-l mailing list >>>>> Bioperl-l at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>>> >>>> >>>> Christopher Fields >>>> Postdoctoral Researcher >>>> Lab of Dr. Robert Switzer >>>> Dept of Biochemistry >>>> University of Illinois Urbana-Champaign >>>> >>>> >>>> >>>> _______________________________________________ >>>> Bioperl-l mailing list >>>> Bioperl-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>> >>>> >>>> >>>> >>> >>> -- >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Rutger Vos, PhD. candidate >>> Department of Biological Sciences >>> Simon Fraser University >>> 8888 University Drive >>> Burnaby, BC, V5A1S6 >>> Phone: 604-291-5625 >>> Fax: 604-291-3496 >>> Personal site: http://www.sfu.ca/~rvosa >>> FAB* lab: http://www.sfu.ca/~fabstar >>> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From frederick.partridge at st-johns.oxford.ac.uk Wed Jul 12 11:16:33 2006 From: frederick.partridge at st-johns.oxford.ac.uk (Frederick Partridge) Date: Wed, 12 Jul 2006 16:16:33 +0100 (BST) Subject: [Bioperl-l] Get nucleotide sequence when expecting proteinfromgenpept In-Reply-To: <000201c6a53c$03970ed0$15327e82@pyrimidine> References: <000201c6a53c$03970ed0$15327e82@pyrimidine> Message-ID: On Tue, 11 Jul 2006, Chris Fields wrote: > This returns both the nucleotide sequence and the correct protein sequence; > the protein was returned second for some reason, so get_Seq_by_acc misses it > while get_Stream_by_acc doesn't. I have notified NCBI about this issue, but > they will likely just tell me to use the GI number for searches as they are > unique. Probably a good warning for anyone using accessions for all their > work (I use the GI myself). Thank you both for your help, I have converted to GIs and it works much better. As an aside, it might be nice to have a $hit->gi method as well as $hit->accession for parsing blast reports. (I now realise that you can derive the gi from $hit->name, but this might have encouraged me to start off using gi instead of accession numbers). Freddie Partridge University of Oxford From cjfields at uiuc.edu Wed Jul 12 11:39:39 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 12 Jul 2006 10:39:39 -0500 Subject: [Bioperl-l] Get nucleotide sequence when expecting proteinfromgenpept In-Reply-To: Message-ID: <000b01c6a5c9$635a7540$15327e82@pyrimidine> Problem is, you may or may not have GIs for a BLAST hit depending on how you retrieve the BLAST report, what interface you use, etc. NCBI is pretty ambiguous when it comes to GI vs. accession; the sequence database guys want you to use the GI for searches (since that's the unique ID for NCBI's databases) and don't promise getting the correct sequence using the accession. However, the BLAST interface guys have set up the BLAST CGI server to not return the GI by default(accessible through Bio::Tools::Run::RemoteBlast). Even more confusing, if you use the NCBI BLAST web interface, this option is turned on by default. Don't know what blastcl3 or blastall does, haven't checked in a while. Anyway, this could be why there is no $hit->gi method for GenericHit/BlastHit. It could be added; I will need to look at SearchIO::blast/blastxml/blasttable to see how this is parsed out. BTW, what I do as a work-around, when using RemoteBlast, is below (you could use the while loop to grab the GIs using SearchIO::blast if they are present in the BLAST report). This grabs all the GI's from the description line (not just the best hit). # sets retrieval header to include the GI always $Bio::Tools::Run::RemoteBlast::RETRIEVALHEADER{'NCBI_GI'} = 'yes'; ... while ( my $hit = $result->next_hit) { my $description = $hit->description; while ($description =~ /gi\|(.*?)\|/g) { my $gi = $1; push @gis, $gi; } } Chris > -----Original Message----- > From: Frederick Partridge [mailto:frederick.partridge at st- > johns.oxford.ac.uk] > Sent: Wednesday, July 12, 2006 10:17 AM > To: Chris Fields > Cc: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Get nucleotide sequence when expecting > proteinfromgenpept > > > > On Tue, 11 Jul 2006, Chris Fields wrote: > > This returns both the nucleotide sequence and the correct protein > sequence; > > the protein was returned second for some reason, so get_Seq_by_acc > misses it > > while get_Stream_by_acc doesn't. I have notified NCBI about this issue, > but > > they will likely just tell me to use the GI number for searches as they > are > > unique. Probably a good warning for anyone using accessions for all > their > > work (I use the GI myself). > > > Thank you both for your help, I have converted to GIs and it works much > better. > > As an aside, it might be nice to have a $hit->gi method as well as > $hit->accession for parsing blast reports. (I now realise that you can > derive the gi from $hit->name, but this might have encouraged me to start > off using gi instead of accession numbers). > > > Freddie Partridge > > University of Oxford > From Steve_Chervitz at affymetrix.com Wed Jul 12 14:53:22 2006 From: Steve_Chervitz at affymetrix.com (Steve_Chervitz) Date: Wed, 12 Jul 2006 11:53:22 -0700 Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? In-Reply-To: <18B66484-FE46-4F77-BDD1-B97085D04C95@uiuc.edu> References: <000b01c6a52a$8c77f8c0$15327e82@pyrimidine> <337D69E3-4075-426B-A53E-31D37ED6CA2E@affymetrix.com> <18B66484-FE46-4F77-BDD1-B97085D04C95@uiuc.edu> Message-ID: <3E119694-68C5-47A6-971B-8E035CBB6429@affymetrix.com> For modules that derive from Bio::Root::Root, there's no need to import Bio::Root::Exception since the Root object does it. I also favor adding the -class parameter to throw_not_implemented in RootI. I just committed this change in in bioperl-live. I also added a test for it in t/RootI.t I haven't run the complete suite of tests after making this change, but I don't suspect there'll be any trouble (famous last words). Really, if any test leads to the calling of throw_not_implemented (besides the test I just added), that in itself is trouble. Steve On Jul 11, 2006, at 8:27 PM, Chris Fields wrote: > Makes sense to keep most of the magic in Root instead of RootI.pm. > The POD for RootI does state that the class exception thrown is > Bio::Root::NotImplemented, so we should probably either change the > POD to reflect what really happens or change throw_not_implemented > like you suggest (my vote is the latter). I don't think many (if > any) implementing classes fall into your 'trouble' category, though I > can't be sure how many actually import Bio::Root::Exception. > > Chris > > On Jul 11, 2006, at 8:07 PM, Steve_Chervitz wrote: > >> On Jul 11, 2006, at 1:42 PM, Chris Fields wrote: >> >>> Bio::Root::Root doesn't overload throw_not_implemented from >>> Bio::Root::RootI; from the comments looks like Steve C and Ewan B >>> couldn't >>> work out some of the Error.pm issues. >> >> The issue (I believe) was that >> Bio::Root::RootI::throw_not_implemented was doing some checking for >> the presence of Error.pm and calling Error::throw. I changed it so >> that this fanciness only happens in Root.pm. >> >>> Judging by the POD for Bio::Root::RootI, throw_not_implemented >>> doesn't >>> accept arguments; it throws a Bio::Root::NotImplemented exception >>> automatically. >> >> Looking at the code now, throw_not_implemented() does not throw a >> Bio::Root::NotImplemented exception. It just throws a simple, >> unclassed message. We could allow it to throw an exception of class >> Bio::Root:NotImplemented by changing this code: >> >> if( $self->can('throw') ) { >> $self->throw($message); >> }... >> >> to this >> >> if( $self->can('throw') ) { >> $self->throw(-text=>$message, - >> class=>'Bio::Root::NotImplemented'); >> }... >> >> This does not create any dependency on Error.pm, but permits it to >> be used if available. If Error.pm is not loaded, the only change is >> that the class string is included in the error message, which is >> kind of handy. >> >> Trouble would occur if the implementing class: >> >> * does not derive from Bio::Root::Root, >> * does not import Bio::Root::Exception, >> * fails to implement a method which gets called, and >> * Error.pm is available. >> >> I don't know if such implementations exist in bioperl now, but I >> suspect they would be rare (and discouraged). >> >> Steve >> >> >>> Chris >>> >>>> -----Original Message----- >>>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >>>> bounces at lists.open-bio.org] On Behalf Of Rutger Vos >>>> Sent: Tuesday, July 11, 2006 1:58 PM >>>> To: Chris Fields >>>> Cc: 'Bioperl List' >>>> Subject: Re: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) >>>> overloading? >>>> >>>> I must have overlooked this. I think it does what I want. So >>>> could I do >>>> something like: >>>> >>>> $obj->thow_not_implemented( -class => >>>> 'Bio::Root::NotImplemented' ); >>>> >>>> ...in interfaces? >>>> >>>> Chris Fields wrote: >>>>> I suppose you could; Bio::Root::Root does that using Error.pm >>>>> (if it >>>>> is installed). It almost sounds like what Bio::Root::Root does is >>>>> what you want, but you want a little more information when >>>>> exceptions >>>>> are thrown maybe? >>>>> >>>>> from perldoc Bio::Root::Root: >>>>> >>>>> ... >>>>> # Alternatively, using the new typed exception syntax in >>>>> the throw() call: >>>>> >>>>> $obj->throw( -class => 'Bio::Root::BadParameter', >>>>> -text => "Can not open file $file", >>>>> -value => $file); >>>>> ... >>>>> >>>>> Typed Exception Syntax >>>>> >>>>> The typed exception syntax of throw() has the advantage of >>>>> plainly >>>>> indicating the nature of the trouble, since the name of >>>>> the >>>>> class is >>>>> included in the title of the exception output. >>>>> >>>>> To take advantage of this capability, you must specify >>>>> arguments as >>>>> named parameters in the throw() call. Here are the >>>>> parameters: >>>>> >>>>> -class >>>>> name of the class of the exception. This should be >>>>> one >>>>> of the >>>>> classes defined in Bio::Root::Exception, or a custom >>>>> error of yours >>>>> that extends one of the exceptions defined in >>>>> Bio::Root::Exception. >>>>> >>>>> -text >>>>> a sensible message for the exception >>>>> >>>>> -value >>>>> the value causing the exception or $!, if appropriate. >>>>> >>>>> Note that Bio::Root::Exception does not need to be >>>>> imported >>>>> into your >>>>> module (or script) namespace in order to throw >>>>> exceptions via >>>>> Bio::Root::Root::throw(), since Bio::Root::Root imports >>>>> it. >>>>> >>>>> >>>>> Chris >>>>> >>>>> On Jul 11, 2006, at 3:58 AM, Rutger Vos wrote: >>>>> >>>>> >>>>>> Dear all, >>>>>> >>>>>> would it be possible to overload Bio::Root::RootI's 'throw' >>>>>> method to >>>>>> accept an additional, optional (positional) argument to define >>>>>> the >>>>>> exception class, e.g. using Exception::Class: >>>>>> >>>>>> # ...somewhere ... >>>>>> >>>>>> sub makefh { >>>>>> my ( $self, $filename ) = @_; >>>>>> open my $fh, '<' $filename or $self->throw("Can't open >>>>>> file: $!", >>>>>> 'Bio::Exceptions::FileIO'); # NOTE second argument >>>>>> return $fh; >>>>>> } >>>>>> >>>>>> #.... somewhere else >>>>>> my $fh; >>>>>> eval { >>>>>> $fh = $obj->makefh( 'data.txt'); >>>>>> } >>>>>> if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) { >>>>>> # something's wrong with the file? >>>>>> } >>>>>> >>>>>> -- >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> Rutger Vos, PhD. candidate >>>>>> Department of Biological Sciences >>>>>> Simon Fraser University >>>>>> 8888 University Drive >>>>>> Burnaby, BC, V5A1S6 >>>>>> Phone: 604-291-5625 >>>>>> Fax: 604-291-3496 >>>>>> Personal site: http://www.sfu.ca/~rvosa >>>>>> FAB* lab: http://www.sfu.ca/~fabstar >>>>>> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Bioperl-l mailing list >>>>>> Bioperl-l at lists.open-bio.org >>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>>>> >>>>> >>>>> Christopher Fields >>>>> Postdoctoral Researcher >>>>> Lab of Dr. Robert Switzer >>>>> Dept of Biochemistry >>>>> University of Illinois Urbana-Champaign >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Bioperl-l mailing list >>>>> Bioperl-l at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>>> >>>>> >>>>> >>>>> >>>> >>>> -- >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Rutger Vos, PhD. candidate >>>> Department of Biological Sciences >>>> Simon Fraser University >>>> 8888 University Drive >>>> Burnaby, BC, V5A1S6 >>>> Phone: 604-291-5625 >>>> Fax: 604-291-3496 >>>> Personal site: http://www.sfu.ca/~rvosa >>>> FAB* lab: http://www.sfu.ca/~fabstar >>>> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >>>> >>>> _______________________________________________ >>>> Bioperl-l mailing list >>>> Bioperl-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Wed Jul 12 15:23:33 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 12 Jul 2006 14:23:33 -0500 Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? In-Reply-To: <3E119694-68C5-47A6-971B-8E035CBB6429@affymetrix.com> Message-ID: <000901c6a5e8$aaca53e0$15327e82@pyrimidine> Thanks Steve! Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Steve_Chervitz > Sent: Wednesday, July 12, 2006 1:53 PM > To: Chris Fields > Cc: Rutger Vos; Bioperl List > Subject: Re: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? > > For modules that derive from Bio::Root::Root, there's no need to > import Bio::Root::Exception since the Root object does it. > > I also favor adding the -class parameter to throw_not_implemented in > RootI. I just committed this change in in bioperl-live. I also added > a test for it in t/RootI.t > > I haven't run the complete suite of tests after making this change, > but I don't suspect there'll be any trouble (famous last words). > Really, if any test leads to the calling of throw_not_implemented > (besides the test I just added), that in itself is trouble. > > Steve > > On Jul 11, 2006, at 8:27 PM, Chris Fields wrote: > > > Makes sense to keep most of the magic in Root instead of RootI.pm. > > The POD for RootI does state that the class exception thrown is > > Bio::Root::NotImplemented, so we should probably either change the > > POD to reflect what really happens or change throw_not_implemented > > like you suggest (my vote is the latter). I don't think many (if > > any) implementing classes fall into your 'trouble' category, though I > > can't be sure how many actually import Bio::Root::Exception. > > > > Chris > > > > On Jul 11, 2006, at 8:07 PM, Steve_Chervitz wrote: > > > >> On Jul 11, 2006, at 1:42 PM, Chris Fields wrote: > >> > >>> Bio::Root::Root doesn't overload throw_not_implemented from > >>> Bio::Root::RootI; from the comments looks like Steve C and Ewan B > >>> couldn't > >>> work out some of the Error.pm issues. > >> > >> The issue (I believe) was that > >> Bio::Root::RootI::throw_not_implemented was doing some checking for > >> the presence of Error.pm and calling Error::throw. I changed it so > >> that this fanciness only happens in Root.pm. > >> > >>> Judging by the POD for Bio::Root::RootI, throw_not_implemented > >>> doesn't > >>> accept arguments; it throws a Bio::Root::NotImplemented exception > >>> automatically. > >> > >> Looking at the code now, throw_not_implemented() does not throw a > >> Bio::Root::NotImplemented exception. It just throws a simple, > >> unclassed message. We could allow it to throw an exception of class > >> Bio::Root:NotImplemented by changing this code: > >> > >> if( $self->can('throw') ) { > >> $self->throw($message); > >> }... > >> > >> to this > >> > >> if( $self->can('throw') ) { > >> $self->throw(-text=>$message, - > >> class=>'Bio::Root::NotImplemented'); > >> }... > >> > >> This does not create any dependency on Error.pm, but permits it to > >> be used if available. If Error.pm is not loaded, the only change is > >> that the class string is included in the error message, which is > >> kind of handy. > >> > >> Trouble would occur if the implementing class: > >> > >> * does not derive from Bio::Root::Root, > >> * does not import Bio::Root::Exception, > >> * fails to implement a method which gets called, and > >> * Error.pm is available. > >> > >> I don't know if such implementations exist in bioperl now, but I > >> suspect they would be rare (and discouraged). > >> > >> Steve > >> > >> > >>> Chris > >>> > >>>> -----Original Message----- > >>>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > >>>> bounces at lists.open-bio.org] On Behalf Of Rutger Vos > >>>> Sent: Tuesday, July 11, 2006 1:58 PM > >>>> To: Chris Fields > >>>> Cc: 'Bioperl List' > >>>> Subject: Re: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) > >>>> overloading? > >>>> > >>>> I must have overlooked this. I think it does what I want. So > >>>> could I do > >>>> something like: > >>>> > >>>> $obj->thow_not_implemented( -class => > >>>> 'Bio::Root::NotImplemented' ); > >>>> > >>>> ...in interfaces? > >>>> > >>>> Chris Fields wrote: > >>>>> I suppose you could; Bio::Root::Root does that using Error.pm > >>>>> (if it > >>>>> is installed). It almost sounds like what Bio::Root::Root does is > >>>>> what you want, but you want a little more information when > >>>>> exceptions > >>>>> are thrown maybe? > >>>>> > >>>>> from perldoc Bio::Root::Root: > >>>>> > >>>>> ... > >>>>> # Alternatively, using the new typed exception syntax in > >>>>> the throw() call: > >>>>> > >>>>> $obj->throw( -class => 'Bio::Root::BadParameter', > >>>>> -text => "Can not open file $file", > >>>>> -value => $file); > >>>>> ... > >>>>> > >>>>> Typed Exception Syntax > >>>>> > >>>>> The typed exception syntax of throw() has the advantage of > >>>>> plainly > >>>>> indicating the nature of the trouble, since the name of > >>>>> the > >>>>> class is > >>>>> included in the title of the exception output. > >>>>> > >>>>> To take advantage of this capability, you must specify > >>>>> arguments as > >>>>> named parameters in the throw() call. Here are the > >>>>> parameters: > >>>>> > >>>>> -class > >>>>> name of the class of the exception. This should be > >>>>> one > >>>>> of the > >>>>> classes defined in Bio::Root::Exception, or a custom > >>>>> error of yours > >>>>> that extends one of the exceptions defined in > >>>>> Bio::Root::Exception. > >>>>> > >>>>> -text > >>>>> a sensible message for the exception > >>>>> > >>>>> -value > >>>>> the value causing the exception or $!, if appropriate. > >>>>> > >>>>> Note that Bio::Root::Exception does not need to be > >>>>> imported > >>>>> into your > >>>>> module (or script) namespace in order to throw > >>>>> exceptions via > >>>>> Bio::Root::Root::throw(), since Bio::Root::Root imports > >>>>> it. > >>>>> > >>>>> > >>>>> Chris > >>>>> > >>>>> On Jul 11, 2006, at 3:58 AM, Rutger Vos wrote: > >>>>> > >>>>> > >>>>>> Dear all, > >>>>>> > >>>>>> would it be possible to overload Bio::Root::RootI's 'throw' > >>>>>> method to > >>>>>> accept an additional, optional (positional) argument to define > >>>>>> the > >>>>>> exception class, e.g. using Exception::Class: > >>>>>> > >>>>>> # ...somewhere ... > >>>>>> > >>>>>> sub makefh { > >>>>>> my ( $self, $filename ) = @_; > >>>>>> open my $fh, '<' $filename or $self->throw("Can't open > >>>>>> file: $!", > >>>>>> 'Bio::Exceptions::FileIO'); # NOTE second argument > >>>>>> return $fh; > >>>>>> } > >>>>>> > >>>>>> #.... somewhere else > >>>>>> my $fh; > >>>>>> eval { > >>>>>> $fh = $obj->makefh( 'data.txt'); > >>>>>> } > >>>>>> if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) { > >>>>>> # something's wrong with the file? > >>>>>> } > >>>>>> > >>>>>> -- > >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>>>> Rutger Vos, PhD. candidate > >>>>>> Department of Biological Sciences > >>>>>> Simon Fraser University > >>>>>> 8888 University Drive > >>>>>> Burnaby, BC, V5A1S6 > >>>>>> Phone: 604-291-5625 > >>>>>> Fax: 604-291-3496 > >>>>>> Personal site: http://www.sfu.ca/~rvosa > >>>>>> FAB* lab: http://www.sfu.ca/~fabstar > >>>>>> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ > >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>>>> > >>>>>> > >>>>>> _______________________________________________ > >>>>>> Bioperl-l mailing list > >>>>>> Bioperl-l at lists.open-bio.org > >>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l > >>>>>> > >>>>> > >>>>> Christopher Fields > >>>>> Postdoctoral Researcher > >>>>> Lab of Dr. Robert Switzer > >>>>> Dept of Biochemistry > >>>>> University of Illinois Urbana-Champaign > >>>>> > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> Bioperl-l mailing list > >>>>> Bioperl-l at lists.open-bio.org > >>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l > >>>>> > >>>>> > >>>>> > >>>>> > >>>> > >>>> -- > >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>> Rutger Vos, PhD. candidate > >>>> Department of Biological Sciences > >>>> Simon Fraser University > >>>> 8888 University Drive > >>>> Burnaby, BC, V5A1S6 > >>>> Phone: 604-291-5625 > >>>> Fax: 604-291-3496 > >>>> Personal site: http://www.sfu.ca/~rvosa > >>>> FAB* lab: http://www.sfu.ca/~fabstar > >>>> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ > >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>> > >>>> > >>>> _______________________________________________ > >>>> Bioperl-l mailing list > >>>> Bioperl-l at lists.open-bio.org > >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l > >>> > >>> _______________________________________________ > >>> Bioperl-l mailing list > >>> Bioperl-l at lists.open-bio.org > >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l > >> > > > > Christopher Fields > > Postdoctoral Researcher > > Lab of Dr. Robert Switzer > > Dept of Biochemistry > > University of Illinois Urbana-Champaign > > > > > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From dsche at uga.edu Thu Jul 13 14:55:03 2006 From: dsche at uga.edu (Dongsheng Che) Date: Thu, 13 Jul 2006 14:55:03 -0400 (EDT) Subject: [Bioperl-l] remoteBlast problem Message-ID: <20060713145503.CIV61560@punts2.cc.uga.edu> To whom it may concern: I'm trying to do blast search remotely, so I downloaded bioperl-1.5, and followed the installation procedure, ie, perl Makefile.PL, make, make test. make install. I know there are some installation failure during the installation. Since my main purpose is to get remoteBlast worked, I don't want bother to figure out all failures. but I run remote Blast, it gave me some erorrs from examples (bptutorial). ------------------------------------------------------------- Beginning run_remoteblast example... Use of uninitialized value in numeric lt (<) at bptutorial.pl line 3303. **Warning**: Couldn't connect to NCBI with Bio::Tools::Run::StandAloneBlast.pm! Probably no network access. Skipping Test ---------------------------------------------------------------- I wondering what cause the problem. Thanks in advance! Dongsheng From vrramnar at student.cs.uwaterloo.ca Thu Jul 13 18:39:19 2006 From: vrramnar at student.cs.uwaterloo.ca (vrramnar at student.cs.uwaterloo.ca) Date: Thu, 13 Jul 2006 18:39:19 -0400 Subject: [Bioperl-l] Remote Blast - Blast Human Genome Message-ID: <1152830359.44b6cb97ef16c@www.nexusmail.uwaterloo.ca> Hello Again, I have another question regarding Remote blast but this time using Genome Blast. Here is the link: http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=9606 which again uses the main Blast web site: http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi Again I am not sure what to add or what HEADER information to change within my script. Here is my program, which was the same as the last email: #!/usr/bin/perl -w use Bio::Perl; use Bio::Tools::Run::RemoteBlast; my $prog = "blastn"; my $db = "refseq_genomic"; my $e_val = 0.01; my @params = ( '-prog' => $prog, '-data' => $db, '-expect' => $e_val); my $factory = new Bio::Tools::Run::RemoteBlast->new(@params); $Bio::Tools::Run::RemoteBlast::HEADER{'WWW_BLAST_TYPE'} = '????'; <----- what do I put here #$Bio::Tools::Run::RemoteBlast::HEADER{'?????'} = '????'; <--- Do I need to add any other values to the form inputs $factory->submit_blast("blast.in"); $v = 1; while (my @rids = $factory->each_rid) { foreach my $rid ( @rids ) { my $rc = $factory->retrieve_blast($rid); if( !ref($rc) ) { if( $rc < 0 ) { $factory->remove_rid($rid); } print STDERR "." if ( $v > 0 ); sleep 5; } else { my $result = $rc->next_result(); my $filename = $result->query_name()."\.out"; $factory->save_output($filename); $factory->remove_rid($rid); print "\nQuery Name: ", $result->query_name(), "\n"; } } } Both of my questions are very similiar as in I know how to use remote blast but not sure what to change to access the specific blast I want. Again, any help would be very appreciated!! Rohan ---------------------------------------- This mail sent through www.mywaterloo.ca From vrramnar at student.cs.uwaterloo.ca Thu Jul 13 18:31:38 2006 From: vrramnar at student.cs.uwaterloo.ca (vrramnar at student.cs.uwaterloo.ca) Date: Thu, 13 Jul 2006 18:31:38 -0400 Subject: [Bioperl-l] Remote Blast - SNP data base Message-ID: <1152829898.44b6c9cab7a3a@www.nexusmail.uwaterloo.ca> Hello, 1. I was wondering if anyone knew how to use SNP Blast via the Remote Blast module?? Basically I want to blast my sequence against the dbSNP database and you can normally do this through NCBI's website: http://www.ncbi.nlm.nih.gov/SNP/snp_blastByOrg.cgi The site basically takes your info and submits it to the main blast site: http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi I am just not sure what settings to change within my script. I have something like this: #!/usr/bin/perl -w use Bio::Perl; use Bio::Tools::Run::RemoteBlast; my $prog = "blastn"; my $db = "refseq_genomic"; <--- What db should I use?? my $e_val = 0.01; my @params = ( '-prog' => $prog, '-data' => $db, '-expect' => $e_val); my $factory = new Bio::Tools::Run::RemoteBlast->new(@params); $factory->submit_blast("blast.in"); <--- Name of my file in fasta format $v = 1; while (my @rids = $factory->each_rid) { foreach my $rid ( @rids ) { my $rc = $qu->retrieve_blast($rid); if( !ref($rc) ) { if( $rc < 0 ) { $factory->remove_rid($rid); } print STDERR "." if ( $v > 0 ); sleep 5; } else { my $result = $rc->next_result(); my $filename = $result->query_name()."\.out"; $factory->save_output($filename); $factory->remove_rid($rid); print "\nQuery Name: ", $result->query_name(), "\n"; } } } I think something like this should be added to have the correct form inputs but I am unsure: $Bio::Tools::Run::RemoteBlast::HEADER{'???'} = '????'; Any help on this topic would greatly be appreciated!! Rohan ---------------------------------------- This mail sent through www.mywaterloo.ca From cjfields at uiuc.edu Thu Jul 13 20:42:57 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 13 Jul 2006 19:42:57 -0500 Subject: [Bioperl-l] remoteBlast problem In-Reply-To: <20060713145503.CIV61560@punts2.cc.uga.edu> Message-ID: <000401c6a6de$737fe570$15327e82@pyrimidine> 1) Before I get wound up in the obvious here, you need to upgrade to CVS; RemoteBlast and SearchIO::blast were fixed post v.-1.5.1 (i.e. in CVS) to account for changes in BLAST output at the NCBI 2) The Bio::Tools::Run::StandAloneBlast.pm bit worried me a little, so I did a little digging; that's a typo. Now corrected in CVS, along with some BPLite cruft left over. 3) Speaking bluntly? Come on. The error is stated as plainly as possible. No? How about this (note the arrows): -----------> **Warning**: Couldn't connect to NCBI with -----------> Bio::Tools::Run::StandAloneBlast.pm! -----------> Probably no network access. Skipping Test Check your network connections, preferably AFTER you update to CVS. It's possible that it's a proxy issue, but that should also be fixed in CVS. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Dongsheng Che > Sent: Thursday, July 13, 2006 1:55 PM > To: bioperl-l at bioperl.org > Subject: [Bioperl-l] remoteBlast problem > > To whom it may concern: > > I'm trying to do blast search remotely, so I downloaded bioperl-1.5, and > followed the installation procedure, ie, perl Makefile.PL, make, make > test. make install. I know there are some installation failure during the > installation. > > Since my main purpose is to get remoteBlast worked, I don't want bother to > figure out all failures. but I run remote Blast, it gave me some erorrs > from examples (bptutorial). > ------------------------------------------------------------- > Beginning run_remoteblast example... > Use of uninitialized value in numeric lt (<) at bptutorial.pl line 3303. > > > **Warning**: Couldn't connect to NCBI with > Bio::Tools::Run::StandAloneBlast.pm! > Probably no network access. > Skipping Test > ---------------------------------------------------------------- > > I wondering what cause the problem. > > Thanks in advance! > > Dongsheng > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Thu Jul 13 21:56:30 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 13 Jul 2006 20:56:30 -0500 Subject: [Bioperl-l] Remote Blast - Blast Human Genome In-Reply-To: <1152830359.44b6cb97ef16c@www.nexusmail.uwaterloo.ca> Message-ID: <000501c6a6e8$b9c24d20$15327e82@pyrimidine> I added a method to RemoteBlast in bioperl-live (CVS) if you want to play with changing the URL. I have been thinking about doing this for a bit now but I already see problems. Here's the issue: the BLAST page you see is NOT the NCBI BLAST page (note the differences in the URL) but a user-friendly request page, generated on the fly by Genome, to submit BLAST requests for the relevant database. So changing the URL will not work (even by adding extra parameters); you only get the original HTML web page. You could try changing the database or limiting the search using an Entrez term (which you should be able to include in the request, probably by adding it to the HEADER). Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of vrramnar at student.cs.uwaterloo.ca > Sent: Thursday, July 13, 2006 5:39 PM > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] Remote Blast - Blast Human Genome > > > Hello Again, > > I have another question regarding Remote blast but this time using Genome > Blast. > > Here is the link: > > http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=9606 > > which again uses the main Blast web site: > > http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi > > Again I am not sure what to add or what HEADER information to change > within my > script. > > Here is my program, which was the same as the last email: > > #!/usr/bin/perl -w > > use Bio::Perl; > use Bio::Tools::Run::RemoteBlast; > > my $prog = "blastn"; > my $db = "refseq_genomic"; > my $e_val = 0.01; > > my @params = ( '-prog' => $prog, > '-data' => $db, > '-expect' => $e_val); > > my $factory = new Bio::Tools::Run::RemoteBlast->new(@params); > $Bio::Tools::Run::RemoteBlast::HEADER{'WWW_BLAST_TYPE'} = '????'; <----- > what > do I put here > #$Bio::Tools::Run::RemoteBlast::HEADER{'?????'} = '????'; <--- Do I need > to add > any other values to the form inputs > > $factory->submit_blast("blast.in"); > $v = 1; > > while (my @rids = $factory->each_rid) > { foreach my $rid ( @rids ) > { my $rc = $factory->retrieve_blast($rid); > if( !ref($rc) ) > { if( $rc < 0 ) > { $factory->remove_rid($rid); > } > print STDERR "." if ( $v > 0 ); > sleep 5; > } > else > { my $result = $rc->next_result(); > my $filename = $result->query_name()."\.out"; > $factory->save_output($filename); > $factory->remove_rid($rid); > print "\nQuery Name: ", $result->query_name(), "\n"; > } > } > } > > > Both of my questions are very similiar as in I know how to use remote > blast but > not sure what to change to access the specific blast I want. > > Again, any help would be very appreciated!! > > Rohan > > > > ---------------------------------------- > This mail sent through www.mywaterloo.ca > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From smart_bioit at yahoo.com Fri Jul 14 13:25:51 2006 From: smart_bioit at yahoo.com (raj sharma) Date: Fri, 14 Jul 2006 10:25:51 -0700 (PDT) Subject: [Bioperl-l] advice Message-ID: <20060714172551.60978.qmail@web37309.mail.mud.yahoo.com> i have one problem in perl i want to make one program which whn run online can download required data from data bank to local server frm where i shld start --------------------------------- Yahoo! Music Unlimited - Access over 1 million songs.Try it free. From charlesh at stedwards.edu Sat Jul 15 15:29:46 2006 From: charlesh at stedwards.edu (Charles Hauser) Date: Sat, 15 Jul 2006 14:29:46 -0500 Subject: [Bioperl-l] Finding locations of a string within a fasta file Message-ID: <040D0FF5-0BD1-49DA-B66C-32E359F254BA@stedwards.edu> All, I'm trying to determine where (the start .. end positions) within a genomic scaffold sequence gaps occur. The gaps are denoted as runs of N's. Suggestions on how to easily retrieve this would be appreciated. ch From cjfields at uiuc.edu Sat Jul 15 17:22:15 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sat, 15 Jul 2006 16:22:15 -0500 Subject: [Bioperl-l] Finding locations of a string within a fasta file In-Reply-To: <040D0FF5-0BD1-49DA-B66C-32E359F254BA@stedwards.edu> Message-ID: <000001c6a854$bee47400$15327e82@pyrimidine> You can retrieve the original GenBank CONTIG file using Bio::DB::GenBank if the format is set to 'gb' (it is now set to 'gbwithparts' by default. The CONTIG lines are currently stored in a series of Bio::Annotation::SimpleValue objects; get the accessions using the following script. use strict; use warnings; use Bio::DB::GenBank; my $factory = Bio::DB::GenBank->new(-format => 'gb'); my $seq = $factory->get_Seq_by_id(shift); my $seqout = Bio::SeqIO->new(-fh => \*STDOUT, -format => 'genbank'); # greps only annotations with CONTIG tagname, joins all together my $contig = join '', grep {$_->tagname eq 'CONTIG'} $seq->get_Annotations(); # split each region, getting rid of gaps and join(), then split into acc/span for (grep {$_ !~ m{gap|join}} split ',', $contig) { my ($acc, $span) = split ':', $_; $span =~ s{\)}{}g; # spurious ')' print "ACC: $acc\n\tSpan:$span\n"; } > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Charles Hauser > Sent: Saturday, July 15, 2006 2:30 PM > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] Finding locations of a string within a fasta file > > All, > > I'm trying to determine where (the start .. end positions) within a > genomic scaffold sequence gaps occur. > The gaps are denoted as runs of N's. > > Suggestions on how to easily retrieve this would be appreciated. > > ch > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From sudhaneti at yahoo.com Sat Jul 15 15:26:01 2006 From: sudhaneti at yahoo.com (Sudha Gunturu) Date: Sat, 15 Jul 2006 12:26:01 -0700 (PDT) Subject: [Bioperl-l] BLOSUM matrix Message-ID: <20060715192601.36517.qmail@web53315.mail.yahoo.com> Being a beginner perl programming, was wondering if anyone can help me with implementation of BLOSUM 65 matrix for the following alignments or in general. Any inputs, websites to help with this are appreciated. AILCAA ALLLAA ILIICL Thanks Sudha --------------------------------- Do you Yahoo!? Next-gen email? Have it all with the all-new Yahoo! Mail Beta. From charlesh at stedwards.edu Sun Jul 16 19:32:38 2006 From: charlesh at stedwards.edu (Charles Hauser) Date: Sun, 16 Jul 2006 18:32:38 -0500 Subject: [Bioperl-l] Finding locations of a string within a fasta file In-Reply-To: <000001c6a854$bee47400$15327e82@pyrimidine> References: <000001c6a854$bee47400$15327e82@pyrimidine> Message-ID: Hi Chris, Thanks for the info. Unfortunately, I was not clear that the sequence is unannotated, i.e. there is no GenBank record. I need to extract the locations of the gaps from a raw fasta file. ch On Jul 15, 2006, at 4:22 PM, Chris Fields wrote: > You can retrieve the original GenBank CONTIG file using > Bio::DB::GenBank if > the format is set to 'gb' (it is now set to 'gbwithparts' by > default. The > CONTIG lines are currently stored in a series of > Bio::Annotation::SimpleValue objects; get the accessions using the > following > script. > > use strict; > use warnings; > > use Bio::DB::GenBank; > > my $factory = Bio::DB::GenBank->new(-format => 'gb'); > > my $seq = $factory->get_Seq_by_id(shift); > > my $seqout = Bio::SeqIO->new(-fh => \*STDOUT, > -format => 'genbank'); > > # greps only annotations with CONTIG tagname, joins all together > my $contig = join '', grep {$_->tagname eq 'CONTIG'} > $seq->get_Annotations(); > > # split each region, getting rid of gaps and join(), then split into > acc/span > for (grep {$_ !~ m{gap|join}} > split ',', $contig) { > my ($acc, $span) = split ':', $_; > $span =~ s{\)}{}g; # spurious ')' > print "ACC: $acc\n\tSpan:$span\n"; > } > > > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> bounces at lists.open-bio.org] On Behalf Of Charles Hauser >> Sent: Saturday, July 15, 2006 2:30 PM >> To: bioperl-l at lists.open-bio.org >> Subject: [Bioperl-l] Finding locations of a string within a fasta >> file >> >> All, >> >> I'm trying to determine where (the start .. end positions) within a >> genomic scaffold sequence gaps occur. >> The gaps are denoted as runs of N's. >> >> Suggestions on how to easily retrieve this would be appreciated. >> >> ch >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > From torsten.seemann at infotech.monash.edu.au Sun Jul 16 22:23:51 2006 From: torsten.seemann at infotech.monash.edu.au (Torsten Seemann) Date: Mon, 17 Jul 2006 12:23:51 +1000 Subject: [Bioperl-l] advice In-Reply-To: <20060714172551.60978.qmail@web37309.mail.mud.yahoo.com> References: <20060714172551.60978.qmail@web37309.mail.mud.yahoo.com> Message-ID: <44BAF4B7.8090508@infotech.monash.edu.au> raj sharma wrote: > i have one problem in perl is this Bio::Perl related? > i want to make one program which whn run online do you mean runs on a web server as a CGI script, or access on-line data? > can download required data from data bank to local server which databank - genbank or ... ? > frm where i shld start http://www.oreilly.com/catalog/lperl3/ -- Dr Torsten Seemann http://www.vicbioinformatics.com Victorian Bioinformatics Consortium, Monash University, Australia From torsten.seemann at infotech.monash.edu.au Sun Jul 16 22:21:31 2006 From: torsten.seemann at infotech.monash.edu.au (Torsten Seemann) Date: Mon, 17 Jul 2006 12:21:31 +1000 Subject: [Bioperl-l] Finding locations of a string within a fasta file In-Reply-To: <040D0FF5-0BD1-49DA-B66C-32E359F254BA@stedwards.edu> References: <040D0FF5-0BD1-49DA-B66C-32E359F254BA@stedwards.edu> Message-ID: <44BAF42B.8080102@infotech.monash.edu.au> > I'm trying to determine where (the start .. end positions) within a > genomic scaffold sequence gaps occur. > The gaps are denoted as runs of N's. > Suggestions on how to easily retrieve this would be appreciated. First you need to get the sequence into a string within Perl. As your email Subject: says it is in the Fasta file, you need to 1. open the fasta file - see Bio::SeqIO 2. read first sequence (as an object) - see next_seq() 3. get the string of the sequence in the object - see seq() Then you could just use the inbuilt Perl function index() to loop through all the occurences of 'N' - type 'perldoc -f index' for help. Alternatively use regexp matching eg, m/(N+)/g and the pos() function. -- Dr Torsten Seemann http://www.vicbioinformatics.com Victorian Bioinformatics Consortium, Monash University, Australia From sudhaneti at yahoo.com Sun Jul 16 22:33:20 2006 From: sudhaneti at yahoo.com (Sudha Gunturu) Date: Sun, 16 Jul 2006 19:33:20 -0700 (PDT) Subject: [Bioperl-l] BLOSUM matrix In-Reply-To: <44BAF316.9020301@infotech.monash.edu.au> Message-ID: <20060717023320.6402.qmail@web53313.mail.yahoo.com> Sorry for not being clear with my question. Let me try to explain. I want to Implement dynamic programing using Blosum as scoring matrix. 1. I want to know how to define the values of Blosum in an array. 2. What functions are suitable for global alignment of two sequences. Etc., Being a beginer programer want some direction, books, and good websites which can help me in achieving the implementation. It would be great if someone can walk me through this. Thanks Sudha Torsten Seemann wrote: Sudha, > Being a beginner perl programming, was wondering if anyone can help me with implementation of BLOSUM 65 matrix for the following alignments or in general. Any inputs, websites to help with this are appreciated. > AILCAA > ALLLAA > ILIICL The BLOSUM65 matrix does not define a method for alignment, it just provides some parameters. Perhaps you should read this first: http://en.wikipedia.org/wiki/Sequence_alignment -- Dr Torsten Seemann http://www.vicbioinformatics.com Victorian Bioinformatics Consortium, Monash University, Australia --------------------------------- Do you Yahoo!? Everyone is raving about the all-new Yahoo! Mail Beta. From torsten.seemann at infotech.monash.edu.au Sun Jul 16 22:16:54 2006 From: torsten.seemann at infotech.monash.edu.au (Torsten Seemann) Date: Mon, 17 Jul 2006 12:16:54 +1000 Subject: [Bioperl-l] BLOSUM matrix In-Reply-To: <20060715192601.36517.qmail@web53315.mail.yahoo.com> References: <20060715192601.36517.qmail@web53315.mail.yahoo.com> Message-ID: <44BAF316.9020301@infotech.monash.edu.au> Sudha, > Being a beginner perl programming, was wondering if anyone can help me with implementation of BLOSUM 65 matrix for the following alignments or in general. Any inputs, websites to help with this are appreciated. > AILCAA > ALLLAA > ILIICL The BLOSUM65 matrix does not define a method for alignment, it just provides some parameters. Perhaps you should read this first: http://en.wikipedia.org/wiki/Sequence_alignment -- Dr Torsten Seemann http://www.vicbioinformatics.com Victorian Bioinformatics Consortium, Monash University, Australia From smart_bioit at yahoo.com Mon Jul 17 00:21:41 2006 From: smart_bioit at yahoo.com (raj sharma) Date: Sun, 16 Jul 2006 21:21:41 -0700 (PDT) Subject: [Bioperl-l] advice In-Reply-To: <44BAF4B7.8090508@infotech.monash.edu.au> Message-ID: <20060717042141.9069.qmail@web37307.mail.mud.yahoo.com> hi trston well here i will make u clear my problem i want to make one data base of marine species u can say this as mirror of data so at present whn i click there on line data base of ncbi gets open so i want to dowload data of marine species (ny one) nd whn ever i click on tht link local data which i have downloaded shld open nd data shld also b updated online after some time waiting for ur reply --------------------------------- Do you Yahoo!? Next-gen email? Have it all with the all-new Yahoo! Mail Beta. From cjfields at uiuc.edu Mon Jul 17 00:51:20 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sun, 16 Jul 2006 23:51:20 -0500 Subject: [Bioperl-l] BLOSUM matrix In-Reply-To: <20060717023320.6402.qmail@web53313.mail.yahoo.com> References: <20060717023320.6402.qmail@web53313.mail.yahoo.com> Message-ID: Hmm, beginner programmer, wants to learn perl? Here are some directions: http://learn.perl.org/ Start with Schwartz's latest incarnation of Learning Perl, then work your way up to Intermediate Perl (I think Mastering Perl is on the horizon...) For some pointers using Perl and bioinformatics, pick up Tisdall's books Beginning/Mastering Perl for Bioinformatics. This is really a list for bioperl, not perl and bioinformatics (thought the two cross here all the time!). We normally don't mind answering questions but we typically don't do people's homework unless we're unusually bored. And we can be excessively cranky when someone repeatedly posts requests for something that shouldn't take much reading and Googling to find out. Again, we're not into that homework gig, i.e. 'walking you through it' is tantamount to 'doing it for you.' 1) Arrays and how to use them are in Learning Perl; there are probably better ways to do this than an array, though... 2) Use Torsten's link to get you started. Chris On Jul 16, 2006, at 9:33 PM, Sudha Gunturu wrote: > Sorry for not being clear with my question. Let me try to > explain. I want to Implement dynamic programing using Blosum as > scoring matrix. > > 1. I want to know how to define the values of Blosum in an array. > 2. What functions are suitable for global alignment of two > sequences. Etc., > > Being a beginer programer want some direction, books, and good > websites which can help me in achieving the implementation. It > would be great if someone can walk me through this. > > Thanks > Sudha > > Torsten Seemann wrote: > Sudha, > >> Being a beginner perl programming, was wondering if anyone can >> help me with implementation of BLOSUM 65 matrix for the following >> alignments or in > general. Any inputs, websites to help with this are appreciated. >> AILCAA >> ALLLAA >> ILIICL > > The BLOSUM65 matrix does not define a method for alignment, it just > provides some parameters. Perhaps you should read this first: > > http://en.wikipedia.org/wiki/Sequence_alignment > > -- > Dr Torsten Seemann http://www.vicbioinformatics.com > Victorian Bioinformatics Consortium, Monash University, Australia > > > > > --------------------------------- > Do you Yahoo!? > Everyone is raving about the all-new Yahoo! Mail Beta. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Mon Jul 17 01:01:53 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 17 Jul 2006 00:01:53 -0500 Subject: [Bioperl-l] advice In-Reply-To: <20060717042141.9069.qmail@web37307.mail.mud.yahoo.com> References: <20060717042141.9069.qmail@web37307.mail.mud.yahoo.com> Message-ID: <82C51420-A18B-4DEA-A519-CE1D7B9C7B10@uiuc.edu> This is a Bioperl list. If you don't have a Bioperl-related question, you will very likely get testy replies. I don't believe that you quite understand Torsten's response, so I'll just copy-and-paste from a reply I just gave a second ago to save myself the typing: Hmm, beginner programmer, wants to learn perl? Here are some directions: http://learn.perl.org/ Start with Schwartz's latest incarnation of Learning Perl, then work your way up to Intermediate Perl (I think Mastering Perl is on the horizon...) For some pointers using Perl and bioinformatics, pick up Tisdall's books Beginning/Mastering Perl for Bioinformatics. This is really a list for bioperl, not perl and bioinformatics (thought the two cross here all the time!). We normally don't mind answering questions but we typically don't do people's homework unless we're unusually bored. And we can be excessively cranky when someone repeatedly posts requests for something that shouldn't take much reading and Googling to find out. Again, we're not into that homework gig, i.e. 'walking you through it' is tantamount to 'doing it for you.' For your particular instance, you might want to brush up on web services, CGI, and a little web etiquette. http://catb.org/esr/faqs/smart-questions.html I think you may be waiting for a long time for a reply! Chris On Jul 16, 2006, at 11:21 PM, raj sharma wrote: > > hi trston > well here i will make u clear my problem > > i want to make one data base of marine species u can say this as > mirror of data > > > so at present whn i click there on line data base of ncbi gets open > > so i want to dowload data of marine species (ny one) > nd whn ever i click on tht link local data which i have > downloaded shld open > nd data shld also b updated online after some time > > waiting for ur reply > > > > > > > --------------------------------- > Do you Yahoo!? > Next-gen email? Have it all with the all-new Yahoo! Mail Beta. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bmoore at genetics.utah.edu Mon Jul 17 01:25:32 2006 From: bmoore at genetics.utah.edu (Barry Moore) Date: Sun, 16 Jul 2006 23:25:32 -0600 Subject: [Bioperl-l] advice In-Reply-To: <20060714172551.60978.qmail@web37309.mail.mud.yahoo.com> Message-ID: By reading this: http://catb.org/esr/faqs/smart-questions.html -----Original Message----- From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of raj sharma Sent: Friday, July 14, 2006 11:26 AM To: bioperl-l at lists.open-bio.org Subject: [Bioperl-l] advice i have one problem in perl i want to make one program which whn run online can download required data from data bank to local server frm where i shld start --------------------------------- Yahoo! Music Unlimited - Access over 1 million songs.Try it free. _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l From bmoore at genetics.utah.edu Mon Jul 17 01:34:58 2006 From: bmoore at genetics.utah.edu (Barry Moore) Date: Sun, 16 Jul 2006 23:34:58 -0600 Subject: [Bioperl-l] advice In-Reply-To: <20060717042141.9069.qmail@web37307.mail.mud.yahoo.com> Message-ID: If you're on a unix type system look at wget -mirror and it's variations. B -----Original Message----- From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of raj sharma Sent: Sunday, July 16, 2006 10:22 PM To: Torsten Seemann Subject: Re: [Bioperl-l] advice hi trston well here i will make u clear my problem i want to make one data base of marine species u can say this as mirror of data so at present whn i click there on line data base of ncbi gets open so i want to dowload data of marine species (ny one) nd whn ever i click on tht link local data which i have downloaded shld open nd data shld also b updated online after some time waiting for ur reply --------------------------------- Do you Yahoo!? Next-gen email? Have it all with the all-new Yahoo! Mail Beta. _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l From bix at sendu.me.uk Mon Jul 17 10:32:13 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 17 Jul 2006 15:32:13 +0100 Subject: [Bioperl-l] Bio::Map changes In-Reply-To: <44ACCCD5.3030309@sendu.me.uk> References: <44985915.8010607@sendu.me.uk> <449A9AF9.2000305@sendu.me.uk> <44ACCCD5.3030309@sendu.me.uk> Message-ID: <44BB9F6D.10005@sendu.me.uk> Sendu Bala wrote: > Sendu Bala wrote: >> The reimplementation will make Position central to the model, allowing >> for lots of other things to work properly without anything becoming >> inconsistent (as is currently the case). > > This is now done. It uses a new PositionHandler class behind the scenes. > > The next step is to introduce relative positioning across the board This is now done. It uses a new Relative class to describe what a given position is relative to. I also made Bio::Map:MapI an AnnotableI and SimpleMap an implementor. I think this pretty much brings an end to my changes to Bio::Map. Unless anyone thinks the changes lack sanity, I think the API of the new things should be somewhat stable. > possibly in a way that makes OrderedPosition redundant or an implementer > of the system. I haven't yet touched the other kinds of Positions to update/remove them. Docs in general could probably do with an update/ improvement. I plan to do this 'soon'. From golharam at umdnj.edu Mon Jul 17 10:13:20 2006 From: golharam at umdnj.edu (Ryan Golhar) Date: Mon, 17 Jul 2006 10:13:20 -0400 Subject: [Bioperl-l] advice In-Reply-To: Message-ID: <004301c6a9ab$2896c8b0$2f01a8c0@GOLHARMOBILE1> I apologize that this is off-topic, but it is an interesting email. Notice the lack of vowels (whn, ny, nd, shld, b) however in other words, the vowels are clearly included. Am I getting old or is "internet spelling" starting to differ from "english spelling"? Or is it that the younger generation (not that I'm old...a mere 32 is not old), using shorthand for frequently used words? -----Original Message----- From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Barry Moore Sent: Monday, July 17, 2006 1:35 AM To: raj sharma Cc: bioperl-l Subject: Re: [Bioperl-l] advice If you're on a unix type system look at wget -mirror and it's variations. B -----Original Message----- From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of raj sharma Sent: Sunday, July 16, 2006 10:22 PM To: Torsten Seemann Subject: Re: [Bioperl-l] advice hi trston well here i will make u clear my problem i want to make one data base of marine species u can say this as mirror of data so at present whn i click there on line data base of ncbi gets open so i want to dowload data of marine species (ny one) nd whn ever i click on tht link local data which i have downloaded shld open nd data shld also b updated online after some time waiting for ur reply --------------------------------- Do you Yahoo!? Next-gen email? Have it all with the all-new Yahoo! Mail Beta. _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l From arareko at campus.iztacala.unam.mx Mon Jul 17 11:31:09 2006 From: arareko at campus.iztacala.unam.mx (Mauricio Herrera Cuadra) Date: Mon, 17 Jul 2006 10:31:09 -0500 Subject: [Bioperl-l] advice In-Reply-To: <004301c6a9ab$2896c8b0$2f01a8c0@GOLHARMOBILE1> References: <004301c6a9ab$2896c8b0$2f01a8c0@GOLHARMOBILE1> Message-ID: <44BBAD3D.2040203@campus.iztacala.unam.mx> Maybe it's a new "obscure" perl6 syntax :) Ryan Golhar wrote: > I apologize that this is off-topic, but it is an interesting email. > Notice the lack of vowels (whn, ny, nd, shld, b) however in other > words, the vowels are clearly included. > > Am I getting old or is "internet spelling" starting to differ from > "english spelling"? Or is it that the younger generation (not that I'm > old...a mere 32 is not old), using shorthand for frequently used words? > > > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org > [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Barry Moore > Sent: Monday, July 17, 2006 1:35 AM > To: raj sharma > Cc: bioperl-l > Subject: Re: [Bioperl-l] advice > > > If you're on a unix type system look at wget -mirror and it's > variations. > > B > > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org > [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of raj sharma > Sent: Sunday, July 16, 2006 10:22 PM > To: Torsten Seemann > Subject: Re: [Bioperl-l] advice > > > hi trston > well here i will make u clear my problem > > i want to make one data base of marine species u can say this as > mirror of data > > > so at present whn i click there on line data base of ncbi gets open > > so i want to dowload data of marine species (ny one) > nd whn ever i click on tht link local data which i have downloaded > shld open > nd data shld also b updated online after some time > > waiting for ur reply > > > > > > > --------------------------------- > Do you Yahoo!? > Next-gen email? Have it all with the all-new Yahoo! Mail Beta. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- MAURICIO HERRERA CUADRA arareko at campus.iztacala.unam.mx Laboratorio de Gen?tica Unidad de Morfofisiolog?a y Funci?n Facultad de Estudios Superiores Iztacala, UNAM From cjfields at uiuc.edu Mon Jul 17 12:09:27 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 17 Jul 2006 11:09:27 -0500 Subject: [Bioperl-l] advice In-Reply-To: <004301c6a9ab$2896c8b0$2f01a8c0@GOLHARMOBILE1> Message-ID: <000a01c6a9bb$6478ba90$15327e82@pyrimidine> Ha ! I *almost* added something about that. I thought his vowel keys were broken for a bit, maybe from pounding the keyboard with extreme frustration! As an aside, doesn't Damian Conway say something about the non-use of vowels in 'Perl Best Practices?' I think it was in relation to variables, though... Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Ryan Golhar > Sent: Monday, July 17, 2006 9:13 AM > To: 'bioperl-l' > Subject: Re: [Bioperl-l] advice > > I apologize that this is off-topic, but it is an interesting email. > Notice the lack of vowels (whn, ny, nd, shld, b) however in other > words, the vowels are clearly included. > > Am I getting old or is "internet spelling" starting to differ from > "english spelling"? Or is it that the younger generation (not that I'm > old...a mere 32 is not old), using shorthand for frequently used words? > > > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org > [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Barry Moore > Sent: Monday, July 17, 2006 1:35 AM > To: raj sharma > Cc: bioperl-l > Subject: Re: [Bioperl-l] advice > > > If you're on a unix type system look at wget -mirror and it's > variations. > > B > > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org > [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of raj sharma > Sent: Sunday, July 16, 2006 10:22 PM > To: Torsten Seemann > Subject: Re: [Bioperl-l] advice > > > hi trston > well here i will make u clear my problem > > i want to make one data base of marine species u can say this as > mirror of data > > > so at present whn i click there on line data base of ncbi gets open > > so i want to dowload data of marine species (ny one) > nd whn ever i click on tht link local data which i have downloaded > shld open > nd data shld also b updated online after some time > > waiting for ur reply > > > > > > > --------------------------------- > Do you Yahoo!? > Next-gen email? Have it all with the all-new Yahoo! Mail Beta. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From bix at sendu.me.uk Mon Jul 17 12:31:37 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 17 Jul 2006 17:31:37 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes Message-ID: <44BBBB69.6000906@sendu.me.uk> I see strange node names via Bio::DB::Taxonomy::flatfile: use Bio::DB::Taxonomy; my $db = new Bio::DB::Taxonomy(-source => 'flatfile', -directory => $taxonomy_dir, -nodesfile => $taxonomy_dir.'nodes.dmp', -namesfile => $taxonomy_dir.'names.dmp'); my $tax_id = 89593; my $node = $db->get_Taxonomy_Node($tax_id); print "node $tax_id has name '", @{$node->name('common')}, "' and rank '", $node->rank, "'\n"; Results in: node 89593 has name 'Craniata ' and rank 'subphylum' Other examples: node 2 has name 'Bacteria ' and rank 'superkingdom' node 1386 has name 'Bacillus ' and rank 'genus' node 7776 has name 'Gnathostomata ' and rank 'superclass' etc. For me the bits in <> are inappropriate and shouldn't be there. The NCBI website agrees, and you won't see these things if you use -source => 'entrez'. Should they be removed by the flatfile parser as a matter of course, with no warnings or option? Or do people want them? Typically they are just the name of the parent node, so I don't see why anyone would /need/ them, and I argue it's invalid for parent node information to be duplicated here. If there are no objections I'll strip the <> bits. I also plan to make $node->name('scientific', 'sapiens'); set and get the node name, and have flatfile and entrez store all common names with $obj->name('common', 'human', 'man');. As these changes will make the implementation match the docs I don't see any problems, except that flatfile users will now find the node name in a different place (@{$node->name('scientific')} instead of @{$node->name('common')}). I'll also fix the problem with node names for ranks species and lower, as discussed in thread 'Bio::DB::Taxonomy:: mishandles species, subspecies/variant names', in the way I suggested there. If anyone can see a problem with any of these changes, let me know asap. From hlapp at gmx.net Mon Jul 17 13:53:17 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 17 Jul 2006 13:53:17 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BBBB69.6000906@sendu.me.uk> References: <44BBBB69.6000906@sendu.me.uk> Message-ID: <8C935CDE-BEFF-48BC-8CC8-0C3FFEF2EF17@gmx.net> Sound good to me. BTW NCBI guarantees (well, promises) that there will only be one node name of class 'scientific'. -hilmar On Jul 17, 2006, at 12:31 PM, Sendu Bala wrote: > I see strange node names via Bio::DB::Taxonomy::flatfile: > > use Bio::DB::Taxonomy; > > my $db = new Bio::DB::Taxonomy(-source => 'flatfile', -directory => > $taxonomy_dir, -nodesfile => $taxonomy_dir.'nodes.dmp', -namesfile => > $taxonomy_dir.'names.dmp'); > > my $tax_id = 89593; > my $node = $db->get_Taxonomy_Node($tax_id); > > print "node $tax_id has name '", @{$node->name('common')}, "' and rank > '", $node->rank, "'\n"; > > Results in: > node 89593 has name 'Craniata ' and rank 'subphylum' > > Other examples: > node 2 has name 'Bacteria ' and rank 'superkingdom' > node 1386 has name 'Bacillus ' and rank 'genus' > node 7776 has name 'Gnathostomata ' and rank 'superclass' > etc. > > For me the bits in <> are inappropriate and shouldn't be there. The > NCBI > website agrees, and you won't see these things if you use -source => > 'entrez'. Should they be removed by the flatfile parser as a matter of > course, with no warnings or option? Or do people want them? Typically > they are just the name of the parent node, so I don't see why anyone > would /need/ them, and I argue it's invalid for parent node > information > to be duplicated here. > > If there are no objections I'll strip the <> bits. I also plan to make > $node->name('scientific', 'sapiens'); set and get the node name, and > have flatfile and entrez store all common names with > $obj->name('common', 'human', 'man');. As these changes will make the > implementation match the docs I don't see any problems, except that > flatfile users will now find the node name in a different place > (@{$node->name('scientific')} instead of @{$node->name('common')}). > > I'll also fix the problem with node names for ranks species and lower, > as discussed in thread 'Bio::DB::Taxonomy:: mishandles species, > subspecies/variant names', in the way I suggested there. > > If anyone can see a problem with any of these changes, let me know > asap. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Mon Jul 17 14:31:08 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 17 Jul 2006 13:31:08 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <8C935CDE-BEFF-48BC-8CC8-0C3FFEF2EF17@gmx.net> Message-ID: <001d01c6a9cf$2cf50f60$15327e82@pyrimidine> I agree. Would be nice to get this to play well with weird bacterial names! I plan on doing some behind-the-scenes work on Bio::DB::Taxonomy::entrez at some point soon to test out Bio::DB::EUtilities as the user agent; it currently uses Bio::Root::HTTPget, I think. Reason I'm doing this is to quickly get tax info based on any primary ID, primarily for grabbing related Tax information from the sequence GI w/o parsing the sequence for the TaxID; this uses NCBI's ELink which I've now implemented. I'll make sure everything passes tests before I commit. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Hilmar Lapp > Sent: Monday, July 17, 2006 12:53 PM > To: Sendu Bala > Cc: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > Sound good to me. > > BTW NCBI guarantees (well, promises) that there will only be one node > name of class 'scientific'. > > -hilmar > > On Jul 17, 2006, at 12:31 PM, Sendu Bala wrote: > > > I see strange node names via Bio::DB::Taxonomy::flatfile: > > > > use Bio::DB::Taxonomy; > > > > my $db = new Bio::DB::Taxonomy(-source => 'flatfile', -directory => > > $taxonomy_dir, -nodesfile => $taxonomy_dir.'nodes.dmp', -namesfile => > > $taxonomy_dir.'names.dmp'); > > > > my $tax_id = 89593; > > my $node = $db->get_Taxonomy_Node($tax_id); > > > > print "node $tax_id has name '", @{$node->name('common')}, "' and rank > > '", $node->rank, "'\n"; > > > > Results in: > > node 89593 has name 'Craniata ' and rank 'subphylum' > > > > Other examples: > > node 2 has name 'Bacteria ' and rank 'superkingdom' > > node 1386 has name 'Bacillus ' and rank 'genus' > > node 7776 has name 'Gnathostomata ' and rank 'superclass' > > etc. > > > > For me the bits in <> are inappropriate and shouldn't be there. The > > NCBI > > website agrees, and you won't see these things if you use -source => > > 'entrez'. Should they be removed by the flatfile parser as a matter of > > course, with no warnings or option? Or do people want them? Typically > > they are just the name of the parent node, so I don't see why anyone > > would /need/ them, and I argue it's invalid for parent node > > information > > to be duplicated here. > > > > If there are no objections I'll strip the <> bits. I also plan to make > > $node->name('scientific', 'sapiens'); set and get the node name, and > > have flatfile and entrez store all common names with > > $obj->name('common', 'human', 'man');. As these changes will make the > > implementation match the docs I don't see any problems, except that > > flatfile users will now find the node name in a different place > > (@{$node->name('scientific')} instead of @{$node->name('common')}). > > > > I'll also fix the problem with node names for ranks species and lower, > > as discussed in thread 'Bio::DB::Taxonomy:: mishandles species, > > subspecies/variant names', in the way I suggested there. > > > > If anyone can see a problem with any of these changes, let me know > > asap. > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From bix at sendu.me.uk Mon Jul 17 14:09:44 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 17 Jul 2006 19:09:44 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <8C935CDE-BEFF-48BC-8CC8-0C3FFEF2EF17@gmx.net> References: <44BBBB69.6000906@sendu.me.uk> <8C935CDE-BEFF-48BC-8CC8-0C3FFEF2EF17@gmx.net> Message-ID: <44BBD268.2060308@sendu.me.uk> Hilmar Lapp wrote: >> I also plan to make $node->name('scientific', 'sapiens'); set and >> get the node name, [...] users will now find the node name in [...] >> @{$node->name('scientific')} > > BTW NCBI guarantees (well, promises) that there will only be one node > name of class 'scientific'. Yes, which is why I feel the API for name() isn't ideal, but thought it would be best to play along. Would having a new scientific_name() method be better, which gets/sets a single value? Perhaps it could just be a more 'sane' shorthand to setting @{$node->name('scientific')} to a list with only the supplied name, and getting ${$node->name('scientific')}[0] ? From hlapp at gmx.net Mon Jul 17 15:31:55 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 17 Jul 2006 15:31:55 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BBD268.2060308@sendu.me.uk> References: <44BBBB69.6000906@sendu.me.uk> <8C935CDE-BEFF-48BC-8CC8-0C3FFEF2EF17@gmx.net> <44BBD268.2060308@sendu.me.uk> Message-ID: <5B62229C-BAB7-4320-BBAE-87A483B0EC15@gmx.net> Yes I think $node->scientific_name() as shorthand would be good to have. Same BTW for $node->common_names() (which would return an array). -hilmar On Jul 17, 2006, at 2:09 PM, Sendu Bala wrote: > Hilmar Lapp wrote: >>> I also plan to make $node->name('scientific', 'sapiens'); set and >>> get the node name, [...] users will now find the node name in [...] >>> @{$node->name('scientific')} >> >> BTW NCBI guarantees (well, promises) that there will only be one node >> name of class 'scientific'. > > Yes, which is why I feel the API for name() isn't ideal, but > thought it > would be best to play along. Would having a new scientific_name() > method > be better, which gets/sets a single value? Perhaps it could just be a > more 'sane' shorthand to setting @{$node->name('scientific')} to a > list > with only the supplied name, and getting ${$node->name > ('scientific')}[0] ? > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Mon Jul 17 16:44:18 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 17 Jul 2006 15:44:18 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <5B62229C-BAB7-4320-BBAE-87A483B0EC15@gmx.net> Message-ID: <000001c6a9e1$c6b51610$15327e82@pyrimidine> There was some interest in getting Bio::Species to delegate to Bio::Taxonomy::Node, so having scientific_name() would help quite a bit since the name used on the ORGANISM line is the scientific name (well, is supposed to be; famous last words). Don't know about SwissProt, EMBL, and others though... Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Hilmar Lapp > Sent: Monday, July 17, 2006 2:32 PM > To: Sendu Bala > Cc: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > Yes I think $node->scientific_name() as shorthand would be good to > have. Same BTW for $node->common_names() (which would return an array). > > -hilmar > > On Jul 17, 2006, at 2:09 PM, Sendu Bala wrote: > > > Hilmar Lapp wrote: > >>> I also plan to make $node->name('scientific', 'sapiens'); set and > >>> get the node name, [...] users will now find the node name in [...] > >>> @{$node->name('scientific')} > >> > >> BTW NCBI guarantees (well, promises) that there will only be one node > >> name of class 'scientific'. > > > > Yes, which is why I feel the API for name() isn't ideal, but > > thought it > > would be best to play along. Would having a new scientific_name() > > method > > be better, which gets/sets a single value? Perhaps it could just be a > > more 'sane' shorthand to setting @{$node->name('scientific')} to a > > list > > with only the supplied name, and getting ${$node->name > > ('scientific')}[0] ? > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From vrramnar at student.cs.uwaterloo.ca Mon Jul 17 16:46:32 2006 From: vrramnar at student.cs.uwaterloo.ca (vrramnar at student.cs.uwaterloo.ca) Date: Mon, 17 Jul 2006 16:46:32 -0400 Subject: [Bioperl-l] Remote Blast - Blast Human Genome In-Reply-To: <000501c6a6e8$b9c24d20$15327e82@pyrimidine> References: <000501c6a6e8$b9c24d20$15327e82@pyrimidine> Message-ID: <1153169192.44bbf728056fd@www.nexusmail.uwaterloo.ca> Hi Chris, 1. I have tried changing the database to snp or dbSNP but neither works. It seems that depending on which type of blast you use(ie, Genome Blast, Blast SNP, normal blast such as blastn, etc...) you see a different listing of databases available for querys. Since you mention that the Blast page I see was generated by Genome, where could I go to see a complete listing of databases I can query?? Or if you knew off hand which database to search if I only wanted dbSNP hits? 2. You also mention, I can limit the search by using Entrez terms. Do you mean like: $Bio::Tools::Run::RemoteBlast::HEADER{'ENTREZ_QUERY'} = 'abc'; where 'abc' is the name of the subject with which you would only like to see result of. For example if you put it as 'Homo sapiens[Organism]' then only human sequences would be in hit lists. If this is what you mean, what would I change it to, to see only hits from dbSNP? Thanks for the ongoing help, Rohan Quoting Chris Fields : > I added a method to RemoteBlast in bioperl-live (CVS) if you want to play > with changing the URL. I have been thinking about doing this for a bit now > but I already see problems. > > Here's the issue: the BLAST page you see is NOT the NCBI BLAST page (note > the differences in the URL) but a user-friendly request page, generated on > the fly by Genome, to submit BLAST requests for the relevant database. So > changing the URL will not work (even by adding extra parameters); you only > get the original HTML web page. > > You could try changing the database or limiting the search using an Entrez > term (which you should be able to include in the request, probably by adding > it to the HEADER). > > Chris > > > -----Original Message----- > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > > bounces at lists.open-bio.org] On Behalf Of vrramnar at student.cs.uwaterloo.ca > > Sent: Thursday, July 13, 2006 5:39 PM > > To: bioperl-l at lists.open-bio.org > > Subject: [Bioperl-l] Remote Blast - Blast Human Genome > > > > > > Hello Again, > > > > I have another question regarding Remote blast but this time using Genome > > Blast. > > > > Here is the link: > > > > http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=9606 > > > > which again uses the main Blast web site: > > > > http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi > > > > Again I am not sure what to add or what HEADER information to change > > within my > > script. > > > > Here is my program, which was the same as the last email: > > > > #!/usr/bin/perl -w > > > > use Bio::Perl; > > use Bio::Tools::Run::RemoteBlast; > > > > my $prog = "blastn"; > > my $db = "refseq_genomic"; > > my $e_val = 0.01; > > > > my @params = ( '-prog' => $prog, > > '-data' => $db, > > '-expect' => $e_val); > > > > my $factory = new Bio::Tools::Run::RemoteBlast->new(@params); > > $Bio::Tools::Run::RemoteBlast::HEADER{'WWW_BLAST_TYPE'} = '????'; <----- > > what > > do I put here > > #$Bio::Tools::Run::RemoteBlast::HEADER{'?????'} = '????'; <--- Do I need > > to add > > any other values to the form inputs > > > > $factory->submit_blast("blast.in"); > > $v = 1; > > > > while (my @rids = $factory->each_rid) > > { foreach my $rid ( @rids ) > > { my $rc = $factory->retrieve_blast($rid); > > if( !ref($rc) ) > > { if( $rc < 0 ) > > { $factory->remove_rid($rid); > > } > > print STDERR "." if ( $v > 0 ); > > sleep 5; > > } > > else > > { my $result = $rc->next_result(); > > my $filename = $result->query_name()."\.out"; > > $factory->save_output($filename); > > $factory->remove_rid($rid); > > print "\nQuery Name: ", $result->query_name(), "\n"; > > } > > } > > } > > > > > > Both of my questions are very similiar as in I know how to use remote > > blast but > > not sure what to change to access the specific blast I want. > > > > Again, any help would be very appreciated!! > > > > Rohan > > > > > > > > ---------------------------------------- > > This mail sent through www.mywaterloo.ca > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > ---------------------------------------- This mail sent through www.mywaterloo.ca From cjfields at uiuc.edu Mon Jul 17 17:25:54 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 17 Jul 2006 16:25:54 -0500 Subject: [Bioperl-l] Remote Blast - Blast Human Genome In-Reply-To: <1153169192.44bbf728056fd@www.nexusmail.uwaterloo.ca> Message-ID: <001001c6a9e7$962b56c0$15327e82@pyrimidine> Okay, I think I may know what's going on a little more now with NCBI's BLAST interface. Looks like any NCBI BLAST query must use the default URL and so must set up to proper GET/PUT commands to retrieve everything correctly. Here's the API description for it all: http://www.ncbi.nlm.nih.gov/BLAST/Doc/urlapi.html You could try setting the database to 'snp' or something along those lines instead of 'nr'; or you could see what the name of the database is when you use the web form and try setting it to that. According to this page, this should be possible: http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helpsnpfaq.section.SearchdbSNP _test._Search_dbSNP_Using_B The Entrez Query limit was a recommendation for limiting your search to a set of sequences for human, for instance. I'll try looking into it a bit more but I'm pretty busy. If you find anything out you should probably post it here . Chris > Hi Chris, > > 1. I have tried changing the database to snp or dbSNP but neither works. > It > seems that depending on which type of blast you use(ie, Genome Blast, > Blast SNP, > normal blast such as blastn, etc...) you see a different listing of > databases > available for querys. Since you mention that the Blast page I see was > generated > by Genome, where could I go to see a complete listing of databases I can > query?? > Or if you knew off hand which database to search if I only wanted dbSNP > hits? > > 2. You also mention, I can limit the search by using Entrez terms. Do you > mean > like: > $Bio::Tools::Run::RemoteBlast::HEADER{'ENTREZ_QUERY'} = 'abc'; > where 'abc' is the name of the subject with which you would only like to > see > result of. For example if you put it as 'Homo sapiens[Organism]' then only > human > sequences would be in hit lists. > If this is what you mean, what would I change it to, to see only hits from > dbSNP? > > Thanks for the ongoing help, > > Rohan > > Quoting Chris Fields : > > > I added a method to RemoteBlast in bioperl-live (CVS) if you want to > play > > with changing the URL. I have been thinking about doing this for a bit > now > > but I already see problems. > > > > Here's the issue: the BLAST page you see is NOT the NCBI BLAST page > (note > > the differences in the URL) but a user-friendly request page, generated > on > > the fly by Genome, to submit BLAST requests for the relevant database. > So > > changing the URL will not work (even by adding extra parameters); you > only > > get the original HTML web page. > > > > You could try changing the database or limiting the search using an > Entrez > > term (which you should be able to include in the request, probably by > adding > > it to the HEADER). > > > > Chris > > > > > -----Original Message----- > > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > > > bounces at lists.open-bio.org] On Behalf Of > vrramnar at student.cs.uwaterloo.ca > > > Sent: Thursday, July 13, 2006 5:39 PM > > > To: bioperl-l at lists.open-bio.org > > > Subject: [Bioperl-l] Remote Blast - Blast Human Genome > > > > > > > > > Hello Again, > > > > > > I have another question regarding Remote blast but this time using > Genome > > > Blast. > > > > > > Here is the link: > > > > > > > http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=9606 > > > > > > which again uses the main Blast web site: > > > > > > http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi > > > > > > Again I am not sure what to add or what HEADER information to change > > > within my > > > script. > > > > > > Here is my program, which was the same as the last email: > > > > > > #!/usr/bin/perl -w > > > > > > use Bio::Perl; > > > use Bio::Tools::Run::RemoteBlast; > > > > > > my $prog = "blastn"; > > > my $db = "refseq_genomic"; > > > my $e_val = 0.01; > > > > > > my @params = ( '-prog' => $prog, > > > '-data' => $db, > > > '-expect' => $e_val); > > > > > > my $factory = new Bio::Tools::Run::RemoteBlast->new(@params); > > > $Bio::Tools::Run::RemoteBlast::HEADER{'WWW_BLAST_TYPE'} = '????'; <-- > --- > > > what > > > do I put here > > > #$Bio::Tools::Run::RemoteBlast::HEADER{'?????'} = '????'; <--- Do I > need > > > to add > > > any other values to the form inputs > > > > > > $factory->submit_blast("blast.in"); > > > $v = 1; > > > > > > while (my @rids = $factory->each_rid) > > > { foreach my $rid ( @rids ) > > > { my $rc = $factory->retrieve_blast($rid); > > > if( !ref($rc) ) > > > { if( $rc < 0 ) > > > { $factory->remove_rid($rid); > > > } > > > print STDERR "." if ( $v > 0 ); > > > sleep 5; > > > } > > > else > > > { my $result = $rc->next_result(); > > > my $filename = $result->query_name()."\.out"; > > > $factory->save_output($filename); > > > $factory->remove_rid($rid); > > > print "\nQuery Name: ", $result->query_name(), "\n"; > > > } > > > } > > > } > > > > > > > > > Both of my questions are very similiar as in I know how to use remote > > > blast but > > > not sure what to change to access the specific blast I want. > > > > > > Again, any help would be very appreciated!! > > > > > > Rohan > > > > > > > > > > > > ---------------------------------------- > > > This mail sent through www.mywaterloo.ca > > > _______________________________________________ > > > Bioperl-l mailing list > > > Bioperl-l at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > > > > ---------------------------------------- > This mail sent through www.mywaterloo.ca From bix at sendu.me.uk Mon Jul 17 17:33:26 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 17 Jul 2006 22:33:26 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <000001c6a9e1$c6b51610$15327e82@pyrimidine> References: <000001c6a9e1$c6b51610$15327e82@pyrimidine> Message-ID: <44BC0226.1080605@sendu.me.uk> Chris Fields wrote: > There was some interest in getting Bio::Species to delegate to > Bio::Taxonomy::Node, so having scientific_name() would help quite a bit > since the name used on the ORGANISM line is the scientific name (well, is > supposed to be; famous last words). Can you clarify exactly what you mean here? Preferably with an example? ORGANISM line of which file format? The reason I ask is that I still feel we need to do parsing of the names for species rank and lower: # The 'scientific name' for humans could be considered to be 'Homo sapiens'. # Taxid 9606 in the NCBI taxonomy database has rank 'species' and ScientificName 'Homo sapiens'. # For sanity, Bio::*Taxonomy* likes to interpret this ScientificName as 'sapiens' so that the genus is not held redundantly. It provides a binomial() method to give you 'Homo sapiens' again if you want it. # I plan on maintaining this; scientific_name() would give you the non-redundant sibling-unique name 'sapiens'. binomial() on a species rank and lower would give you 'Homo sapiens' (presumably grabbing the 'Homo' from the parent node with rank 'genus', or similar). Good, bad or ugly? I would prefer it works like this and we agree to differ with NCBI on what the 'scientific name' of a species node should be. Bio::Species can still delegate to Bio::Taxonomy::Node by calling binomial() (which I propose will actually give the correct answer, even for bacteria and viruses). Perhaps the short-hand (and the classifier used in name()) shouldn't mention the word 'scientific' to avoid confusion? But a) what else would we call it?, and b) for all ranks above species it /is/ the scientific name. From hlapp at gmx.net Mon Jul 17 19:47:24 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 17 Jul 2006 19:47:24 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BC0226.1080605@sendu.me.uk> References: <000001c6a9e1$c6b51610$15327e82@pyrimidine> <44BC0226.1080605@sendu.me.uk> Message-ID: <64AB2D48-B342-4FC4-BB4E-9927B24C972C@gmx.net> I don't think we should differ from NCBI in places where the connection between a method name and the NCBI data file is obvious or otherwise we will confuse people and send them into traps. $node->scientific_name() should simply report what NCBI reports. For simple species this will be identical to what $node->binomial() returns, but for others it may not, e.g., strains, varieties, etc or the weird world of viri and bacteria. This will also absolve us from retaining the business logic for how to construct the scientific name from genus, species, and possibly strain or whatever. binomial() isn't part of the NCBI taxonomy definition, so you have freedom there to report what suits you. -hilmar On Jul 17, 2006, at 5:33 PM, Sendu Bala wrote: > Chris Fields wrote: >> There was some interest in getting Bio::Species to delegate to >> Bio::Taxonomy::Node, so having scientific_name() would help quite >> a bit >> since the name used on the ORGANISM line is the scientific name >> (well, is >> supposed to be; famous last words). > > Can you clarify exactly what you mean here? Preferably with an > example? > ORGANISM line of which file format? > The reason I ask is that I still feel we need to do parsing of the > names > for species rank and lower: > > # The 'scientific name' for humans could be considered to be 'Homo > sapiens'. > # Taxid 9606 in the NCBI taxonomy database has rank 'species' and > ScientificName 'Homo sapiens'. > # For sanity, Bio::*Taxonomy* likes to interpret this > ScientificName as > 'sapiens' so that the genus is not held redundantly. It provides a > binomial() method to give you 'Homo sapiens' again if you want it. > # I plan on maintaining this; scientific_name() would give you the > non-redundant sibling-unique name 'sapiens'. binomial() on a species > rank and lower would give you 'Homo sapiens' (presumably grabbing the > 'Homo' from the parent node with rank 'genus', or similar). > > Good, bad or ugly? I would prefer it works like this and we agree to > differ with NCBI on what the 'scientific name' of a species node > should > be. Bio::Species can still delegate to Bio::Taxonomy::Node by calling > binomial() (which I propose will actually give the correct answer, > even > for bacteria and viruses). > > Perhaps the short-hand (and the classifier used in name()) shouldn't > mention the word 'scientific' to avoid confusion? But a) what else > would > we call it?, and b) for all ranks above species it /is/ the > scientific name. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From osborne1 at optonline.net Mon Jul 17 20:52:04 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Mon, 17 Jul 2006 20:52:04 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BC0226.1080605@sendu.me.uk> Message-ID: Sendu, The string "sapiens" is not what a biology textbook would call a scientific name. You're going to have to respect decades of convention and have scientific_name() return the genus and species name. Brian O. On 7/17/06 5:33 PM, "Sendu Bala" wrote: > # I plan on maintaining this; scientific_name() would give you the > non-redundant sibling-unique name 'sapiens'. binomial() on a species > rank and lower would give you 'Homo sapiens' (presumably grabbing the > 'Homo' from the parent node with rank 'genus', or similar). From cjfields at uiuc.edu Mon Jul 17 21:36:12 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 17 Jul 2006 20:36:12 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BC0226.1080605@sendu.me.uk> References: <000001c6a9e1$c6b51610$15327e82@pyrimidine> <44BC0226.1080605@sendu.me.uk> Message-ID: <1345AB61-E7AB-447A-AB40-2170244404B2@uiuc.edu> On Jul 17, 2006, at 4:33 PM, Sendu Bala wrote: > Chris Fields wrote: >> There was some interest in getting Bio::Species to delegate to >> Bio::Taxonomy::Node, so having scientific_name() would help quite >> a bit >> since the name used on the ORGANISM line is the scientific name >> (well, is >> supposed to be; famous last words). > > Can you clarify exactly what you mean here? Preferably with an > example? > ORGANISM line of which file format? > The reason I ask is that I still feel we need to do parsing of the > names > for species rank and lower: Sorry, should have clarified; GenBank sequence format. Here's the link: http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html The ORGANISM annotation line for a GenBank record contains the formal scientific name for the organism along with the lineage. I believe SwissProt/EMBL and several other RichSeq formats do the same. The lineage that is also present is almost always abbreviated, so it's not always possible to determine the formal rankings strictly from the file with any real degree of reliability (hence the past problems with Bio::Species). > > # The 'scientific name' for humans could be considered to be 'Homo > sapiens'. > # Taxid 9606 in the NCBI taxonomy database has rank 'species' and > ScientificName 'Homo sapiens'. > # For sanity, Bio::*Taxonomy* likes to interpret this > ScientificName as > 'sapiens' so that the genus is not held redundantly. It provides a > binomial() method to give you 'Homo sapiens' again if you want it. > # I plan on maintaining this; scientific_name() would give you the > non-redundant sibling-unique name 'sapiens'. binomial() on a species > rank and lower would give you 'Homo sapiens' (presumably grabbing the > 'Homo' from the parent node with rank 'genus', or similar). I think you should use scientific_name to designate the full formal scientific name for an organism according to the way NCBI describes it for that particular node (nothing more, except removing the <> stuff you mentioned earlier) and as it would appear for the ORGANISM line. Otherwise you'll run into serious species/subspecies/strain headaches (see below). If you want real genus/species (i.e. nothing extra, like strains or subspecies), separate them out and store them using a genus/species get/set if possible; the binomial them will give back the two name genus species designation. Here are a couple of example ones in (this is in XML, using EUtilities). These were retrieved using NCBI TaxIDs using Elink from a list of protein GI's (~700 of them total), so represent the actual NCBI TaxID linked with the sequence file. If you try breaking these apart into species, what happens to the strain/subspecies stuff? Notice that many of these nodes, which come directly from protein GI's, also have no rank. ... 376686 Flavobacterium johnsoniae UW101 Flavobacterium johnsoniae NBRC 14942 Flavobacterium johnsoniae IFO 14942 Flavobacterium johnsoniae IAM 14304 Flavobacterium johnsoniae MYX.1.1.1 Flavobacterium johnsoniae NCIB 11054 Flavobacterium johnsoniae DSM 2064 Flavobacterium johnsoniae LMG 1341 Flavobacterium johnsoniae ATCC 17061 Flavobacterium johnsoniae strain UW101 Flavobacterium johnsoniae str. UW101 986 no rank Bacteria ... 370552 Streptococcus pyogenes MGAS10270 Streptococcus pyogenes strain MGAS10270 Streptococcus pyogenes str. MGAS10270 301448 no rank Bacteria ... 224308 Bacillus subtilis subsp. subtilis str. 168 Bacillus subtilis subsp. subtilis 168 135461 no rank Bacteria > Good, bad or ugly? I would prefer it works like this and we agree to > differ with NCBI on what the 'scientific name' of a species node > should > be. Bio::Species can still delegate to Bio::Taxonomy::Node by calling > binomial() (which I propose will actually give the correct answer, > even > for bacteria and viruses). This is where I would strongly disagree (though I agree that the way NCBI uses 'scientific name' is a bit off). We are using the NCBI tax database, anf as such we are somewhat at the mercy of the NCBI tax nomenclature, unfortunately. If NCBI decides to change their official definition for the scientific name to something that made a bit more sense, the XML and dump data will reflect that and we won't have many problems adapting since the scientific name will always conform to their definition. But if we split the information up ad hoc then we are bound for disaster; it's just way too much headache to worry about. We could always point to the official NCBI definition as the one we adopt and then assign the tagged information from the node directly to scientific_name (no globbing together at all). Bio::Species could delegate likewise fro the ORGANISM line, so there's no piecemeal attempts to get Humpty Dumpty to fit back together again. You could go through and get the lineage from the XML/dump file data and try to sort the genus/species out, then paste it all back together (fingers crossed!), but I think it's more headache than it's worth to split these up, then hope that you can paste them back together again and always expect to get the same results. Chris > Perhaps the short-hand (and the classifier used in name()) shouldn't > mention the word 'scientific' to avoid confusion? But a) what else > would > we call it?, and b) for all ranks above species it /is/ the > scientific name. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Mon Jul 17 21:55:28 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 17 Jul 2006 20:55:28 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: References: Message-ID: <07041A82-BE01-4998-AF26-35212CA1F0F2@uiuc.edu> I agree with Hilmar's assessment, not b/c I disagree with your definition of scientific name or the reasoning Sendu proposes. I think we are somewhat bound to NCBI's nomenclature for their tax database. If we veer away from NCBI's definition for 'scientific name' it will just confuse users and lead to more trouble than it's worth, frankly. If we stick with it then any changes NCBI makes should be easier to deal with. Leaving the scientific_name as NCBI designates it, though it probably disagrees with ~99% of the world's textbooks, may be the most maintainable solution. Now, binomial() on the other hand... Chris On Jul 17, 2006, at 7:52 PM, Brian Osborne wrote: > Sendu, > > The string "sapiens" is not what a biology textbook would call a > scientific > name. You're going to have to respect decades of convention and have > scientific_name() return the genus and species name. > > Brian O. > > > On 7/17/06 5:33 PM, "Sendu Bala" wrote: > >> # I plan on maintaining this; scientific_name() would give you the >> non-redundant sibling-unique name 'sapiens'. binomial() on a species >> rank and lower would give you 'Homo sapiens' (presumably grabbing the >> 'Homo' from the parent node with rank 'genus', or similar). > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From hlapp at gmx.net Mon Jul 17 22:06:01 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 17 Jul 2006 22:06:01 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <07041A82-BE01-4998-AF26-35212CA1F0F2@uiuc.edu> References: <07041A82-BE01-4998-AF26-35212CA1F0F2@uiuc.edu> Message-ID: On Jul 17, 2006, at 9:55 PM, Chris Fields wrote: > Leaving the scientific_name as NCBI designates it, though it probably > disagrees with ~99% of the world's textbooks, may be the most > maintainable solution. It doesn't disagree, it's quite like what the world's textbooks give you as a 'scientific name'. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Tue Jul 18 00:24:50 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 17 Jul 2006 23:24:50 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: References: <07041A82-BE01-4998-AF26-35212CA1F0F2@uiuc.edu> Message-ID: <7BCA093B-90FB-4B0A-91FD-A6E0B34C96DD@uiuc.edu> When you mean genus-species, which would be yes. But parent nodes? If you trust WIkipedia, the scientific name == binomial nomenclature. Which could mean no subspecies, strains, etc if one were to be really strict about it, though that may be a grey area; I'm no taxonomist. http://en.wikipedia.org/wiki/Scientific_name The parent nodes shouldn't have a scientific name if one were to adhere strictly to the standard definition above, but NCBI refers to the names for the parent nodes as 'scientific name' (the XML element is still ScientificName, just like the child node). I'm not sure what the tax dump file is, though, so that may be different. Here's the lineage for Taxid 312284 (marine actinobacterium PHSC20C1). I cut out the irrelevant bits and just show the lineage with all the parent nodes, taxID, and rank: 131567 cellular organisms no rank 2 Bacteria superkingdom 201174 Actinobacteria phylum 1760 Actinobacteria (class) class 52018 unclassified Actinobacteria no rank 78537 unclassified Actinobacteria (miscellaneous) no rank .... Seems to me the easiest thing to do here, when looking at a particular node, is to use scientific_name() to hold that particular element for the node and have binomial represent the true 'scientific name', much as Sendu proposed. It would also make life much easier when parsing GenBank/SwissProt/EMBL (SeqIO) to have the data designating the formal scientific name (according to NCBI) be assigned to a scientific_name() get/set method in Bio::Species for later writing; then if we want to delegate this over to Bio::Taxonomy::Node from Bio::Species it would be that much easier. This would also get around some of the problems I have been seeing with bacterial names when passing GenBank data through SeqIO, since you wouldn't be required to glop the name together from the way Bio::Species tried to guess the lineage. Chris On Jul 17, 2006, at 9:06 PM, Hilmar Lapp wrote: > > On Jul 17, 2006, at 9:55 PM, Chris Fields wrote: > >> Leaving the scientific_name as NCBI designates it, though it probably >> disagrees with ~99% of the world's textbooks, may be the most >> maintainable solution. > > It doesn't disagree, it's quite like what the world's textbooks give > you as a 'scientific name'. > > -hilmar > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Tue Jul 18 03:27:49 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Tue, 18 Jul 2006 08:27:49 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <64AB2D48-B342-4FC4-BB4E-9927B24C972C@gmx.net> References: <000001c6a9e1$c6b51610$15327e82@pyrimidine> <44BC0226.1080605@sendu.me.uk> <64AB2D48-B342-4FC4-BB4E-9927B24C972C@gmx.net> Message-ID: <44BC8D75.1080806@sendu.me.uk> Hilmar Lapp wrote: > I don't think we should differ from NCBI in places where the > connection between a method name and the NCBI data file is obvious or > otherwise we will confuse people and send them into traps. > > $node->scientific_name() should simply report what NCBI reports. For > simple species this will be identical to what $node->binomial() > returns, but for others it may not, e.g., strains, varieties, etc or > the weird world of viri and bacteria. Ok, well this certainly seems to be consensus so I'll abide. > This will also absolve us from retaining the business logic for how > to construct the scientific name from genus, species, and possibly > strain or whatever. What about the existing genus(), species(), sub_species() and variant() methods? There would be no need for any logic to join things together, but I would still like to be able to get just 'sapiens' from somewhere. Can I use species() for that purpose (though again, species is strictly 'Homo sapiens')? Likewise sub_species() and variant() could hold the remaining non-redundant names. Or should all of these be deprecated because they don't really have a place in a generic Node class? What about node_name()? Yet another synonym of scientific_name? (right now it grabs the common name(s)). Ugh. What should I do with the classification array? Should it hold the raw ScientificName like: join(',', $node->classification) eq 'Homo sapiens, Homo, Homo/Pan/Gorilla group [...]'? Or should it be like: join(',', $node->classification) eq 'sapiens, Homo, Homo/Pan/Gorilla group [...]'? The latter is how it currently works (when it works correctly); I would rather fix it than lose the logic completely, but if we're staying true to proper classification (vs. what a programmer might expect), I guess I must use the raw ScientificName? > binomial() isn't part of the NCBI taxonomy definition, so you have > freedom there to report what suits you. I don't think binomial() would serve any useful purpose now, however. I can either deprecate it or make it a synonym of scientific_name() or both. Or binomial() can be a version of scientific_name() that complains if you use it on a rank higher or lower than species. As for species() et al., it may have no place in a generic Node class. Thoughts? From bix at sendu.me.uk Tue Jul 18 04:43:43 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Tue, 18 Jul 2006 09:43:43 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BBBB69.6000906@sendu.me.uk> References: <44BBBB69.6000906@sendu.me.uk> Message-ID: <44BC9F3F.2040500@sendu.me.uk> Sendu Bala wrote: [snip proposed changes to Bio::DB::Taxonomy::* and Bio::Taxonomy::Node] > If anyone can see a problem with any of these changes, let me know asap. I've just realised that there are currently no tests for Bio::DB::Taxonomy::flatfile, and that the ones for entrez get skipped. Node doesn't get an especially thorough work-out either (in the skipped section). I'm guessing it's not feasible to include the full taxdump from NCBI (~40MB) in t/data... do people think it would be reasonable to create some sort of small subset of the data? I could just pull out the lines from names.dmp and nodes.dmp relevant to a few example organisms. Say, for human and a tricky bacteria and virus? For the purposes of running the test, where should the index files be kept? In t/data with the .dmp files or in /tmp? Should the test script delete them afterwards, or leave them be? The entrez tests are skipped to 'avoid blocking', but the test only makes 2 entrez queries with a sleep(3) in-between. Basically, I don't think there's ever any reason to skip. Shall I remove the skip? Lots of other database-accessing tests in the test suite just go right ahead and access their database, no problem. Cheers, Sendu. From torsten.seemann at infotech.monash.edu.au Mon Jul 17 23:53:02 2006 From: torsten.seemann at infotech.monash.edu.au (Torsten Seemann) Date: Tue, 18 Jul 2006 13:53:02 +1000 Subject: [Bioperl-l] advice In-Reply-To: <000a01c6a9bb$6478ba90$15327e82@pyrimidine> References: <000a01c6a9bb$6478ba90$15327e82@pyrimidine> Message-ID: <44BC5B1E.5080600@infotech.monash.edu.au> > Ha ! I *almost* added something about that. I thought his vowel keys were > broken for a bit, maybe from pounding the keyboard with extreme frustration! The wide variety of pronunciation of English around the world can be mostly blamed on those damned vowels... so perhaps removing them helps one to reach a wider audience :-) > As an aside, doesn't Damian Conway say something about the non-use of vowels > in 'Perl Best Practices?' I think it was in relation to variables, > though... Yeah, on page 46 he says NOT to remove vowels in variable names, use prefixes instead. Apprntly rmvng vwls mks t hrdr t ndrstnd stff. (Actually, I studied at Monash University under Damian Conway, and recall his ridiculing of Perl, so I found it kind of ironic that he ended up changing the Perl landscape so significantly! He even wrote an internal publication "theStyle - a guide to C programming style" in about 1990 in which he violates some of his later Perl Best Practices :-) -- Dr Torsten Seemann http://www.vicbioinformatics.com Victorian Bioinformatics Consortium, Monash University, Australia From sharma.animesh at gmail.com Tue Jul 18 03:58:41 2006 From: sharma.animesh at gmail.com (Animesh Sharma) Date: Tue, 18 Jul 2006 13:28:41 +0530 Subject: [Bioperl-l] PDB file parser (Separates chain-sequence and chain-structure) Message-ID: <156674e60607180058r653fa8fesbc654508c9c19b5b@mail.gmail.com> Hi Chris, I have written a small script to separate the Chain in a PDB file. It stores the sequence (fasta format) and structure (pdb format) in separate files with middle name according to the Chain it contains. If the PDB file has only one chain, it creates a file with default as middle name. Eg, perl pdb_chain_extract.pl 1HCO.pdb Will create 4 files with names: 1HCO.A.fas ( Sequence of Chain A in fasta format) 1HCO.A.pdb ( Structure of Chain A in pdb format) 1HCO.B.fas ( Sequence of Chain B in fasta format) 1HCO.B.pdb ( Sequence of Chain B in pdb format) .I wrote it in the spirit of your example script given @ http://cvs.bioperl.org/cgi-bin/viewcvs/viewcvs.cgi/bioperl-live/examples/structure/structure-io.pl?rev=1.2&content-type=text/vnd.viewcvs-markupCan this be included in the example scripts too? Thanks and regards, Animesh -- ______________________"The Answer Lies in Genome"______________________ http://fuzzylife.org/animesh/ +919868580004 -------------- next part -------------- A non-text attachment was scrubbed... Name: pdb_chain_extract.pl Type: application/octet-stream Size: 2593 bytes Desc: not available URL: From bix at sendu.me.uk Tue Jul 18 09:20:34 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Tue, 18 Jul 2006 14:20:34 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BCAE08.8070307@ebi.ac.uk> References: <44BCAE08.8070307@ebi.ac.uk> Message-ID: <44BCE022.5000502@sendu.me.uk> I thought I'd post this here incase anyone wants to discuss the points Nadeem brings up. As far as I can see it is acceptable to remove the <> bits so I still plan to do so. Nadeem Faruque wrote: [off-list, posted here with permission] > In case you didn't realise, odd node names such as 'Gnathostomata > ' are created to uniquify some tax nodes that have identical > scientific names, eg there are 8 entries for Rhodotorula. > > When we parse the ncbi tax dump we store this column as UNIQUE_NAME but > I don't think that we actually use it for anything at within EMBL > nucleotide sequence bank. [...] > Also, I note that there are 548 non-unique NAME_TXT of class 'scientific > name', so the UNIQUE_NAME column may be of use to someone (though given > the strength of using a taxid directly I don't see why you'd want to). Indeed. And given that we are building a taxonomy with nodes, it doesn't matter that two different nodes in the entire taxonomy tree share the same name - the position in the tree implicitly is something unique. So if you find yourself with a node called 'Rhodotorula' you can find out which one it is by looking at the closest ranked parent. That said, for 'Rhodotorula ' the closest ranked parent is 'Sporidiobolales' and not 'Sporidiobolaceae'. Is that a problem? Do we need to care about this word 'Sporidiobolaceae' that is effectively just a synonym of 'Sporidiobolales'? [Nadeem later replied "...I can't imagine the <> value to be of any use.". He also clarified that if species have identical names and you store those, you can't work out what the corresponding taxid is. Without the <> bit you need some other information, like the classification. I think this other information will be present in input file formats and it must be up to the user to store the extra when outputting from bioperl] From osborne1 at optonline.net Tue Jul 18 10:50:48 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Tue, 18 Jul 2006 10:50:48 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BC9F3F.2040500@sendu.me.uk> Message-ID: Sendu, The idea to create mini *dmp files is a good one, I think. With respect to temporary files I'm fairly sure that most tests that use them create them some where in t/data and then delete them after. Brian O. On 7/18/06 4:43 AM, "Sendu Bala" wrote: > (~40MB) in t/data... do people think it would be reasonable to create > some sort of small subset of the data? I could just pull out the lines > from names.dmp and nodes.dmp relevant to a few example organisms. Say, > for human and a tricky bacteria and virus? > For the purposes of running the test, where should the index files be > kept? In t/data with the .dmp files or in /tmp? Should the test script > delete them afterwards, or leave them be? From cjfields at uiuc.edu Tue Jul 18 11:44:07 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 18 Jul 2006 10:44:07 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BC8D75.1080806@sendu.me.uk> Message-ID: <003201c6aa81$01db9a30$15327e82@pyrimidine> > What about the existing genus(), species(), sub_species() and variant() > methods? There would be no need for any logic to join things together, > but I would still like to be able to get just 'sapiens' from somewhere. > Can I use species() for that purpose (though again, species is strictly > 'Homo sapiens')? Likewise sub_species() and variant() could hold the > remaining non-redundant names. Or should all of these be deprecated > because they don't really have a place in a generic Node class? This is where Hilmar suggests that you have a bit of freedom in doing what you want, as with binomial(). So species() should return species ('sapiens'), genus return genus, etc. At that level there will need to be some additional data munging since the ranks below species seem to include the entire name, not just the species. But this could be done from the lineage if all nodes are present and tagged as such. > What about node_name()? Yet another synonym of scientific_name? (right > now it grabs the common name(s)). Ugh. I agree things need cleaning up. You could always make node_name() an alias for scientific_name() though it could just be deprecated. > What should I do with the classification array? Should it hold the raw > ScientificName like: > join(',', $node->classification) eq 'Homo sapiens, Homo, > Homo/Pan/Gorilla group [...]'? > Or should it be like: > join(',', $node->classification) eq 'sapiens, Homo, Homo/Pan/Gorilla > group [...]'? Don't know what the dump file gives; the XML output using efetch via entrez has the raw lineage (as appears in a GenBank sequence file) and the actual full lineage with TaxID, rank, 'scientific name,' in the actual lineage order. I think one problem area will be the 'no rank' designations in the lineage. Note that the below example also has a species and no genus; tricky! 312284 marine actinobacterium PHSC20C1 marine actinobacterium strain PHSC20C1 marine actinobacterium str. PHSC20C1 78537 species Bacteria ... cellular organisms; Bacteria; Actinobacteria; Actinobacteria (class); unclassified Actinobacteria; unclassified Actinobacteria (miscellaneous) 131567 cellular organisms no rank 2 Bacteria superkingdom 201174 Actinobacteria phylum 1760 Actinobacteria (class) class 52018 unclassified Actinobacteria no rank 78537 unclassified Actinobacteria (miscellaneous) no rank > The latter is how it currently works (when it works correctly); I would > rather fix it than lose the logic completely, but if we're staying true > to proper classification (vs. what a programmer might expect), I guess I > must use the raw ScientificName? > > > binomial() isn't part of the NCBI taxonomy definition, so you have > > freedom there to report what suits you. > > I don't think binomial() would serve any useful purpose now, however. I > can either deprecate it or make it a synonym of scientific_name() or > both. Or binomial() can be a version of scientific_name() that complains > if you use it on a rank higher or lower than species. As for species() > et al., it may have no place in a generic Node class. Thoughts? The use of scientific_name() in this context would be more to conform with what NCBI defines it as rather than as the actual definition; this should be explicitly stated as such in POD and is more for long-term maintainability. No matter what is done here, you will have some degree of confusion: those who want strict adherence to the term 'scientific name' and those who want the method to conform to NCBI's definition. Better to document the reasoning for it in some way that risk the random masses complaining. We could use binomial() for the 'scientific name' as the rest of the world knows it (as in binomial nomenclature), having it built from genus-species like you had originally suggested. That's what Hilmar suggested as an 'experimental' area of sorts, since NCBI doesn't use that particular term in its taxonomy definition. Chris From cjfields at uiuc.edu Tue Jul 18 11:48:36 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 18 Jul 2006 10:48:36 -0500 Subject: [Bioperl-l] advice In-Reply-To: <44BC5B1E.5080600@infotech.monash.edu.au> Message-ID: <003301c6aa81$a34fd8e0$15327e82@pyrimidine> Guess Dr. Conway became a Perl convert. The reviews of the book state that the 'best practices' really come from his experience as a Perl programmer over the last couple of decades, so maybe he learned something since 1990. Chris > > Ha ! I *almost* added something about that. I thought his vowel keys > were > > broken for a bit, maybe from pounding the keyboard with extreme > frustration! > > The wide variety of pronunciation of English around the world can be > mostly blamed on those damned vowels... so perhaps removing them helps > one to reach a wider audience :-) > > > As an aside, doesn't Damian Conway say something about the non-use of > vowels > > in 'Perl Best Practices?' I think it was in relation to variables, > > though... > > Yeah, on page 46 he says NOT to remove vowels in variable names, use > prefixes instead. Apprntly rmvng vwls mks t hrdr t ndrstnd stff. > > (Actually, I studied at Monash University under Damian Conway, and > recall his ridiculing of Perl, so I found it kind of ironic that he > ended up changing the Perl landscape so significantly! He even wrote an > internal publication "theStyle - a guide to C programming style" in > about 1990 in which he violates some of his later Perl Best Practices :-) > > -- > Dr Torsten Seemann http://www.vicbioinformatics.com > Victorian Bioinformatics Consortium, Monash University, Australia > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Tue Jul 18 12:05:48 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 18 Jul 2006 11:05:48 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BC9F3F.2040500@sendu.me.uk> Message-ID: <003401c6aa84$08ff6c80$15327e82@pyrimidine> > I've just realised that there are currently no tests for > Bio::DB::Taxonomy::flatfile, and that the ones for entrez get skipped. > Node doesn't get an especially thorough work-out either (in the skipped > section). > > I'm guessing it's not feasible to include the full taxdump from NCBI > (~40MB) in t/data... do people think it would be reasonable to create > some sort of small subset of the data? I could just pull out the lines > from names.dmp and nodes.dmp relevant to a few example organisms. Say, > for human and a tricky bacteria and virus? > For the purposes of running the test, where should the index files be > kept? In t/data with the .dmp files or in /tmp? Should the test script > delete them afterwards, or leave them be? I would place a small section in t/data or several individual examples in a subdirectory thereof (t/data/taxonomy). > The entrez tests are skipped to 'avoid blocking', but the test only > makes 2 entrez queries with a sleep(3) in-between. Basically, I don't > think there's ever any reason to skip. Shall I remove the skip? Lots of > other database-accessing tests in the test suite just go right ahead and > access their database, no problem. Depends on whether there is someone out there who doesn't have a network connection (and there always is). The DB.t tests skip based on testing for the env. variable BIOPERLDEBUG. 1..121 ok 1 # Skipping tests which require remote servers - set env variable BIOPERLDEBUG to test You could always do something along those lines or add a test for a network connection using an eval block and skip the tests if the network test fails, but there you run the risk of the tests failing not b/c of code problems but from remote server issues; I've seen this happen with SwissProt and GenBank testing before during peak hours. Chris > Cheers, > Sendu. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From bix at sendu.me.uk Tue Jul 18 13:03:54 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Tue, 18 Jul 2006 18:03:54 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <003201c6aa81$01db9a30$15327e82@pyrimidine> References: <003201c6aa81$01db9a30$15327e82@pyrimidine> Message-ID: <44BD147A.9020103@sendu.me.uk> Chris Fields wrote: >> What about the existing genus(), species(), sub_species() and variant() >> methods? There would be no need for any logic to join things together, >> but I would still like to be able to get just 'sapiens' from somewhere. >> Can I use species() for that purpose (though again, species is strictly >> 'Homo sapiens')? Likewise sub_species() and variant() could hold the >> remaining non-redundant names. Or should all of these be deprecated >> because they don't really have a place in a generic Node class? > > This is where Hilmar suggests that you have a bit of freedom in doing what > you want, as with binomial(). So species() should return species > ('sapiens'), genus return genus, etc. [regarding changes to Bio::Taxonomy::Node] Actually, I'm really strongly leaning toward getting rid of the following methods and new() options (and giving up entirely on being able to keep 'sapiens' somewhere): -organelle, organelle() -division, division() -sub_species, sub_species() -variant, variant() species(), validate_species_name() genus() binomial() As far as I can see none of these methods have any place in a generic Node class. If you want to know what your species is you have to be rank() 'species' and you just call scientific_name(). The above kind of methods belong in something like Bio::Species or similar, NOT in Node. Does anyone disagree? Can anyone offer a justification for keeping these methods? Changes I haven't yet discussed but have already made (but not committed): *parent_taxon_id = \&parent_id; *common_name = \&common_names; -factory and factory() removed, since there is no Bio::Taxonomy::FactoryI-implementing module, nothing in Node to make use of a factory once set, and a factory seems redundant when we're a node with a -dbh. validate_name() removed because it just returns 1. >> What about node_name()? Yet another synonym of scientific_name? (right >> now it grabs the common name(s)). Ugh. > > I agree things need cleaning up. You could always make node_name() an alias > for scientific_name() though it could just be deprecated. Actually, I've gone with node_name as the 'pure' and best method to set the name of your node with, and made scientific_name an alias of it (though it behaves as suggested earlier in the thread). >> What should I do with the classification array? Should it hold the raw >> ScientificName like: >> join(',', $node->classification) eq 'Homo sapiens, Homo, >> Homo/Pan/Gorilla group [...]'? (I've decided to do it the above way for consistency with scientific_name) >> Or should it be like: >> join(',', $node->classification) eq 'sapiens, Homo, Homo/Pan/Gorilla >> group [...]'? > > Don't know what the dump file gives; the XML output using efetch via entrez > has the raw lineage (as appears in a GenBank sequence file) and the actual > full lineage with TaxID, rank, 'scientific name,' in the actual lineage > order. I think one problem area will be the 'no rank' designations in the > lineage. Note that the below example also has a species and no genus; > tricky! Currently, flatfile and entrez ignore nodes with a rank of 'no rank' when they build the classification array. I had no intention of changing this behaviour. > 1760 > Actinobacteria (class) > class Ugh. I guess my proposal to remove <> bits via flatfile extends to removing () bits via entrez. We don't need unique names; we can use object_id() when uniqueness matters. >> I don't think binomial() would serve any useful purpose now, however. > > We could use binomial() for the 'scientific name' as the rest of the world > knows it (as in binomial nomenclature), having it built from genus-species > like you had originally suggested. No, see above. I don't think it makes the slightest bit of sense for a Node to go around trying to build things from a parent it may or may not have. Again, binomial() is a method for something like Bio::Species, not a generic Node class. From cjfields at uiuc.edu Tue Jul 18 15:34:29 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 18 Jul 2006 14:34:29 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BD147A.9020103@sendu.me.uk> Message-ID: <000901c6aaa1$328dd3d0$15327e82@pyrimidine> ... > [regarding changes to Bio::Taxonomy::Node] > > Actually, I'm really strongly leaning toward getting rid of the > following methods and new() options (and giving up entirely on being > able to keep 'sapiens' somewhere): > > -organelle, organelle() > -division, division() > -sub_species, sub_species() > -variant, variant() > species(), validate_species_name() > genus() > binomial() > > As far as I can see none of these methods have any place in a generic > Node class. If you want to know what your species is you have to be > rank() 'species' and you just call scientific_name(). The above kind of > methods belong in something like Bio::Species or similar, NOT in Node. > Does anyone disagree? Can anyone offer a justification for keeping these > methods? Bio::Species and Bio::Taxonomy::Node are closely linked and plans are to have Bio::Species delegate methods to Bio::Taxonomy::Node. So any changes to Node will affect Bio::Species to some degree. If you can get the lineage from XML, you could set many of these based on the rank given. Jason uses XML::Twig in Bio::DB::Taxonomy::entrez to parse out the XML data into Bio::Taxonomy::Node objects; it shouldn't be difficult to leave some methods based on rank (genus, species, etc) as simple get/set methods for the time being and leave the heavy lifting to the modules dealing directly with the data. Bio::Species could then delegate data/methods over to Bio::Taxonomy::Node fairly easily. If there is no genus/species data to be grabbed (either it doesn't exist or isn't present for some reason), then simply leave it as undef. That's also why I thought binomial() could stick around; if you have both the genus() and species() you could grab both using binomial(), building in special cases or error handling in case genus() or species() or both return undef. I don't see the problem in keeping this as long as users know what it means: by detailing the method in POD. If someone complains we tell them to RTFM. > Changes I haven't yet discussed but have already made (but not committed): > > *parent_taxon_id = \&parent_id; > *common_name = \&common_names; > -factory and factory() removed, since there is no > Bio::Taxonomy::FactoryI-implementing module, nothing in Node to make use > of a factory once set, and a factory seems redundant when we're a node > with a -dbh. > validate_name() removed because it just returns 1. > ... > Actually, I've gone with node_name as the 'pure' and best method to set > the name of your node with, and made scientific_name an alias of it > (though it behaves as suggested earlier in the thread). I don't have any problem with that. As long as it conforms somewhat to the NCBI definition to prevent confusion I think it's okay. > >> What should I do with the classification array? Should it hold the raw > >> ScientificName like: > >> join(',', $node->classification) eq 'Homo sapiens, Homo, > >> Homo/Pan/Gorilla group [...]'? > > (I've decided to do it the above way for consistency with scientific_name) I think that's fine. ... > Currently, flatfile and entrez ignore nodes with a rank of 'no rank' > when they build the classification array. I had no intention of changing > this behaviour. If you ignore nodes with 'no rank' there will be major problems when retrieving certain TaxID's from protein/nucleotide sequences. I had posted some sample XML from many NCBI TaxIDs taken from sequence files and via ELink and a good many of those nodes (most of them from genome projects) have 'no rank'. 376686 Flavobacterium johnsoniae UW101 ... 986 no rank ... 373903 Halothermothrix orenii H 168 ... 31909 no rank These aren't 'edge cases' anymore but now are pretty common from genome sequencing. I would just assign 'no rank' to rank() and have the node retained for DB purposes. It seems that the tax dump loses quite a bit of information somewhere along the way that shows up in the XML. Or am I wrong? > > 1760 > > Actinobacteria (class) > > class > > Ugh. I guess my proposal to remove <> bits via flatfile extends to > removing () bits via entrez. We don't need unique names; we can use > object_id() when uniqueness matters. The XML parsing in Taxonomy::entrez will take care of the and retains the character data in between. It would be a matter of setting the parser correctly to grab the relevant data and assign it properly. > >> I don't think binomial() would serve any useful purpose now, however. > > > > We could use binomial() for the 'scientific name' as the rest of the > world > > knows it (as in binomial nomenclature), having it built from genus- > species > > like you had originally suggested. > > No, see above. I don't think it makes the slightest bit of sense for a > Node to go around trying to build things from a parent it may or may not > have. Again, binomial() is a method for something like Bio::Species, not > a generic Node class. Bio::Species, from what I gather, was initially created to hold the tax data from GenBank/EMBL/SwissProt (RichSeq) files and is not DB-aware. Bio::Taxonomy::Node was supposed to be like Bio::Species and also be DB-aware: http://thread.gmane.org/gmane.comp.lang.perl.bio.general/4284/focus=4321 Again, Bio::Species methods are supposed to (eventually) delegate to Bio::Taxonomy::Node, so the two are closely linked along with their methods. Any way we go about it here (keeping certain methods and tossing others, changing the data returned, etc), it looks like there will be API issues down the road which will directly affect anyone using tax data. That affects bioperl-db directly as well as any other bioperl-based DB's which rely on tax data. So we need to tread a bit carefully when making major changes to make sure that they work for bioperl-db and anywhere else that may require it. Chris From cjfields at uiuc.edu Tue Jul 18 15:41:31 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 18 Jul 2006 14:41:31 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BD147A.9020103@sendu.me.uk> Message-ID: <000a01c6aaa2$2b4f50c0$15327e82@pyrimidine> Sendu et al, I'll play around with adding a quick method to Bio::Species for scientific_name(); if I can get it to play nice with Bio::SeqIO::genbank and it passes tests I'll commit it. Chris From golharam at umdnj.edu Tue Jul 18 15:36:54 2006 From: golharam at umdnj.edu (Ryan Golhar) Date: Tue, 18 Jul 2006 15:36:54 -0400 Subject: [Bioperl-l] advice In-Reply-To: <003301c6aa81$a34fd8e0$15327e82@pyrimidine> Message-ID: <00a501c6aaa1$86edb620$2f01a8c0@GOLHARMOBILE1> Right. There was a chain letter going around the internet for awhile about how you can leave out certain letters and the human brain will still be able to correctly interpret what the word is supposed to be. Either that or it was something about how Europe was adopting a new variation of English and after many successions it started to sound/look like German. > The wide variety of pronunciation of English around the world can be > mostly blamed on those damned vowels... so perhaps removing them helps > one to reach a wider audience :-) > > > As an aside, doesn't Damian Conway say something about the non-use > > of > vowels > > in 'Perl Best Practices?' I think it was in relation to variables, > > though... > > Yeah, on page 46 he says NOT to remove vowels in variable names, use > prefixes instead. Apprntly rmvng vwls mks t hrdr t ndrstnd stff. From cjfields at uiuc.edu Tue Jul 18 17:44:29 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 18 Jul 2006 16:44:29 -0500 Subject: [Bioperl-l] Bio::SeqIO::genbank and Bio::Species Message-ID: <000001c6aab3$58ee7bd0$15327e82@pyrimidine> For a given GenBank file, you'll have the following (this is from NCBI's current flatfile format, http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html): LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999 DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds. ACCESSION U49845 VERSION U49845.1 GI:1293613 KEYWORDS . SOURCE Saccharomyces cerevisiae (baker's yeast) ORGANISM Saccharomyces cerevisiae Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces. ... The SOURCE line above, according to NCBI, contains an abbreviated name and a common name (optional); it can also apparently contain additional information, such as organelles and so on. The ORGANISM line contains NCBI's definition of the formal scientific name (see the related thread on Taxonomy proposed changes) along with lineage information Currently, Bio::SeqIO::genbank and Bio::Species are very inconsistent with bacterial names, so when I process everything through SeqIO I get: SOURCE Mycobacterium tuberculosis H37Rv H37Rv ORGANISM Mycobacterium tuberculosis SOURCE Mycobacterium tuberculosis CDC1551 CDC1551 ORGANISM Mycobacterium tuberculosis SOURCE Mycobacterium avium subsp. paratuberculosis K-10 paratuberculosis K-10 ORGANISM Mycobacterium avium subsp. SOURCE Bacillus sp. NRRL B-14911 NRRL B-14911 ORGANISM Bacillus sp. I have added a scientific_name() method to Bio::Species to contain the string on the ORGANISM line and replace it as is, which seems to work well (doesn't chop the name down). The bigger issue is the mess with the SOURCE line. This stems from adding back information from sub_species(), which I don't think needs to be done as it's supposed to be an abbreviated name. Anybody mind if I try splitting up the original SOURCE line data into organelle(), abbreviated_name(), and common_name()? This will change common_name a bit (so, instead of 'Saccharomyces cerevisiae' it will give 'baker's yeast') but will also conform more to the NCBI definition of 'common name.' Also, organelle info isn't handled yet; I could toy with adding support for it. Any objections? I may proceed to do the same with EMBL, SwissPort, and others that use Bio::Species if this works out. Christopher Fields Postdoctoral Researcher - Switzer Lab Dept. of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Tue Jul 18 18:50:37 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Tue, 18 Jul 2006 23:50:37 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <000901c6aaa1$328dd3d0$15327e82@pyrimidine> References: <000901c6aaa1$328dd3d0$15327e82@pyrimidine> Message-ID: <44BD65BD.4030501@sendu.me.uk> Chris Fields wrote: > ... >> [regarding changes to Bio::Taxonomy::Node] >> >> Actually, I'm really strongly leaning toward getting rid of the >> following methods and new() options (and giving up entirely on being >> able to keep 'sapiens' somewhere): >> >> -organelle, organelle() >> -division, division() >> -sub_species, sub_species() >> -variant, variant() >> species(), validate_species_name() >> genus() >> binomial() > > Bio::Species and Bio::Taxonomy::Node are closely linked and plans are to > have Bio::Species delegate methods to Bio::Taxonomy::Node. So any changes > to Node will affect Bio::Species to some degree. I see from the original postings that Node was intended to be like Species, but I don't think it makes the slightest bit of sense. A /single/ Node need only (must only!) represent the information for a single node in the taxonomy. Or else what do these objects mean? What is the object model? It's bad bad bad for it to be sensible one way (when you're making your own taxonomy by making your own nodes) and nonsensical another (when we stuff in methods so that Bio::Species is happy). The way Node is written right now, and what you're suggesting, is that we stuff the entire Taxonomy into the Node. Well, except that you don't even have methods for every taxonomic level - there is genus() but no subphylum(). I can't emphasise strongly enough how insane all this is. The correct thing for Bio::Species to interact with is Bio::Taxonomy. Bio::Taxonomy is a collection of Nodes and has the sort of methods that Bio::Species would need to delegate its current functionality. I'm quite willing to do a proper overhaul here so everything makes sense. You either make your own nodes and add these to a Taxonomy or use a factory (which would use a Bio::DB::Taxonomy presumably). A Taxonomy lets you discover the classification of any node it contains. Bio::Species could implement a method like genus() by: $node = $taxonomy->get_node('genus') || return; return $node->scientific_name; Bio::Taxonomy isn't perfect, but I can certainly get it to do its job. I'd probably make it rank-name and order independent for starters. Bio::Taxonomy::Node needs to be reduced right down to just hold data about the node it represents, and possibly its parent node id (or other way of getting to its parent). So now I'm proposing dropping the classification() method from Node as well. It's simply not necessary; Bio::Taxonomy should give you that information. Bio::Taxnomoy::FactoryI doesn't make much sense to me at the moment from its docs, but it could be used to build a Taxonomy (that seems to be its intent, I'm just not sure what some of the methods are really supposed to do) such that Node might not even need any methods for getting its parent or child nodes. The Factory or Taxonomy might be able to deal with that. In short, I'm proposing a major change to Bio::Taxonomy::Node (make it just a node), and minor changes to (& implementation of) Bio::Taxonomy and Bio::Taxonomy::FactoryI such that they actually get used to do their jobs. > That's also why I thought binomial() could stick around; if you have both > the genus() and species() you could grab both using binomial(), building in > special cases or error handling in case genus() or species() or both return > undef. binomial() would belong in (and is present in) Bio::Taxonomy. But in any case, it's not needed there either; if you want the binomial you just ask for the scientific_name of the species node in your Taxonomy, since this now contains the actual scientific name == binomial. binomial() in Bio::Taxonomy could be reimplemented as: $node = $self->get_node('species') || return; return $node->scientific_name; >> Currently, flatfile and entrez ignore nodes with a rank of 'no rank' >> when they build the classification array. I had no intention of changing >> this behaviour. > > If you ignore nodes with 'no rank' there will be major problems when > retrieving certain TaxID's from protein/nucleotide sequences. This is only for the classification array, which is meaningless anyway (there only for file-format compatibility). If you want the real information you ask your Bio::Taxonomy (which asks each of its nodes). This is the whole point of having Bio::Taxonomy in the first place. It gives you great flexibility to do whatever you want to do. >>> 1760 >>> Actinobacteria (class) >>> class >> Ugh. I guess my proposal to remove <> bits via flatfile extends to >> removing () bits via entrez. We don't need unique names; we can use >> object_id() when uniqueness matters. > > The XML parsing in Taxonomy::entrez will take care of the and retains > the character data in between. You misunderstood. I meant the <> bits I discussed at the very start of this thread, that flatfile gives you. Here I'm referring to getting rid of ' (class)' as well. > Any way we go about it here (keeping certain methods and tossing others, > changing the data returned, etc), it looks like there will be API issues > down the road which will directly affect anyone using tax data. That > affects bioperl-db directly as well as any other bioperl-based DB's which > rely on tax data. So we need to tread a bit carefully when making major > changes to make sure that they work for bioperl-db and anywhere else that > may require it. Does anything make serious use of the current Bio::Taxonomy code? Or are they using Bio::Species? From cjfields at uiuc.edu Wed Jul 19 00:38:05 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 18 Jul 2006 23:38:05 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BD65BD.4030501@sendu.me.uk> References: <000901c6aaa1$328dd3d0$15327e82@pyrimidine> <44BD65BD.4030501@sendu.me.uk> Message-ID: I think we should wait a bit for any dramatic changes but implement the ones there seems to be a consensus on. I understand your reasoning for taking this on but I'm not sure completely revamping Bio::Taxonomy w/o input from the core developers is wise, especially since we do NOT know who uses it, why they use it, and how changing/ removing methods will affect their code. We are doing nothing productive here by constantly butting heads on this and having different opinions on what we think Bio::Taxonomy/Bio::Species is best suited for, when neither one of us is actually sure about who uses it and why. A reasonable solution is there but we must rely on outside opinions in order to reach it, so I propose a short moratorium on changes to Bio::Taxonomy/Bio::Species that radically redefine the API on either class. BTW, for anbody following, I'm perfectly comfortable if Sendu takes the lead on this and implements his changes; I'm just not sure about stripping the class down to the bare minimum. So far, the only thing that has been proposed (and accepted by all) is that scientific_name() hold the data for that tag in a node. I think most here would agree that's fine; I've already added a get/set to Bio::Species but haven't committed it yet. However, what you propose doing below is refactoring the code and changing the API. I agree there needs to be an overhaul but we can't do this w/o guidance or input from the GBE (Great Bioperl Elders). I would like some of the 'senior' core developers chime in a bit more on their thoughts on this. Jason also mentioned somewhere that any changes for Taxonomy/ Species should be tracked on the wiki somewhere as well to make sure everything is kosher and keep users up-to-date. I would like his input here but I think he's still incommunicado at the moment. Chris On Jul 18, 2006, at 5:50 PM, Sendu Bala wrote: > Chris Fields wrote: >> ... >>> [regarding changes to Bio::Taxonomy::Node] >>> >>> Actually, I'm really strongly leaning toward getting rid of the >>> following methods and new() options (and giving up entirely on being >>> able to keep 'sapiens' somewhere): >>> >>> -organelle, organelle() >>> -division, division() >>> -sub_species, sub_species() >>> -variant, variant() >>> species(), validate_species_name() >>> genus() >>> binomial() >> >> Bio::Species and Bio::Taxonomy::Node are closely linked and plans >> are to >> have Bio::Species delegate methods to Bio::Taxonomy::Node. So any >> changes >> to Node will affect Bio::Species to some degree. > > I see from the original postings that Node was intended to be like > Species, but I don't think it makes the slightest bit of sense. A > /single/ Node need only (must only!) represent the information for a > single node in the taxonomy. Or else what do these objects mean? > What is > the object model? It's bad bad bad for it to be sensible one way (when > you're making your own taxonomy by making your own nodes) and > nonsensical another (when we stuff in methods so that Bio::Species is > happy). The way Node is written right now, and what you're suggesting, > is that we stuff the entire Taxonomy into the Node. Well, except that > you don't even have methods for every taxonomic level - there is > genus() > but no subphylum(). I can't emphasise strongly enough how insane all > this is. > > The correct thing for Bio::Species to interact with is Bio::Taxonomy. > Bio::Taxonomy is a collection of Nodes and has the sort of methods > that > Bio::Species would need to delegate its current functionality. > > I'm quite willing to do a proper overhaul here so everything makes > sense. You either make your own nodes and add these to a Taxonomy > or use > a factory (which would use a Bio::DB::Taxonomy presumably). A Taxonomy > lets you discover the classification of any node it contains. > Bio::Species could implement a method like genus() by: > $node = $taxonomy->get_node('genus') || return; > return $node->scientific_name; > > Bio::Taxonomy isn't perfect, but I can certainly get it to do its job. > I'd probably make it rank-name and order independent for starters. > > Bio::Taxonomy::Node needs to be reduced right down to just hold data > about the node it represents, and possibly its parent node id (or > other > way of getting to its parent). So now I'm proposing dropping the > classification() method from Node as well. It's simply not necessary; > Bio::Taxonomy should give you that information. > > Bio::Taxnomoy::FactoryI doesn't make much sense to me at the moment > from > its docs, but it could be used to build a Taxonomy (that seems to > be its > intent, I'm just not sure what some of the methods are really supposed > to do) such that Node might not even need any methods for getting its > parent or child nodes. The Factory or Taxonomy might be able to deal > with that. > > In short, I'm proposing a major change to Bio::Taxonomy::Node (make it > just a node), and minor changes to (& implementation of) Bio::Taxonomy > and Bio::Taxonomy::FactoryI such that they actually get used to do > their > jobs. > > >> That's also why I thought binomial() could stick around; if you >> have both >> the genus() and species() you could grab both using binomial(), >> building in >> special cases or error handling in case genus() or species() or >> both return >> undef. > > binomial() would belong in (and is present in) Bio::Taxonomy. But > in any > case, it's not needed there either; if you want the binomial you just > ask for the scientific_name of the species node in your Taxonomy, > since > this now contains the actual scientific name == binomial. > > binomial() in Bio::Taxonomy could be reimplemented as: > $node = $self->get_node('species') || return; > return $node->scientific_name; > > >>> Currently, flatfile and entrez ignore nodes with a rank of 'no rank' >>> when they build the classification array. I had no intention of >>> changing >>> this behaviour. >> >> If you ignore nodes with 'no rank' there will be major problems when >> retrieving certain TaxID's from protein/nucleotide sequences. > > This is only for the classification array, which is meaningless anyway > (there only for file-format compatibility). If you want the real > information you ask your Bio::Taxonomy (which asks each of its nodes). > This is the whole point of having Bio::Taxonomy in the first place. > > It gives you great flexibility to do whatever you want to do. > > >>>> 1760 >>>> Actinobacteria (class) >>>> class >>> Ugh. I guess my proposal to remove <> bits via flatfile extends to >>> removing () bits via entrez. We don't need unique names; we can use >>> object_id() when uniqueness matters. >> >> The XML parsing in Taxonomy::entrez will take care of the >> and retains >> the character data in between. > > You misunderstood. I meant the <> bits I discussed at the very > start of > this thread, that flatfile gives you. Here I'm referring to getting > rid > of ' (class)' as well. > > >> Any way we go about it here (keeping certain methods and tossing >> others, >> changing the data returned, etc), it looks like there will be API >> issues >> down the road which will directly affect anyone using tax data. That >> affects bioperl-db directly as well as any other bioperl-based >> DB's which >> rely on tax data. So we need to tread a bit carefully when making >> major >> changes to make sure that they work for bioperl-db and anywhere >> else that >> may require it. > > Does anything make serious use of the current Bio::Taxonomy code? > Or are > they using Bio::Species? > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From ong at embl.de Wed Jul 19 03:51:48 2006 From: ong at embl.de (ong at embl.de) Date: Wed, 19 Jul 2006 09:51:48 +0200 Subject: [Bioperl-l] Fwd: Re: BioPerl query Message-ID: <20060719095148.f71b1v3p7qosk440@webmail.embl.de> HI, Anybody have an answer to the below query? Thanks. Regards, Ong ----- Forwarded message from birney at ebi.ac.uk ----- Date: Wed, 19 Jul 2006 08:16:06 +0100 From: Ewan Birney Reply-To: Ewan Birney Subject: Re: BioPerl query To: ong at embl.de On 18 Jul 2006, at 10:26, ong at embl.de wrote: > Dear Birney, > > Good day i wish to get your advise on how do i print out the PSM > matrix from > the code below. Thanks > I would ask this message on the bioperl list, not to me directly. > Regards, > Ong > > use Bio::Matrix::PSM::IO; > > my $psmIO=new Bio::Matrix::PSM::IO(-file=>'matrix.dat',- > format=>'transfac'); > while (my $psm=$psmIO->next_psm) { > my $id=$psm->id; > my $an=$psm->accession_number; > my $re = $psm->regexp; > #my $l=$psm->width; > my $cons=$psm->IUPAC; > print"$id\t$an\t$re\t$l\t$cons\t$psm\n"; > } ----- End forwarded message ----- From rmb32 at cornell.edu Tue Jul 18 20:06:02 2006 From: rmb32 at cornell.edu (Robert Buels) Date: Tue, 18 Jul 2006 17:06:02 -0700 Subject: [Bioperl-l] bioperl pulls, xml parsers push, and things get complicated Message-ID: <44BD776A.1080402@cornell.edu> Hi all, Here's a kind of abstract question about Bioperl and XML parsing: I'm thinking about writing a bioperl parser for genomethreader XML, and I'm sort of mulling over the 'impedence mismatch' between the way bioperl Bio::*IO::* modules work and the way all of the current XML parsers work. Bioperl uses a 'pull' model, where every time you want a new chunk of stuff, you call $io_object->next_thing. All the XML parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a 'push' model, where every time they parse a chunk, they call _your_ code, usually via a subroutine reference you've given to the XML parser when you start it up. From what I can tell, current Bioperl IO modules that parse XML are using push parsers to parse the whole document, holding stuff in memory, then spoon-feeding it in chunks to the calling program when it calls next_*(). This is fine until the input XML gets really big, in which case you can quickly run out of memory. Does anybody have good ideas for nice, robust ways of writing a bioperl IO module for really big input XML files? There don't seem to be any perl pull parsers for XML. All I've dug up so far would be having the XML push parser running in a different thread or process, pushing chunks of data into a pipe or similar structure that blocks the progress of the push parser until the pulling bioperl code wants the next piece of data, but there are plenty of ugly issues with that, whether one were too use perl threads for it (aaagh!) or fork and push some kind of intermediate format through a pipe or socket between the two processes (eek!). So, um, if you've read this far, do you have any ideas? Rob -- Robert Buels SGN Bioinformatics Analyst 252A Emerson Hall, Cornell University Ithaca, NY 14853 Tel: 503-889-8539 rmb32 at cornell.edu http://www.sgn.cornell.edu From alc at sanger.ac.uk Wed Jul 19 06:55:12 2006 From: alc at sanger.ac.uk (Avril Coghlan) Date: Wed, 19 Jul 2006 11:55:12 +0100 Subject: [Bioperl-l] parsing est2genome output Message-ID: <1153306513.27383.12.camel@deskpro104.dynamic.sanger.ac.uk> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From bernd.web at gmail.com Wed Jul 19 07:36:08 2006 From: bernd.web at gmail.com (Bernd Web) Date: Wed, 19 Jul 2006 13:36:08 +0200 Subject: [Bioperl-l] SearchIO HOWTO Message-ID: <716af09c0607190436n5fdd5576m23887051aaf95f8e@mail.gmail.com> Hi, On http://www.bioperl.org/wiki/HOWTO:SearchIO there is a great HOWTO parse your BLAST report. In the Table of methods, the third line from the bottom is: "HSP alignment Not available in this report Bio::SimpleAlign object " Would it not be good to add the get_aln method ( $hsp->get_aln) ? The line in "Using the methods" my $alignment_as_string = $alnIO->write_aln($aln); may be confusing: $alignment_as_string will be "1" on success and the alignment is printed to STDIO. Should IO::String be introduced here too set up a string filehandle? Best regards, Bernd From hlapp at gmx.net Wed Jul 19 09:40:47 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 19 Jul 2006 09:40:47 -0400 Subject: [Bioperl-l] bioperl pulls, xml parsers push, and things get complicated In-Reply-To: <44BD776A.1080402@cornell.edu> References: <44BD776A.1080402@cornell.edu> Message-ID: <73755CCF-2966-4580-BBEF-1F8A94CDC55D@gmx.net> In the past the way this was done for potentially big XML files is to use regex-based extraction of chunks that correspond to a object you want to return per call to next_XXX(). That chunk would then be passed on to the XML parser under the hood. This only gets problematic once even the chunks are huge, or the name of the element that encloses your chunk can be ambiguous with what's in your text. The latter is unlikely though if you include the angle brackets. I believe this is how at least some bioperl parsers for XML-based formats were written, and it seemed to work fine. -hilmar On Jul 18, 2006, at 8:06 PM, Robert Buels wrote: > Hi all, > > Here's a kind of abstract question about Bioperl and XML parsing: > > I'm thinking about writing a bioperl parser for genomethreader XML, > and > I'm sort of mulling over the 'impedence mismatch' between the way > bioperl Bio::*IO::* modules work and the way all of the current XML > parsers work. Bioperl uses a 'pull' model, where every time you > want a > new chunk of stuff, you call $io_object->next_thing. All the XML > parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a > 'push' model, where every time they parse a chunk, they call _your_ > code, usually via a subroutine reference you've given to the XML > parser > when you start it up. > > From what I can tell, current Bioperl IO modules that parse XML are > using push parsers to parse the whole document, holding stuff in > memory, > then spoon-feeding it in chunks to the calling program when it calls > next_*(). This is fine until the input XML gets really big, in which > case you can quickly run out of memory. > > Does anybody have good ideas for nice, robust ways of writing a > bioperl > IO module for really big input XML files? There don't seem to be any > perl pull parsers for XML. All I've dug up so far would be having the > XML push parser running in a different thread or process, pushing > chunks > of data into a pipe or similar structure that blocks the progress > of the > push parser until the pulling bioperl code wants the next piece of > data, > but there are plenty of ugly issues with that, whether one were too > use > perl threads for it (aaagh!) or fork and push some kind of > intermediate > format through a pipe or socket between the two processes (eek!). > > So, um, if you've read this far, do you have any ideas? > > Rob > > -- > Robert Buels > SGN Bioinformatics Analyst > 252A Emerson Hall, Cornell University > Ithaca, NY 14853 > Tel: 503-889-8539 > rmb32 at cornell.edu > http://www.sgn.cornell.edu > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From jay at jays.net Wed Jul 19 09:43:52 2006 From: jay at jays.net (Jay Hannah) Date: Wed, 19 Jul 2006 08:43:52 -0500 (CDT) Subject: [Bioperl-l] Walking multiple bioentries using bioperl-db Message-ID: Howdy -- I'm using bioperl-db + biosql-schema + mySQL. I can now successfully build a biosql-schema instance in mySQL, load taxonomy, then using bioperl-db load a GenBank file from disk, commiting the sequences I want. For a given accession number + version + namespace, I can tell bioperl-db to delete that from mySQL and it does. Yay!! I'll be throwing a "Using bioperl-db" document onto the wiki over the next week. What I am current baffled by: How do I ask bioperl-db to walk over multiple bioentries in my database so I can do things with them? The simplest possible example: print a list of all bioentries in my database. It is trivially easy to just query mySQL directly, but if I'm reading / understanding the documentation correctly bioperl-db intends to be database schema and RDBMS agnostic. In that case, I should use bioperl-db to walk my records. So, how do I do that? Is Bio::DB::Query::BioQuery the way to do this? The only way? If so then can someone help me understand the datacollections() and where() methods? perldoc Bio::DB::Query::BioQuery # all mouse sequences loaded under namespace ensembl that # have receptor in their description $query->datacollections(["Bio::PrimarySeqI e", "Bio::Species=>Bio::PrimarySeqI sp", "BioNamespace=>Bio::PrimarySeqI db"]); $query->where(["sp.binomial like 'Mus *'", "e.desc like '*receptor*'", "db.namespace = 'ensembl'"]); # all mouse sequences loaded under namespace ensembl that # have receptor in their description, and that also have a # cross-reference with SWISS as the database $query->datacollections(["Bio::PrimarySeqI e", "Bio::Species=>Bio::PrimarySeqI sp", "BioNamespace=>Bio::PrimarySeqI db", "Bio::Annotation::DBLink xref", I'm bewildered by this API. Please forgive my ignorance. 1) How do I get *all* bioentries out of my database? 2) Say I did want just the "namespace" 'Pico' (one of my biodatabase.name's). Where did "BioNamespace=>Bio::PrimarySeqI db"]); come from? How was I supposed to figure out the left hand side of that mapping? The right hand side? If that line wasn't sitting in that document was there a way for me to figure it out as a *user* of bioperl-db? Or would I need to be a *programmer* of bioperl-db reading source to figure this out? Where did "db.namespace = 'ensembl'"]); come from? Again, do I have to read source code to know how to invoke that magic? Sorry if I sound like a jerk. That is not my intention. Hopefully I can document the answers for future bioperl-db'ers. Thanks in advance, j my current plaything: http://openlab.jays.net From cjfields at uiuc.edu Wed Jul 19 10:34:48 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 19 Jul 2006 09:34:48 -0500 Subject: [Bioperl-l] bioperl pulls, xml parsers push, and things get complicated In-Reply-To: <44BD776A.1080402@cornell.edu> Message-ID: <002801c6ab40$7cfcd980$15327e82@pyrimidine> The Bio::SearchIO modules are supposed work like a SAX parser, where results are returned as the report is parsed b/c of the occurrence of specific 'events' (start_element, end_element, and so on). However, the actual behaviour for each module changes depending on the report type and the author's intention. There was a thread about a month ago on HMMPFAM report parsing where there was some contention as to how to build hits(models)/HSPs(domains). HMMPFAM output has one HSP per hit and is sorted on the sequence length so a particular hit can appear more than once, depending on how many times it hits along the sequence length itself. So, to gather all the HSPs together under one hit you would have to parse the entire report and build up a Hit/HSP tree, then use the next_hit/next_hsp oterators to parse through everything. Currently it just reports Hit/HSP pairs and it is up to the user to build that tree. In contrast, BLAST output should be capable of throwing hit/HSP clusters on the fly based on the report output, but is quite slow (event the XML output crawls). Jason thinks it's b/c of object inheritance and instantiation; I think it's probably more complicated than that (there are a ton of method calls which tend to slow things down quite a bit as well). I would say try using SearchIO, but instead of relying directly on object handler calls to create Hit/HSP objects using an object factory (which is where I think a majority of the speed is lost), build the data internally on the fly using start_element/end_element, then return hashes instead based on the element type triggered using end_element. As an aside, I'm trying to switch the SearchIO::blastxml over to XML::SAX (using XML::SAX::ExpatXS/expat) and plan on switching it over to using hashes at some point, possibly starting off with a different SearchIO plugin module. If you have other suggestions (XML parser of choice, ways to speed up parsing/retrieve data) we would be glad to hear them. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Robert Buels > Sent: Tuesday, July 18, 2006 7:06 PM > To: bioperl-l at bioperl.org > Subject: [Bioperl-l] bioperl pulls, xml parsers push,and things get > complicated > > Hi all, > > Here's a kind of abstract question about Bioperl and XML parsing: > > I'm thinking about writing a bioperl parser for genomethreader XML, and > I'm sort of mulling over the 'impedence mismatch' between the way > bioperl Bio::*IO::* modules work and the way all of the current XML > parsers work. Bioperl uses a 'pull' model, where every time you want a > new chunk of stuff, you call $io_object->next_thing. All the XML > parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a > 'push' model, where every time they parse a chunk, they call _your_ > code, usually via a subroutine reference you've given to the XML parser > when you start it up. > > From what I can tell, current Bioperl IO modules that parse XML are > using push parsers to parse the whole document, holding stuff in memory, > then spoon-feeding it in chunks to the calling program when it calls > next_*(). This is fine until the input XML gets really big, in which > case you can quickly run out of memory. > > Does anybody have good ideas for nice, robust ways of writing a bioperl > IO module for really big input XML files? There don't seem to be any > perl pull parsers for XML. All I've dug up so far would be having the > XML push parser running in a different thread or process, pushing chunks > of data into a pipe or similar structure that blocks the progress of the > push parser until the pulling bioperl code wants the next piece of data, > but there are plenty of ugly issues with that, whether one were too use > perl threads for it (aaagh!) or fork and push some kind of intermediate > format through a pipe or socket between the two processes (eek!). > > So, um, if you've read this far, do you have any ideas? > > Rob > > -- > Robert Buels > SGN Bioinformatics Analyst > 252A Emerson Hall, Cornell University > Ithaca, NY 14853 > Tel: 503-889-8539 > rmb32 at cornell.edu > http://www.sgn.cornell.edu > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Wed Jul 19 10:44:30 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 19 Jul 2006 09:44:30 -0500 Subject: [Bioperl-l] SearchIO HOWTO In-Reply-To: <716af09c0607190436n5fdd5576m23887051aaf95f8e@mail.gmail.com> Message-ID: <002901c6ab41$d7f61350$15327e82@pyrimidine> The information in that table is referring to the BLAST report example before the table itself. However, I can tell you that using that report works (sorry if the text wrapping here mangles the output), so the table information is erroneous. I'll do some updating on that. Chris Here's the script: use Bio::SearchIO; use Bio::AlignIO; my $parser = Bio::SearchIO->new (-file => shift @ARGV, -format => 'blast'); my $aln_out = Bio::AlignIO->new(-fh => \*STDOUT, -format => 'clustalw'); while (my $result = $parser->next_result) { while (my $hit = $result->next_hit) { while (my $hsp = $hit->next_hsp) { $aln_out->write_aln($hsp->get_aln); } } } Output (via STDOUT): ------------------------------------ CLUSTAL W(1.81) multiple sequence alignment gi|20521485|dbj|AP004641.2/2896-3051 DMGRCSSGCNRYPEPMTPDTMIKLYREKEGLGAYIWMPTPDMSTEGRVQMLP gb|443893|124775/197-246 DIVQNSSGCNRYPEPMTPDTMIKLYRE-EGL-AYIWMPTPDMSTEGRVQMLP *: : ********************** *** ******************** ------------------------------------ > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Bernd Web > Sent: Wednesday, July 19, 2006 6:36 AM > To: bioperl-l at bioperl.org > Subject: [Bioperl-l] SearchIO HOWTO > > Hi, > > On http://www.bioperl.org/wiki/HOWTO:SearchIO there is a great HOWTO > parse your BLAST report. > In the Table of methods, the third line from the bottom is: > "HSP alignment Not available in this report Bio::SimpleAlign object " > > Would it not be good to add the get_aln method ( $hsp->get_aln) ? > > The line in "Using the methods" > my $alignment_as_string = $alnIO->write_aln($aln); > > may be confusing: $alignment_as_string will be "1" on success and the > alignment is printed to STDIO. Should IO::String be introduced here > too set up a string filehandle? > > > Best regards, > Bernd > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Wed Jul 19 10:55:02 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 19 Jul 2006 09:55:02 -0500 Subject: [Bioperl-l] ListSummaries delay apologies Message-ID: <002a01c6ab43$508aa5a0$15327e82@pyrimidine> Sorry about the delay for the ListSummaries the past couple months; things have been pretty hectic here which has put me really behind on them (it hasn't ever been my top priority, anyway). We're getting papers ready for publication, I going to a summer institute in a few weeks, and research (as always) is full steam ahead. Just so everybody know, I haven't given up on them, and plan on getting caught up after I get back from the institute in Connecticut (beginning of August). Cheers! Christopher Fields Postdoctoral Researcher - Switzer Lab Dept. of Biochemistry University of Illinois Urbana-Champaign From hlapp at gmx.net Wed Jul 19 11:31:50 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 19 Jul 2006 11:31:50 -0400 Subject: [Bioperl-l] Walking multiple bioentries using bioperl-db In-Reply-To: References: Message-ID: <62DA6CBC-CD0E-46A7-A669-71FFC808041B@gmx.net> On Jul 19, 2006, at 9:43 AM, Jay Hannah wrote: > Howdy -- > > I'm using bioperl-db + biosql-schema + mySQL. > > I can now successfully build a biosql-schema instance in mySQL, load > taxonomy, then using bioperl-db load a GenBank file from disk, > commiting > the sequences I want. For a given accession number + version + > namespace, > I can tell bioperl-db to delete that from mySQL and it does. Yay!! > I'll be > throwing a "Using bioperl-db" document onto the wiki over the next > week. Excellent! > > What I am current baffled by: > > How do I ask bioperl-db to walk over multiple bioentries in my > database so > I can do things with them? The simplest possible example: print a > list of > all bioentries in my database. > > It is trivially easy to just query mySQL directly, but if I'm > reading / > understanding the documentation correctly bioperl-db intends to be > database schema and RDBMS agnostic. In that case, I should use > bioperl-db > to walk my records. So, how do I do that? Bioperl-db indeed intends to be schema(-variant) and RDBMS agnostic, but that doesn't mean that you have to be as well. If you find it trivially easy to query your database using SQL and DBI and you don't care about being RDBMS or schema-variant agnostic, then by all means don't feel obligated to go through the bioperl-db API for querying. Note you can obtain the DBI database handle being used by a persistence adaptor by calling dbh(): my $dbh = $adaptor->dbh(); (The advantage of this is that you use the same connection, and therefore the same machinery for obtaining connection parameters and building the DSN that the rest of bioperl-db uses. Also, you have the ability to see transactions in progress that have not been committed yet by the adaptor.) What you should not do through SQL directly is modifying (UPDATE & DELETE) entities which bioperl-db also holds in a cache (by default terms, dbxrefs), unless you also take care to clear the cache of the respective adaptor. > > Is Bio::DB::Query::BioQuery the way to do this? The only way? Well, yes, unless you want to use SQL directly (which is not 0a despised option, see above). > > If so then can someone help me understand the datacollections() and > where() methods? datacollections() in essence corresponds to the FROM clause in a SQL statement, including JOIN statements. '=>' joins two entities in 1:n relationship, '<=>' joins two entities in n:n relationship. Instead of the table(s) you give the (Bioperl) objects that are to be joined, and bioperl-db will translate the objects to database entities, i.e., tables. Each object may be followed by an alias. The alias makes it easier to refer to the object (entity) in the query constraint part (where()). A single alias following a join expression will always apply to the master object (table). > > perldoc Bio::DB::Query::BioQuery > > # all mouse sequences loaded under namespace ensembl that > # have receptor in their description > $query->datacollections(["Bio::PrimarySeqI e", > "Bio::Species=>Bio::PrimarySeqI sp", > "BioNamespace=>Bio::PrimarySeqI > db"]); This is short for $query->datacollections([ # enumare the objects we need: "Bio::PrimarySeqI e", "Bio::Species sp", "BioNamespace db", # specify master-detail relationships "Bio::Species=>Bio::PrimarySeqI", "BioNamespace=>Bio::PrimarySeqI"]); because the alias following the join statement applies to the master entity. > $query->where(["sp.binomial like 'Mus *'", > "e.desc like '*receptor*'", > "db.namespace = 'ensembl'"]); The where() method corresponds to the WHERE clause in SQL. The default logical operator between constraints is AND. There is more documentation in on the syntax of expressing constraints in Bio::DB::Query::QueryConstraint. The column for which to constrain the value is given as the attribute (method) of the (bioperl) object. If there are multiple objects in the 'datacollections' then you need to qualify each attribute by prefixing it with the object, or the alias assigned in datacollections (), followed by a dot; corresponding to typical OO syntax. > > # all mouse sequences loaded under namespace ensembl that > # have receptor in their description, and that also have a > # cross-reference with SWISS as the database > $query->datacollections(["Bio::PrimarySeqI e", > "Bio::Species=>Bio::PrimarySeqI sp", > "BioNamespace=>Bio::PrimarySeqI db", > "Bio::Annotation::DBLink xref", > > I'm bewildered by this API. Please forgive my ignorance. I understand. This part of the API is by far the one with the skimpiest documentation. There are a considerable number of tests in t/query.t which may serve as examples. They also are known to work if their tests don't fail. The tests don't actually execute any query, instead some internal guts are used to test the translation to SQL, so if you know SQL you may be able to understand better what's going on by seeing the object- level query and the SQL-level query side-by-side. > > 1) How do I get *all* bioentries out of my database? Your datacollections would consist of the single object Bio::SeqI (or Bio::PrimarySeqI if you didn't want any annotation), and there would be no query constraint: my $query = Bio::DB::Query::BioQuery->new(-datacollections=> ["Bio::SeqI"]); > > 2) Say I did want just the "namespace" 'Pico' (one of my > biodatabase.name's). Where did > > "BioNamespace=>Bio::PrimarySeqI db"]); > > come from? How was I supposed to figure out the left hand side of that > mapping? The right hand side? If that line wasn't sitting in that > document > was there a way for me to figure it out as a *user* of bioperl-db? You would not know from Bioperl itself. The right hand side is a Bioperl class. The left hand side is a kludge because Bioperl does not have a namespace class, instead objects that have a namespace implement the Bio::IdentifiableI interface directly. This kind of one class mapping to two database entities (biodatabase is a table separate from, in fact a master for, bioentry) is extremely cumbersome to express in a generic way, so I chose to create a Bio::DB::Persistent::BioNamespace class to represent that for the purpose of queries. > Or would I need to be a *programmer* of bioperl-db reading source > to figure > this out? Where did > > "db.namespace = 'ensembl'"]); > > come from? Again, do I have to read source code to know how to invoke > that magic? Well, I'm not sure even reading the source code clears it all up ;) As I said before, the part before the dot is the alias or object, the part after is the attribute (or method) to be constrained. > > Sorry if I sound like a jerk. That is not my intention. Hopefully I > can > document the answers for future bioperl-db'ers. No problem, that's fine - and whatever you would be willing to contribute to documentation would be highly appreciated. -hilmar > > Thanks in advance, > > j > my current plaything: http://openlab.jays.net > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From aaron.j.mackey at gsk.com Wed Jul 19 09:48:55 2006 From: aaron.j.mackey at gsk.com (aaron.j.mackey at gsk.com) Date: Wed, 19 Jul 2006 09:48:55 -0400 Subject: [Bioperl-l] bioperl pulls, xml parsers push, and things get complicated In-Reply-To: <44BD776A.1080402@cornell.edu> Message-ID: There are 3rd generation XML "Pull" parsers (also called "StAX" for Streaming API for XML), but they seem to still be stuck in Java land (e.g. "MXP1") You could probably use POE to setup a state machine that used XML::Twig to "push" units of XML content onto a stack, to be read by your "next_*" pull method (where the XML::Twig push "stalled" until the "next_*" method was called, and vice versa). -Aaron bioperl-l-bounces at lists.open-bio.org wrote on 07/18/2006 08:06:02 PM: > Hi all, > > Here's a kind of abstract question about Bioperl and XML parsing: > > I'm thinking about writing a bioperl parser for genomethreader XML, and > I'm sort of mulling over the 'impedence mismatch' between the way > bioperl Bio::*IO::* modules work and the way all of the current XML > parsers work. Bioperl uses a 'pull' model, where every time you want a > new chunk of stuff, you call $io_object->next_thing. All the XML > parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a > 'push' model, where every time they parse a chunk, they call _your_ > code, usually via a subroutine reference you've given to the XML parser > when you start it up. > > From what I can tell, current Bioperl IO modules that parse XML are > using push parsers to parse the whole document, holding stuff in memory, > then spoon-feeding it in chunks to the calling program when it calls > next_*(). This is fine until the input XML gets really big, in which > case you can quickly run out of memory. > > Does anybody have good ideas for nice, robust ways of writing a bioperl > IO module for really big input XML files? There don't seem to be any > perl pull parsers for XML. All I've dug up so far would be having the > XML push parser running in a different thread or process, pushing chunks > of data into a pipe or similar structure that blocks the progress of the > push parser until the pulling bioperl code wants the next piece of data, > but there are plenty of ugly issues with that, whether one were too use > perl threads for it (aaagh!) or fork and push some kind of intermediate > format through a pipe or socket between the two processes (eek!). > > So, um, if you've read this far, do you have any ideas? > > Rob > > -- > Robert Buels > SGN Bioinformatics Analyst > 252A Emerson Hall, Cornell University > Ithaca, NY 14853 > Tel: 503-889-8539 > rmb32 at cornell.edu > http://www.sgn.cornell.edu > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From arareko at campus.iztacala.unam.mx Wed Jul 19 12:20:21 2006 From: arareko at campus.iztacala.unam.mx (Mauricio Herrera Cuadra) Date: Wed, 19 Jul 2006 11:20:21 -0500 Subject: [Bioperl-l] bioperl pulls, xml parsers push, and things get complicated In-Reply-To: <002801c6ab40$7cfcd980$15327e82@pyrimidine> References: <002801c6ab40$7cfcd980$15327e82@pyrimidine> Message-ID: <44BE5BC5.5040006@campus.iztacala.unam.mx> There are a lot of different XML processing strategies. Most fall into two categories: stream-based and tree-based. With the stream-based strategy, the parser continuously alerts a program to patterns in the XML. The parser functions like a pipeline, taking XML markup on one end and pumping out processed nuggets of data to your program. With the tree-based strategy, the parser keeps the data to itself until the very end, when it presents a complete model of the document to your program. The whole point to this strategy is that your program can pull out any data it needs, in any order. Most of the times I use tree-based strategies because they place all of the data into a structure which lets me to access any internal node using array/hash references. The simplest parser for this is XML::Simple using XML::Parser as the 'preferred parser' (which is built on top of XML::Parser::Expat, which is a wrapper around the expat library). More advanced parsers (both stream and tree-based) are: * XML::LibXML (a wrapper for libxml2's C library) * XML::Grove (takes a tree and changes it into an object hierarchy. Each node type is represented by a different class) * XML::PYX (for repackaging XML as a stream of easily recognizable and transmutable symbols) * XML::SimpleObject (changes a hierarchy of lists into a hierarchy of objects) * XML::XPath (for writing expressions that pinpoint specific pieces of documents) There are also some standards-based solutions like: * XML::SAX (Simple API for XML) for event streams. * XML::DOM (Document Object Model) for tree processing. Your strategy of choice depends a lot on the type of XML files you want to parse. Understanding the structure of the files and deciding which is the data you want to extract from them is a fundamental step to choose the appropriate method/parser to use. Just my 2 cents :) Regards, Mauricio. Chris Fields wrote: > The Bio::SearchIO modules are supposed work like a SAX parser, where results > are returned as the report is parsed b/c of the occurrence of specific > 'events' (start_element, end_element, and so on). However, the actual > behaviour for each module changes depending on the report type and the > author's intention. > > There was a thread about a month ago on HMMPFAM report parsing where there > was some contention as to how to build hits(models)/HSPs(domains). HMMPFAM > output has one HSP per hit and is sorted on the sequence length so a > particular hit can appear more than once, depending on how many times it > hits along the sequence length itself. So, to gather all the HSPs together > under one hit you would have to parse the entire report and build up a > Hit/HSP tree, then use the next_hit/next_hsp oterators to parse through > everything. Currently it just reports Hit/HSP pairs and it is up to the > user to build that tree. > > In contrast, BLAST output should be capable of throwing hit/HSP clusters on > the fly based on the report output, but is quite slow (event the XML output > crawls). Jason thinks it's b/c of object inheritance and instantiation; I > think it's probably more complicated than that (there are a ton of method > calls which tend to slow things down quite a bit as well). > > I would say try using SearchIO, but instead of relying directly on object > handler calls to create Hit/HSP objects using an object factory (which is > where I think a majority of the speed is lost), build the data internally on > the fly using start_element/end_element, then return hashes instead based on > the element type triggered using end_element. > > As an aside, I'm trying to switch the SearchIO::blastxml over to XML::SAX > (using XML::SAX::ExpatXS/expat) and plan on switching it over to using > hashes at some point, possibly starting off with a different SearchIO plugin > module. If you have other suggestions (XML parser of choice, ways to speed > up parsing/retrieve data) we would be glad to hear them. > > Chris > > > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> bounces at lists.open-bio.org] On Behalf Of Robert Buels >> Sent: Tuesday, July 18, 2006 7:06 PM >> To: bioperl-l at bioperl.org >> Subject: [Bioperl-l] bioperl pulls, xml parsers push,and things get >> complicated >> >> Hi all, >> >> Here's a kind of abstract question about Bioperl and XML parsing: >> >> I'm thinking about writing a bioperl parser for genomethreader XML, and >> I'm sort of mulling over the 'impedence mismatch' between the way >> bioperl Bio::*IO::* modules work and the way all of the current XML >> parsers work. Bioperl uses a 'pull' model, where every time you want a >> new chunk of stuff, you call $io_object->next_thing. All the XML >> parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a >> 'push' model, where every time they parse a chunk, they call _your_ >> code, usually via a subroutine reference you've given to the XML parser >> when you start it up. >> >> From what I can tell, current Bioperl IO modules that parse XML are >> using push parsers to parse the whole document, holding stuff in memory, >> then spoon-feeding it in chunks to the calling program when it calls >> next_*(). This is fine until the input XML gets really big, in which >> case you can quickly run out of memory. >> >> Does anybody have good ideas for nice, robust ways of writing a bioperl >> IO module for really big input XML files? There don't seem to be any >> perl pull parsers for XML. All I've dug up so far would be having the >> XML push parser running in a different thread or process, pushing chunks >> of data into a pipe or similar structure that blocks the progress of the >> push parser until the pulling bioperl code wants the next piece of data, >> but there are plenty of ugly issues with that, whether one were too use >> perl threads for it (aaagh!) or fork and push some kind of intermediate >> format through a pipe or socket between the two processes (eek!). >> >> So, um, if you've read this far, do you have any ideas? >> >> Rob >> >> -- >> Robert Buels >> SGN Bioinformatics Analyst >> 252A Emerson Hall, Cornell University >> Ithaca, NY 14853 >> Tel: 503-889-8539 >> rmb32 at cornell.edu >> http://www.sgn.cornell.edu >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- MAURICIO HERRERA CUADRA arareko at campus.iztacala.unam.mx Laboratorio de Gen?tica Unidad de Morfofisiolog?a y Funci?n Facultad de Estudios Superiores Iztacala, UNAM From cjfields at uiuc.edu Wed Jul 19 14:45:55 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 19 Jul 2006 13:45:55 -0500 Subject: [Bioperl-l] bioperl pulls, xml parsers push, and things get complicated In-Reply-To: <44BE5BC5.5040006@campus.iztacala.unam.mx> Message-ID: <000301c6ab63$91d31680$15327e82@pyrimidine> Yeah, we use XML::SAX, with XML::SAX::ExpatXS and expat, for SearchIO::blastxml. It previously used XML::Parser::PerlSAX but that didn't support SAX2-based parsing. XML::Twig is also used quite a bit Jason added his thoughts about this to the wiki: http://www.bioperl.org/wiki/XML_parsers Personally, I use XML::Simple with EUtilities because the XML returned is remarkably simple and normally fairly short. The trick is making sure when parsing data to dereference everything properly since XML::Simple stores everything in an elaborate data structure. I plan on switching to XML::SAX::ExpatXS or XML::Twig soon. Chris > There are a lot of different XML processing strategies. Most fall into > two categories: stream-based and tree-based. > > With the stream-based strategy, the parser continuously alerts a program > to patterns in the XML. The parser functions like a pipeline, taking XML > markup on one end and pumping out processed nuggets of data to your > program. > > With the tree-based strategy, the parser keeps the data to itself until > the very end, when it presents a complete model of the document to your > program. The whole point to this strategy is that your program can pull > out any data it needs, in any order. > > Most of the times I use tree-based strategies because they place all of > the data into a structure which lets me to access any internal node > using array/hash references. The simplest parser for this is XML::Simple > using XML::Parser as the 'preferred parser' (which is built on top of > XML::Parser::Expat, which is a wrapper around the expat library). > > More advanced parsers (both stream and tree-based) are: > > * XML::LibXML (a wrapper for libxml2's C library) > * XML::Grove (takes a tree and changes it into an object hierarchy. Each > node type is represented by a different class) > * XML::PYX (for repackaging XML as a stream of easily recognizable and > transmutable symbols) > * XML::SimpleObject (changes a hierarchy of lists into a hierarchy of > objects) > * XML::XPath (for writing expressions that pinpoint specific pieces of > documents) > > There are also some standards-based solutions like: > > * XML::SAX (Simple API for XML) for event streams. > * XML::DOM (Document Object Model) for tree processing. > > Your strategy of choice depends a lot on the type of XML files you want > to parse. Understanding the structure of the files and deciding which is > the data you want to extract from them is a fundamental step to choose > the appropriate method/parser to use. > > Just my 2 cents :) > > Regards, > Mauricio. > > Chris Fields wrote: > > The Bio::SearchIO modules are supposed work like a SAX parser, where > results > > are returned as the report is parsed b/c of the occurrence of specific > > 'events' (start_element, end_element, and so on). However, the actual > > behaviour for each module changes depending on the report type and the > > author's intention. > > > > There was a thread about a month ago on HMMPFAM report parsing where > there > > was some contention as to how to build hits(models)/HSPs(domains). > HMMPFAM > > output has one HSP per hit and is sorted on the sequence length so a > > particular hit can appear more than once, depending on how many times it > > hits along the sequence length itself. So, to gather all the HSPs > together > > under one hit you would have to parse the entire report and build up a > > Hit/HSP tree, then use the next_hit/next_hsp oterators to parse through > > everything. Currently it just reports Hit/HSP pairs and it is up to the > > user to build that tree. > > > > In contrast, BLAST output should be capable of throwing hit/HSP clusters > on > > the fly based on the report output, but is quite slow (event the XML > output > > crawls). Jason thinks it's b/c of object inheritance and instantiation; > I > > think it's probably more complicated than that (there are a ton of > method > > calls which tend to slow things down quite a bit as well). > > > > I would say try using SearchIO, but instead of relying directly on > object > > handler calls to create Hit/HSP objects using an object factory (which > is > > where I think a majority of the speed is lost), build the data > internally on > > the fly using start_element/end_element, then return hashes instead > based on > > the element type triggered using end_element. > > > > As an aside, I'm trying to switch the SearchIO::blastxml over to > XML::SAX > > (using XML::SAX::ExpatXS/expat) and plan on switching it over to using > > hashes at some point, possibly starting off with a different SearchIO > plugin > > module. If you have other suggestions (XML parser of choice, ways to > speed > > up parsing/retrieve data) we would be glad to hear them. > > > > Chris > > > > > > > >> -----Original Message----- > >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > >> bounces at lists.open-bio.org] On Behalf Of Robert Buels > >> Sent: Tuesday, July 18, 2006 7:06 PM > >> To: bioperl-l at bioperl.org > >> Subject: [Bioperl-l] bioperl pulls, xml parsers push,and things get > >> complicated > >> > >> Hi all, > >> > >> Here's a kind of abstract question about Bioperl and XML parsing: > >> > >> I'm thinking about writing a bioperl parser for genomethreader XML, and > >> I'm sort of mulling over the 'impedence mismatch' between the way > >> bioperl Bio::*IO::* modules work and the way all of the current XML > >> parsers work. Bioperl uses a 'pull' model, where every time you want a > >> new chunk of stuff, you call $io_object->next_thing. All the XML > >> parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a > >> 'push' model, where every time they parse a chunk, they call _your_ > >> code, usually via a subroutine reference you've given to the XML parser > >> when you start it up. > >> > >> From what I can tell, current Bioperl IO modules that parse XML are > >> using push parsers to parse the whole document, holding stuff in > memory, > >> then spoon-feeding it in chunks to the calling program when it calls > >> next_*(). This is fine until the input XML gets really big, in which > >> case you can quickly run out of memory. > >> > >> Does anybody have good ideas for nice, robust ways of writing a bioperl > >> IO module for really big input XML files? There don't seem to be any > >> perl pull parsers for XML. All I've dug up so far would be having the > >> XML push parser running in a different thread or process, pushing > chunks > >> of data into a pipe or similar structure that blocks the progress of > the > >> push parser until the pulling bioperl code wants the next piece of > data, > >> but there are plenty of ugly issues with that, whether one were too use > >> perl threads for it (aaagh!) or fork and push some kind of intermediate > >> format through a pipe or socket between the two processes (eek!). > >> > >> So, um, if you've read this far, do you have any ideas? > >> > >> Rob > >> > >> -- > >> Robert Buels > >> SGN Bioinformatics Analyst > >> 252A Emerson Hall, Cornell University > >> Ithaca, NY 14853 > >> Tel: 503-889-8539 > >> rmb32 at cornell.edu > >> http://www.sgn.cornell.edu > >> > >> > >> _______________________________________________ > >> Bioperl-l mailing list > >> Bioperl-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- > MAURICIO HERRERA CUADRA > arareko at campus.iztacala.unam.mx > Laboratorio de Gen?tica > Unidad de Morfofisiolog?a y Funci?n > Facultad de Estudios Superiores Iztacala, UNAM > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From rmb32 at cornell.edu Wed Jul 19 15:30:28 2006 From: rmb32 at cornell.edu (Robert Buels) Date: Wed, 19 Jul 2006 12:30:28 -0700 Subject: [Bioperl-l] bioperl pulls, xml parsers push, and things get complicated In-Reply-To: References: Message-ID: <44BE8854.8010301@cornell.edu> POE is a really neat thing, I didn't know about it before. Something tells me, however, that I would have trouble convincing people to install POE as a dependency for a genomethreader output parser. ;-) I hope I'll have the opportunity to use it sometime. For the curious, here's a nice intro to POE: http://perl.com/pub/a/2001/01/poe.html And the POE main site: http://poe.perl.org/ Rob aaron.j.mackey at GSK.COM wrote: > There are 3rd generation XML "Pull" parsers (also called "StAX" for > Streaming API for XML), but they seem to still be stuck in Java land (e.g. > "MXP1") > > You could probably use POE to setup a state machine that used XML::Twig to > "push" units of XML content onto a stack, to be read by your "next_*" pull > method (where the XML::Twig push "stalled" until the "next_*" method was > called, and vice versa). > > -Aaron > > bioperl-l-bounces at lists.open-bio.org wrote on 07/18/2006 08:06:02 PM: > > >> Hi all, >> >> Here's a kind of abstract question about Bioperl and XML parsing: >> >> I'm thinking about writing a bioperl parser for genomethreader XML, and >> I'm sort of mulling over the 'impedence mismatch' between the way >> bioperl Bio::*IO::* modules work and the way all of the current XML >> parsers work. Bioperl uses a 'pull' model, where every time you want a >> new chunk of stuff, you call $io_object->next_thing. All the XML >> parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a >> 'push' model, where every time they parse a chunk, they call _your_ >> code, usually via a subroutine reference you've given to the XML parser >> when you start it up. >> >> From what I can tell, current Bioperl IO modules that parse XML are >> using push parsers to parse the whole document, holding stuff in memory, >> > > >> then spoon-feeding it in chunks to the calling program when it calls >> next_*(). This is fine until the input XML gets really big, in which >> case you can quickly run out of memory. >> >> Does anybody have good ideas for nice, robust ways of writing a bioperl >> IO module for really big input XML files? There don't seem to be any >> perl pull parsers for XML. All I've dug up so far would be having the >> XML push parser running in a different thread or process, pushing chunks >> > > >> of data into a pipe or similar structure that blocks the progress of the >> > > >> push parser until the pulling bioperl code wants the next piece of data, >> > > >> but there are plenty of ugly issues with that, whether one were too use >> perl threads for it (aaagh!) or fork and push some kind of intermediate >> format through a pipe or socket between the two processes (eek!). >> >> So, um, if you've read this far, do you have any ideas? >> >> Rob >> >> -- >> Robert Buels >> SGN Bioinformatics Analyst >> 252A Emerson Hall, Cornell University >> Ithaca, NY 14853 >> Tel: 503-889-8539 >> rmb32 at cornell.edu >> http://www.sgn.cornell.edu >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> > > > -- Robert Buels SGN Bioinformatics Analyst 252A Emerson Hall, Cornell University Ithaca, NY 14853 Tel: 503-889-8539 rmb32 at cornell.edu http://www.sgn.cornell.edu From dwaner at scitegic.com Wed Jul 19 15:47:58 2006 From: dwaner at scitegic.com (dwaner at scitegic.com) Date: Wed, 19 Jul 2006 12:47:58 -0700 Subject: [Bioperl-l] EMBL release 87 format changes. Message-ID: BioPerl Users and Developers, I have updated the EMBL SeqIO parser to work correctly with Release 87 of EMBL (June 19th, 2006). As suggested by Chris Fields in an earlier message, the EMBL parser now reads both new and old formats, but only writes the new format. I don't think that my changes will affect most users, but if you are using the EMBL format can you review the changes described below and speak up if anything looks like it could create a problem for you? If I don't hear any objections soon, I will submit a patch to bugzilla. Thanks, - David Parser changes: - EMBL files no longer contain the "entry name". When reading old format files, the EMBL "entry name" from the ID line is used as the Bio::Seq::id and Bio::Seq::display_id, but when reading new format files, the accession number is used for these fields. Changes to output: - The ID line was changed to the new format. - The SV line is never written; SV is now part of the ID line. - "DNA" and "RNA" are no longer valid EMBL molecule types. They are now written as "unassigned DNA" and "unassigned RNA" - Strictly speaking, EMBL format should only be used for nucleotide sequences. If the alphabet is 'protein', write_seq() emits a warning and writes the non-standard molecule type "AA" in the ID line. - Because BioPerl sequences do not have a "data class" attribute, all sequences are written with a data class of "STD" in the ID line. - The ID line contains the Bio::Seq::accession, unless it is missing, in which case the Bio::Seq::id is used. - molecule type is strictly validated. Non-EMBL values are output as "unassigned DNA" or "unassigned RNA", depending on the sequence alphabet. - "taxonomic division" is strictly validated. Non-EMBL values are output as "UNC". - The taxonomic division code "UNK" is now written as "UNC" (unclassified). Possible Gotchas for some users: - Because the EMBL entry name is no longer included anywhere in the file, when round-tripping from old format to new format the entry name will be lost. - In order to ensure that BioPerl writes valid EMBL files, I have added strict validation to the writer for "molecule type" and "taxonomic division". This could present a problem for users who are using non-standard values for these fields, but I felt it was important to write files that adhere to the EMBL spec. From slenk at emich.edu Wed Jul 19 16:04:16 2006 From: slenk at emich.edu (Stephen Gordon Lenk) Date: Wed, 19 Jul 2006 16:04:16 -0400 Subject: [Bioperl-l] bioperl pulls, xml parsers push, and things get complicated Message-ID: <13edac5b13ed8208.13ed820813edac5b@emich.edu> Hi, I have found that POE fails to execute a periodic task after 32 iterations in a Perl thread, consistent failure on both XP and OSX - if I knew how to write up a defect for Perl I would do this (hint ? how is this done - I'm *not* asking RTFM etc) - probably remiss for not doing so - I was going to write messages to a Controller Area Network (CAN) to control automotive widgets from Perl - I wound up using a C code exe (piped to from Perl) with its own threads to do this. Oh yes I believe that bio lab systems can be done this way as well. But ... POE is really neat if you think in state machine terms. I have an alternate architecture for my test harness (Perlizer) that would use POE to run tests with CAN and GPIB. Steve Lenk ----- Original Message ----- From: Robert Buels Date: Wednesday, July 19, 2006 3:30 pm Subject: Re: [Bioperl-l] bioperl pulls, xml parsers push, and things get complicated > POE is a really neat thing, I didn't know about it before. > Something > tells me, however, that I would have trouble convincing people to > install POE as a dependency for a genomethreader output parser. ;- > ) I > hope I'll have the opportunity to use it sometime. > > For the curious, here's a nice intro to POE: > http://perl.com/pub/a/2001/01/poe.html > And the POE main site: > http://poe.perl.org/ > > Rob > > aaron.j.mackey at GSK.COM wrote: > > There are 3rd generation XML "Pull" parsers (also called "StAX" > for > > Streaming API for XML), but they seem to still be stuck in Java > land (e.g. > > "MXP1") > > > > You could probably use POE to setup a state machine that used > XML::Twig to > > "push" units of XML content onto a stack, to be read by your > "next_*" pull > > method (where the XML::Twig push "stalled" until the "next_*" > method was > > called, and vice versa). > > > > -Aaron > > > > bioperl-l-bounces at lists.open-bio.org wrote on 07/18/2006 > 08:06:02 PM: > > > > > >> Hi all, > >> > >> Here's a kind of abstract question about Bioperl and XML parsing: > >> > >> I'm thinking about writing a bioperl parser for genomethreader > XML, and > >> I'm sort of mulling over the 'impedence mismatch' between the > way > >> bioperl Bio::*IO::* modules work and the way all of the current > XML > >> parsers work. Bioperl uses a 'pull' model, where every time > you want a > >> new chunk of stuff, you call $io_object->next_thing. All the > XML > >> parsers (including XML::SAX, XML::Parser::PerlSAX and > XML::Twig) use a > >> 'push' model, where every time they parse a chunk, they call > _your_ > >> code, usually via a subroutine reference you've given to the > XML parser > >> when you start it up. > >> > >> From what I can tell, current Bioperl IO modules that parse > XML are > >> using push parsers to parse the whole document, holding stuff > in memory, > >> > > > > > >> then spoon-feeding it in chunks to the calling program when it > calls > >> next_*(). This is fine until the input XML gets really big, in > which > >> case you can quickly run out of memory. > >> > >> Does anybody have good ideas for nice, robust ways of writing a > bioperl > >> IO module for really big input XML files? There don't seem to > be any > >> perl pull parsers for XML. All I've dug up so far would be > having the > >> XML push parser running in a different thread or process, > pushing chunks > >> > > > > > >> of data into a pipe or similar structure that blocks the > progress of the > >> > > > > > >> push parser until the pulling bioperl code wants the next piece > of data, > >> > > > > > >> but there are plenty of ugly issues with that, whether one were > too use > >> perl threads for it (aaagh!) or fork and push some kind of > intermediate > >> format through a pipe or socket between the two processes (eek!). > >> > >> So, um, if you've read this far, do you have any ideas? > >> > >> Rob > >> > >> -- > >> Robert Buels > >> SGN Bioinformatics Analyst > >> 252A Emerson Hall, Cornell University > >> Ithaca, NY 14853 > >> Tel: 503-889-8539 > >> rmb32 at cornell.edu > >> http://www.sgn.cornell.edu > >> > >> > >> _______________________________________________ > >> Bioperl-l mailing list > >> Bioperl-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > >> > >> > > > > > > > > -- > Robert Buels > SGN Bioinformatics Analyst > 252A Emerson Hall, Cornell University > Ithaca, NY 14853 > Tel: 503-889-8539 > rmb32 at cornell.edu > http://www.sgn.cornell.edu > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From cjfields at uiuc.edu Wed Jul 19 17:46:43 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 19 Jul 2006 16:46:43 -0500 Subject: [Bioperl-l] EMBL release 87 format changes. In-Reply-To: Message-ID: <000601c6ab7c$d39d8cd0$15327e82@pyrimidine> You can go ahead and submit the patch to Bugzilla anyway. Comments about the proposed changes from the developers can be added there. I think there's some confusion here, though: the EMBL SeqIO change you mentioned I committed is actually for Bio::SeqIO::swiss (SwissProt). I haven't touched Bio::SeqIO::embl (yet). 'swiss' format now reads old and new swiss data files and writes only new format; no major changes have been made to SeqIO::embl in about a year (and even that was a small one). Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of dwaner at scitegic.com > Sent: Wednesday, July 19, 2006 2:48 PM > To: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] EMBL release 87 format changes. > > BioPerl Users and Developers, > > I have updated the EMBL SeqIO parser to work correctly with Release 87 of > EMBL (June 19th, 2006). As suggested by Chris Fields in an earlier > message, the EMBL parser now reads both new and old formats, but only > writes the new format. > > I don't think that my changes will affect most users, but if you are using > the EMBL format can you review the changes described below and speak up if > anything looks like it could create a problem for you? > > If I don't hear any objections soon, I will submit a patch to bugzilla. > > Thanks, > > - David > > Parser changes: > > - EMBL files no longer contain the "entry name". When reading old format > files, > the EMBL "entry name" from the ID line is used as the Bio::Seq::id and > Bio::Seq::display_id, but when reading new format files, the accession > number > is used for these fields. > > Changes to output: > > - The ID line was changed to the new format. > > - The SV line is never written; SV is now part of the ID line. > > - "DNA" and "RNA" are no longer valid EMBL molecule types. They are now > written > as "unassigned DNA" and "unassigned RNA" > > - Strictly speaking, EMBL format should only be used for nucleotide > sequences. > If the alphabet is 'protein', write_seq() emits a warning and writes the > > non-standard molecule type "AA" in the ID line. > > - Because BioPerl sequences do not have a "data class" attribute, all > sequences > are written with a data class of "STD" in the ID line. > > - The ID line contains the Bio::Seq::accession, unless it is missing, in > which > case the Bio::Seq::id is used. > > - molecule type is strictly validated. Non-EMBL values are output as > "unassigned DNA" or "unassigned RNA", depending on the sequence > alphabet. > > - "taxonomic division" is strictly validated. Non-EMBL values are output > as "UNC". > > - The taxonomic division code "UNK" is now written as "UNC" > (unclassified). > > Possible Gotchas for some users: > > - Because the EMBL entry name is no longer included anywhere in the file, > when round-tripping from old format to new format the entry name will be > lost. > > - In order to ensure that BioPerl writes valid EMBL files, I have added > strict > validation to the writer for "molecule type" and "taxonomic division". > This > could present a problem for users who are using non-standard values for > these > fields, but I felt it was important to write files that adhere to the > EMBL spec. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From stewarta at nmrc.navy.mil Wed Jul 19 18:00:26 2006 From: stewarta at nmrc.navy.mil (Andrew Stewart) Date: Wed, 19 Jul 2006 18:00:26 -0400 Subject: [Bioperl-l] #bioperl Message-ID: Wandering about the new bioperl.org page, I noticed that there's never really been much mention of starting up a bioperl chat channel on IRC for casual bioperl discussion and support. This has worked really well for projects like MediaWiki, etc. I'll sit on the channel for awhile and maybe we can see if the idea picks up. Point your favorite IRC client to... (windows users I would suggest mIRC, mac I would suggest Colloquy) server: irc.freenode.net channel: #bioperl Hope to see you there. -- Andrew Stewart Research Assistant, Genomics Team Navy Medical Research Center (NMRC) Biological Defense Research Directorate (BDRD) BDRD Annex 12300 Washington Avenue, 2nd Floor Rockville, MD 20852 email: stewarta at nmrc.navy.mil phone: 301-231-6700 Ext 270 From rmb32 at cornell.edu Wed Jul 19 18:40:52 2006 From: rmb32 at cornell.edu (Robert Buels) Date: Wed, 19 Jul 2006 15:40:52 -0700 Subject: [Bioperl-l] bioperl pulls, xml parsers push, and things get complicated In-Reply-To: <002801c6ab40$7cfcd980$15327e82@pyrimidine> References: <002801c6ab40$7cfcd980$15327e82@pyrimidine> Message-ID: <44BEB4F4.1060407@cornell.edu> Hi Chris, It seems to me the SearchIO framework isn't really appropriate for genomethreader, since it's more of a gene prediction program than a search/alignment program. Also, w.r.t. XML parsing and buffering, I don't see how Bio::SearchIO is fundamentally different from the other bioperl IO systems, it still has a next_this(), next_that() interface,which means lots of buffering memory if you're doing your actual parsing with a push parser (or a tree parser, of course, which is buffering an expanded form of the entire document). It looks like it just adds another layer of method calls for parser events, allowing the SearchIO to make different kinds of objects and stuff. It looks like none of this changes the fact that these are all push parsers, and bioperl pulls, so you have to buffer a lot of stuff. I guess the only really general strategies for reducing the buffering is a.) to break up the XML with regexps and such like Hilmar said, b.) to put your push parser in another process, and somehow keep it blocking in one of its callbacks until you're ready for its next data. I think what I'll do with the gthxml parser is find a way to split the input XML into chunks and run a parser separately on each, like Hilmar said. If more performance is needed, maybe a multi-process approach would be appropriate, but not yet. Anyway, looking at blastxml, I have some ruminations, which fill the rest of this email: Looking at SearchIO::blastxml, it looks like it's already using XML::SAX, which will use XML::SAX::ExpatXS if installed. Is that recent? Is blastxml faster when using the tempfile option than when putting the whole report in a string in memory? If you're looking for speed gains, have you tried running some kind of profiling on it? Whenever one is out to optimize code, profiling should be stop number one. Almost every time, you will be surprised at what parts of the code are actually eating up the most time. Here's a perl profiling intro: http://perl.com/pub/a/2004/06/25/profiling.html . The profiling mechansim talked about in that article is kind of old, there are also a bunch of newer code profiling tools available on CPAN. I haven't used any of them though. But yeah, I can't emphasize enough the importance of profiling if you're trying to optimize for speed. As for memory, the blastxml parser suffers from the same handicap I was pondering at the start of this thread. To see what I mean, think of what would happen if there were somehow 10 million HSPs in one of the reports? It's buffering all of them before returning each result, and your machine could melt. :-) Things would be beautiful (and fast, probably) if next_hsp() would actually parse the next HSP in the report instead of just returning a HSP object that's sitting in memory. But there's not really anything that can be done about that, I don't think. One nice thing, the blastxml parser's memory footprint doesn't really suffer if you have 100,000 blast reports in your input file, because it splits out the reports and parses each one individually. This I think is a good illustration of what Hilmar was talking about, breaking the input XML into chunks cuts down on the amount of buffering you have to do. As XML parsers go, I kind of like XML::Twig, because it manages to combine most of the easy use of a DOM/tree parser with the better memory usage and speed of a push parser (like SAX and XML::Parser). Within a parser callback, you have a DOM-like tree that's just the part of your XML document you're interested in at that time, and then you free that structure when you're done picking things out of it. I'm not sure how fast it is, though, probably not as fast as ExpatXS. At any rate, it is definitely a lot more intuitive to use than a more standard push parser, since if you make good choices about what elements to use as the roots of your twigs, you can often do your processing on a self-contained chunk and not have to keep track of a bunch of parse state like you typically need with a straight push parser like XML::Parser or a SAX parser. Rob Chris Fields wrote: > The Bio::SearchIO modules are supposed work like a SAX parser, where results > are returned as the report is parsed b/c of the occurrence of specific > 'events' (start_element, end_element, and so on). However, the actual > behaviour for each module changes depending on the report type and the > author's intention. > > There was a thread about a month ago on HMMPFAM report parsing where there > was some contention as to how to build hits(models)/HSPs(domains). HMMPFAM > output has one HSP per hit and is sorted on the sequence length so a > particular hit can appear more than once, depending on how many times it > hits along the sequence length itself. So, to gather all the HSPs together > under one hit you would have to parse the entire report and build up a > Hit/HSP tree, then use the next_hit/next_hsp oterators to parse through > everything. Currently it just reports Hit/HSP pairs and it is up to the > user to build that tree. > > In contrast, BLAST output should be capable of throwing hit/HSP clusters on > the fly based on the report output, but is quite slow (event the XML output > crawls). Jason thinks it's b/c of object inheritance and instantiation; I > think it's probably more complicated than that (there are a ton of method > calls which tend to slow things down quite a bit as well). > > I would say try using SearchIO, but instead of relying directly on object > handler calls to create Hit/HSP objects using an object factory (which is > where I think a majority of the speed is lost), build the data internally on > the fly using start_element/end_element, then return hashes instead based on > the element type triggered using end_element. > > As an aside, I'm trying to switch the SearchIO::blastxml over to XML::SAX > (using XML::SAX::ExpatXS/expat) and plan on switching it over to using > hashes at some point, possibly starting off with a different SearchIO plugin > module. If you have other suggestions (XML parser of choice, ways to speed > up parsing/retrieve data) we would be glad to hear them. > > Chris > > > > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> bounces at lists.open-bio.org] On Behalf Of Robert Buels >> Sent: Tuesday, July 18, 2006 7:06 PM >> To: bioperl-l at bioperl.org >> Subject: [Bioperl-l] bioperl pulls, xml parsers push,and things get >> complicated >> >> Hi all, >> >> Here's a kind of abstract question about Bioperl and XML parsing: >> >> I'm thinking about writing a bioperl parser for genomethreader XML, and >> I'm sort of mulling over the 'impedence mismatch' between the way >> bioperl Bio::*IO::* modules work and the way all of the current XML >> parsers work. Bioperl uses a 'pull' model, where every time you want a >> new chunk of stuff, you call $io_object->next_thing. All the XML >> parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a >> 'push' model, where every time they parse a chunk, they call _your_ >> code, usually via a subroutine reference you've given to the XML parser >> when you start it up. >> >> From what I can tell, current Bioperl IO modules that parse XML are >> using push parsers to parse the whole document, holding stuff in memory, >> then spoon-feeding it in chunks to the calling program when it calls >> next_*(). This is fine until the input XML gets really big, in which >> case you can quickly run out of memory. >> >> Does anybody have good ideas for nice, robust ways of writing a bioperl >> IO module for really big input XML files? There don't seem to be any >> perl pull parsers for XML. All I've dug up so far would be having the >> XML push parser running in a different thread or process, pushing chunks >> of data into a pipe or similar structure that blocks the progress of the >> push parser until the pulling bioperl code wants the next piece of data, >> but there are plenty of ugly issues with that, whether one were too use >> perl threads for it (aaagh!) or fork and push some kind of intermediate >> format through a pipe or socket between the two processes (eek!). >> >> So, um, if you've read this far, do you have any ideas? >> >> Rob >> >> -- >> Robert Buels >> SGN Bioinformatics Analyst >> 252A Emerson Hall, Cornell University >> Ithaca, NY 14853 >> Tel: 503-889-8539 >> rmb32 at cornell.edu >> http://www.sgn.cornell.edu >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > -- Robert Buels SGN Bioinformatics Analyst 252A Emerson Hall, Cornell University Ithaca, NY 14853 Tel: 503-889-8539 rmb32 at cornell.edu http://www.sgn.cornell.edu From skirov at utk.edu Wed Jul 19 17:54:03 2006 From: skirov at utk.edu (Stefan Kirov) Date: Wed, 19 Jul 2006 17:54:03 -0400 Subject: [Bioperl-l] Accessing TRANSFAC matrices In-Reply-To: <20060717145000.0aig7ymcmurk4wsk@webmail.embl.de> References: <20060717145000.0aig7ymcmurk4wsk@webmail.embl.de> Message-ID: <44BEA9FB.1070009@utk.edu> I have nothing to do with TFBS (except for using it). I suggest you contact Boris Lenhard who is behind TFBS. Please also send bioperl questions to the list. Finally, I believe TRANSFAC does not distribute the data files anymore. However, if you find out this is not the case, please let me know. Stefan ong at embl.de wrote: >HI , > > Good day, i am trying to retrieve TRANSFAC matrices via TFBS Perl module, but >it happens that about 50 matrices are missing after M00359 do you have any idea? >Also i wish to try using the Bio::Matrix::PSM::IO object, but can you advise how >do i get the matrix.dat which is a transfac file? > > Tahnks and hear for you soon. > >REgards, >Ong > > From bix at sendu.me.uk Thu Jul 20 02:49:45 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 20 Jul 2006 07:49:45 +0100 Subject: [Bioperl-l] Accessing TRANSFAC matrices In-Reply-To: <44BEA9FB.1070009@utk.edu> References: <20060717145000.0aig7ymcmurk4wsk@webmail.embl.de> <44BEA9FB.1070009@utk.edu> Message-ID: <44BF2789.1090204@sendu.me.uk> Stefan Kirov wrote: > Finally, I believe TRANSFAC does not distribute the data files anymore. > However, if you find out this is not the case, please let me know. They get distributed as Transfac 'Pro', for which you need a license (money). > ong at embl.de wrote: >> good day, i am trying to retrieve TRANSFAC matrices via TFBS Perl module, but >> it happens that about 50 matrices are missing after M00359 do you have any idea? What is meant by this? Missing from where? At the least, M00360 is accessible via the website (public database). >> Also i wish to try using the Bio::Matrix::PSM::IO object, but can you advise how >> do i get the matrix.dat which is a transfac file? http://www.biobase-international.com/pages/index.php?id=174 From dhoworth at mrc-lmb.cam.ac.uk Thu Jul 20 05:19:22 2006 From: dhoworth at mrc-lmb.cam.ac.uk (Dave Howorth) Date: Thu, 20 Jul 2006 10:19:22 +0100 Subject: [Bioperl-l] bioperl pulls, xml parsers push, and things get complicated In-Reply-To: <13edac5b13ed8208.13ed820813edac5b@emich.edu> References: <13edac5b13ed8208.13ed820813edac5b@emich.edu> Message-ID: <44BF4A9A.60100@mrc-lmb.cam.ac.uk> Stephen Gordon Lenk wrote: > I have found that POE fails to execute a periodic task after 32 > iterations in a Perl thread, consistent failure on both XP and OSX - > if I knew how to write up a defect for Perl I would do this (hint ? > how is this done - I'm *not* asking RTFM etc) Generally: Go to http://search.cpan.org and search for the module (POE). Click on the distribution link, rather than the doc link (i.e. POE-0.3502, which takes you to http://search.cpan.org/~rcaputo/POE-0.3502/). Click on the View/Report Bugs link. Check through the existing bugs and if it's not there click on the Report a new bug link. Cheers, Dave From georg.otto at tuebingen.mpg.de Thu Jul 20 06:53:53 2006 From: georg.otto at tuebingen.mpg.de (Georg Otto) Date: Thu, 20 Jul 2006 12:53:53 +0200 Subject: [Bioperl-l] Features in SeqIO GenBank output Message-ID: Hi, this is probably a FAQ but I could not find anything to solve it. I want to get sequences from GenBank and save them in GenBank format. This works with the script shown below, but the "Features" part is missing and contains references instead (see below). How can I print out the complete GenBank entry? I am running Bioperl 1.5, Perl 5.8.6, Mac 10.4.7 Best, Georg Here is my script: use strict; use warnings; use Bio::Seq; use Bio::SeqIO; use Bio::DB::GenBank; my $acc = 'AB017118'; my $db_obj = Bio::DB::GenBank->new(); my $seq_obj = $db_obj-> get_Seq_by_acc($acc); my $out = Bio::SeqIO->new(-format => 'genbank', -file => '>output.gb'); $out->write_seq($seq_obj); Here is the output: LOCUS AB017118 2038 bp mRNA linear VRT 06-JUN-2006 DEFINITION Danio rerio mRNA for ornithine decarboxylase antizyme long isoform, complete cds. ACCESSION AB017118 VERSION AB017118.1 GI:4239978 KEYWORDS . SOURCE Danio rerio (zebrafish) ORGANISM Danio rerio Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Actinopterygii; Neopterygii; Teleostei; Ostariophysi; Cypriniformes; Cyprinidae; Danio. REFERENCE 1 AUTHORS Saito,T., Hascilowicz,T., Ohkido,I., Kikuchi,Y., Okamoto,H., Hayashi,S., Murakami,Y. and Matsufuji,S. TITLE Two zebrafish (Danio rerio) antizymes with different expression and activities JOURNAL Biochem. J. 345 PT 1, 99-106 (2000) PUBMED 10600644 REFERENCE 2 (bases 1 to 2038) AUTHORS Matsufuji,S. and Saito,T. TITLE Direct Submission JOURNAL Submitted (23-AUG-1998) Senya Matsufuji, Jikei University School of Medicine, Department of Biochemistry II; 3-25-8 Nishishinbashi, Minato-ku, Tokyo 105-8461, Japan (E-mail:senya at jikei.ac.jp, Tel:+81-3-3433-1111(ex.2276), Fax:+81-3-3436-3897) FEATURES Location/Qualifiers source 1..2038 /db_xref="Bio::Annotation::SimpleValue=HASH(0x19b9a28)" /mol_type="Bio::Annotation::SimpleValue=HASH(0x19b9b6c)" /dev_stage="Bio::Annotation::SimpleValue=HASH(0x19b9bb4)" /organism="Bio::Annotation::SimpleValue=HASH(0x19bfe18)" /clone_lib="Bio::Annotation::SimpleValue=HASH(0x19bfe60)" CDS join(45..224,226..702) /db_xref="Bio::Annotation::SimpleValue=HASH(0x19c0960)" /ribosomal_slippage="Bio::Annotation::SimpleValue=HASH(0x1 9beecc)" /codon_start=Bio::Annotation::SimpleValue=HASH(0x19bef14) /protein_id="Bio::Annotation::SimpleValue=HASH(0x19bef5c)" /translation="Bio::Annotation::SimpleValue=HASH(0x19befa4) " /product="Bio::Annotation::SimpleValue=HASH(0x19befec)" /note="Bio::Annotation::SimpleValue=HASH(0x19bf034)" CDS 45..227 /db_xref="Bio::Annotation::SimpleValue=HASH(0x19bee24)" /codon_start=Bio::Annotation::SimpleValue=HASH(0x19bf160) /protein_id="Bio::Annotation::SimpleValue=HASH(0x19bf1cc)" /translation="Bio::Annotation::SimpleValue=HASH(0x19c1830) " /note="Bio::Annotation::SimpleValue=HASH(0x19c1878)" polyA_signal 2017..2022 polyA_site 2038 /note="Bio::Annotation::SimpleValue=HASH(0x19bffc8)" BASE COUNT 439 a 377 c 532 g 690 t ORIGIN 1 cagcagccga gcgcacaggc cgccgtgaaa cctcccgagg ccggatggta aaatccaacc 1981 ttatcctcta tagtggtaca cctttgcttc tgtcataata aaaccattat ttaaagac // From cjfields at uiuc.edu Thu Jul 20 08:43:08 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 20 Jul 2006 07:43:08 -0500 Subject: [Bioperl-l] Features in SeqIO GenBank output In-Reply-To: References: Message-ID: <73C89D17-91FE-47E4-80C1-AA6A689FA14E@uiuc.edu> I'll give it a look. You might try upgrading to Bioperl 1.5.1 to see if this was fixed. Chris On Jul 20, 2006, at 5:53 AM, Georg Otto wrote: > > Hi, > > this is probably a FAQ but I could not find anything to solve it. > > I want to get sequences from GenBank and save them in GenBank > format. This works with the script shown below, but the "Features" > part is missing and contains references instead (see below). How can I > print out the complete GenBank entry? > > I am running Bioperl 1.5, Perl 5.8.6, Mac 10.4.7 > > Best, > > Georg > > > > Here is my script: > > use strict; > use warnings; > > use Bio::Seq; > use Bio::SeqIO; > use Bio::DB::GenBank; > > > my $acc = 'AB017118'; > my $db_obj = Bio::DB::GenBank->new(); > my $seq_obj = $db_obj-> get_Seq_by_acc($acc); > my $out = Bio::SeqIO->new(-format => 'genbank', > -file => '>output.gb'); > $out->write_seq($seq_obj); > > > > Here is the output: > > LOCUS AB017118 2038 bp mRNA linear VRT > 06-JUN-2006 > DEFINITION Danio rerio mRNA for ornithine decarboxylase antizyme long > isoform, complete cds. > ACCESSION AB017118 > VERSION AB017118.1 GI:4239978 > KEYWORDS . > SOURCE Danio rerio (zebrafish) > ORGANISM Danio rerio > Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; > Euteleostomi; > Actinopterygii; Neopterygii; Teleostei; Ostariophysi; > Cypriniformes; Cyprinidae; Danio. > REFERENCE 1 > AUTHORS Saito,T., Hascilowicz,T., Ohkido,I., Kikuchi,Y., > Okamoto,H., > Hayashi,S., Murakami,Y. and Matsufuji,S. > TITLE Two zebrafish (Danio rerio) antizymes with different > expression > and activities > JOURNAL Biochem. J. 345 PT 1, 99-106 (2000) > PUBMED 10600644 > REFERENCE 2 (bases 1 to 2038) > AUTHORS Matsufuji,S. and Saito,T. > TITLE Direct Submission > JOURNAL Submitted (23-AUG-1998) Senya Matsufuji, Jikei > University School > of Medicine, Department of Biochemistry II; 3-25-8 > Nishishinbashi, > Minato-ku, Tokyo 105-8461, Japan (E- > mail:senya at jikei.ac.jp, > Tel:+81-3-3433-1111(ex.2276), Fax:+81-3-3436-3897) > FEATURES Location/Qualifiers > source 1..2038 > /db_xref="Bio::Annotation::SimpleValue=HASH > (0x19b9a28)" > /mol_type="Bio::Annotation::SimpleValue=HASH > (0x19b9b6c)" > /dev_stage="Bio::Annotation::SimpleValue=HASH > (0x19b9bb4)" > /organism="Bio::Annotation::SimpleValue=HASH > (0x19bfe18)" > /clone_lib="Bio::Annotation::SimpleValue=HASH > (0x19bfe60)" > CDS join(45..224,226..702) > /db_xref="Bio::Annotation::SimpleValue=HASH > (0x19c0960)" > / > ribosomal_slippage="Bio::Annotation::SimpleValue=HASH(0x1 > 9beecc)" > /codon_start=Bio::Annotation::SimpleValue=HASH > (0x19bef14) > /protein_id="Bio::Annotation::SimpleValue=HASH > (0x19bef5c)" > /translation="Bio::Annotation::SimpleValue=HASH > (0x19befa4) > " > /product="Bio::Annotation::SimpleValue=HASH > (0x19befec)" > /note="Bio::Annotation::SimpleValue=HASH > (0x19bf034)" > CDS 45..227 > /db_xref="Bio::Annotation::SimpleValue=HASH > (0x19bee24)" > /codon_start=Bio::Annotation::SimpleValue=HASH > (0x19bf160) > /protein_id="Bio::Annotation::SimpleValue=HASH > (0x19bf1cc)" > /translation="Bio::Annotation::SimpleValue=HASH > (0x19c1830) > " > /note="Bio::Annotation::SimpleValue=HASH > (0x19c1878)" > polyA_signal 2017..2022 > polyA_site 2038 > /note="Bio::Annotation::SimpleValue=HASH > (0x19bffc8)" > BASE COUNT 439 a 377 c 532 g 690 t > ORIGIN > 1 cagcagccga gcgcacaggc cgccgtgaaa cctcccgagg ccggatggta > aaatccaacc > > > > > 1981 ttatcctcta tagtggtaca cctttgcttc tgtcataata aaaccattat > ttaaagac > // > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Thu Jul 20 09:35:43 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 20 Jul 2006 14:35:43 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BBBB69.6000906@sendu.me.uk> References: <44BBBB69.6000906@sendu.me.uk> Message-ID: <44BF86AF.8080408@sendu.me.uk> Sendu Bala wrote: > node 2 has name 'Bacteria ' and rank 'superkingdom' > node 1386 has name 'Bacillus ' and rank 'genus' > node 7776 has name 'Gnathostomata ' and rank 'superclass' > etc. > > For me the bits in <> are inappropriate and shouldn't be there. > [...] > If there are no objections I'll strip the <> bits. I also plan to make > $node->name('scientific', 'sapiens'); set and get the node name, and > have flatfile and entrez store all common names with > $obj->name('common', 'human', 'man');. I'll describe all the changes I've now made and if no-one complains I'll commit. (I've also made these notes into bug 2047 for easier reference in the future.) Bio::DB::Taxonomy::flatfile --------------------------- # Bug-fixes Removed invalid requirement that all species nodes have at least 7 named-rank parents. The names->id solution used by get_taxonid() only stored that last id associated with a name. However the name used wasn't necessarily unique, such that multiple ids could match. names->id solution now remembers all ids that match a name. API-CHANGE: for this reason I've renamed get_taxonid() to get_taxonids() and it returns an array of ids in list context. For backward compatibility it returns one of the ids in scalar context, and *get_taxonid = \&get_taxonids. Added missing division ENV 'Environmental samples'. # Improvements Like Bio::DB::Taxonomy::entrez, flatfile now retrieves and stores the common names, genetic code and mitochondrial genetic code in each node it makes. NOTE: entrez also stores creation, publication and update dates, but this data is not available in the taxdump from NCBI ftp site. NOTE: the common names are stored in no particular order; the genbank common name in particular isn't necessarily the first in the list (cf. old entrez.pm behaviour). BEHAVIOUR-CHANGE: flatfile used to store within the nodes it makes the division as a three letter code, like 'PRI'. However, for consistency with entrez and the scientific_name() of the node the division is supposed to correspond to, it is now stored as the full name, like 'Primates'. The names->id solution also stores the artificially uniqued names like 'Craniata ', allowing you for the first time to retrieve the correct id. Previously the search would have simply failed completely. The names->id solution now handles nodes with scientific names of 'xyz (class)', allowing you to retrieve the id with both get_taxonids('xyz') and get_taxonids('xyz (class)'). Previously only the latter would work. NOTE: the previous 2 changes (and the issues with entrez, see below) make flatfile better at searching the taxonomy database than entrez module or the website, both in terms of speed and completeness of results. BEHAVIOUR-CHANGE: The scientific name field isn't touched in any way, always being sent directly to Bio::Taxonomy::Node->new(-name => $untouched) or the $node->classification() array. Previously, a species node would have its name converted from 'Homo sapiens' to 'sapiens', but the conversion mangled very badly certain other species names. Bio::DB::Taxonomy::entrez ------------------------- # Bug-fixes Special characters like ", ( and ) in the input query string to get_taxonid() result in the failure or inaccuracy of the search. These characters are now removed prior to submission, allowing for correct search results. API-CHANGE: entrez has always been able to return multiple ids that match a single input name, so I've renamed get_taxonid() to get_taxonids() and it returns an array of ids in list context. It returns one of the ids in scalar context. For backward compatibility, *get_taxonid = \&get_taxonids. NOTE: entrez modules (and website) cannot cope with '' in the query, failing searches like 'Craniata '. For this reason, if get_taxonids() is given a query with '' it will immediately return undefined, saving a pointless website access. If you want the id of 'Craniata ' you must search for 'Craniata', then get the node for each returned id to see which one has a parent node with a scientific_name() or common_names() case-insensitive matching to 'chordata'. # Improvements BEHAVIOUR-CHANGE: now throws on failure to retrieve data from website. BEHAVIOUR-CHANGE: the ScientificName field isn't touched except for s/ \(class\)$//, being sent directly to Bio::Taxonomy::Node->new(-name => $untouched) or the $node->classification() array. Previously, a species node would have its name converted from 'Homo sapiens' to 'sapiens', but the conversion mangled very badly certain other species names. BEHAVIOUR-CHANGE: all common names of a node are now stored in the resulting Node object with Bio::Taxonomy::Node->new(-common_names => \@names). This means that the Genbank common name is now just one amongst others, and isn't guaranteed to be the first in the list either. Bio::Taxonomy::Node ------------------- # Bug-fixes non-interesting fixes to get get_Children_Nodes(), get_Lineage_Nodes() and get_LCA_Node() to work correctly. classification() has a proper solution to finding the classification when the array wasn't manually set. # Improvements BEHAVIOUR-CHANGE: node_name() used to be an alias to name('common'). Now it is an alias to name('scientific'). NOTE: node_name is what is set when ->new(-name => $name) is set, so flatfile and entrez and user-created nodes now implicitly associate the name of the node they create with its scientific name. BEHAVIOUR-CHANGE: scientific_name() used to be an alias to binomial(). Now it is *scientific_name = \&node_name. binomial(), in addition to working the old way (assume first two elements of classification array are species and genus, combine them), will shortcut and return the scientific_name() if we are a node with rank 'species' and scientific_name is two words. This makes binomial() an effective synonym of scientific_name() when Nodes were constructed as per flatfile or entrez, and when it is used correctly on a species node. BEHAVIOUR-CHANGE: *parent_taxon_id = \&parent_id. (Previously, you could assign and retrieve different values to/from each method.) New method common_names() supersedes common_name(), returning a list of all common_names. For backward compatibility, returns one of the names in scalar context, and *common_name = \&common_names. -factory and factory() removed, since there is no Bio::Taxonomy::FactoryI-implementing module, nothing in Node to make use of a factory once set, and a factory seems redundant when we're a node with a -dbh. species() and genus() issue a warning when you try to use them on a node that isn't of rank 'species' (since they interact with the classification array and not names('method') like the other similar methods). validate_name() removed because it just returns 1. validate_species_name() removed because species() can (should) now contain the real species name, like 'Homo sapiens', not 'sapiens'. But it could also be any wonderfully complex thing, so there's nothing we can confidently check for as being 'correct'. t/Taxonomy.t ------------ Runs a slightly more comprehensive set of tests on entrez, which are now only skipped if data retrieval fails. Tests flatfile on a cut-down version of the taxdump. > I'll also fix the problem with node names for ranks species and lower, > as discussed in thread 'Bio::DB::Taxonomy:: mishandles species, > subspecies/variant names', in the way I suggested there. This hasn't been done per se, because we now store the real ScientificName so there is no 'mishandling' to fix. From bix at sendu.me.uk Thu Jul 20 09:49:04 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 20 Jul 2006 14:49:04 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BF86AF.8080408@sendu.me.uk> References: <44BBBB69.6000906@sendu.me.uk> <44BF86AF.8080408@sendu.me.uk> Message-ID: <44BF89D0.7090103@sendu.me.uk> Sendu Bala wrote: > > Bio::DB::Taxonomy::flatfile > > BEHAVIOUR-CHANGE: The scientific name field isn't touched in any way, > always being sent directly to Bio::Taxonomy::Node->new(-name => > $untouched) or the $node->classification() array. Previously, a species > node would have its name converted from 'Homo sapiens' to 'sapiens', but > the conversion mangled very badly certain other species names. [...] > Bio::DB::Taxonomy::entrez > > BEHAVIOUR-CHANGE: the ScientificName field isn't touched except for s/ > \(class\)$//, being sent directly to Bio::Taxonomy::Node->new(-name => > $untouched) or the $node->classification() array. Previously, a species > node would have its name converted from 'Homo sapiens' to 'sapiens', but > the conversion mangled very badly certain other species names. Oops. In both cases the scientific name has ' (class)' removed from it, but the original name (with ' (class)') is stored as one of the common names. From georg.otto at tuebingen.mpg.de Thu Jul 20 10:29:33 2006 From: georg.otto at tuebingen.mpg.de (Georg Otto) Date: Thu, 20 Jul 2006 16:29:33 +0200 Subject: [Bioperl-l] Features in SeqIO GenBank output References: <73C89D17-91FE-47E4-80C1-AA6A689FA14E@uiuc.edu> Message-ID: This indeed seems to be the case. After upgrading it works fine. Sorry for stealing your time. Georg Chris Fields writes: > I'll give it a look. You might try upgrading to Bioperl 1.5.1 to see > if this was fixed. > > Chris > > On Jul 20, 2006, at 5:53 AM, Georg Otto wrote: > >> >> Hi, >> >> this is probably a FAQ but I could not find anything to solve it. >> >> I want to get sequences from GenBank and save them in GenBank >> format. This works with the script shown below, but the "Features" >> part is missing and contains references instead (see below). How can I >> print out the complete GenBank entry? >> >> I am running Bioperl 1.5, Perl 5.8.6, Mac 10.4.7 >> >> Best, >> >> Georg >> >> >> >> Here is my script: >> >> use strict; >> use warnings; >> >> use Bio::Seq; >> use Bio::SeqIO; >> use Bio::DB::GenBank; >> >> >> my $acc = 'AB017118'; >> my $db_obj = Bio::DB::GenBank->new(); >> my $seq_obj = $db_obj-> get_Seq_by_acc($acc); >> my $out = Bio::SeqIO->new(-format => 'genbank', >> -file => '>output.gb'); >> $out->write_seq($seq_obj); >> >> >> >> Here is the output: >> >> LOCUS AB017118 2038 bp mRNA linear VRT >> 06-JUN-2006 >> DEFINITION Danio rerio mRNA for ornithine decarboxylase antizyme long >> isoform, complete cds. >> ACCESSION AB017118 >> VERSION AB017118.1 GI:4239978 >> KEYWORDS . >> SOURCE Danio rerio (zebrafish) >> ORGANISM Danio rerio >> Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; >> Euteleostomi; >> Actinopterygii; Neopterygii; Teleostei; Ostariophysi; >> Cypriniformes; Cyprinidae; Danio. >> REFERENCE 1 >> AUTHORS Saito,T., Hascilowicz,T., Ohkido,I., Kikuchi,Y., >> Okamoto,H., >> Hayashi,S., Murakami,Y. and Matsufuji,S. >> TITLE Two zebrafish (Danio rerio) antizymes with different >> expression >> and activities >> JOURNAL Biochem. J. 345 PT 1, 99-106 (2000) >> PUBMED 10600644 >> REFERENCE 2 (bases 1 to 2038) >> AUTHORS Matsufuji,S. and Saito,T. >> TITLE Direct Submission >> JOURNAL Submitted (23-AUG-1998) Senya Matsufuji, Jikei >> University School >> of Medicine, Department of Biochemistry II; 3-25-8 >> Nishishinbashi, >> Minato-ku, Tokyo 105-8461, Japan (E- >> mail:senya at jikei.ac.jp, >> Tel:+81-3-3433-1111(ex.2276), Fax:+81-3-3436-3897) >> FEATURES Location/Qualifiers >> source 1..2038 >> /db_xref="Bio::Annotation::SimpleValue=HASH >> (0x19b9a28)" >> /mol_type="Bio::Annotation::SimpleValue=HASH >> (0x19b9b6c)" >> /dev_stage="Bio::Annotation::SimpleValue=HASH >> (0x19b9bb4)" >> /organism="Bio::Annotation::SimpleValue=HASH >> (0x19bfe18)" >> /clone_lib="Bio::Annotation::SimpleValue=HASH >> (0x19bfe60)" >> CDS join(45..224,226..702) >> /db_xref="Bio::Annotation::SimpleValue=HASH >> (0x19c0960)" >> / >> ribosomal_slippage="Bio::Annotation::SimpleValue=HASH(0x1 >> 9beecc)" >> /codon_start=Bio::Annotation::SimpleValue=HASH >> (0x19bef14) >> /protein_id="Bio::Annotation::SimpleValue=HASH >> (0x19bef5c)" >> /translation="Bio::Annotation::SimpleValue=HASH >> (0x19befa4) >> " >> /product="Bio::Annotation::SimpleValue=HASH >> (0x19befec)" >> /note="Bio::Annotation::SimpleValue=HASH >> (0x19bf034)" >> CDS 45..227 >> /db_xref="Bio::Annotation::SimpleValue=HASH >> (0x19bee24)" >> /codon_start=Bio::Annotation::SimpleValue=HASH >> (0x19bf160) >> /protein_id="Bio::Annotation::SimpleValue=HASH >> (0x19bf1cc)" >> /translation="Bio::Annotation::SimpleValue=HASH >> (0x19c1830) >> " >> /note="Bio::Annotation::SimpleValue=HASH >> (0x19c1878)" >> polyA_signal 2017..2022 >> polyA_site 2038 >> /note="Bio::Annotation::SimpleValue=HASH >> (0x19bffc8)" >> BASE COUNT 439 a 377 c 532 g 690 t >> ORIGIN >> 1 cagcagccga gcgcacaggc cgccgtgaaa cctcccgagg ccggatggta >> aaatccaacc >> >> >> >> >> 1981 ttatcctcta tagtggtaca cctttgcttc tgtcataata aaaccattat >> ttaaagac >> // >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign From prabubio at gmail.com Thu Jul 20 12:01:35 2006 From: prabubio at gmail.com (Prabu R) Date: Thu, 20 Jul 2006 21:31:35 +0530 Subject: [Bioperl-l] Blast Output Parsing Message-ID: Dear All! I am now trying to parse a Blast output using PERL. I have to extract each alignment and have to parse the alignment. I mean, I have to check whether a particular part of the given sequence got aligned 100%. Anybody please tell me what module in PERL I have to use for getting this. I've tried Bio::SearchIO. But I didnt get any method to get the alignment. Kindly help. Thanks, R. Prabu From cjfields at uiuc.edu Thu Jul 20 13:03:17 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 20 Jul 2006 12:03:17 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BF86AF.8080408@sendu.me.uk> Message-ID: <002901c6ac1e$66ea3820$15327e82@pyrimidine> These all seem fine to me. Fantastic work! I added some comments but everything seems fine to me. I still plan on switching Bio::DB::Taxonomy::entrez to use Bio::DB::EUtilities at some point but probably won't get around to it until August; I still need to write up tests for the EUtilities modules. I may add a method for retrieving tax data based on protein/nucleotide sequence primary ID and relevant sequence database, so you could directly retrieve the relevant TaxID w/o parsing sequences directly for them. This would mainly be useful if you gather GIs from a BLAST search, for instance. Anyway, I could add this in then base class Bio::DB::Taxonomy directly so one could used the retrieved TaxIDs for flat-file or entrez searches; this requires, of course, access to the remote Entrez database (it would use ELink). Would that be of interest? If so, I'll work on that and add relevant tests to Taxonomy.t when I can. > Bio::DB::Taxonomy::flatfile > --------------------------- ... > API-CHANGE: for this reason I've renamed get_taxonid() to get_taxonids() > and it returns an array of ids in list context. For backward > compatibility it returns one of the ids in scalar context, and > *get_taxonid = \&get_taxonids. Returning a scalar makes sense as long as its noted in the POD. I have seen similar methods return an array ref based on wantarray instead of a scalar, but that largely depends on the complexity of the array (an array of hashes, for instance). ... > Bio::DB::Taxonomy::entrez > ------------------------- ... > NOTE: entrez modules (and website) cannot cope with '' in the > query, failing searches like 'Craniata '. For this reason, if > get_taxonids() is given a query with '' it will immediately > return undefined, saving a pointless website access. If you want the id > of 'Craniata ' you must search for 'Craniata', then get the > node for each returned id to see which one has a parent node with a > scientific_name() or common_names() case-insensitive matching to > 'chordata'. It may be something with the esearch interface, though the direct TaxBrowser query also seems to have problems with this: http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/ I'll try looking into it to see if there is a more direct way to get those (there probably isn't). > # Improvements > BEHAVIOUR-CHANGE: now throws on failure to retrieve data from website. > > BEHAVIOUR-CHANGE: the ScientificName field isn't touched except for s/ > \(class\)$//, being sent directly to Bio::Taxonomy::Node->new(-name => > $untouched) or the $node->classification() array. Previously, a species > node would have its name converted from 'Homo sapiens' to 'sapiens', but > the conversion mangled very badly certain other species names. This actually relates to the similar comment made for Bio::DB::Taxonomy::flatfle. The mangling probably depends on the current node and whether using flatfile or XML (entrez). Most of the odd XML examples I posted before, where the TaxID associated with a sequence had extra data, were a rank of 'no rank'. The species rank, if present, has a normal binomial name for : Flavobacterium johnsoniae UW101 ... Flavobacterium johnsoniae species Pseudomonas putida F1 ... Pseudomonas putida species Caldicellulosiruptor saccharolyticus DSM 8903 ... Caldicellulosiruptor saccharolyticus species The genus rank has one name; the subspecies rank has the full species name with 'subsp.' followed by the subspecies name. So, if using XML, one could use the taxon subelements stored in the XML element to sort out genus(), species(), subspecies(), and also higher order elements if someone wanted to implement them. This, of course, isn't necessary for the current changes, but down the road if anybody wanted it... ... > Bio::Taxonomy::Node > ------------------- ... > species() and genus() issue a warning when you try to use them on a node > that isn't of rank 'species' (since they interact with the > classification array and not names('method') like the other similar > methods). I would just have genus() and species() issue warnings if they aren't set to a particular value. So, if the current node is at the genus rank, genus() will be set but species() won't be. And no need to do additional checking! Fabulous work Sendu! Chris From cjfields at uiuc.edu Thu Jul 20 13:23:14 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 20 Jul 2006 12:23:14 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BF89D0.7090103@sendu.me.uk> Message-ID: <002a01c6ac21$2ed16190$15327e82@pyrimidine> Just thought of something... You had mentioned using a stripped-down version of Bio::Taxonomy::Node previously, which led to a bit of contention. One way to make everybody happy would be to create an interface class that contains the basic shared methods (Bio::Taxonomy::NodeI), then have the currently-named Bio::Taxonomy::Node (which could be renamed to Bio::Taxonomy::Species or something similar) implement those methods along with the current methods. Another class (your stripped down version, which could then be Bio::Taxonomy::Node) would also implement whatever base class methods were needed. They would both be Bio::Taxonomy::NodeI-implementing, so you could use either object type where required. |------Node NodeI----| |------Species Another option would be to have Bio::Taxonomy::Node itself stripped down, then have another class (Bio::Taxonomy::Species) inherit methods from it and also implement additional methods (genus(), species(), etc). Node----Species Would something like that be feasible? I favor the interface version as it sticks with the interface-implementation design that Bioperl has been migrating towards: http://www.bioperl.org/wiki/Advanced_BioPerl#Bioperl_Interface_design This would also help out with the whole Bio::Species issue; just have Bio::Taxonomy::Species replace it. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Sendu Bala > Sent: Thursday, July 20, 2006 8:49 AM > To: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > Sendu Bala wrote: > > > > Bio::DB::Taxonomy::flatfile > > > > BEHAVIOUR-CHANGE: The scientific name field isn't touched in any way, > > always being sent directly to Bio::Taxonomy::Node->new(-name => > > $untouched) or the $node->classification() array. Previously, a species > > node would have its name converted from 'Homo sapiens' to 'sapiens', but > > the conversion mangled very badly certain other species names. > [...] > > Bio::DB::Taxonomy::entrez > > > > BEHAVIOUR-CHANGE: the ScientificName field isn't touched except for s/ > > \(class\)$//, being sent directly to Bio::Taxonomy::Node->new(-name => > > $untouched) or the $node->classification() array. Previously, a species > > node would have its name converted from 'Homo sapiens' to 'sapiens', but > > the conversion mangled very badly certain other species names. > > Oops. In both cases the scientific name has ' (class)' removed from it, > but the original name (with ' (class)') is stored as one of the common > names. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Thu Jul 20 13:31:42 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 20 Jul 2006 12:31:42 -0500 Subject: [Bioperl-l] Blast Output Parsing In-Reply-To: Message-ID: <002b01c6ac22$5d75e1f0$15327e82@pyrimidine> Grab the HSPs, then use get_aln() to generate a Bio::SimpleAlign object. You can then use Bio::AlignIO to generate the alignment output if needed, or use the Bio::SimpleAlign methods to get what you want. http://www.bioperl.org/wiki/HOWTO:Beginners http://www.bioperl.org/wiki/HOWTO:SearchIO http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/SimpleAlign .html Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Prabu R > Sent: Thursday, July 20, 2006 11:02 AM > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] Blast Output Parsing > > Dear All! > > I am now trying to parse a Blast output using PERL. > > I have to extract each alignment and have to parse the alignment. I mean, > I > have to check whether a particular part of the given sequence got aligned > 100%. > > Anybody please tell me what module in PERL I have to use for getting this. > > I've tried Bio::SearchIO. But I didnt get any method to get the > alignment. > > Kindly help. > > Thanks, > R. Prabu > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From bix at sendu.me.uk Thu Jul 20 13:53:03 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 20 Jul 2006 18:53:03 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <002901c6ac1e$66ea3820$15327e82@pyrimidine> References: <002901c6ac1e$66ea3820$15327e82@pyrimidine> Message-ID: <44BFC2FF.3030704@sendu.me.uk> Chris Fields wrote: > > I still plan on switching Bio::DB::Taxonomy::entrez to use > Bio::DB::EUtilities at some point but probably won't get around to it until > August; If I may make two feature requests (you've probably already done them, if so apologies)? a) Automatically enforce the 3second wait rule when querying via the ncbi website. b) Automatically cache results locally in a reasonable way, such that repeated queries aiming to get the same result don't have to go via the website. > Anyway, I could add this in then base class Bio::DB::Taxonomy directly so > one could used the retrieved TaxIDs for flat-file or entrez searches; this > requires, of course, access to the remote Entrez database (it would use > ELink). Would that be of interest? Sorry, I don't really understand this paragraph. I'm unable to parse '...then base class Bio::DB::Taxonomy directly so...', for starters. >> Bio::Taxonomy::Node >> ------------------- > > ... > >> species() and genus() issue a warning when you try to use them on a node >> that isn't of rank 'species' (since they interact with the >> classification array and not names('method') like the other similar >> methods). > > I would just have genus() and species() issue warnings if they aren't set to > a particular value. So, if the current node is at the genus rank, genus() > will be set but species() won't be. And no need to do additional checking! The problem is, genus() and species() are special cases that aren't normally directly set. They get their values from the classification array: genus() returns (classification())[1] and species() returns (classification())[0]. They set the same values. Doing this is only sane (though is still likely to be wrong, given that there can be ranks between species and genus) when the node is of rank 'species', hence the warnings. I imagine this is to work with pesky file formats like genbank, so I can't really change anything here without major overhaul. And my plans for overhaul involve getting rid of genus() and species(), so I'll just leave them be for now. Anyway, thanks for your comments and input into this thread! It's much appreciated. From bix at sendu.me.uk Thu Jul 20 13:55:56 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 20 Jul 2006 18:55:56 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <002a01c6ac21$2ed16190$15327e82@pyrimidine> References: <002a01c6ac21$2ed16190$15327e82@pyrimidine> Message-ID: <44BFC3AC.8010704@sendu.me.uk> Chris Fields wrote: > Just thought of something... > > You had mentioned using a stripped-down version of Bio::Taxonomy::Node > previously, which led to a bit of contention. One way to make everybody > happy would be to create an interface class that contains the basic shared > methods (Bio::Taxonomy::NodeI), then have the currently-named > Bio::Taxonomy::Node (which could be renamed to Bio::Taxonomy::Species or > something similar) implement those methods along with the current methods. > Another class (your stripped down version, which could then be > Bio::Taxonomy::Node) would also implement whatever base class methods were > needed. They would both be Bio::Taxonomy::NodeI-implementing, so you could > use either object type where required. > > |------Node > NodeI----| > |------Species [...] > I favor the interface version as it > sticks with the interface-implementation design that Bioperl has been > migrating towards: > > http://www.bioperl.org/wiki/Advanced_BioPerl#Bioperl_Interface_design > > This would also help out with the whole Bio::Species issue; just have > Bio::Taxonomy::Species replace it. Yes, this sounds good to me. Should I still wait until Jason/elders are able to comment before I start exploring this avenue? From cjfields at uiuc.edu Thu Jul 20 14:21:48 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 20 Jul 2006 13:21:48 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BFC3AC.8010704@sendu.me.uk> Message-ID: <000601c6ac29$5d533a90$15327e82@pyrimidine> I would say go ahead, why not? This would likely lead to the eventual deprecation of Bio::Species, which was in the cards anyway. The only problem I can foresee is which class to use with Bio::DB::Taxonomy*? I guess one could settle on one class by default and have the option to use another Bio::Taxonomy::NodeI-implementing class if you wanted more data/methods available... Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Sendu Bala > Sent: Thursday, July 20, 2006 12:56 PM > To: bioperl-l at bioperl.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > Chris Fields wrote: > > Just thought of something... > > > > You had mentioned using a stripped-down version of Bio::Taxonomy::Node > > previously, which led to a bit of contention. One way to make everybody > > happy would be to create an interface class that contains the basic > shared > > methods (Bio::Taxonomy::NodeI), then have the currently-named > > Bio::Taxonomy::Node (which could be renamed to Bio::Taxonomy::Species or > > something similar) implement those methods along with the current > methods. > > Another class (your stripped down version, which could then be > > Bio::Taxonomy::Node) would also implement whatever base class methods > were > > needed. They would both be Bio::Taxonomy::NodeI-implementing, so you > could > > use either object type where required. > > > > |------Node > > NodeI----| > > |------Species > [...] > > I favor the interface version as it > > sticks with the interface-implementation design that Bioperl has been > > migrating towards: > > > > http://www.bioperl.org/wiki/Advanced_BioPerl#Bioperl_Interface_design > > > > This would also help out with the whole Bio::Species issue; just have > > Bio::Taxonomy::Species replace it. > > Yes, this sounds good to me. Should I still wait until Jason/elders are > able to comment before I start exploring this avenue? > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From hlapp at gmx.net Thu Jul 20 14:24:19 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 20 Jul 2006 14:24:19 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BFC3AC.8010704@sendu.me.uk> References: <002a01c6ac21$2ed16190$15327e82@pyrimidine> <44BFC3AC.8010704@sendu.me.uk> Message-ID: On Jul 20, 2006, at 1:55 PM, Sendu Bala wrote: > > Yes, this sounds good to me. Should I still wait until Jason/elders > are > able to comment before I start exploring this avenue? Unless you're afraid that your suggestions are going too wild for our palate please do go ahead. The joy of CVS is we can always go back. For my part, I just haven't been able to keep up with the flurry of long emails ... I'll have to do some extensive bedtime reading (and then writing ;) soon I guess :-) -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From saunders at uchicago.edu Thu Jul 20 17:47:08 2006 From: saunders at uchicago.edu (Matthew A. Saunders) Date: Thu, 20 Jul 2006 16:47:08 -0500 (CDT) Subject: [Bioperl-l] installing bioperl Message-ID: Dear Bioperl representative, I have been trying to install bioperl (in order to ultimately run some Ensembl APIs) but I seem to be having some problems with the bioperl installation. I have followed the installation directions and I get to the last steps of the "make" process, yet this stage fails with the error message below. Can you possibly tell me what is the problem. I am not sure that I understand the command "make", but I think that it requires that there be a file named "makefile" in the given folder, when I look in my newly formed "bioperl-1.4" folder there is no "makefile" in there. Perhaps that is a problem. If so, how might I rectify the matter? Thanks! Matt ************************************************************* . . . Enjoy the rest of bioperl, which you can use after going 'make install' Checking if your kit is complete... Looks good /usr/bin/perl: symbol lookup error: /usr/lib64/perl5/5.8.5/x86_64-linux-thread-multi/auto/DB_File/DB_File.so: undefined symbol: db_version Running make test Make had some problems, maybe interrupted? Won't test Running make install Make had some problems, maybe interrupted? Won't install *************************************************************** ----------------------------------------------------- Matthew A. Saunders UNCF-MERCK Postdoctoral Research Fellow Dept. of Ecology and Evolution University of Chicago (773)834-3964 Skype: mattsaunders555 http://home.uchicago.edu/~saunders ------------------------------------------------------- From saunders at uchicago.edu Thu Jul 20 18:01:53 2006 From: saunders at uchicago.edu (Matthew A. Saunders) Date: Thu, 20 Jul 2006 17:01:53 -0500 (CDT) Subject: [Bioperl-l] installing bioperl In-Reply-To: References: Message-ID: In continuation to my described problem, I have just installed the bioperl-run file from the .tar.gz format and that was successful through the "perl Makefile.PL" and the "make" & "make test" phases. It is the "bioperl core" file that is still giving me the problems described below. Thanks! Matt ******************************** On Thu, 20 Jul 2006, Matthew A. Saunders wrote: > Dear Bioperl representative, > > I have been trying to install bioperl (in order to ultimately run some > Ensembl APIs) but I seem to be having some problems with the bioperl > installation. > > I have followed the installation directions and I get to the last steps of > the "make" process, yet this stage fails with the error message below. Can > you possibly tell me what is the problem. I am not sure that I understand > the command "make", but I think that it requires that there be a file named > "makefile" in the given folder, when I look in my newly formed "bioperl-1.4" > folder there is no "makefile" in there. Perhaps that is a problem. If so, > how might I rectify the matter? > > Thanks! > > Matt > > > ************************************************************* . . . > Enjoy the rest of bioperl, which you can use after going 'make install' > > Checking if your kit is complete... > Looks good > /usr/bin/perl: symbol lookup error: > /usr/lib64/perl5/5.8.5/x86_64-linux-thread-multi/auto/DB_File/DB_File.so: > undefined symbol: db_version > Running make test > Make had some problems, maybe interrupted? Won't test > Running make install > Make had some problems, maybe interrupted? Won't install > *************************************************************** > > > > ----------------------------------------------------- > Matthew A. Saunders > UNCF-MERCK Postdoctoral Research Fellow > > Dept. of Ecology and Evolution > University of Chicago > (773)834-3964 > Skype: mattsaunders555 > http://home.uchicago.edu/~saunders > ------------------------------------------------------- > > From bix at sendu.me.uk Thu Jul 20 18:47:33 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 20 Jul 2006 23:47:33 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine> References: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine> Message-ID: <44C00805.7090403@sendu.me.uk> Chris Fields wrote: > As for caching, > do you mean caching of the tax information or the sequence ID information? Anything you get from entrez. > Caching of tax information would be great, but how would you go about it? I > can see how it would be easy to have a cache for the flatfile using a local > index, but not so much for XML data retrieved from Entrez (a > continually-appended local file, maybe, with a n accompanying index file?). I didn't actually mean a stored file (but that would be possible with a tied hash or something: DB_File, just like flatfile), but an in-memory one for use during the course of program execution. Stored file would probably be dangerous because you wouldn't know if the data has become stale or not - and checking to see if it wasn't would defeat the point. >> The problem is, genus() and species() are special cases that aren't >> normally directly set. They get their values from the classification >> array: genus() returns (classification())[1] and species() returns >> (classification())[0]. They set the same values. Doing this is only sane >> (though is still likely to be wrong, given that there can be ranks >> between species and genus) when the node is of rank 'species', hence the >> warnings. >> >> I imagine this is to work with pesky file formats like genbank, so I >> can't really change anything here without major overhaul. And my plans >> for overhaul involve getting rid of genus() and species(), so I'll just >> leave them be for now. > > This would all depend on where the information came from; if the information > came from the Entrez XML element data: > [snip] > > The subspecies(), genus(), and species() could all be set from this instead > of the classification array. The problem lies then with the flatfile data > and how it would be parsed out, if that's at all possible with the flatfile > data. If not, I see why you would rather have this return a stripped-down > Bio::Taxonomy::Node object. > > I would have to look at how everything is indexed in > Bio::DB::Taxonomy::entrez, but I think it's feasible. entrez already parses through LineageEx to build the classification array. flatfile walks up all the parents to do the same. Having the information isn't the issue. We have the information. The methods genus() and species() need to work with the genbank fileformat, that is the problem. From MEC at stowers-institute.org Thu Jul 20 18:40:55 2006 From: MEC at stowers-institute.org (Cook, Malcolm) Date: Thu, 20 Jul 2006 17:40:55 -0500 Subject: [Bioperl-l] Remote Blast - Blast Human Genome Message-ID: Rohan, 'snp/human/human_snp' is the database name you need to use to blast into human snp database at NCBI See the following document for the full list (which link was provided to me via personal correspondace with NCBI helpdesk). Very useful... Hmm, looming again, there appear now to be two versions: http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastdblist.html (last updated 2/7/2006) http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/remote_accessible_blastdbli st.html (last uypdated 5/29/2006) Neither are linked to by any other document on the internet (google sez) including anywhere else at NCBI. Go figure. It should be IMHO since this info is nowhere else collected. Of course it may be out of date, but it always has got me through. Good luck Malcolm Cook - mec at stowers-institute.org - 816-926-4449 Database Applications Manager - Bioinformatics Stowers Institute for Medical Research - Kansas City, MO USA >-----Original Message----- >From: bioperl-l-bounces at lists.open-bio.org >[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Chris Fields >Sent: Monday, July 17, 2006 4:26 PM >To: vrramnar at student.cs.uwaterloo.ca; bioperl-l at lists.open-bio.org >Subject: Re: [Bioperl-l] Remote Blast - Blast Human Genome > >Okay, I think I may know what's going on a little more now >with NCBI's BLAST >interface. Looks like any NCBI BLAST query must use the >default URL and so >must set up to proper GET/PUT commands to retrieve everything >correctly. > >Here's the API description for it all: > >http://www.ncbi.nlm.nih.gov/BLAST/Doc/urlapi.html > >You could try setting the database to 'snp' or something along >those lines >instead of 'nr'; or you could see what the name of the >database is when you >use the web form and try setting it to that. According to >this page, this >should be possible: > >http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helpsnpfaq.sectio >n.SearchdbSNP >_test._Search_dbSNP_Using_B > >The Entrez Query limit was a recommendation for limiting your >search to a >set of sequences for human, for instance. > >I'll try looking into it a bit more but I'm pretty busy. If you find >anything out you should probably post it here . > >Chris > >> Hi Chris, >> >> 1. I have tried changing the database to snp or dbSNP but >neither works. >> It >> seems that depending on which type of blast you use(ie, Genome Blast, >> Blast SNP, >> normal blast such as blastn, etc...) you see a different listing of >> databases >> available for querys. Since you mention that the Blast page I see was >> generated >> by Genome, where could I go to see a complete listing of >databases I can >> query?? >> Or if you knew off hand which database to search if I only >wanted dbSNP >> hits? >> >> 2. You also mention, I can limit the search by using Entrez >terms. Do you >> mean >> like: >> $Bio::Tools::Run::RemoteBlast::HEADER{'ENTREZ_QUERY'} = 'abc'; >> where 'abc' is the name of the subject with which you would >only like to >> see >> result of. For example if you put it as 'Homo >sapiens[Organism]' then only >> human >> sequences would be in hit lists. >> If this is what you mean, what would I change it to, to see >only hits from >> dbSNP? >> >> Thanks for the ongoing help, >> >> Rohan >> >> Quoting Chris Fields : >> >> > I added a method to RemoteBlast in bioperl-live (CVS) if >you want to >> play >> > with changing the URL. I have been thinking about doing >this for a bit >> now >> > but I already see problems. >> > >> > Here's the issue: the BLAST page you see is NOT the NCBI BLAST page >> (note >> > the differences in the URL) but a user-friendly request >page, generated >> on >> > the fly by Genome, to submit BLAST requests for the >relevant database. >> So >> > changing the URL will not work (even by adding extra >parameters); you >> only >> > get the original HTML web page. >> > >> > You could try changing the database or limiting the search using an >> Entrez >> > term (which you should be able to include in the request, >probably by >> adding >> > it to the HEADER). >> > >> > Chris >> > >> > > -----Original Message----- >> > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> > > bounces at lists.open-bio.org] On Behalf Of >> vrramnar at student.cs.uwaterloo.ca >> > > Sent: Thursday, July 13, 2006 5:39 PM >> > > To: bioperl-l at lists.open-bio.org >> > > Subject: [Bioperl-l] Remote Blast - Blast Human Genome >> > > >> > > >> > > Hello Again, >> > > >> > > I have another question regarding Remote blast but this >time using >> Genome >> > > Blast. >> > > >> > > Here is the link: >> > > >> > > >> >http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=9606 >> > > >> > > which again uses the main Blast web site: >> > > >> > > http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi >> > > >> > > Again I am not sure what to add or what HEADER >information to change >> > > within my >> > > script. >> > > >> > > Here is my program, which was the same as the last email: >> > > >> > > #!/usr/bin/perl -w >> > > >> > > use Bio::Perl; >> > > use Bio::Tools::Run::RemoteBlast; >> > > >> > > my $prog = "blastn"; >> > > my $db = "refseq_genomic"; >> > > my $e_val = 0.01; >> > > >> > > my @params = ( '-prog' => $prog, >> > > '-data' => $db, >> > > '-expect' => $e_val); >> > > >> > > my $factory = new Bio::Tools::Run::RemoteBlast->new(@params); >> > > $Bio::Tools::Run::RemoteBlast::HEADER{'WWW_BLAST_TYPE'} >= '????'; <-- >> --- >> > > what >> > > do I put here >> > > #$Bio::Tools::Run::RemoteBlast::HEADER{'?????'} = >'????'; <--- Do I >> need >> > > to add >> > > any other values to the form inputs >> > > >> > > $factory->submit_blast("blast.in"); >> > > $v = 1; >> > > >> > > while (my @rids = $factory->each_rid) >> > > { foreach my $rid ( @rids ) >> > > { my $rc = $factory->retrieve_blast($rid); >> > > if( !ref($rc) ) >> > > { if( $rc < 0 ) >> > > { $factory->remove_rid($rid); >> > > } >> > > print STDERR "." if ( $v > 0 ); >> > > sleep 5; >> > > } >> > > else >> > > { my $result = $rc->next_result(); >> > > my $filename = $result->query_name()."\.out"; >> > > $factory->save_output($filename); >> > > $factory->remove_rid($rid); >> > > print "\nQuery Name: ", $result->query_name(), "\n"; >> > > } >> > > } >> > > } >> > > >> > > >> > > Both of my questions are very similiar as in I know how >to use remote >> > > blast but >> > > not sure what to change to access the specific blast I want. >> > > >> > > Again, any help would be very appreciated!! >> > > >> > > Rohan >> > > >> > > >> > > >> > > ---------------------------------------- >> > > This mail sent through www.mywaterloo.ca >> > > _______________________________________________ >> > > Bioperl-l mailing list >> > > Bioperl-l at lists.open-bio.org >> > > http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > >> >> >> >> >> ---------------------------------------- >> This mail sent through www.mywaterloo.ca > >_______________________________________________ >Bioperl-l mailing list >Bioperl-l at lists.open-bio.org >http://lists.open-bio.org/mailman/listinfo/bioperl-l > From cjfields at uiuc.edu Thu Jul 20 19:01:02 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 20 Jul 2006 18:01:02 -0500 Subject: [Bioperl-l] installing bioperl In-Reply-To: References: Message-ID: <68C6025D-A9FE-47F0-905C-28B79C4B843A@uiuc.edu> Did you run perl Makefile.PL make make install 'perl Makefile.PL' generates the Makefile. Something screwy with DB_File, apparently, is also going on. > /usr/lib64/perl5/5.8.5/x86_64-linux-thread-multi/auto/DB_File/ > DB_File.so: Try updating or reinstalling DB_File. Chris On Jul 20, 2006, at 4:47 PM, Matthew A. Saunders wrote: > Dear Bioperl representative, > > I have been trying to install bioperl (in order to ultimately run some > Ensembl APIs) but I seem to be having some problems with the > bioperl installation. > > I have followed the installation directions and I get to the last > steps of > the "make" process, yet this stage fails with the error message below. > Can you possibly tell me what is the problem. I am not sure that I > understand the command "make", but I think that it requires that > there be > a file named "makefile" in the given folder, when I look in my newly > formed "bioperl-1.4" folder there is no "makefile" in there. > Perhaps that > is a problem. If so, how might I rectify the matter? > > Thanks! > > Matt > > > ************************************************************* . . > . > Enjoy the rest of bioperl, which you can use after going 'make > install' > > Checking if your kit is complete... > Looks good > /usr/bin/perl: symbol lookup error: > /usr/lib64/perl5/5.8.5/x86_64-linux-thread-multi/auto/DB_File/ > DB_File.so: > undefined symbol: db_version > Running make test > Make had some problems, maybe interrupted? Won't test > Running make install > Make had some problems, maybe interrupted? Won't install > *************************************************************** > > > > ----------------------------------------------------- > Matthew A. Saunders > UNCF-MERCK Postdoctoral Research Fellow > > Dept. of Ecology and Evolution > University of Chicago > (773)834-3964 > Skype: mattsaunders555 > http://home.uchicago.edu/~saunders > ------------------------------------------------------- > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Thu Jul 20 19:02:08 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 20 Jul 2006 18:02:08 -0500 Subject: [Bioperl-l] Remote Blast - Blast Human Genome In-Reply-To: References: Message-ID: Nice to know! I'll add this to the wiki. Chris On Jul 20, 2006, at 5:40 PM, Cook, Malcolm wrote: > Rohan, > > 'snp/human/human_snp' is the database name you need to use to blast > into > human snp database at NCBI > > See the following document for the full list (which link was > provided to > me via personal correspondace with NCBI helpdesk). Very useful... > > Hmm, looming again, there appear now to be two versions: > > http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastdblist.html (last > updated 2/7/2006) > http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/ > remote_accessible_blastdbli > st.html (last uypdated 5/29/2006) > > Neither are linked to by any other document on the internet (google > sez) > including anywhere else at NCBI. Go figure. It should be IMHO since > this info is nowhere else collected. > > Of course it may be out of date, but it always has got me through. > > Good luck > > Malcolm Cook - mec at stowers-institute.org - 816-926-4449 > Database Applications Manager - Bioinformatics > Stowers Institute for Medical Research - Kansas City, MO USA > > > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org >> [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Chris >> Fields >> Sent: Monday, July 17, 2006 4:26 PM >> To: vrramnar at student.cs.uwaterloo.ca; bioperl-l at lists.open-bio.org >> Subject: Re: [Bioperl-l] Remote Blast - Blast Human Genome >> >> Okay, I think I may know what's going on a little more now >> with NCBI's BLAST >> interface. Looks like any NCBI BLAST query must use the >> default URL and so >> must set up to proper GET/PUT commands to retrieve everything >> correctly. >> >> Here's the API description for it all: >> >> http://www.ncbi.nlm.nih.gov/BLAST/Doc/urlapi.html >> >> You could try setting the database to 'snp' or something along >> those lines >> instead of 'nr'; or you could see what the name of the >> database is when you >> use the web form and try setting it to that. According to >> this page, this >> should be possible: >> >> http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helpsnpfaq.sectio >> n.SearchdbSNP >> _test._Search_dbSNP_Using_B >> >> The Entrez Query limit was a recommendation for limiting your >> search to a >> set of sequences for human, for instance. >> >> I'll try looking into it a bit more but I'm pretty busy. If you find >> anything out you should probably post it here . >> >> Chris >> >>> Hi Chris, >>> >>> 1. I have tried changing the database to snp or dbSNP but >> neither works. >>> It >>> seems that depending on which type of blast you use(ie, Genome >>> Blast, >>> Blast SNP, >>> normal blast such as blastn, etc...) you see a different listing of >>> databases >>> available for querys. Since you mention that the Blast page I see >>> was >>> generated >>> by Genome, where could I go to see a complete listing of >> databases I can >>> query?? >>> Or if you knew off hand which database to search if I only >> wanted dbSNP >>> hits? >>> >>> 2. You also mention, I can limit the search by using Entrez >> terms. Do you >>> mean >>> like: >>> $Bio::Tools::Run::RemoteBlast::HEADER{'ENTREZ_QUERY'} = 'abc'; >>> where 'abc' is the name of the subject with which you would >> only like to >>> see >>> result of. For example if you put it as 'Homo >> sapiens[Organism]' then only >>> human >>> sequences would be in hit lists. >>> If this is what you mean, what would I change it to, to see >> only hits from >>> dbSNP? >>> >>> Thanks for the ongoing help, >>> >>> Rohan >>> >>> Quoting Chris Fields : >>> >>>> I added a method to RemoteBlast in bioperl-live (CVS) if >> you want to >>> play >>>> with changing the URL. I have been thinking about doing >> this for a bit >>> now >>>> but I already see problems. >>>> >>>> Here's the issue: the BLAST page you see is NOT the NCBI BLAST page >>> (note >>>> the differences in the URL) but a user-friendly request >> page, generated >>> on >>>> the fly by Genome, to submit BLAST requests for the >> relevant database. >>> So >>>> changing the URL will not work (even by adding extra >> parameters); you >>> only >>>> get the original HTML web page. >>>> >>>> You could try changing the database or limiting the search using an >>> Entrez >>>> term (which you should be able to include in the request, >> probably by >>> adding >>>> it to the HEADER). >>>> >>>> Chris >>>> >>>>> -----Original Message----- >>>>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >>>>> bounces at lists.open-bio.org] On Behalf Of >>> vrramnar at student.cs.uwaterloo.ca >>>>> Sent: Thursday, July 13, 2006 5:39 PM >>>>> To: bioperl-l at lists.open-bio.org >>>>> Subject: [Bioperl-l] Remote Blast - Blast Human Genome >>>>> >>>>> >>>>> Hello Again, >>>>> >>>>> I have another question regarding Remote blast but this >> time using >>> Genome >>>>> Blast. >>>>> >>>>> Here is the link: >>>>> >>>>> >>> >> http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi? >> taxid=9606 >>>>> >>>>> which again uses the main Blast web site: >>>>> >>>>> http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi >>>>> >>>>> Again I am not sure what to add or what HEADER >> information to change >>>>> within my >>>>> script. >>>>> >>>>> Here is my program, which was the same as the last email: >>>>> >>>>> #!/usr/bin/perl -w >>>>> >>>>> use Bio::Perl; >>>>> use Bio::Tools::Run::RemoteBlast; >>>>> >>>>> my $prog = "blastn"; >>>>> my $db = "refseq_genomic"; >>>>> my $e_val = 0.01; >>>>> >>>>> my @params = ( '-prog' => $prog, >>>>> '-data' => $db, >>>>> '-expect' => $e_val); >>>>> >>>>> my $factory = new Bio::Tools::Run::RemoteBlast->new(@params); >>>>> $Bio::Tools::Run::RemoteBlast::HEADER{'WWW_BLAST_TYPE'} >> = '????'; <-- >>> --- >>>>> what >>>>> do I put here >>>>> #$Bio::Tools::Run::RemoteBlast::HEADER{'?????'} = >> '????'; <--- Do I >>> need >>>>> to add >>>>> any other values to the form inputs >>>>> >>>>> $factory->submit_blast("blast.in"); >>>>> $v = 1; >>>>> >>>>> while (my @rids = $factory->each_rid) >>>>> { foreach my $rid ( @rids ) >>>>> { my $rc = $factory->retrieve_blast($rid); >>>>> if( !ref($rc) ) >>>>> { if( $rc < 0 ) >>>>> { $factory->remove_rid($rid); >>>>> } >>>>> print STDERR "." if ( $v > 0 ); >>>>> sleep 5; >>>>> } >>>>> else >>>>> { my $result = $rc->next_result(); >>>>> my $filename = $result->query_name()."\.out"; >>>>> $factory->save_output($filename); >>>>> $factory->remove_rid($rid); >>>>> print "\nQuery Name: ", $result->query_name(), "\n"; >>>>> } >>>>> } >>>>> } >>>>> >>>>> >>>>> Both of my questions are very similiar as in I know how >> to use remote >>>>> blast but >>>>> not sure what to change to access the specific blast I want. >>>>> >>>>> Again, any help would be very appreciated!! >>>>> >>>>> Rohan >>>>> >>>>> >>>>> >>>>> ---------------------------------------- >>>>> This mail sent through www.mywaterloo.ca >>>>> _______________________________________________ >>>>> Bioperl-l mailing list >>>>> Bioperl-l at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>> >>> >>> >>> >>> >>> ---------------------------------------- >>> This mail sent through www.mywaterloo.ca >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From vrramnar at student.cs.uwaterloo.ca Thu Jul 20 19:07:15 2006 From: vrramnar at student.cs.uwaterloo.ca (vrramnar at student.cs.uwaterloo.ca) Date: Thu, 20 Jul 2006 19:07:15 -0400 Subject: [Bioperl-l] Remote Blast - Blast Human Genome In-Reply-To: References: Message-ID: <1153436835.44c00ca39f2ee@www.nexusmail.uwaterloo.ca> Hi Malcolm, Thanks for the help, I actually figured this out today the same way you did through discussions with NCBI help deskng. He mentioned the main site is: http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/ But specifically: http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/remote_accessible_blastdblist.html So all you would need to do while using remoteblast is set your $db to one of the following: snp/human_9606/human_9606 Human SNPs snp/human_9606/rs_ch1 Human chr 1 SNPs snp/human_9606/rs_ch10 Human chr 10 SNPs snp/human_9606/rs_ch11 Human chr 11 SNPs snp/human_9606/rs_ch12 Human chr 12 SNPs snp/human_9606/rs_ch13 Human chr 13 SNPs snp/human_9606/rs_ch14 Human chr 14 SNPs snp/human_9606/rs_ch15 Human chr 15 SNPs snp/human_9606/rs_ch16 Human chr 16 SNPs snp/human_9606/rs_ch17 Human chr 17 SNPs snp/human_9606/rs_ch18 Human chr 18 SNPs snp/human_9606/rs_ch19 Human chr 19 SNPs snp/human_9606/rs_ch2 Human chr 2 SNPs snp/human_9606/rs_ch20 Human chr 20 SNPs snp/human_9606/rs_ch21 Human chr 21 SNPs snp/human_9606/rs_ch22 Human chr 22 SNPs snp/human_9606/rs_ch3 Human chr 3 SNPs snp/human_9606/rs_ch4 Human chr 4 SNPs snp/human_9606/rs_ch5 Human chr 5 SNPs snp/human_9606/rs_ch6 Human chr 6 SNPs snp/human_9606/rs_ch7 Human chr 7 SNPs snp/human_9606/rs_ch8 Human chr 8 SNPs snp/human_9606/rs_ch9 Human chr 9 SNPs snp/human_9606/rs_chMT Human chr Mitochondrial SNPs snp/human_9606/rs_chMulti Human SNPs mapped to multiple locations snp/human_9606/rs_chNotOn Human SNPs not mapped snp/human_9606/rs_chUn Human SNPs mapped to unplaced contigs snp/human_9606/rs_chX Human chr x SNPs snp/human_9606/rs_chY Human chr y SNPs The web site has a more complete list of all other databases available using the remoteblast module. Rohan Quoting "Cook, Malcolm" : > Rohan, > > 'snp/human/human_snp' is the database name you need to use to blast into > human snp database at NCBI > > See the following document for the full list (which link was provided to > me via personal correspondace with NCBI helpdesk). Very useful... > > Hmm, looming again, there appear now to be two versions: > > http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastdblist.html (last > updated 2/7/2006) > http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/remote_accessible_blastdbli > st.html (last uypdated 5/29/2006) > > Neither are linked to by any other document on the internet (google sez) > including anywhere else at NCBI. Go figure. It should be IMHO since > this info is nowhere else collected. > > Of course it may be out of date, but it always has got me through. > > Good luck > > Malcolm Cook - mec at stowers-institute.org - 816-926-4449 > Database Applications Manager - Bioinformatics > Stowers Institute for Medical Research - Kansas City, MO USA > > > > >-----Original Message----- > >From: bioperl-l-bounces at lists.open-bio.org > >[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Chris Fields > >Sent: Monday, July 17, 2006 4:26 PM > >To: vrramnar at student.cs.uwaterloo.ca; bioperl-l at lists.open-bio.org > >Subject: Re: [Bioperl-l] Remote Blast - Blast Human Genome > > > >Okay, I think I may know what's going on a little more now > >with NCBI's BLAST > >interface. Looks like any NCBI BLAST query must use the > >default URL and so > >must set up to proper GET/PUT commands to retrieve everything > >correctly. > > > >Here's the API description for it all: > > > >http://www.ncbi.nlm.nih.gov/BLAST/Doc/urlapi.html > > > >You could try setting the database to 'snp' or something along > >those lines > >instead of 'nr'; or you could see what the name of the > >database is when you > >use the web form and try setting it to that. According to > >this page, this > >should be possible: > > > >http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helpsnpfaq.sectio > >n.SearchdbSNP > >_test._Search_dbSNP_Using_B > > > >The Entrez Query limit was a recommendation for limiting your > >search to a > >set of sequences for human, for instance. > > > >I'll try looking into it a bit more but I'm pretty busy. If you find > >anything out you should probably post it here . > > > >Chris > > > >> Hi Chris, > >> > >> 1. I have tried changing the database to snp or dbSNP but > >neither works. > >> It > >> seems that depending on which type of blast you use(ie, Genome Blast, > >> Blast SNP, > >> normal blast such as blastn, etc...) you see a different listing of > >> databases > >> available for querys. Since you mention that the Blast page I see was > >> generated > >> by Genome, where could I go to see a complete listing of > >databases I can > >> query?? > >> Or if you knew off hand which database to search if I only > >wanted dbSNP > >> hits? > >> > >> 2. You also mention, I can limit the search by using Entrez > >terms. Do you > >> mean > >> like: > >> $Bio::Tools::Run::RemoteBlast::HEADER{'ENTREZ_QUERY'} = 'abc'; > >> where 'abc' is the name of the subject with which you would > >only like to > >> see > >> result of. For example if you put it as 'Homo > >sapiens[Organism]' then only > >> human > >> sequences would be in hit lists. > >> If this is what you mean, what would I change it to, to see > >only hits from > >> dbSNP? > >> > >> Thanks for the ongoing help, > >> > >> Rohan > >> > >> Quoting Chris Fields : > >> > >> > I added a method to RemoteBlast in bioperl-live (CVS) if > >you want to > >> play > >> > with changing the URL. I have been thinking about doing > >this for a bit > >> now > >> > but I already see problems. > >> > > >> > Here's the issue: the BLAST page you see is NOT the NCBI BLAST page > >> (note > >> > the differences in the URL) but a user-friendly request > >page, generated > >> on > >> > the fly by Genome, to submit BLAST requests for the > >relevant database. > >> So > >> > changing the URL will not work (even by adding extra > >parameters); you > >> only > >> > get the original HTML web page. > >> > > >> > You could try changing the database or limiting the search using an > >> Entrez > >> > term (which you should be able to include in the request, > >probably by > >> adding > >> > it to the HEADER). > >> > > >> > Chris > >> > > >> > > -----Original Message----- > >> > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > >> > > bounces at lists.open-bio.org] On Behalf Of > >> vrramnar at student.cs.uwaterloo.ca > >> > > Sent: Thursday, July 13, 2006 5:39 PM > >> > > To: bioperl-l at lists.open-bio.org > >> > > Subject: [Bioperl-l] Remote Blast - Blast Human Genome > >> > > > >> > > > >> > > Hello Again, > >> > > > >> > > I have another question regarding Remote blast but this > >time using > >> Genome > >> > > Blast. > >> > > > >> > > Here is the link: > >> > > > >> > > > >> > >http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=9606 > >> > > > >> > > which again uses the main Blast web site: > >> > > > >> > > http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi > >> > > > >> > > Again I am not sure what to add or what HEADER > >information to change > >> > > within my > >> > > script. > >> > > > >> > > Here is my program, which was the same as the last email: > >> > > > >> > > #!/usr/bin/perl -w > >> > > > >> > > use Bio::Perl; > >> > > use Bio::Tools::Run::RemoteBlast; > >> > > > >> > > my $prog = "blastn"; > >> > > my $db = "refseq_genomic"; > >> > > my $e_val = 0.01; > >> > > > >> > > my @params = ( '-prog' => $prog, > >> > > '-data' => $db, > >> > > '-expect' => $e_val); > >> > > > >> > > my $factory = new Bio::Tools::Run::RemoteBlast->new(@params); > >> > > $Bio::Tools::Run::RemoteBlast::HEADER{'WWW_BLAST_TYPE'} > >= '????'; <-- > >> --- > >> > > what > >> > > do I put here > >> > > #$Bio::Tools::Run::RemoteBlast::HEADER{'?????'} = > >'????'; <--- Do I > >> need > >> > > to add > >> > > any other values to the form inputs > >> > > > >> > > $factory->submit_blast("blast.in"); > >> > > $v = 1; > >> > > > >> > > while (my @rids = $factory->each_rid) > >> > > { foreach my $rid ( @rids ) > >> > > { my $rc = $factory->retrieve_blast($rid); > >> > > if( !ref($rc) ) > >> > > { if( $rc < 0 ) > >> > > { $factory->remove_rid($rid); > >> > > } > >> > > print STDERR "." if ( $v > 0 ); > >> > > sleep 5; > >> > > } > >> > > else > >> > > { my $result = $rc->next_result(); > >> > > my $filename = $result->query_name()."\.out"; > >> > > $factory->save_output($filename); > >> > > $factory->remove_rid($rid); > >> > > print "\nQuery Name: ", $result->query_name(), "\n"; > >> > > } > >> > > } > >> > > } > >> > > > >> > > > >> > > Both of my questions are very similiar as in I know how > >to use remote > >> > > blast but > >> > > not sure what to change to access the specific blast I want. > >> > > > >> > > Again, any help would be very appreciated!! > >> > > > >> > > Rohan > >> > > > >> > > > >> > > > >> > > ---------------------------------------- > >> > > This mail sent through www.mywaterloo.ca > >> > > _______________________________________________ > >> > > Bioperl-l mailing list > >> > > Bioperl-l at lists.open-bio.org > >> > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > >> > > >> > >> > >> > >> > >> ---------------------------------------- > >> This mail sent through www.mywaterloo.ca > > > >_______________________________________________ > >Bioperl-l mailing list > >Bioperl-l at lists.open-bio.org > >http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > ---------------------------------------- This mail sent through www.mywaterloo.ca From vrramnar at student.cs.uwaterloo.ca Thu Jul 20 19:18:27 2006 From: vrramnar at student.cs.uwaterloo.ca (vrramnar at student.cs.uwaterloo.ca) Date: Thu, 20 Jul 2006 19:18:27 -0400 Subject: [Bioperl-l] SNP reference file download Message-ID: <1153437507.44c00f43b53d4@www.nexusmail.uwaterloo.ca> Hello All, I was wondering if anyone knew how to download an entire SNP reference file from NCBI?? Or even downloading the sequence data for a particular SNP. I know how to do this via Bio::DB::GenBank, Bio::DB::SwissP, etc.. when referring to NM_##### but when I try to access rs###### files I am unsure of what Bio::DB to point to, if there is one. For example, if I had the accession number: rs4986950 How could I retrieve NCBI's entire reference file for this SNP record OR just the SNP sequence relating to this accession number. Any help on this subject would greatly be appreciated, Rohan ---------------------------------------- This mail sent through www.mywaterloo.ca From cjfields at uiuc.edu Fri Jul 21 00:51:30 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 20 Jul 2006 23:51:30 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C00805.7090403@sendu.me.uk> References: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine> <44C00805.7090403@sendu.me.uk> Message-ID: <9777D0A4-5340-4367-879D-F9125393B4FD@uiuc.edu> > I didn't actually mean a stored file (but that would be possible > with a > tied hash or something: DB_File, just like flatfile), but an in-memory > one for use during the course of program execution. Stored file would > probably be dangerous because you wouldn't know if the data has become > stale or not - and checking to see if it wasn't would defeat the > point. Okay, that wouldn't be a problem. I currently use in-memory caches to hold NCBI history information and ELink information for EUtilities. It would just a matter of doing the same for Bio::DB::Taxonomy. ... > entrez already parses through LineageEx to build the classification > array. flatfile walks up all the parents to do the same. Having the > information isn't the issue. We have the information. The methods > genus() and species() need to work with the genbank fileformat, > that is > the problem. The original purpose for Bio::Species was a simple object to hold taxonomic information. This object was then used in an attempt to hold the basic organism information (scientific name, common name, lineage information, etc) contained in a RichSeq file, like GenBank, EMBL, SwissProt, etc. The problem: trying to determine which term in the lineage corresponds to which rank and what part of the organism's scientific name is the genus, the species, and so on based solely on the data in the file, which comes down to a best-guess scenario for many organisms. It does work, but not equally well for all RichSeq files, not for every organism, and definitely not all the time. So, yes, genus(), species(), binomial, and other methods are present, but one must realize that parsing out the data into the appropriate object data using the various get/sets, with the obvious exceptions, is not the best way. Unless... you incorporate information available only outside the actual file itself (i.e. NCBI Taxonomy information). This is where Bio::Taxonomy seems to come along, as it's not-species specific (it can represent any rank) and is also DB-aware. Though Bio::Species was originally going to delegate all its data to Bio::Taxonomy::Node, I think the purpose was to eventually replace Bio::Species. So, my question is, why not use a Bio::Taxonomy::Node-like class initially to contain the appropriate data for a GenBank file (just for read/write purposes)? This object, since it implements Bio::Taxonomy::NodeI, is also DB-aware and thus, if set up with a database could also get/set the appropriate object data correctly using the lineage data. So, for instance, if I called $species = $seq->species(); and wanted the classification, scientific_name(), common_name, and other information that is gleaned from the file, then there's no need for a lookup. Once you cross into the bounds of: print $species->species(); print $species->genus(); then there's trouble, since we're working straight from the file (i.e. parsing is mainly correct, but still guesswork and sometimes wrong). But what if you could do something like this: my $db = Bio::DB::Taxonomy->new(-source => 'entrez'); # normally not needed as this is set by default internally, but as a demo here... $species->db_handle($db); # reset the appropriate data (genus, species, etc) based on Entrez tax data $species->reset_data(); # this method, BTW, doesn't exist yet but should be easy to implement print $species->species(); my $parent = $species->get_Parent_Node; my @child = $species->get_Children_Nodes; ...and so on Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From prabubio at gmail.com Fri Jul 21 02:17:41 2006 From: prabubio at gmail.com (Prabu R) Date: Fri, 21 Jul 2006 11:47:41 +0530 Subject: [Bioperl-l] Blast Output Parsing In-Reply-To: <002b01c6ac22$5d75e1f0$15327e82@pyrimidine> References: <002b01c6ac22$5d75e1f0$15327e82@pyrimidine> Message-ID: It works great Thanks a lot Mr.Chris. R. Prabu On 7/20/06, Chris Fields wrote: > > Grab the HSPs, then use get_aln() to generate a Bio::SimpleAlign object. > You can then use Bio::AlignIO to generate the alignment output if needed, > or > use the Bio::SimpleAlign methods to get what you want. > > http://www.bioperl.org/wiki/HOWTO:Beginners > > http://www.bioperl.org/wiki/HOWTO:SearchIO > > > http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/SimpleAlign > .html > > Chris > > > -----Original Message----- > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > > bounces at lists.open-bio.org] On Behalf Of Prabu R > > Sent: Thursday, July 20, 2006 11:02 AM > > To: bioperl-l at lists.open-bio.org > > Subject: [Bioperl-l] Blast Output Parsing > > > > Dear All! > > > > I am now trying to parse a Blast output using PERL. > > > > I have to extract each alignment and have to parse the alignment. I > mean, > > I > > have to check whether a particular part of the given sequence got > aligned > > 100%. > > > > Anybody please tell me what module in PERL I have to use for getting > this. > > > > I've tried Bio::SearchIO. But I didnt get any method to get the > > alignment. > > > > Kindly help. > > > > Thanks, > > R. Prabu > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- "Every noble work is at first impossible." - Thomas Carlyle From mh6 at sanger.ac.uk Fri Jul 21 05:04:57 2006 From: mh6 at sanger.ac.uk (Michael Han) Date: Fri, 21 Jul 2006 10:04:57 +0100 Subject: [Bioperl-l] PAML parser Message-ID: <44C098B9.4090003@sanger.ac.uk> Hi, I have some questions about the PAML parser (Bio::Tools::Phylo::PAML in CVS HEAD). Maybe some of you could help. If you call next_result, $self->_parse_summary might be called, which loops over $self->_readline . Later in next_result when "while (defined ($_=$self->_readline))" is used isn't the filepointer/filehandle already at the end of the output file and should return undef breaking the parsing? I added a crude seek($self->{_filehandle},0,0) after the _parse_summary and it seemed to work, but I wonder if I missed something obvious. thanks, Mike From cjfields at uiuc.edu Fri Jul 21 08:22:01 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 21 Jul 2006 07:22:01 -0500 Subject: [Bioperl-l] PAML parser In-Reply-To: <44C098B9.4090003@sanger.ac.uk> References: <44C098B9.4090003@sanger.ac.uk> Message-ID: Normally when you parse a report you use a loop to iterate through results: while (my $result = $parser->next_result) { # do work here } So returning undef is necessary to end the loop. This type of loop construct is common in BioPerl (and in Perl in general). There is a HOWTO for PAML: http://www.bioperl.org/wiki/HOWTO:PAML Chris On Jul 21, 2006, at 4:04 AM, Michael Han wrote: > Hi, > > I have some questions about the PAML parser > (Bio::Tools::Phylo::PAML in CVS HEAD). Maybe some of you could help. > > If you call next_result, $self->_parse_summary might be called, > which loops over $self->_readline . > > Later in next_result when "while (defined ($_=$self->_readline))" > is used isn't the filepointer/filehandle > already at the end of the output file and should return undef > breaking the parsing? > > I added a crude seek($self->{_filehandle},0,0) after the > _parse_summary and it seemed to work, but I wonder if I missed > something obvious. > > thanks, > > Mike > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Fri Jul 21 11:50:20 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 21 Jul 2006 10:50:20 -0500 Subject: [Bioperl-l] SNP reference file download In-Reply-To: <1153437507.44c00f43b53d4@www.nexusmail.uwaterloo.ca> Message-ID: <000901c6acdd$5f38ddb0$15327e82@pyrimidine> You'll need the latest code from CVS; you could try (the highly experimental) Bio::DB::EUtilities to get the raw flatfile XML data, then pass everything through Bio::ClusterIO. Currently there isn't tempfile, file, or filehandle support for the EUtilities but I plan on adding this soon. You could also pipe STDOUT from one SNP retrieval script into STDIN for the ClusterIO. BTW, the EFetch object below accepts an array reference of primary IDs if you want to use them instead, so you don't need to run an ESearch query first. To do this you'll need to set the database parameter (-db => 'snp'); the database from the ESearch query is passed to EFetch via the Cookie object. Chris use Bio::DB::EUtilities; use Bio::ClusterIO; # save XML to tempfile for read/write open my $XMLDATA, '+>', 'tempfile.xml'; # ESearch for term, place data in search history my $esearch= Bio::DB::EUtilities->new(-eutil => 'esearch', -term => 'dihydroorotase', -db => 'snp', -usehistory => 'y'); $esearch->get_response; print STDERR "Count: ", $esearch->count,"\n"; # efetch is default EUtility my $efetch = Bio::DB::EUtilities->new(-cookie => $esearch->next_cookie, -rettype => 'flt'); # SNP flatfile print $XMLDATA $efetch->get_response->content; seek ($XMLDATA, 0, 0); # don't forget to rewind... my $cio = Bio::ClusterIO->new(-format => 'dbsnp', -fh => $XMLDATA); # $snp is a Bio::Variation::snp object, see perldoc for methods while (my $snp = $cio->next_cluster) { print "ID : ",$snp->id,"\n"; } close $XMLDATA; > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of vrramnar at student.cs.uwaterloo.ca > Sent: Thursday, July 20, 2006 6:18 PM > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] SNP reference file download > > > Hello All, > > I was wondering if anyone knew how to download an entire SNP reference > file from > NCBI?? Or even downloading the sequence data for a particular SNP. > > I know how to do this via Bio::DB::GenBank, Bio::DB::SwissP, etc.. when > referring > to NM_##### but when I try to access rs###### files I am unsure of what > Bio::DB > to point to, if there is one. > > For example, if I had the accession number: rs4986950 How could I retrieve > NCBI's > entire reference file for this SNP record OR just the SNP sequence > relating to > this accession number. > > Any help on this subject would greatly be appreciated, > > Rohan > > > ---------------------------------------- > This mail sent through www.mywaterloo.ca > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Sun Jul 23 15:09:48 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sun, 23 Jul 2006 14:09:48 -0500 Subject: [Bioperl-l] obo_parser.t test warnings Message-ID: Hilmar, Sohel, Didn't know who to notify, so sorry in advance about cross-posting this to the list. I was running through cleaning up some bugs and found that obo_parser.t is throwing a ton of warnings: bayou-75:~/Chris/Bioperl/bioperl-live natashacapell$ perl -I. -w t/ obo_parser.t 1..40 "my" variable $val masks earlier declaration in same scope at Bio/ OntologyIO/obo.pm line 592. "my" variable $qh masks earlier declaration in same scope at Bio/ OntologyIO/obo.pm line 592. Use of uninitialized value in string eq at Bio/OntologyIO/obo.pm line 239, line 13. ... Good news: all tests pass! Cheers! Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Sun Jul 23 16:53:32 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sun, 23 Jul 2006 15:53:32 -0500 Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes Message-ID: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu> Sendu, Hilmar, et al, I was looking through SeqIO::genbank and though I would bring up a couple of things to think about re: GenBank Taxonomy information. This is how NCBI defines the names used for SOURCE and ORGANISM according to the latest GenBank release notes: SOURCE - Common name of the organism or the name most frequently used in the literature. Mandatory keyword in all annotated entries/one or more records/includes one subkeyword. ORGANISM - Formal scientific name of the organism (first line) and taxonomic classification levels (second and subsequent lines). Mandatory subkeyword in all annotated entries/two or more records. According to their sample file page (http://www.ncbi.nlm.nih.gov/ Sitemap/samplerecord.html), the SOURCE is this: Free-format information including an abbreviated form of the organism name, sometimes followed by a molecule type. (See section 3.4.10 of the GenBank release notes for more info.) The SOURCE can also include the organelle and also may include additional information, such as an abbreviated name and a common name in parentheses. ... SOURCE Saccharomyces cerevisiae (baker's yeast) ORGANISM Saccharomyces cerevisiae Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces. ... Setting scientific_name() isn't a problem; acc. to the above definition, it is the full name on the ORGANISM line. The lineage (or classification() array) is also straight-forward. The common_name (), though as used by Bio::SeqIO::genbank, is the entire SOURCE line (not just the abbreviated name, but the name and everything else). No additional parsing is performed on it. write_seq() also seems to do the wrong thing when rebuilding the SOURCE line as well as the method writes the subspecies to the line. I plan on using Bio::SeqIO::genbank as a guinea pig of sorts to try using Bio::Taxonomy::Node objects instead of Bio::Species, then get the parsing for these lines corrected and simplified. Essentially, the way NCBI describes it, the main name on the line is actually the free-form abbreviated name, the name in parentheses is the common name (optionally present), and the organelle precedes all of these if present. I want to try getting common_name() to match the common name found for taxonomy (baker's yeast) rather than have it be a simple container, add an abbreviated_name() method for the name container for the SOURCE line, and have the organelle() method actually be used if an organelle is present (it doesn't seem to be set at the moment in SeqIO::genbank). Right now, I have NO idea how EMBL, DDBJ, other formats deal with organism info; I would think that the main three (GenBank/EMBL- SwissProt/DDBJ) handle them similarly...(Famous Last Words) I also propose (I'll probably get yelled at here) NOT actively supporting additional parsing of species, subspecies, etc directly from a file w/o a DB lookup. As in, leave species, subspecies, genus parsing from the flatfile as is (no longer support it) or remove it completely and leave them unset. I haven't looked, but I have a strong feeling that the species parsing in Bio::SeqIO is different from format to format. It really seems like more trouble than it's worth to maintain this, especially as there is perfectly valid Taxonomy database information available either locally using a flatfile or via Entrez. If people want to have reliable $species->species or $species-genus for taxonomy information, they will need to have the db_handle() set for the Bio::Taxonomy::Node object and have an Node-based method to reset species, genus, etc to the tax database information (maybe reset_taxon or something along those lines). Okay, rambled on enough. Any thoughts? Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From hlapp at gmx.net Sun Jul 23 19:40:45 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 23 Jul 2006 19:40:45 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BF86AF.8080408@sendu.me.uk> References: <44BBBB69.6000906@sendu.me.uk> <44BF86AF.8080408@sendu.me.uk> Message-ID: On Jul 20, 2006, at 9:35 AM, Sendu Bala wrote: > I'll describe all the changes I've now made and if no-one complains > I'll > commit. (I've also made these notes into bug 2047 for easier reference > in the future.) > > Bio::DB::Taxonomy::flatfile > --------------------------- > [...] > > BEHAVIOUR-CHANGE: flatfile used to store within the nodes it makes the > division as a three letter code, like 'PRI'. However, for consistency > with entrez and the scientific_name() of the node the division is > supposed to correspond to, it is now stored as the full name, like > 'Primates'. What about adding a method division_code() which would return the 3- letter abbreviation? The abbreviation may be needed by flat-file writers, so it may be handy to have in some cases. > > The names->id solution also stores the artificially uniqued names like > 'Craniata ', allowing you for the first time to retrieve the > correct id. Previously the search would have simply failed completely. > > The names->id solution now handles nodes with scientific names of 'xyz > (class)', allowing you to retrieve the id with both get_taxonids > ('xyz') > and get_taxonids('xyz (class)'). Previously only the latter would > work. Should angle brackets be allowed too? > > NOTE: the previous 2 changes (and the issues with entrez, see below) > make flatfile better at searching the taxonomy database than entrez > module or the website, both in terms of speed and completeness of > results. > > BEHAVIOUR-CHANGE: The scientific name field isn't touched in any way, > always being sent directly to Bio::Taxonomy::Node->new(-name => > $untouched) Maybe there should also be a -names parameter which accepts a hash reference with keys being the kind of name (scientific, common, etc) and the values being array references with the set of names of that kind? > or the $node->classification() array. Bio::Taxonomy::Node shouldn't have this attribute. It is legacy brought over from a flawed (because flat) object model in Bio::Species. > [...] > > Bio::DB::Taxonomy::entrez > ------------------------- > > # Bug-fixes > Special characters like ", ( and ) in the input query string to > get_taxonid() result in the failure or inaccuracy of the search. These > characters are now removed prior to submission, allowing for correct > search results. > API-CHANGE: entrez has always been able to return multiple ids that > match a single input name, so I've renamed get_taxonid() to > get_taxonids() and it returns an array of ids in list context. It > returns one of the ids in scalar context. For backward compatibility, > *get_taxonid = \&get_taxonids. Sounds good to me. > NOTE: entrez modules (and website) cannot cope with '' > in the > query, failing searches like 'Craniata '. For this > reason, if > get_taxonids() is given a query with '' it will immediately > return undefined, saving a pointless website access. If there is a 'next-best-thing' that is still semantically compatible with the API documentation, I would do that. In this case, if there is a in the query the entrez module should strip it and automatically use the rest for searching. If indeed multiple IDs match there should be a warning to inform the user that entrez cannot use the notation to limit the query results. In fact, you might as well provide an option to enable an automatic check for the correct branch for each ID if multiple ones are returned. I.e., if this option is enabled, the module would automatically query the parent nodes to see if is in the lineage, and if not will remove the respective ID from the result set. The reason you may want to make it optional is because it potentially costs time. (but in reality I'm not sure why a client will not want to enable the option - so maybe this should even be default) > If you want the id > of 'Craniata ' you must search for 'Craniata', then get the > node for each returned id to see which one has a parent node with a > scientific_name() or common_names() case-insensitive matching to > 'chordata'. Yep, see above. The more burden you can shield from the user the better. > [...] > Bio::Taxonomy::Node > ------------------- > [...] > classification() has a proper solution to finding the classification > when the array wasn't manually set. > > # Improvements > BEHAVIOUR-CHANGE: node_name() used to be an alias to name > ('common'). Now > it is an alias to name('scientific'). > NOTE: node_name is what is set when ->new(-name => $name) is set, so > flatfile and entrez and user-created nodes now implicitly associate > the > name of the node they create with its scientific name. I'm not even sure node_name() should just be deprecated. The methods falsely suggests that there is only a single and definitive name for the taxon node. In NCBI reality, this is only true for the scientific name of the node. In real reality, many nodes have multiple scientific names - taxonomy isn't static and therefore the scientific naming of nodes isn't either. > [...] > Thanks for the work, all other changes sound great. Thanks also to Chris for assisting! Looks like this is in much better shape now than before. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Sun Jul 23 19:44:23 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 23 Jul 2006 19:44:23 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BD147A.9020103@sendu.me.uk> References: <003201c6aa81$01db9a30$15327e82@pyrimidine> <44BD147A.9020103@sendu.me.uk> Message-ID: <6701FDC3-7630-48EB-9B21-5BB5C566424E@gmx.net> On Jul 18, 2006, at 1:03 PM, Sendu Bala wrote: > > [regarding changes to Bio::Taxonomy::Node] > > Actually, I'm really strongly leaning toward getting rid of the > following methods and new() options (and giving up entirely on being > able to keep 'sapiens' somewhere): > > -organelle, organelle() > -division, division() > -sub_species, sub_species() > -variant, variant() > species(), validate_species_name() > genus() > binomial() > > As far as I can see none of these methods have any place in a generic > Node class. I agree. Some of them are a special case for genbank files (organelle () etc), and the rest is legacy from Bio::Species. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Sun Jul 23 20:48:22 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 23 Jul 2006 20:48:22 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <9777D0A4-5340-4367-879D-F9125393B4FD@uiuc.edu> References: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine> <44C00805.7090403@sendu.me.uk> <9777D0A4-5340-4367-879D-F9125393B4FD@uiuc.edu> Message-ID: <21C7763A-B8FF-4E91-BAEC-E5F79E462DD3@gmx.net> On Jul 21, 2006, at 12:51 AM, Chris Fields wrote: > my $db = Bio::DB::Taxonomy->new(-source => 'entrez'); > > # normally not needed as this is set by default internally, but as a > demo here... > $species->db_handle($db); > > # reset the appropriate data (genus, species, etc) based on Entrez > tax data > $species->reset_data(); # this method, BTW, doesn't exist yet but > should be easy to implement Don't call this reset_data() as it may be misleading (usually reset() means to revert into a native or original state). Instead, you would use fetch_from_db() or something. However, it seems redundant to me to begin with. If we ignore for a second that the return value in the following isn't exactly compatible, why would you not just call $species = $db->get_Taxonomy_Node(-taxonid => $species->ncbi_taxid); So I think more than anything else, this should be made to work, and you would have a more seamless interface. > Short and sweet summary: > > Sendu volunteered making changes to Bio::Taxonomy::Node and related > modules; > we disagreed on exactly what changes should be made. Sendu wanted a > stripped-down version of Bio::Taxonomy::Node; I wanted one which would > support similar methods as in Bio::Species. Bio::Species should be considered legacy; I think it is flawed as an object model because it imposes a flat view on something which in reality is only a node in a tree and not flat at all. The only real need for the flat view came from the desire to write sequence files - for all other purposes the classification() etc attributes are useless anyway. I.e., binomial() and common_name() (corresponding to scientific_name () and names('common')) are the only real useful attributes, the rest is baggage for writing sequence files. The baggage should not be passed on to a better model ... Instead, there should be a separate module (in essence a Bio::Species factory) which can translate a Bio::Taxonomy::Node into a Bio::Species object - if the rank is 'species' or below. Alternatively, you could have a Bio::Taxonomy::SpeciesNode object which implements both APIs and can be initialized with either a Bio::Taxonomy::Node instance, or the combination of a Bio::Species and a db handle. At any rate, I think Bio::Taxonomy::Node should be stripped of legacy methods that are only there to achieve Bio::Species compatibility. > > I suggested have a common interface module, one for Node and > another for > Species; both implement the same interface methods (NodeI maybe), > so you > could use either a bare-bones Node or a full-fledged Species > object. I then > suggested this new version of Species could replace Bio::Species. > We could > worry about which one to use for Bio::DB::Taxonomy* later. I'm not following here... How would this look like? What would the API (s) be? > > We both agreed. Everybody's happy. Happiness is great, so maybe you shouldn't bother about me not following... > I still plan on switching Bio::DB::Taxonomy::entrez to use > Bio::DB::EUtilities at some point Wouldn't that rather be Bio::DB::Taxonomy::eutil? > I may > add a method for retrieving tax data based on protein/nucleotide > sequence > primary ID and relevant sequence database, so you could directly > retrieve > the relevant TaxID w/o parsing sequences directly for them. This > would > mainly be useful if you gather GIs from a BLAST search, for instance. > > Anyway, I could add this in then base class Bio::DB::Taxonomy > directly so > one could used the retrieved TaxIDs for flat-file or entrez > searches; this > requires, of course, access to the remote Entrez database (it would > use > ELink). Would that be of interest? If you add the API methods for this to the base class (which in this case is close in concept to an interface, much like Bio/SeqIO.pm), then make clear that not every database will allow you to implement this. > > |------Node > NodeI----| > |------Species > > Another option would be to have Bio::Taxonomy::Node itself stripped > down, > then have another class (Bio::Taxonomy::Species) inherit methods > from it and > also implement additional methods (genus(), species(), etc). I think this would be the way to go. I.e., |------Node NodeI----| |-| |----SpeciesNode Species----| This way the NodeI interface and its direct implementors are kept free of legacy. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Sun Jul 23 20:43:45 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sun, 23 Jul 2006 19:43:45 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <6701FDC3-7630-48EB-9B21-5BB5C566424E@gmx.net> References: <003201c6aa81$01db9a30$15327e82@pyrimidine> <44BD147A.9020103@sendu.me.uk> <6701FDC3-7630-48EB-9B21-5BB5C566424E@gmx.net> Message-ID: <5F6027E0-A504-4019-8DAB-C50DF9EB6E18@uiuc.edu> As an aside, the 'source' seqfeature in a GenBank file contains some of the following information as tags; that's where the NCBI tax ID is taken from in Bio::SeqIO::genbank: FEATURES Location/Qualifiers source 1..814 /organism="Porterinema fluviatile" /organelle="plastid:chloroplast" /mol_type="genomic DNA" /strain="SAG 124.79" /db_xref="taxon:246123" /country="Germany" ... So, variant(), organelle(), and ncbi_taxid() could all be set from the same point in Bio::SeqIO::genbank. I suggested an option to Sendu, but I'd like to hear your thoughts on this since this will possibly affect bioperl-db. We could have two Node-like Taxonomy objects using a common interface class (Bio::Taxonomy::NodeI) : Bio::Taxonomy::Node (stripped down version), and Bio::Taxonomy::Species (the sequence-based NodeI-implementing object, which would retain the other Bio::Species-like methods). Bio::Taxonomy::Species would act sort of as an 'entry point' for Bio::Taxonomy from sequences; moving up or down the tax node hierarchy gets Tax::Node objects, unless you are specifically at a 'species'-ranked node (though this could be just a Tax::Node as well). BTW, I have managed to get Bio::SeqIO::genbank switched over to Bio::Taxonomy::Node (er... Bio::Taxonomy::Species); all tests pass. I was quite surprised how easy it was. It shouldn't be too hard to get a NodeI/Node/Species class hierarchy set up if everybody thinks it's worth it. Then we could deprecate Bio::Species! Chris On Jul 23, 2006, at 6:44 PM, Hilmar Lapp wrote: > > On Jul 18, 2006, at 1:03 PM, Sendu Bala wrote: > >> >> [regarding changes to Bio::Taxonomy::Node] >> >> Actually, I'm really strongly leaning toward getting rid of the >> following methods and new() options (and giving up entirely on being >> able to keep 'sapiens' somewhere): >> >> -organelle, organelle() >> -division, division() >> -sub_species, sub_species() >> -variant, variant() >> species(), validate_species_name() >> genus() >> binomial() >> >> As far as I can see none of these methods have any place in a generic >> Node class. > > I agree. Some of them are a special case for genbank files (organelle > () etc), and the rest is legacy from Bio::Species. > > -hilmar > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From hlapp at gmx.net Sun Jul 23 20:58:32 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 23 Jul 2006 20:58:32 -0400 Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes In-Reply-To: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu> References: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu> Message-ID: On Jul 23, 2006, at 4:53 PM, Chris Fields wrote: > I also propose (I'll probably get yelled at here) NOT actively > supporting additional parsing of species, subspecies, etc directly > from a file w/o a DB lookup. As in, leave species, subspecies, genus > parsing from the flatfile as is (no longer support it) or remove it > completely and leave them unset. Note that most (as in: most used, not most taxa) cases are actually straightforward. I don't think removing what's there is desirable, just everyone needs to understand that it will recognize only a limited number of syntactical variations, and beyond that if you want correct taxon attributes you will a database (be it flatfile, eutil, whatever) lookup. > If people want to > have reliable $species->species or $species-genus for taxonomy > information, they will need to have the db_handle() set for the > Bio::Taxonomy::Node object and have an Node-based method to reset > species, genus, etc to the tax database information (maybe > reset_taxon or something along those lines). That's what I've saying all along. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Sun Jul 23 23:30:07 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sun, 23 Jul 2006 22:30:07 -0500 Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes In-Reply-To: References: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu> Message-ID: <28D3470B-DA8F-4C41-96C7-F0D0DE89BAEE@uiuc.edu> On Jul 23, 2006, at 7:58 PM, Hilmar Lapp wrote: > > On Jul 23, 2006, at 4:53 PM, Chris Fields wrote: > >> I also propose (I'll probably get yelled at here) NOT actively >> supporting additional parsing of species, subspecies, etc directly >> from a file w/o a DB lookup. As in, leave species, subspecies, genus >> parsing from the flatfile as is (no longer support it) or remove it >> completely and leave them unset. > > Note that most (as in: most used, not most taxa) cases are actually > straightforward. I don't think removing what's there is desirable, > just everyone needs to understand that it will recognize only a > limited number of syntactical variations, and beyond that if you > want correct taxon attributes you will a database (be it flatfile, > eutil, whatever) lookup. Aha! We seem to agree on that... >> If people want to >> have reliable $species->species or $species-genus for taxonomy >> information, they will need to have the db_handle() set for the >> Bio::Taxonomy::Node object and have an Node-based method to reset >> species, genus, etc to the tax database information (maybe >> reset_taxon or something along those lines). > > That's what I've saying all along. > > -hilmar > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== I thought you had mentioned something about this a few months back on EMBL format issues with organism data. Anyway, I don't think it was from anybody disagreeing with you as much as it was one of the project priorities that sort of got lost in the shuffle. I'm sure Sendu will like having a bit of freedom with Bio::Taxonomy::Node. Anyway, I'll do what I can within reason; I have to leave next weekend for a 5-day conference. Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Mon Jul 24 04:21:55 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 24 Jul 2006 09:21:55 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <21C7763A-B8FF-4E91-BAEC-E5F79E462DD3@gmx.net> References: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine> <44C00805.7090403@sendu.me.uk> <9777D0A4-5340-4367-879D-F9125393B4FD@uiuc.edu> <21C7763A-B8FF-4E91-BAEC-E5F79E462DD3@gmx.net> Message-ID: <44C48323.5060704@sendu.me.uk> Hilmar Lapp wrote: > On Jul 21, 2006, at 12:51 AM, Chris Fields wrote: > >> my $db = Bio::DB::Taxonomy->new(-source => 'entrez'); >> >> # normally not needed as this is set by default internally, but as a >> demo here... >> $species->db_handle($db); >> >> # reset the appropriate data (genus, species, etc) based on Entrez >> tax data >> $species->reset_data(); # this method, BTW, doesn't exist yet but >> should be easy to implement > > Don't call this reset_data() as it may be misleading (usually reset() > means to revert into a native or original state). Instead, you would > use fetch_from_db() or something. > > However, it seems redundant to me to begin with. If we ignore for a > second that the return value in the following isn't exactly > compatible, why would you not just call > > $species = $db->get_Taxonomy_Node(-taxonid => $species->ncbi_taxid); If Bio::Species was a Bio::Taxonomy, and we had FactoryI implementing classes or similar, we would say: $species = $factory->fetch(-taxon_id => $species->ncbi_taxid); > Instead, there should be a separate module (in essence a Bio::Species > factory) which can translate a Bio::Taxonomy::Node into a > Bio::Species object - if the rank is 'species' or below. I don't think a 'translation' module is necessary. Bio::Species can just be a Bio::Taxonomy. > At any rate, I think Bio::Taxonomy::Node should be stripped of legacy > methods that are only there to achieve Bio::Species compatibility. Yes :) > I think this would be the way to go. I.e., > > > |------Node > NodeI----| > |-| > |----SpeciesNode > Species----| Actually, if we're changing the name of the module that Species interacts with, any existing code needs to be re-written. So why not just do it properly and have Bio::Species interact with Bio::Taxonomy? |----Bio::Taxonomy Bio::TaxonomyI----| |----Bio::Species Or Bio::TaxonomyI----|----Bio::Taxonomy----|----Bio::Species Leaving Node completely free to be just a node. This way we don't have a crufty SpeciesNode there simply for the sake of Bio::Species. Bio::Species itself provides all the legacy stuff it needs for itself, while interacting with Nodes via TaxonomyI methods in the 'correct' way only. From bix at sendu.me.uk Mon Jul 24 03:58:57 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 24 Jul 2006 08:58:57 +0100 Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes In-Reply-To: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu> References: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu> Message-ID: <44C47DC1.8020503@sendu.me.uk> Chris Fields wrote: > Sendu, Hilmar, et al, > > I was looking through SeqIO::genbank and though I would bring up a > couple of things to think about re: GenBank Taxonomy information. [...] > SOURCE - Common name of the organism or the name most frequently used > in the literature. Mandatory keyword in all annotated entries/one or > more records/includes one subkeyword. [...] > Free-format information including an abbreviated form of the organism > name, sometimes followed by a molecule type. (See section 3.4.10 of > the GenBank release notes for more info.) > > The SOURCE can also include the organelle and also may include > additional information, such as an abbreviated name and a common name > in parentheses. More specifically: (from 3.4.10 ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt) The SOURCE field consists of two parts. The first part is found after the SOURCE keyword and contains free-format information including an abbreviated form of the organism name followed by a molecule type; multiple lines are allowed, but the last line must end with a period. The second part consists of information found after the ORGANISM subkeyword. The formal scientific name for the source organism (genus and species, where appropriate) is found on the same line as ORGANISM. The records following the ORGANISM line list the taxonomic classification levels, separated by semicolons and ending with a period. > The common_name (), though as used by Bio::SeqIO::genbank, is the > entire SOURCE line (not just the abbreviated name, but the name and > everything else). No additional parsing is performed on it. > write_seq() also seems to do the wrong thing when rebuilding the > SOURCE line as well as the method writes the subspecies to the line. > > I plan on using Bio::SeqIO::genbank as a guinea pig of sorts to try > using Bio::Taxonomy::Node objects instead of Bio::Species, then get > the parsing for these lines corrected and simplified. Essentially, > the way NCBI describes it, the main name on the line is actually the > free-form abbreviated name, the name in parentheses is the common > name (optionally present), and the organelle precedes all of these if > present. I want to try getting common_name() to match the common > name found for taxonomy (baker's yeast) rather than have it be a > simple container, add an abbreviated_name() method for the name > container for the SOURCE line, and have the organelle() method > actually be used if an organelle is present (it doesn't seem to be > set at the moment in SeqIO::genbank). This is not how I read the specification. Everything on the the same line as 'Source' is free-format text and therefore cannot be parsed. For the purposes of writing out it must be stored as-is, but it serves no other useful purpose. The file also provides the scientific name which can be used to do an accurate database lookup, which in turn gives you access to the common names, like "baker's yeast". On a side note, why would we care about 'organelle' when we're dealing with taxonomy? Why does the NCBI taxonomy db have a slot for organelle? From bix at sendu.me.uk Mon Jul 24 04:45:38 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 24 Jul 2006 09:45:38 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: References: <44BBBB69.6000906@sendu.me.uk> <44BF86AF.8080408@sendu.me.uk> Message-ID: <44C488B2.5070806@sendu.me.uk> Hilmar Lapp wrote: > On Jul 20, 2006, at 9:35 AM, Sendu Bala wrote: > >> Bio::DB::Taxonomy::flatfile >> --------------------------- >> [...] >> >> BEHAVIOUR-CHANGE: flatfile used to store within the nodes it makes the >> division as a three letter code, like 'PRI'. However, for consistency >> with entrez and the scientific_name() of the node the division is >> supposed to correspond to, it is now stored as the full name, like >> 'Primates'. > > What about adding a method division_code() which would return the 3- > letter abbreviation? > > The abbreviation may be needed by flat-file writers, so it may be > handy to have in some cases. As far as I know you can't get the 3-letter version via entrez, so no other module can really expect to be able to get it, not knowing which database (flatfile.pm or entez.pm) the taxonomic information is coming from. But of course it would be somewhat harmless to add division_code() anyway. It might be better done as a -code => 1 option to division()? >> The names->id solution also stores the artificially uniqued names like >> 'Craniata ', allowing you for the first time to retrieve the >> correct id. Previously the search would have simply failed completely. >> >> The names->id solution now handles nodes with scientific names of 'xyz >> (class)', allowing you to retrieve the id with both get_taxonids >> ('xyz') >> and get_taxonids('xyz (class)'). Previously only the latter would >> work. > > Should angle brackets be allowed too? Allowed in what sense? You can indeed search for both get_taxonids('Craniata ') [returns a single id] and get_taxonids('Craniata') [returns multipe ids, one of which is the previous answer]. > Maybe there should also be a -names parameter which accepts a hash > reference with keys being the kind of name (scientific, common, etc) > and the values being array references with the set of names of that > kind? Not sure what you mean. name() has that data structure, though you're not supposed to set its hash ref directly. >> or the $node->classification() array. > > Bio::Taxonomy::Node shouldn't have this attribute. It is legacy > brought over from a flawed (because flat) object model in Bio::Species. Yes, I agree. >> NOTE: entrez modules (and website) cannot cope with '' >> in the >> query, failing searches like 'Craniata '. For this >> reason, if >> get_taxonids() is given a query with '' it will immediately >> return undefined, saving a pointless website access. > > If there is a 'next-best-thing' that is still semantically compatible > with the API documentation, I would do that. > > In this case, if there is a in the query the entrez > module should strip it and automatically use the rest for searching. > If indeed multiple IDs match there should be a warning to inform the > user that entrez cannot use the notation to limit the > query results. I wouldn't like this. I actually had it working this way initially, but decided that if someone entered 'xyz ' they really didn't want multiple ids, expected to get multiple ids with just 'xyz' and don't want their query made something else and then be warned about it. > In fact, you might as well provide an option to enable an automatic > check for the correct branch for each ID if multiple ones are > returned. I.e., if this option is enabled, the module would > automatically query the parent nodes to see if is in the > lineage, and if not will remove the respective ID from the result > set. The reason you may want to make it optional is because it > potentially costs time. (but in reality I'm not sure why a client > will not want to enable the option - so maybe this should even be > default) I can certainly add that, it seems like a good idea. I don't, however, see any scope for an option at all. What would the option be called? -don't_give_me_the_answer_I_actually_want_to_save_time ? Pointless, imho. If the user queries 'xyz ' with that option, they're just going to have to do for themselves manually what the method would have done for them without that option, in order to get the correct answer. It'll be slower that way, if anything. So the option would actually be called -don't_give_me_the_answer_I_actually_want_so_I_can_get_it_myself_a_little_slower (!). >> Bio::Taxonomy::Node >> ------------------- >> [...] >> classification() has a proper solution to finding the classification >> when the array wasn't manually set. >> >> # Improvements >> BEHAVIOUR-CHANGE: node_name() used to be an alias to name >> ('common'). Now >> it is an alias to name('scientific'). >> NOTE: node_name is what is set when ->new(-name => $name) is set, so >> flatfile and entrez and user-created nodes now implicitly associate >> the >> name of the node they create with its scientific name. > > I'm not even sure node_name() should just be deprecated. The methods > falsely suggests that there is only a single and definitive name for > the taxon node. > > In NCBI reality, this is only true for the scientific name of the > node. In real reality, many nodes have multiple scientific names - > taxonomy isn't static and therefore the scientific naming of nodes > isn't either. For the programmer not using any database but just making up his own nodes, I think he needs a node_name() because he may not be thinking about anything fancy or realistic. He just want to give his node a single name that he invents. node_name() seems like the ideal method name to me. From jaynelvallance at hotmail.com Mon Jul 24 05:45:50 2006 From: jaynelvallance at hotmail.com (Jayne Vallance) Date: Mon, 24 Jul 2006 09:45:50 +0000 Subject: [Bioperl-l] SearchIO - Stop throwing away data Message-ID: Hi I developing someone elses work. I wondered whether anyone could identify the mistake that the previous coder made? I am not very familiar with SearchIO yet. They are trying to extract filenames from an output report. This is their code: # store the query name of the mito db blast hits into an array my $searchio = new Bio::SearchIO( -file => $blast_mito_output ); # array to store the mitochondrial BLAST database hits my @mito_hits; # name of query for BLAST hit my $query_name; while ( my $result = $searchio->next_result() ) { # get the hits and their associated name # do not want to include these in the clustering step while( my $hit = $result->next_hit ) { # store the names of these hits into an array # these filenames will not be copied over $query_name = $result->query_name(); #print "\nQuery $query_name\n"; push(@mito_hits, $query_name); } } I think they have based it on the code at http://www.bioperl.org/wiki/HOWTO:SearchIO#Authors use Bio::SearchIO; use Bio::SearchIO::FastHitEventBuilder; my $searchio = new Bio::SearchIO(-format => $format, -file => $file); $searchio->attach_EventHandler(Bio::SearchIO::FastHitEventBuilder->new); while( my $r = $searchio->next_result ) { while( my $h = $r->next_hit ) { # Hits will NOT have HSPs print $h->significance,"\n"; } which "throws away data you don't want"??? I am finding that our code is finding the last file name in the ouput report, rather than each and every one. I suspect it is overwriting (or throwing away the data). How do I need to change the code to make sure *every* file name goes into @mito_hits? Thankyou Jayne _________________________________________________________________ The new MSN Search Toolbar now includes Desktop search! http://join.msn.com/toolbar/overview From simon.andrews at bbsrc.ac.uk Mon Jul 24 07:14:08 2006 From: simon.andrews at bbsrc.ac.uk (simon andrews (BI)) Date: Mon, 24 Jul 2006 12:14:08 +0100 Subject: [Bioperl-l] SearchIO - Stop throwing away data In-Reply-To: Message-ID: > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org > [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of > Jayne Vallance > Sent: 24 July 2006 10:46 > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] SearchIO - Stop throwing away data > > Hi > > I developing someone > elses work. I wondered whether anyone could identify the > mistake that the previous coder made? > I am not very familiar with SearchIO yet. > > They are trying to extract filenames from an output report. I'm not sure what you mean by filenames here. The value which is being collected in your code snippet is the name of the original query sequence. > This is their code: > while ( my $result = $searchio->next_result() ) { > # get the hits and their associated name > # do not want to include these in the clustering step > while( my $hit = $result->next_hit ) { > # store the names of these hits into an array > # these filenames will not be copied over > $query_name = $result->query_name(); > #print "\nQuery $query_name\n"; > push(@mito_hits, $query_name); OK, this bit is odd. You're collecting the name of the query sequence but you're doing it as you're looping through the hits. Since all the hits come from the same result you're just going to get the same query name put into your array multiple times (once for each hit). This almost certainly isn't what you want. If you just want the name of the query sequence you can miss out the inner loop (the $result->next_hit() loop). If you actually want to collect the names of the sequences which were hit then you need to collect $hit->name() rather than $result->query_name(); > } > } > > I think they have based it on the code at > http://www.bioperl.org/wiki/HOWTO:SearchIO#Authors > $searchio->attach_EventHandler(Bio::SearchIO::FastHitEventBuil der->new); > while( my $r = $searchio->next_result ) { while( my $h = > $r->next_hit ) { > # Hits will NOT have HSPs > print $h->significance,"\n"; > } > > which "throws away data you don't want"??? Indeed, but probably not in the way you're thinking. The data it throws away is the details of each individual HSP (mostly the alinment data). You're not using hsp data in your code so it will have no effect (other than making it a bit quicker). It doesn't throw away whole hits or anything like that. > I am finding that our code is finding the last file name in > the ouput report, rather than each and every one. I suspect > it is overwriting (or throwing away the data). I suspect then that you should be collecting the hit names rather than the query names? Simon. From hlapp at gmx.net Mon Jul 24 08:20:00 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 24 Jul 2006 08:20:00 -0400 Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes In-Reply-To: <44C47DC1.8020503@sendu.me.uk> References: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu> <44C47DC1.8020503@sendu.me.uk> Message-ID: <27E70EEF-28F6-4B54-9CCB-6090E338CEFB@gmx.net> On Jul 24, 2006, at 3:58 AM, Sendu Bala wrote: > On a side note, why would we care about 'organelle' when we're dealing > with taxonomy? Why does the NCBI taxonomy db have a slot for > organelle? Because some sequences are of the organelle DNA, and Genbank needs a way to express this. Highly artificial, but still can't be ignored. -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Mon Jul 24 08:27:28 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 24 Jul 2006 08:27:28 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C488B2.5070806@sendu.me.uk> References: <44BBBB69.6000906@sendu.me.uk> <44BF86AF.8080408@sendu.me.uk> <44C488B2.5070806@sendu.me.uk> Message-ID: <11A2B917-C633-4806-A6F4-920F02F0BF6E@gmx.net> :-) I think we're largely in agreement. As for node_name() I fully understand the motivation, but it needs to be understood that the attribute's value will be based on a largely arbitrary choice unless it is set directly by the user. -hilmar On Jul 24, 2006, at 4:45 AM, Sendu Bala wrote: > Hilmar Lapp wrote: >> On Jul 20, 2006, at 9:35 AM, Sendu Bala wrote: >> >>> Bio::DB::Taxonomy::flatfile >>> --------------------------- >>> [...] >>> >>> BEHAVIOUR-CHANGE: flatfile used to store within the nodes it >>> makes the >>> division as a three letter code, like 'PRI'. However, for >>> consistency >>> with entrez and the scientific_name() of the node the division is >>> supposed to correspond to, it is now stored as the full name, like >>> 'Primates'. >> >> What about adding a method division_code() which would return the 3- >> letter abbreviation? >> >> The abbreviation may be needed by flat-file writers, so it may be >> handy to have in some cases. > > As far as I know you can't get the 3-letter version via entrez, so no > other module can really expect to be able to get it, not knowing which > database (flatfile.pm or entez.pm) the taxonomic information is > coming from. > > But of course it would be somewhat harmless to add division_code() > anyway. It might be better done as a -code => 1 option to division()? > > >>> The names->id solution also stores the artificially uniqued names >>> like >>> 'Craniata ', allowing you for the first time to >>> retrieve the >>> correct id. Previously the search would have simply failed >>> completely. >>> >>> The names->id solution now handles nodes with scientific names of >>> 'xyz >>> (class)', allowing you to retrieve the id with both get_taxonids >>> ('xyz') >>> and get_taxonids('xyz (class)'). Previously only the latter would >>> work. >> >> Should angle brackets be allowed too? > > Allowed in what sense? You can indeed search for both > get_taxonids('Craniata ') [returns a single id] and > get_taxonids('Craniata') [returns multipe ids, one of which is the > previous answer]. > > >> Maybe there should also be a -names parameter which accepts a hash >> reference with keys being the kind of name (scientific, common, etc) >> and the values being array references with the set of names of that >> kind? > > Not sure what you mean. name() has that data structure, though you're > not supposed to set its hash ref directly. > > >>> or the $node->classification() array. >> >> Bio::Taxonomy::Node shouldn't have this attribute. It is legacy >> brought over from a flawed (because flat) object model in >> Bio::Species. > > Yes, I agree. > > >>> NOTE: entrez modules (and website) cannot cope with '' >>> in the >>> query, failing searches like 'Craniata '. For this >>> reason, if >>> get_taxonids() is given a query with '' it will >>> immediately >>> return undefined, saving a pointless website access. >> >> If there is a 'next-best-thing' that is still semantically compatible >> with the API documentation, I would do that. >> >> In this case, if there is a in the query the entrez >> module should strip it and automatically use the rest for searching. >> If indeed multiple IDs match there should be a warning to inform the >> user that entrez cannot use the notation to limit the >> query results. > > I wouldn't like this. I actually had it working this way initially, > but > decided that if someone entered 'xyz ' they really didn't > want multiple ids, expected to get multiple ids with just 'xyz' and > don't want their query made something else and then be warned about > it. > > >> In fact, you might as well provide an option to enable an automatic >> check for the correct branch for each ID if multiple ones are >> returned. I.e., if this option is enabled, the module would >> automatically query the parent nodes to see if is in the >> lineage, and if not will remove the respective ID from the result >> set. The reason you may want to make it optional is because it >> potentially costs time. (but in reality I'm not sure why a client >> will not want to enable the option - so maybe this should even be >> default) > > I can certainly add that, it seems like a good idea. I don't, however, > see any scope for an option at all. What would the option be called? > -don't_give_me_the_answer_I_actually_want_to_save_time ? Pointless, > imho. If the user queries 'xyz ' with that option, they're > just going to have to do for themselves manually what the method would > have done for them without that option, in order to get the correct > answer. It'll be slower that way, if anything. So the option would > actually be called > - > don't_give_me_the_answer_I_actually_want_so_I_can_get_it_myself_a_litt > le_slower > (!). > > >>> Bio::Taxonomy::Node >>> ------------------- >>> [...] >>> classification() has a proper solution to finding the classification >>> when the array wasn't manually set. >>> >>> # Improvements >>> BEHAVIOUR-CHANGE: node_name() used to be an alias to name >>> ('common'). Now >>> it is an alias to name('scientific'). >>> NOTE: node_name is what is set when ->new(-name => $name) is set, so >>> flatfile and entrez and user-created nodes now implicitly associate >>> the >>> name of the node they create with its scientific name. >> >> I'm not even sure node_name() should just be deprecated. The methods >> falsely suggests that there is only a single and definitive name for >> the taxon node. >> >> In NCBI reality, this is only true for the scientific name of the >> node. In real reality, many nodes have multiple scientific names - >> taxonomy isn't static and therefore the scientific naming of nodes >> isn't either. > > For the programmer not using any database but just making up his own > nodes, I think he needs a node_name() because he may not be thinking > about anything fancy or realistic. He just want to give his node a > single name that he invents. node_name() seems like the ideal method > name to me. > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Mon Jul 24 08:31:44 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 24 Jul 2006 08:31:44 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C48323.5060704@sendu.me.uk> References: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine> <44C00805.7090403@sendu.me.uk> <9777D0A4-5340-4367-879D-F9125393B4FD@uiuc.edu> <21C7763A-B8FF-4E91-BAEC-E5F79E462DD3@gmx.net> <44C48323.5060704@sendu.me.uk> Message-ID: <8B9F12C5-C520-48CF-825F-DD24975B43BB@gmx.net> Sounds good to me, except there is no Bio::TaxonomyI yet, and also Bio::Species shouldn't fully depend on an internet connection or flat file to do anything meaningful. I.e., it should take advantage of a lookup database if there is one, but in the absence of that one should also be able to statically set attribute values to whatever one thinks can be gleaned from a parsed text or whatever. -hilmar On Jul 24, 2006, at 4:21 AM, Sendu Bala wrote: >> I think this would be the way to go. I.e., >> >> >> |------Node >> NodeI----| >> |-| >> |----SpeciesNode >> Species----| > > Actually, if we're changing the name of the module that Species > interacts with, any existing code needs to be re-written. So why not > just do it properly and have Bio::Species interact with Bio::Taxonomy? > > |----Bio::Taxonomy > Bio::TaxonomyI----| > |----Bio::Species > > Or > > Bio::TaxonomyI----|----Bio::Taxonomy----|----Bio::Species > > Leaving Node completely free to be just a node. This way we don't > have a > crufty SpeciesNode there simply for the sake of Bio::Species. > Bio::Species itself provides all the legacy stuff it needs for itself, > while interacting with Nodes via TaxonomyI methods in the 'correct' > way > only. > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From bix at sendu.me.uk Mon Jul 24 08:34:45 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 24 Jul 2006 13:34:45 +0100 Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes In-Reply-To: <27E70EEF-28F6-4B54-9CCB-6090E338CEFB@gmx.net> References: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu> <44C47DC1.8020503@sendu.me.uk> <27E70EEF-28F6-4B54-9CCB-6090E338CEFB@gmx.net> Message-ID: <44C4BE65.8080304@sendu.me.uk> Hilmar Lapp wrote: > > On Jul 24, 2006, at 3:58 AM, Sendu Bala wrote: > >> On a side note, why would we care about 'organelle' when we're dealing >> with taxonomy? Why does the NCBI taxonomy db have a slot for organelle? > > Because some sequences are of the organelle DNA, and Genbank needs a way > to express this. Highly artificial, but still can't be ignored. Ok, but why is it stored as part of the taxonomy? Why isn't it stored in its own field? And does /bioperl/ have to store it as part of the taxonomy? Maybe the file parser could have its own organelle() method and leave all taxonomic classes without such a method. Or it could stay as is, I don't know. Do different organelles in the same species get unique taxonomy ids? From hlapp at gmx.net Mon Jul 24 08:46:51 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 24 Jul 2006 08:46:51 -0400 Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes In-Reply-To: <44C4BE65.8080304@sendu.me.uk> References: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu> <44C47DC1.8020503@sendu.me.uk> <27E70EEF-28F6-4B54-9CCB-6090E338CEFB@gmx.net> <44C4BE65.8080304@sendu.me.uk> Message-ID: <2C99E56B-84D2-4C51-BBF1-76BAF81205AB@gmx.net> On Jul 24, 2006, at 8:34 AM, Sendu Bala wrote: > Hilmar Lapp wrote: >> >> On Jul 24, 2006, at 3:58 AM, Sendu Bala wrote: >> >>> On a side note, why would we care about 'organelle' when we're >>> dealing >>> with taxonomy? Why does the NCBI taxonomy db have a slot for >>> organelle? >> Because some sequences are of the organelle DNA, and Genbank needs >> a way >> to express this. Highly artificial, but still can't be ignored. > > Ok, but why is it stored as part of the taxonomy? Why isn't it > stored in > its own field? And does /bioperl/ have to store it as part of the > taxonomy? No, but clients need to be able to obtain it. Organelles have their own genome. If we talk about the human genome, for instance, most commonly we refer to the nuclear genome only. > Maybe the file parser could have its own organelle() method > and leave all taxonomic classes without such a method. Or it could > stay > as is, I don't know. Like I said above, at the end of the day there needs to be a way to qualify a sequence by the genome it is part of. > > Do different organelles in the same species get unique taxonomy ids? I would have to confirm, but I believe so. As I said, from a genome/ sequence-centric viewpoint, the organelle and nuclear genomes are two different things. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From simon.andrews at bbsrc.ac.uk Mon Jul 24 09:34:10 2006 From: simon.andrews at bbsrc.ac.uk (simon andrews (BI)) Date: Mon, 24 Jul 2006 14:34:10 +0100 Subject: [Bioperl-l] New EMBL format parsing/writing Message-ID: I few weeks ago I saw a couple of messages on this list mentioning the new ID/SV line format used in the latest EMBL release. I'm in the process of moving our database server over to the new format and was looking to update SeqIO::embl.pm. I'm sure someone said they'd made a patch to fix up parsing of the new format, but I can't find it either in CVS or bugzilla. Rather than do this again myself can someone point me to an updated SeqIO::embl.pm please? If there isn't one then I'll look into making the patch myself. Since this is such a major change are there any plans to put out a new release with this fix included? I'm sure this will start to bite more people as the new format becomes more widely adopted. Cheers Simon. -- Simon Andrews PhD Bioinformatics Group The Babraham Institute simon.andrews at bbsrc.ac.uk +44 (0) 1223 496463 From cjfields at uiuc.edu Mon Jul 24 09:44:37 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 08:44:37 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <8B9F12C5-C520-48CF-825F-DD24975B43BB@gmx.net> References: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine> <44C00805.7090403@sendu.me.uk> <9777D0A4-5340-4367-879D-F9125393B4FD@uiuc.edu> <21C7763A-B8FF-4E91-BAEC-E5F79E462DD3@gmx.net> <44C48323.5060704@sendu.me.uk> <8B9F12C5-C520-48CF-825F-DD24975B43BB@gmx.net> Message-ID: <5CE136DF-2443-4AC0-AD12-B333D894A75B@uiuc.edu> Hence the reason to have it be a hybrid of Bio::Species and Tax::Node. Bio::SeqIO::genbank works very happily with the current Bio::Taxonomy::Node now; if we intend to remove most of the method we need to have a similar DB-aware module to house the flatfile data (like Bio::Species) yet be capable of working with Bio::Taxonomy (like Tax::Node). As for organelle(), that could be made into something else (Bio::Annotation::SimpleValue or similar) but as it's always been included with the tax data, that's where it has been. The TaxID in the 'source' seqfeature doesn't refer to the organelle but the organism. Chris On Jul 24, 2006, at 7:31 AM, Hilmar Lapp wrote: > Sounds good to me, except there is no Bio::TaxonomyI yet, and also > Bio::Species shouldn't fully depend on an internet connection or flat > file to do anything meaningful. > > I.e., it should take advantage of a lookup database if there is one, > but in the absence of that one should also be able to statically set > attribute values to whatever one thinks can be gleaned from a parsed > text or whatever. > > -hilmar > > On Jul 24, 2006, at 4:21 AM, Sendu Bala wrote: > >>> I think this would be the way to go. I.e., >>> >>> >>> |------Node >>> NodeI----| >>> |-| >>> |----SpeciesNode >>> Species----| >> >> Actually, if we're changing the name of the module that Species >> interacts with, any existing code needs to be re-written. So why not >> just do it properly and have Bio::Species interact with >> Bio::Taxonomy? >> >> |----Bio::Taxonomy >> Bio::TaxonomyI----| >> |----Bio::Species >> >> Or >> >> Bio::TaxonomyI----|----Bio::Taxonomy----|----Bio::Species >> >> Leaving Node completely free to be just a node. This way we don't >> have a >> crufty SpeciesNode there simply for the sake of Bio::Species. >> Bio::Species itself provides all the legacy stuff it needs for >> itself, >> while interacting with Nodes via TaxonomyI methods in the 'correct' >> way >> only. >> > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Mon Jul 24 09:49:42 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 24 Jul 2006 14:49:42 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <8B9F12C5-C520-48CF-825F-DD24975B43BB@gmx.net> References: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine> <44C00805.7090403@sendu.me.uk> <9777D0A4-5340-4367-879D-F9125393B4FD@uiuc.edu> <21C7763A-B8FF-4E91-BAEC-E5F79E462DD3@gmx.net> <44C48323.5060704@sendu.me.uk> <8B9F12C5-C520-48CF-825F-DD24975B43BB@gmx.net> Message-ID: <44C4CFF6.40609@sendu.me.uk> Hilmar Lapp wrote: > Sounds good to me, except there is no Bio::TaxonomyI yet, Indeed, I propose making one. > Bio::Species shouldn't fully depend on an internet connection or flat > file to do anything meaningful. > > I.e., it should take advantage of a lookup database if there is one, but > in the absence of that one should also be able to statically set > attribute values to whatever one thinks can be gleaned from a parsed > text or whatever. Yes, which is why Bio::Taxonomy is appropriate here. Assuming that Bio::Species isa Bio::TaxonomyI: ... SOURCE Saccharomyces cerevisiae (baker's yeast) ORGANISM Saccharomyces cerevisiae Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces. ... ## the fully-manual way my $species = new Bio::Species; my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces cerevisiae', -rank => 'species', -object_id => 1, -parent_id => 2); my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces', -object_id => 2, -parent_id => 3); # (no assumption that 'Saccharomyces' is the genus, so rank() undefined) my $n3 = [etc] $species->add_node($node); $species->add_node($n2); [etc] ## Using a factory without db access # assume that Bio::Taxonomy::GenbankFactory implements # some modified Bio::Taxonomy::FactoryI my $factory = Bio::Taxonomy::GenbankFactory->new(); my $species = $factory->generate(-classification => ['Saccharomyces cerevisiae', 'Saccharomyces', 'Saccharomycetaceae' ...]); # the generate() method above just does the fully-manual way for you ## Using a factory with db access # assume that Bio::Taxonomy::EntrezFactory implements some # modified Bio::Taxonomy::FactoryI and uses Bio::DB::Taxonomy::entrez # to get the nodes my $factory = Bio::Taxonomy::EntrezFactory->new(); my $species = $factory->fetch(-scientifc_name => 'Saccharomyces cerevisiae'); # (would probably want to come up with a more generic name for the # fetch() and generate() methods, so that all Factories use the same # same method name) It's very clean and flexible this way. Ultimately you always make your Bio::Species the same way - you add nodes to it. You can make those nodes yourself or use a factory. We also solve Chris' earlier quandary: [ in a world where Bio::Taxonomy::Node and Bio::Taxonomy::SpeciesNode exist, and given that Bio::DB::Taxonomy* currently directly make Node objects ] > The only problem I can foresee is which class to use with > Bio::DB::Taxonomy*? I guess one could settle on one class by default and > have the option to use another Bio::Taxonomy::NodeI-implementing class if > you wanted more data/methods available... The way to do it is to have the Bio::DB::Taxonomy* modules return only the information that a Bio::Taxonomy::FactoryI would need to make a NodeI. The specific Factory that you use could generate whatever type of Node you wanted. But actually I propose there is only one Node and the specific Factory that you use determines the kind of Bio::TaxonomyI made; GenbankFactory might make a Bio::Species, while EntrezFactory might make a Bio::Taxonomy. Bio::Species differs from Bio::Taxonomy only so it contains all the legacy methods names that Bio::Species currently has, for backward compatibility. Setting $species->classification() would delete all nodes of self, use a GenbankFactory to make a new Bio::Species, then pull out all its Nodes and add them to self. Unless anyone can think of a better way of doing things, I'll explore the above ideas and start writing code. To summarise: major changes to Bio::DB::Taxonomy* (make them factory slaves), implementation of some Bio::Taxonomy::FactoryIs, tweak Bio::Taxonomy::FactoryI and make Bio::TaxonomyI, make Bio::Species a Bio::TaxonomyI. Oh, Bio::Taxonomy might need some changes as well. It has a classify() method does something with a Bio::Species, which would be all wrong in the new way of doing things. From bix at sendu.me.uk Mon Jul 24 09:53:23 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 24 Jul 2006 14:53:23 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <5CE136DF-2443-4AC0-AD12-B333D894A75B@uiuc.edu> References: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine> <44C00805.7090403@sendu.me.uk> <9777D0A4-5340-4367-879D-F9125393B4FD@uiuc.edu> <21C7763A-B8FF-4E91-BAEC-E5F79E462DD3@gmx.net> <44C48323.5060704@sendu.me.uk> <8B9F12C5-C520-48CF-825F-DD24975B43BB@gmx.net> <5CE136DF-2443-4AC0-AD12-B333D894A75B@uiuc.edu> Message-ID: <44C4D0D3.1020506@sendu.me.uk> Chris Fields wrote: > Bio::SeqIO::genbank works very happily with the current > Bio::Taxonomy::Node now; if we intend to remove most of the method we > need to have a similar DB-aware module to house the flatfile data (like > Bio::Species) yet be capable of working with Bio::Taxonomy (like Tax::Node). Can you give code examples of what Bio::SeqIO::genbank is doing and what makes it 'happy'? What are the requirements? Would it be as happy working with a Bio::Taxonomy object? From aramsey at vecna.com Mon Jul 24 10:23:46 2006 From: aramsey at vecna.com (Al Ramsey) Date: Mon, 24 Jul 2006 10:23:46 -0400 Subject: [Bioperl-l] Making BioPerl Faster Message-ID: <44C4D7F2.6020107@vecna.com> I'm interested into following up with a suggestion from the bioperl.org site about making it faster (http://www.bioperl.org/wiki/Why_BioPerl_is_slow). In particular, I wanted to look a little more into how the object instantiations might be more efficient. Is anyone else looking into this actively now? I want to ask if anyone had any additional insights that weren't previously published before I started. Thank you, Al Ramsey -- Alvin Ramsey, PhD. Vecna Technologies, Inc. 5205 Leesburg Pike Falls Church, VA 22041 aramsey at vecna.com t: 703.998.5333 f: 703.998.5816 From s-merchant at northwestern.edu Mon Jul 24 11:09:49 2006 From: s-merchant at northwestern.edu (Sohel Merchant) Date: Mon, 24 Jul 2006 10:09:49 -0500 Subject: [Bioperl-l] obo_parser.t test warnings In-Reply-To: Message-ID: <004301c6af33$3564a8e0$c2987ca5@pc13> Hey Chris, I usually run perl with all warnings disabled. So I never saw these. I will put a fix to them sometime this week. Thanks, Sohel. _____ From: Chris Fields [mailto:cjfields at uiuc.edu] Sent: Sunday, July 23, 2006 2:10 PM To: bioperl-l List; Hilmar Lapp; s-merchant at northwestern.edu Subject: obo_parser.t test warnings Hilmar, Sohel, Didn't know who to notify, so sorry in advance about cross-posting this to the list. I was running through cleaning up some bugs and found that obo_parser.t is throwing a ton of warnings: bayou-75:~/Chris/Bioperl/bioperl-live natashacapell$ perl -I. -w t/obo_parser.t 1..40 "my" variable $val masks earlier declaration in same scope at Bio/OntologyIO/obo.pm line 592. "my" variable $qh masks earlier declaration in same scope at Bio/OntologyIO/obo.pm line 592. Use of uninitialized value in string eq at Bio/OntologyIO/obo.pm line 239, line 13. ... Good news: all tests pass! Cheers! Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From prabubio at gmail.com Mon Jul 24 11:39:43 2006 From: prabubio at gmail.com (Prabu R) Date: Mon, 24 Jul 2006 21:09:43 +0530 Subject: [Bioperl-l] Remote Blast Execution Message-ID: Dear All! I am trying to run Remote Blast using Bio::Tools::Run::RemoteBlast. I am not able to get the blast result. Upto my knowledge, the Bio::SearchIO::blast hash object does not returns any result. Secondly, I tried 'remote_blast.pl ' a program from CPAN bioperl 1.5release. Command: perl bp_remote_blast.pl -p blastn -d est_mouse -e 1e-5 -i /home/prabucn/Blast/mm_test1.fa Error Message: retrieving blasts.. -------------------- WARNING --------------------- MSG: Possible error (1) while parsing BLAST report! --------------------------------------------------- Please help. Thanks, R. Prabu. Please look into my test program. ---------------------------------------------------------------------------------------------- use Bio::Tools::Run::RemoteBlast; use strict; use Bio::SeqIO; use Bio::SearchIO; my $prog = 'blastn'; my $db = 'est'; my $e_val= '1e-10'; my @params = ( '-prog' => $prog, '-data' => $db, '-expect' => $e_val, '-readmethod' => 'SearchIO' ); my $factory = Bio::Tools::Run::RemoteBlast->new(@params) || die "Cant do"; my $v = 1; my $str = Bio::SeqIO->new(-file=>'mm_test2.txt' , '-format' => 'fasta' ); while (my $input = $str->next_seq()){ my $r = $factory->submit_blast($input); print STDERR "waiting..." if( $v > 0 ); while ( my @rids = $factory->each_rid ) { foreach my $rid ( @rids ) { my $rc = $factory->retrieve_blast($rid); if( !ref($rc) ) { if( $rc < 0 ) { $factory->remove_rid($rid); } print STDERR "." if ( $v > 0 ); sleep 5; } else { print "$rc\n"; my $result = $rc->next_result(); my $filename = $result->query_name()."\.out"; $factory->save_output($filename); $factory->remove_rid($rid); print "\nQuery Name: ", $result->query_name(), "\n"; while ( my $hit = $result->next_hit ) { next unless ( $v > 0); print "\thit name is ", $hit->name, "\n"; while( my $hsp = $hit->next_hsp ) { print "\t\tscore is ", $hsp->score, "\n"; } } } } } } ---------------------------------------------------------------------------------------------- -- "Every noble work is at first impossible." - Thomas Carlyle From cjfields at uiuc.edu Mon Jul 24 11:48:45 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 10:48:45 -0500 Subject: [Bioperl-l] SearchIO - Stop throwing away data In-Reply-To: Message-ID: <001701c6af38$a81c1580$15327e82@pyrimidine> > Hi > > I developing someone > elses work. I wondered whether anyone could identify the > mistake that the previous coder made? > I am not very familiar with SearchIO yet. > > They are trying to extract filenames from an output report. > This is their code: > > # store the query name of the mito db blast hits into an array > my $searchio = new Bio::SearchIO( -file => $blast_mito_output ); > # array to store the mitochondrial BLAST database hits > my @mito_hits; > # name of query for BLAST hit > my $query_name; > Just as a gripe here: you should always designate the '-format' here to be 'blast' for BLAST text output. my $searchio = new Bio::SearchIO(-file => $blast_mito_output, -format => 'blast' ); The default is still text, so the above works, but that very well may change in the future. Each BLAST report is a Result. Each Result contains one or more hits; each hit contains one or more HSPs. SearchIO only parses the information contained in the BLAST report (i.e. no filenames). From here, it looks like you want Hit information, though. The code below copies the query_name from the BlastResult object, $result (i.e. the name of your query sequence, the one you submitted for BLAST'ing against a database). You need the BlastHit data from $hit. Change : $query_name = $result->query_name(); #print "\nQuery $query_name\n"; push(@mito_hits, $query_name); To : $hit_name = $hit->description(); #print "\nHit $hit_name\n"; push(@mito_hits, $hit_name); or, for the hit accession, use $hit_name = $hit->accession(); For all accessions in the description (there may be multiples if sequences are identical), use an array and @hit_name = $hit->get_all_accessions(); You can use a different EventHandler if you want to speed things up: my $searchio = new Bio::SearchIO(-format => $format, -file => $file); $searchio->attach_EventHandler(Bio::SearchIO::FastHitEventBuilder->new); But to have this work you need to update to the latest CVS version of bioperl; this was a recent bug that was fixed. Chris > while ( my $result = $searchio->next_result() ) { > # get the hits and their associated name > # do not want to include these in the clustering step > while( my $hit = $result->next_hit ) { > # store the names of these hits into an array > # these filenames will not be copied over > $query_name = $result->query_name(); > #print "\nQuery $query_name\n"; > push(@mito_hits, $query_name); > } > } > I think they have based it on the code at > http://www.bioperl.org/wiki/HOWTO:SearchIO#Authors > > use Bio::SearchIO; > use Bio::SearchIO::FastHitEventBuilder; > my $searchio = new Bio::SearchIO(-format => $format, -file => $file); > > $searchio->attach_EventHandler(Bio::SearchIO::FastHitEventBuilder->new); > while( my $r = $searchio->next_result ) { > while( my $h = $r->next_hit ) { > # Hits will NOT have HSPs > print $h->significance,"\n"; > } > > which "throws away data you don't want"??? > > I am finding that our code is finding the last file name in the ouput > report, > rather than each and every one. I suspect it is overwriting (or throwing > away the data). > > How do I need to change the code to make sure *every* file name goes > into @mito_hits? > > Thankyou > > Jayne > > _________________________________________________________________ > The new MSN Search Toolbar now includes Desktop search! > http://join.msn.com/toolbar/overview > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From dwaner at scitegic.com Mon Jul 24 12:03:21 2006 From: dwaner at scitegic.com (dwaner at scitegic.com) Date: Mon, 24 Jul 2006 09:03:21 -0700 Subject: [Bioperl-l] New EMBL format parsing/writing Message-ID: Simon, I have already updated SeqIO::embl.pm to support release 87. All I have left to do is generate the patch and update the /t test. I will try to get this submitted to bugzilla today (24 July). - David From cjfields at uiuc.edu Mon Jul 24 12:04:40 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 11:04:40 -0500 Subject: [Bioperl-l] Making BioPerl Faster In-Reply-To: <44C4D7F2.6020107@vecna.com> Message-ID: <001901c6af3a$df146ea0$15327e82@pyrimidine> Give it a look, sure! Not sure if this the only problem though when it comes to speed; I think it's more complicated than that. I think that (at least on WinXP) the Perl version used is also partially to blame. It's possible that something modified between v 5.6 and 5.8 slowed everything down considerably. I always wondered if it had something to do with Unicode support in perl 5.8 ... There is a report on Bugzilla about a dramatic slowdown on sequence parsing between v. 1.4 and v. 1.5 (including the latest, v 1.5.1) http://bugzilla.open-bio.org/show_bug.cgi?id=1875 This is unresolved at this time but may be unrelated to the possible perl versioning issue above. I've a feeling you may find regexes and redundant methods calls also add quite a bit of overhead. I've seen several places where accessors are called over and over w/o assigning to a local variable. Or places where a tr/// would work much faster than a s///. There was an instance of the latter in SeqIO which sped up parsing about 2-3x faster on WinXP. If you want to look at the impact of object instantiation on speed, check out Bio::SearchIO (parsing of BLAST/FASTA/HMMER reports). Lots of method calls, object creation, etc. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Al Ramsey > Sent: Monday, July 24, 2006 9:24 AM > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] Making BioPerl Faster > > I'm interested into following up with a suggestion from the bioperl.org > site about making it faster > (http://www.bioperl.org/wiki/Why_BioPerl_is_slow). In particular, I > wanted to look a little more into how the object instantiations might be > more efficient. Is anyone else looking into this actively now? I want > to ask if anyone had any additional insights that weren't previously > published before I started. > > Thank you, > Al Ramsey > > > -- > Alvin Ramsey, PhD. > > Vecna Technologies, Inc. > 5205 Leesburg Pike > Falls Church, VA 22041 > aramsey at vecna.com > t: 703.998.5333 > f: 703.998.5816 > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Mon Jul 24 12:06:03 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 11:06:03 -0500 Subject: [Bioperl-l] Remote Blast Execution In-Reply-To: Message-ID: <001a01c6af3b$10187f50$15327e82@pyrimidine> You need to update to the latest code (bioperl-live) from CVS. BLAST parsing using RemoteBlast is broken in all the latest releases. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Prabu R > Sent: Monday, July 24, 2006 10:40 AM > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] Remote Blast Execution > > Dear All! > > I am trying to run Remote Blast using Bio::Tools::Run::RemoteBlast. > > I am not able to get the blast result. > Upto my knowledge, the Bio::SearchIO::blast hash object does not returns > any > result. > > > Secondly, I tried 'remote_blast.pl ' a program from CPAN bioperl > 1.5release. > > Command: > perl bp_remote_blast.pl -p blastn -d est_mouse -e 1e-5 -i > /home/prabucn/Blast/mm_test1.fa > > Error Message: > > retrieving blasts.. > > -------------------- WARNING --------------------- > MSG: Possible error (1) while parsing BLAST report! > --------------------------------------------------- > > Please help. > > Thanks, > R. Prabu. > > > Please look into my test program. > -------------------------------------------------------------------------- > -------------------- > use Bio::Tools::Run::RemoteBlast; > use strict; > use Bio::SeqIO; > use Bio::SearchIO; > > my $prog = 'blastn'; > my $db = 'est'; > my $e_val= '1e-10'; > > my @params = ( '-prog' => $prog, > '-data' => $db, > '-expect' => $e_val, > '-readmethod' => 'SearchIO' ); > > my $factory = Bio::Tools::Run::RemoteBlast->new(@params) || die "Cant > do"; > > my $v = 1; > > my $str = Bio::SeqIO->new(-file=>'mm_test2.txt' , '-format' => 'fasta' > ); > > while (my $input = $str->next_seq()){ > my $r = $factory->submit_blast($input); > > print STDERR "waiting..." if( $v > 0 ); > while ( my @rids = $factory->each_rid ) { > foreach my $rid ( @rids ) { > my $rc = $factory->retrieve_blast($rid); > > if( !ref($rc) ) { > if( $rc < 0 ) { > $factory->remove_rid($rid); > } > print STDERR "." if ( $v > 0 ); > sleep 5; > } else { > print "$rc\n"; > my $result = $rc->next_result(); > my $filename = $result->query_name()."\.out"; > $factory->save_output($filename); > $factory->remove_rid($rid); > print "\nQuery Name: ", $result->query_name(), "\n"; > while ( my $hit = $result->next_hit ) { > next unless ( $v > 0); > print "\thit name is ", $hit->name, "\n"; > while( my $hsp = $hit->next_hsp ) { > print "\t\tscore is ", $hsp->score, "\n"; > } > } > } > } > } > } > -------------------------------------------------------------------------- > -------------------- > > -- > "Every noble work is at first impossible." > - Thomas Carlyle > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Mon Jul 24 12:21:39 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 11:21:39 -0500 Subject: [Bioperl-l] New EMBL format parsing/writing In-Reply-To: Message-ID: <001c01c6af3d$3df2dc70$15327e82@pyrimidine> The only proposed EMBL changes I can remember were for Tax data (organism lines). It shouldn't be hard to change the way these are parsed. We could leave parsing of SV for older files and run a check on the ID line format to accommodate old and new sequences, though I have no problem with only supporting the latest formats. Continual support for old deprecated sequence formats leads to lots of cruft over time; SwissPort parsing has the same issue. You would be surprised how many people out there never bother to update their sequences and use old data... I believe you are referring to this (from the latest EMBL release notes): ... 2 CHANGES IN THIS RELEASE 2.1 Changes to the Feature Table Document: Chapter 3.5 "Location" The use of range (.) descriptor within location spans is no longer legal. 2.2 ID line changes ID line structure underwent the following changes * All tokens are separated by a semicolon. * The entry name is not displayed, in its place there is the primary accession number. * The sequence version is indicated. * The topology is a separate token and is indicated for both circular and linear molecules. * Both the data class and taxonomic divisions will be displayed. This is an example of the new ID line: ID CD789012; SV 4; linear; genomic DNA; HTG; MAM; 500 BP. (1) (2) (3) (4) (5) (6) (7) The tokens represent: 1. Primary accession number. 2. 'SV' + sequence version number. 3. Topology: 'circular' or 'linear'. 4. Molecule type. 5. Data class (ANN, CON, PAT, EST, GSS, HTC, HTG, MGA, WGS, TPA, STS, STD, "normal" entries will have STD for standard). 6. Taxonomic division (HUM, MUS, ROD, PRO, MAM, VRT, FUN, PLN, ENV, INV, SYN, UNC, VRL, PHG). 7. Sequence length + 'BP.'. The entry name is no longer displayed in the ID line. A mapping file (entryname to accession number) ftp://ftp.ebi.ac.uk/pub/databases/embl/misc/entryname_to_acc.mapping is provided for those entries where the entryname is not the same as the accession number. The SV line has been dropped as sequence version information is now displayed in the ID line. In order to facilitate the changeover to the new ID line structure, two small utilities have been released: 'new2oldID.pl' and 'old2newID.pl'. They can be used to convert EMBL flat files from the old to the new format and vice-versa. The converters can be found at ftp://ftp.ebi.ac.uk/pub/databases/embl/tools A new version of the Syncron tools (for maintaining synchronised copies of EMBL database updates) that became the working version with EMBL release 87 can be found in the same directory. In this version the tools were adjusted to cope with the new format of the ID line in EMBL entries and some related changes. ... Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of simon andrews (BI) > Sent: Monday, July 24, 2006 8:34 AM > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] New EMBL format parsing/writing > > I few weeks ago I saw a couple of messages on this list mentioning the > new ID/SV line format used in the latest EMBL release. I'm in the > process of moving our database server over to the new format and was > looking to update SeqIO::embl.pm. > > I'm sure someone said they'd made a patch to fix up parsing of the new > format, but I can't find it either in CVS or bugzilla. > > Rather than do this again myself can someone point me to an updated > SeqIO::embl.pm please? If there isn't one then I'll look into making > the patch myself. > > Since this is such a major change are there any plans to put out a new > release with this fix included? I'm sure this will start to bite more > people as the new format becomes more widely adopted. > > > Cheers > > Simon. > > -- > Simon Andrews PhD > Bioinformatics Group > The Babraham Institute > > simon.andrews at bbsrc.ac.uk > +44 (0) 1223 496463 > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Mon Jul 24 12:37:32 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 11:37:32 -0500 Subject: [Bioperl-l] New EMBL format parsing/writing In-Reply-To: Message-ID: <002001c6af3f$76214490$15327e82@pyrimidine> Great work! Does it support old and new EMBL or only the newest? I don't have a problem with dumping old format support, but if we do we need to note this in POD and elsewhere (wiki, perhaps). Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of dwaner at scitegic.com > Sent: Monday, July 24, 2006 11:03 AM > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] New EMBL format parsing/writing > > Simon, > > I have already updated SeqIO::embl.pm to support release 87. All I have > left to do is generate the patch and update the /t test. I will try to > get this submitted to bugzilla today (24 July). > > - David > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Mon Jul 24 14:40:03 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 13:40:03 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C4D0D3.1020506@sendu.me.uk> Message-ID: <002f01c6af50$97242250$15327e82@pyrimidine> I have to do a little catching up on things here; lots of conversation this morning! According to NCBI, the SOURCE line can hold organelle data, an abbreviated version of the scientific name, and the GenBank common name in parentheses. No other information is present. The ORGANISM lines contains the scientific name (NCBI definition) and the lineage, generally only ranked node but not always. I believe it was Nadeem Faruque who indicated that there is some way that NCBI marks the ranks which determines whether or not they appear in the lineage. Here's what Bio::SeqIO::genbank does to get data into and out of GenBank files: ------------------------------------------------------ Bio::SeqIO::genbank in methods next_seq() and _read_GenBank_Species(): 1) Bio::Species acts as a container object 2) The SOURCE data is dumped entirely into common_name() (ughhhh). There is some additional work done as well before instantiating a Bio::Species ; if it is considered an unknown organism there is no Bio::Species object returned. We should get rid of that bit; every GenBank SOURCE has a TaxID and therefore has a node, including plasmids and unknowns. There will be no genus/species or anything else set for that group. 3) The ORGANISM name was divided up into genus(), species(), and subspecies(), based on the classification array (again, ughhh). 4) The classification array is split into an array and dumped into classification() 5) No parsing of potential organelle information occurs. None. Zero. Squat. 6) TaxID is grabbed from the 'source' seqfeature and assigned via ncbi_taxid(). We could use this to also grab the organelle, etc. ------------------------------------------------------ Bio::SeqIO::genbank in method write_seq(): 1) SOURCE line : use the common_name data for output, but tag on the subspecies information (?!?!?!). 2) ORGANISM lines : the name is rebuilt from the organelle() (which should be on the SOURCE line) and genus and species, which comes from the classification array (?!?!?!). The classification array is rebuilt from classification() ------------------------------------------------------ Much of this may be cruft from changes in the official GenBank format that we neglected to update. However, I think there's WAY too much hand-wringing about trying to get everything into genus() species() etc without anything more that the (very scant) information in the flatfile, esp. when using the classification array as a basis. The only places where reliable tax information is present in the flatfile are: 1) SOURCE line (organelle, common name, abbreviated name) 2) ORGANISM lines (scientific name, classification array) 3) 'source' seqfeature (strain/variant (!), organelle, TaxID, etc found here). We should assign those accordingly; we could even use the 'source' seqfeature to grab strain, organelle, etc. just like we now do for the TaxID. Beyond that we're really just guessing the ranks and the genus-species names. Makes no sense, especially when that is easily available in Bio::Taxonomy using entrez/flatfile. We could have Bio::Taxonomy::Species act as a container for IO purpose, ONLY using the methods in the 'reliable information' list above in Bio::SeqIO::genbank and other SeqIO RichSeqs. Then hold the additional data with warnings attached if a lookup hasn't been run, or not set them at all. Or, use Hilmar's suggestion and force the user to use the db handle and ncbi_taxid() to grab a new Bio::Taxonomy::Node/Species object (based on the rank) which has the correct information. As for the other container get/sets: species(), genus() etc. These methods should be present, but only for species or below (hence Bio::Taxonomy::Species). In a way Bio::Taxonomy::Species is not entirely correct as the sequence file many times the sequence is from an organism at the genus level (unassigned species) or subspecies/strain levels, or is unranked (environmental samples, for instance). All of these seem to have TaxIDs though. Don't think it really matters... We could convert Bio::Species into an abstract interface class (Bio::SpeciesI), moving the implemented methods over to Bio::Taxonomy::Species, and have Bio::Taxonomy::Species implement Bio::Taxonomy::NodeI or Bio::TaxonomyI as well. Bio::Taxonomy::Species could be checked with $obj->isa('Bio::TaxonomyI') && $obj->isa('Bio::SpeciesI') Or, modifying Hilmar's suggestion: |-----Tax::Node NodeI/TaxI -| |-----Tax::Species | SpeciesI -------| So Species doesn't 'contaminate' Node. This will allow you to proceed with doing what you want to Bio::Taxonomy::Node; both Node and Species could be checked simultaneously though they need to be changed at some point to implement the same base class, so you could check using : if ($obj->isa('Bio::Taxonomy::NodeI')) { As for getting Bio::SeqIO::genbank to play well with Bio::Taxonomy::Species, all I did was 'clone' the Bio::Taxonomy::Node module into Bio::Taxonomy::Species, removed the warnings in species() and other methods for the time being, and changed the method call for classification() in Bio::SeqIO::genbank to send an array instead of an array_ref. Then I modified the parsing to retain the scientific_name and abbreviated_name (though the latter should go into common_names()). Passed all but one test, where common_name was called and returned the entire SOURCE line (not correct!). Pretty simple, really... BTW, I checked EMBL format, and it is very similar in format to the way GenBank is with the interesting addition of the OG line (for organelle). Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Sendu Bala > Sent: Monday, July 24, 2006 8:53 AM > To: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > Chris Fields wrote: > > Bio::SeqIO::genbank works very happily with the current > > Bio::Taxonomy::Node now; if we intend to remove most of the method we > > need to have a similar DB-aware module to house the flatfile data (like > > Bio::Species) yet be capable of working with Bio::Taxonomy (like > Tax::Node). > > Can you give code examples of what Bio::SeqIO::genbank is doing and what > makes it 'happy'? What are the requirements? Would it be as happy > working with a Bio::Taxonomy object? > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Mon Jul 24 15:24:23 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 14:24:23 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C4CFF6.40609@sendu.me.uk> Message-ID: <003c01c6af56$c5fd2df0$15327e82@pyrimidine> > Hilmar Lapp wrote: > > Sounds good to me, except there is no Bio::TaxonomyI yet, > > Indeed, I propose making one. So, Node would implement this, correct? Naming it Bio::TaxonomyI makes me think that Bio::Taxonomy implements TaxonomyI, not that Bio::Taxonomy::Node implements it. ... > Yes, which is why Bio::Taxonomy is appropriate here. Assuming that > Bio::Species isa Bio::TaxonomyI: > > ... > SOURCE Saccharomyces cerevisiae (baker's yeast) > ORGANISM Saccharomyces cerevisiae > Eukaryota; Fungi; Ascomycota; Saccharomycotina; > Saccharomycetes; > Saccharomycetales; Saccharomycetaceae; Saccharomyces. > > ... > > ## the fully-manual way > my $species = new Bio::Species; > my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces cerevisiae', > -rank => 'species', -object_id => 1, > -parent_id => 2); > my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces', > -object_id => 2, -parent_id => 3); > # (no assumption that 'Saccharomyces' is the genus, so rank() undefined) > my $n3 = [etc] > $species->add_node($node); > $species->add_node($n2); > [etc] Hrmm... why would you add multiple nodes to a species object? A Species is-a Node, not a full Bio::Taxonomy. Taxonomy has-a Node (hence the add_node() method). So, you should be able to add a NodeI-implementing object to a Taxonomy object (either a Node or a Species). Not sure I agree with what you propose here; doesn't seem right... ... > We also solve Chris' earlier quandary: > > [ in a world where Bio::Taxonomy::Node and Bio::Taxonomy::SpeciesNode > exist, and given that Bio::DB::Taxonomy* currently directly make Node > objects ] > > The only problem I can foresee is which class to use with > > Bio::DB::Taxonomy*? I guess one could settle on one class by default > and > > have the option to use another Bio::Taxonomy::NodeI-implementing class > if > > you wanted more data/methods available... > > The way to do it is to have the Bio::DB::Taxonomy* modules return only > the information that a Bio::Taxonomy::FactoryI would need to make a > NodeI. The specific Factory that you use could generate whatever type of > Node you wanted. Yes, using an object factory here makes a lot of sense, returning the correct object type based on the rank. ... > Bio::Species differs from Bio::Taxonomy only so it contains all the > legacy methods names that Bio::Species currently has, for backward > compatibility. Setting $species->classification() would delete all nodes > of self, use a GenbankFactory to make a new Bio::Species, then pull out > all its Nodes and add them to self. The idea is to replace Bio::Species with something that works well, so having it implement a Node-like interface works since it is-a Node. Having it implement a Taxonomy-like interface, though, doesn't make a lot of sense as a species is-not-a Taxonomy. It should act just like a fancier node object. Using a factory in Bio::DB::Taxonomy should solve any issues about what object type is returned, since that could simply be made based on the rank itself (species rank or below == Bio::Taxonomy::Species, genus and above == Bio::Taxonomy::Node). > Unless anyone can think of a better way of doing things, I'll explore > the above ideas and start writing code. To summarise: major changes to > Bio::DB::Taxonomy* (make them factory slaves), implementation of some > Bio::Taxonomy::FactoryIs, tweak Bio::Taxonomy::FactoryI and make > Bio::TaxonomyI, make Bio::Species a Bio::TaxonomyI. Nope. Don't agree. Sorry. I can't see why you would force a Species to be a Taxonomy when it isn't. The object hierarchy doesn't make sense to me. I would just have a simple interface for Node (NodeI), and either convert Bio::Species to an abstract interface or place its methods in Bio::Taxonomy::Species/SpeciesNode. I like the interface idea as Bio::Taxonomy::Node is-a NodeI only, while Bio::Taxonomy::Species is-a NodeI and SpeciesI; these checks can be run using the UNIVERSAL object method 'isa' when using a Factory. I'll repeat: a Node and a Species is-not-a Taxonomy. A Taxonomy object has-a Node or Species or combinations thereof ; all would be NodeI-implementing. That's the reason that add_node() is there, which could be modified to allow only objects that isa->('Bio::Taxonomy::NodeI') (i.e. a Node or a Species). > Oh, Bio::Taxonomy might need some changes as well. It has a classify() > method does something with a Bio::Species, which would be all wrong in > the new way of doing things. We'll have to make eventual changes to anything referencing Bio::Species to get them to work correctly. Getting the object hierarchy finalized and worked out is priority one. Getting Bio::SeqIO modules switched over to Bio::Taxonomy::Species (pretty commonly used) and making sure that Bio::DB::Taxonomy returns the correct objects from the factory is a close second. Any small issues that pop up along the way can be taken care of when they reveal themselves. Chris From cjfields at uiuc.edu Mon Jul 24 15:34:55 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 14:34:55 -0500 Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes In-Reply-To: <2C99E56B-84D2-4C51-BBF1-76BAF81205AB@gmx.net> Message-ID: <003d01c6af58$3dc4ac40$15327e82@pyrimidine> > > Maybe the file parser could have its own organelle() method > > and leave all taxonomic classes without such a method. Or it could > > stay > > as is, I don't know. > > Like I said above, at the end of the day there needs to be a way to > qualify a sequence by the genome it is part of. Agreed. I think Sendu's right in one regard, it doesn't seem to have anything to do with the taxonomy itself. See below... There should be a way of containing this somehow, maybe using a Bio::Annotation::SimpleValue object or having a get/set somehow. > > Do different organelles in the same species get unique taxonomy ids? > > I would have to confirm, but I believe so. As I said, from a genome/ > sequence-centric viewpoint, the organelle and nuclear genomes are two > different things. Looks like the organelle sequence data uses the organism TaxID. I couldn't find organelle-specific taxon information using the TaxBrowser for mitochondrion, chloroplast, or plastid. source 1..426 /organism="Reticulitermes tibialis" /organelle="mitochondrion" /mol_type="genomic DNA" /db_xref="taxon:186107" /haplotype="T9" TaxID refers to the organism ("Reticulitermes tibialis"), not the mitochondrion. source 1..814 /organism="Porterinema fluviatile" /organelle="plastid:chloroplast" /mol_type="genomic DNA" /strain="SAG 124.79" /db_xref="taxon:246123" /country="Germany" TaxID refers to the organism ("Porterinema fluviatile"), not the chloroplast. Chris From bix at sendu.me.uk Mon Jul 24 15:45:09 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 24 Jul 2006 20:45:09 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <003c01c6af56$c5fd2df0$15327e82@pyrimidine> References: <003c01c6af56$c5fd2df0$15327e82@pyrimidine> Message-ID: <44C52345.5060903@sendu.me.uk> Chris Fields wrote: >> Hilmar Lapp wrote: >>> Sounds good to me, except there is no Bio::TaxonomyI yet, >> Indeed, I propose making one. > > So, Node would implement this, correct? Naming it Bio::TaxonomyI makes me > think that Bio::Taxonomy implements TaxonomyI, not that Bio::Taxonomy::Node > implements it. No no, I guess the whole rest of you reply was confused by this one point. Bio::TaxonomyI would be the interface for Bio::Taxonomy. Definitely not a Node. >> Yes, which is why Bio::Taxonomy is appropriate here. Assuming that >> Bio::Species isa Bio::TaxonomyI: >> >> ... >> SOURCE Saccharomyces cerevisiae (baker's yeast) >> ORGANISM Saccharomyces cerevisiae >> Eukaryota; Fungi; Ascomycota; Saccharomycotina; >> Saccharomycetes; >> Saccharomycetales; Saccharomycetaceae; Saccharomyces. >> >> ... >> >> ## the fully-manual way >> my $species = new Bio::Species; >> my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces cerevisiae', >> -rank => 'species', -object_id => 1, >> -parent_id => 2); >> my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces', >> -object_id => 2, -parent_id => 3); >> # (no assumption that 'Saccharomyces' is the genus, so rank() undefined) >> my $n3 = [etc] >> $species->add_node($node); >> $species->add_node($n2); >> [etc] > > > Hrmm... why would you add multiple nodes to a species object? A Species > is-a Node, not a full Bio::Taxonomy. In my proposal, a Bio::Species certainly is a full Bio::Taxonomy. >> Bio::Species differs from Bio::Taxonomy only so it contains all the >> legacy methods names that Bio::Species currently has, for backward >> compatibility. Setting $species->classification() would delete all nodes >> of self, use a GenbankFactory to make a new Bio::Species, then pull out >> all its Nodes and add them to self. > > The idea is to replace Bio::Species with something that works well, so > having it implement a Node-like interface works since it is-a Node. Having > it implement a Taxonomy-like interface, though, doesn't make a lot of sense > as a species is-not-a Taxonomy. Right. So this is why we've been 'butting heads'. Up till now I had no idea why you were so adamant about keeping things the old Bio::Taxonomy::Node way. Bio::Species very definitely has never been, nor do we want it to become, a single node of a taxonomy. It has always been a complete taxonomy. You can tell that by the fact it has a classification, and you could ask what its genus is. This is why I'm proposing that Bio::Species become a Bio::Taxonomy. Because that's the correct object model for the kinds of things Bio::Species wants to do. > Using a factory in Bio::DB::Taxonomy should solve any issues about what > object type is returned, since that could simply be made based on the rank > itself (species rank or below == Bio::Taxonomy::Species, genus and above == > Bio::Taxonomy::Node). Frankly, that idea makes me ill. A Node, at the fundamental level, is just a very simple object that needs to associated a taxonomic rank with a scientific name. If you start making different objects for different ranks, you've departed from any semblance of meaning in the object model. > Nope. Don't agree. Sorry. I can't see why you would force a Species to be > a Taxonomy when it isn't. The object hierarchy doesn't make sense to me. Does it make sense now? > I'll repeat: a Node and a Species is-not-a Taxonomy. I'll repeat: A Node is a Node and a Bio::Species is a Taxonomy ;) > A Taxonomy object has-a Node or Species or combinations thereof ; No, a Taxonomy contains Nodes. One of those Nodes might have a rank() of 'species'. A Bio::Species contains Nodes. One of those Nodes definitely has a rank() of 'species'. It /must/ have other nodes, because the job of Bio::Species has in the past and will in the future be to store all the other taxonomic levels in a Genbank file. For the same reason Bio::Species can't be a Node itself, because you can't store other Nodes inside a Node. From cjfields at uiuc.edu Mon Jul 24 15:49:06 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 14:49:06 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <11A2B917-C633-4806-A6F4-920F02F0BF6E@gmx.net> Message-ID: <003e01c6af5a$390cdea0$15327e82@pyrimidine> Yes, 'largely' the key word. I don't really agree with Sendu's hierarchy scheme (making Species implement Taxonomy and not Node doesn't make sense), but, besides that, everything else seems fine. I like the following setup (which is similar to what you proposed, I believe), which I already posted. |-----Tax::Node NodeI-------| |-----Tax::SpeciesNode | SpeciesI -------| Taxonomy::Node is-a NodeI Taxonomy::SpeciesNode is-a NodeI and-a SpeciesI Bio::Taxonomy 'has-a' NodeI-implementing module SeqIO has-a SpeciesI-implementing module Bio::DB::Taxonomy uses a factory to return NodeI-implementing modules; specifically, a SpeciesNode for species ranks or below, and a Node for anything else. It would be nice to get this hammered out soon. I think we can actually start work on the Bio::Taxonomy::Node/SpeciesNode split; the interface classes would be easy to add. I could work on getting SeqIO to work with Bio::Taxonomy::SpeciesNode when I can (sometime in the next few weeks). Like I mentioned before, I got Bio::SeqIO::genbank already using it but haven't committed it to CVS until we sorted out the class hierarchy and interface-implementation issues. I won't be able to add too much more to this for a few weeks, unfortunately. I need to prepare for a conference as well as finish up a ton of bench research. I'll try keeping up though... Chris > :-) I think we're largely in agreement. As for node_name() I fully > understand the motivation, but it needs to be understood that the > attribute's value will be based on a largely arbitrary choice unless > it is set directly by the user. > > -hilmar > > On Jul 24, 2006, at 4:45 AM, Sendu Bala wrote: > > > Hilmar Lapp wrote: > >> On Jul 20, 2006, at 9:35 AM, Sendu Bala wrote: > >> > >>> Bio::DB::Taxonomy::flatfile > >>> --------------------------- > >>> [...] > >>> > >>> BEHAVIOUR-CHANGE: flatfile used to store within the nodes it > >>> makes the > >>> division as a three letter code, like 'PRI'. However, for > >>> consistency > >>> with entrez and the scientific_name() of the node the division is > >>> supposed to correspond to, it is now stored as the full name, like > >>> 'Primates'. > >> > >> What about adding a method division_code() which would return the 3- > >> letter abbreviation? > >> > >> The abbreviation may be needed by flat-file writers, so it may be > >> handy to have in some cases. > > > > As far as I know you can't get the 3-letter version via entrez, so no > > other module can really expect to be able to get it, not knowing which > > database (flatfile.pm or entez.pm) the taxonomic information is > > coming from. > > > > But of course it would be somewhat harmless to add division_code() > > anyway. It might be better done as a -code => 1 option to division()? > > > > > >>> The names->id solution also stores the artificially uniqued names > >>> like > >>> 'Craniata ', allowing you for the first time to > >>> retrieve the > >>> correct id. Previously the search would have simply failed > >>> completely. > >>> > >>> The names->id solution now handles nodes with scientific names of > >>> 'xyz > >>> (class)', allowing you to retrieve the id with both get_taxonids > >>> ('xyz') > >>> and get_taxonids('xyz (class)'). Previously only the latter would > >>> work. > >> > >> Should angle brackets be allowed too? > > > > Allowed in what sense? You can indeed search for both > > get_taxonids('Craniata ') [returns a single id] and > > get_taxonids('Craniata') [returns multipe ids, one of which is the > > previous answer]. > > > > > >> Maybe there should also be a -names parameter which accepts a hash > >> reference with keys being the kind of name (scientific, common, etc) > >> and the values being array references with the set of names of that > >> kind? > > > > Not sure what you mean. name() has that data structure, though you're > > not supposed to set its hash ref directly. > > > > > >>> or the $node->classification() array. > >> > >> Bio::Taxonomy::Node shouldn't have this attribute. It is legacy > >> brought over from a flawed (because flat) object model in > >> Bio::Species. > > > > Yes, I agree. > > > > > >>> NOTE: entrez modules (and website) cannot cope with '' > >>> in the > >>> query, failing searches like 'Craniata '. For this > >>> reason, if > >>> get_taxonids() is given a query with '' it will > >>> immediately > >>> return undefined, saving a pointless website access. > >> > >> If there is a 'next-best-thing' that is still semantically compatible > >> with the API documentation, I would do that. > >> > >> In this case, if there is a in the query the entrez > >> module should strip it and automatically use the rest for searching. > >> If indeed multiple IDs match there should be a warning to inform the > >> user that entrez cannot use the notation to limit the > >> query results. > > > > I wouldn't like this. I actually had it working this way initially, > > but > > decided that if someone entered 'xyz ' they really didn't > > want multiple ids, expected to get multiple ids with just 'xyz' and > > don't want their query made something else and then be warned about > > it. > > > > > >> In fact, you might as well provide an option to enable an automatic > >> check for the correct branch for each ID if multiple ones are > >> returned. I.e., if this option is enabled, the module would > >> automatically query the parent nodes to see if is in the > >> lineage, and if not will remove the respective ID from the result > >> set. The reason you may want to make it optional is because it > >> potentially costs time. (but in reality I'm not sure why a client > >> will not want to enable the option - so maybe this should even be > >> default) > > > > I can certainly add that, it seems like a good idea. I don't, however, > > see any scope for an option at all. What would the option be called? > > -don't_give_me_the_answer_I_actually_want_to_save_time ? Pointless, > > imho. If the user queries 'xyz ' with that option, they're > > just going to have to do for themselves manually what the method would > > have done for them without that option, in order to get the correct > > answer. It'll be slower that way, if anything. So the option would > > actually be called > > - > > don't_give_me_the_answer_I_actually_want_so_I_can_get_it_myself_a_litt > > le_slower > > (!). > > > > > >>> Bio::Taxonomy::Node > >>> ------------------- > >>> [...] > >>> classification() has a proper solution to finding the classification > >>> when the array wasn't manually set. > >>> > >>> # Improvements > >>> BEHAVIOUR-CHANGE: node_name() used to be an alias to name > >>> ('common'). Now > >>> it is an alias to name('scientific'). > >>> NOTE: node_name is what is set when ->new(-name => $name) is set, so > >>> flatfile and entrez and user-created nodes now implicitly associate > >>> the > >>> name of the node they create with its scientific name. > >> > >> I'm not even sure node_name() should just be deprecated. The methods > >> falsely suggests that there is only a single and definitive name for > >> the taxon node. > >> > >> In NCBI reality, this is only true for the scientific name of the > >> node. In real reality, many nodes have multiple scientific names - > >> taxonomy isn't static and therefore the scientific naming of nodes > >> isn't either. > > > > For the programmer not using any database but just making up his own > > nodes, I think he needs a node_name() because he may not be thinking > > about anything fancy or realistic. He just want to give his node a > > single name that he invents. node_name() seems like the ideal method > > name to me. > > > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From hlapp at gmx.net Mon Jul 24 15:56:02 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 24 Jul 2006 15:56:02 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <003c01c6af56$c5fd2df0$15327e82@pyrimidine> References: <003c01c6af56$c5fd2df0$15327e82@pyrimidine> Message-ID: <88700A84-B426-4BC7-88F2-D5E793870ADF@gmx.net> On Jul 24, 2006, at 3:24 PM, Chris Fields wrote: > >> Hilmar Lapp wrote: >>> Sounds good to me, except there is no Bio::TaxonomyI yet, >> >> Indeed, I propose making one. > > So, Node would implement this, correct? No - > Naming it Bio::TaxonomyI makes me > think that Bio::Taxonomy implements TaxonomyI, not that > Bio::Taxonomy::Node > implements it. I'd suppose so. >> Yes, which is why Bio::Taxonomy is appropriate here. Assuming that >> Bio::Species isa Bio::TaxonomyI: >> >> ... >> SOURCE Saccharomyces cerevisiae (baker's yeast) >> ORGANISM Saccharomyces cerevisiae >> Eukaryota; Fungi; Ascomycota; Saccharomycotina; >> Saccharomycetes; >> Saccharomycetales; Saccharomycetaceae; Saccharomyces. >> >> ... >> >> ## the fully-manual way >> my $species = new Bio::Species; >> my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces >> cerevisiae', >> -rank => 'species', -object_id >> => 1, >> -parent_id => 2); >> my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces', >> -object_id => 2, -parent_id => 3); >> # (no assumption that 'Saccharomyces' is the genus, so rank() >> undefined) >> my $n3 = [etc] >> $species->add_node($node); >> $species->add_node($n2); >> [etc] > > > Hrmm... why would you add multiple nodes to a species object? A > Species > is-a Node, not a full Bio::Taxonomy. No. See above: Bio::Species is-a Bio::Taxonomy. > Taxonomy has-a Node (hence the > add_node() method). So, you should be able to add a NodeI- > implementing > object to a Taxonomy object (either a Node or a Species). Let's keep Bio::Species and Taxonomy::Node separate. They look like representing something similar but once you look at the Bio::Species API (and a Genbank record) you realize they do not. Bio::Species is more like an entire lineage and the species node all flattened out into one. I'm not sure Bio::Species would need to implement a Bio::TaxonomyI interface; it may as well just use an implementation of it internally. I'm not sure how Sendu wants to design this, but for sure Bio::Taxonomy::Node should not be a Bio::Species, and the reverse should rather be avoided too. >> [..] >> The way to do it is to have the Bio::DB::Taxonomy* modules return >> only >> the information that a Bio::Taxonomy::FactoryI would need to make a >> NodeI. The specific Factory that you use could generate whatever >> type of >> Node you wanted. > > Yes, using an object factory here makes a lot of sense, returning the > correct object type based on the rank. Well, I don't think you'd want to create instances of different node classes depending on the rank of the node. However, a particular factory implementation may of course be free to do exactly that. > ... >> Bio::Species differs from Bio::Taxonomy only so it contains all the >> legacy methods names that Bio::Species currently has, for backward >> compatibility. Setting $species->classification() would delete all >> nodes >> of self, use a GenbankFactory to make a new Bio::Species, then >> pull out >> all its Nodes and add them to self. > > The idea is to replace Bio::Species with something that works well, so > having it implement a Node-like interface works since it is-a > Node. Having > it implement a Taxonomy-like interface, though, doesn't make a lot > of sense > as a species is-not-a Taxonomy. It should act just like a fancier > node > object. No, I'd really recommend against muddling up a taxonomy node model with the Bio::Species legacy model. Bio::Species is not a node at all. You may argue it's not a taxonomy either. This is just one more reason for containing the Bio::Species contagious disease of conflating disjoint concepts into one. > > Using a factory in Bio::DB::Taxonomy should solve any issues about > what > object type is returned, since that could simply be made based on > the rank > itself (species rank or below == Bio::Taxonomy::Species, genus and > above == > Bio::Taxonomy::Node). Bio::Taxonomy::Species was an invention of mine and - if created - should not be used for anything else other than representing a taxonomy node as a Bio::Species object iff necessary (i.e., if the client really wants a Bio::Species object). I'd actually like to see what Sendu would come up with. It sounds at the very minimum like an excellent start. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Mon Jul 24 15:59:10 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 24 Jul 2006 15:59:10 -0400 Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes In-Reply-To: <003d01c6af58$3dc4ac40$15327e82@pyrimidine> References: <003d01c6af58$3dc4ac40$15327e82@pyrimidine> Message-ID: <3C520B8C-8755-4A7E-80CF-8B94FEAB867E@gmx.net> On Jul 24, 2006, at 3:34 PM, Chris Fields wrote: > Looks like the organelle sequence data uses the organism TaxID. Then you might as well store it as annotation. Really the only thing that matters is that the flat file writers can get from an expected location. In fact storing as annotation is better e.g. for Biosql since right now the taxonomy model is the NCBI model and so organelle will not be stored (and hence neither be round-tripped). -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Mon Jul 24 16:10:20 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 15:10:20 -0500 Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes In-Reply-To: <3C520B8C-8755-4A7E-80CF-8B94FEAB867E@gmx.net> Message-ID: <000001c6af5d$3094b830$15327e82@pyrimidine> Sounds good. Will be easy to change this over. Chris > -----Original Message----- > From: Hilmar Lapp [mailto:hlapp at gmx.net] > Sent: Monday, July 24, 2006 2:59 PM > To: Chris Fields > Cc: 'Sendu Bala'; bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Bio::Species/Bio::Taxonomy changes > > > On Jul 24, 2006, at 3:34 PM, Chris Fields wrote: > > > Looks like the organelle sequence data uses the organism TaxID. > > Then you might as well store it as annotation. Really the only thing > that matters is that the flat file writers can get from an expected > location. > > In fact storing as annotation is better e.g. for Biosql since right > now the taxonomy model is the NCBI model and so organelle will not be > stored (and hence neither be round-tripped). > > -hilmar > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > From hlapp at gmx.net Mon Jul 24 16:12:39 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 24 Jul 2006 16:12:39 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <003e01c6af5a$390cdea0$15327e82@pyrimidine> References: <003e01c6af5a$390cdea0$15327e82@pyrimidine> Message-ID: <5FB07071-42D7-4F43-B2A1-3AF5F1FC5193@gmx.net> On Jul 24, 2006, at 3:49 PM, Chris Fields wrote: > Yes, 'largely' the key word. I don't really agree with Sendu's > hierarchy > scheme (making Species implement Taxonomy and not Node doesn't make > sense), > but, besides that, everything else seems fine. I like the > following setup > (which is similar to what you proposed, I believe), which I already > posted. > > |-----Tax::Node > NodeI-------| > |-----Tax::SpeciesNode > | > SpeciesI -------| > > Taxonomy::Node is-a NodeI > Taxonomy::SpeciesNode is-a NodeI and-a SpeciesI I don't even think we would need SpeciesI - why would a species- ranked taxonomy node be so different from any other node such that it would need its own interface. Chris - just one suggestion: take a step back and imagine a Bioperl in which Bio::Species had never existed. Instead, only taxonomy nodes existed, and code that can effectively deal with them, including filtering by rank. In this picture, what would you make to want to introduce SpeciesI and Bio::Species? Frankly, I don't see anything. I.e., the only reason is backward compatibility (which is a valid reason), but let's not glorify Bio::Species by adding ill-conceived interfaces. > > Bio::DB::Taxonomy uses a factory to return NodeI-implementing modules; > specifically, a SpeciesNode for species ranks or below, and a Node for > anything else. Like I said before, SpeciesNode or whatever it's called would draw its right of existence solely from backward compatibility - don't use it for anything else. And if you can achieve backward compatibility by other means, don't even create a SpeciesNode. My $0.02 ... -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Mon Jul 24 17:34:29 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 16:34:29 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <5FB07071-42D7-4F43-B2A1-3AF5F1FC5193@gmx.net> Message-ID: <000101c6af68$f27521a0$15327e82@pyrimidine> > I don't even think we would need SpeciesI - why would a species- > ranked taxonomy node be so different from any other node such that it > would need its own interface. > > Chris - just one suggestion: take a step back and imagine a Bioperl > in which Bio::Species had never existed. Instead, only taxonomy nodes > existed, and code that can effectively deal with them, including > filtering by rank. In this picture, what would you make to want to > introduce SpeciesI and Bio::Species? Argh!!! Just when I thought I could pull away... Okay. I thought it would be nice to have a class that could accomplish two things: 1) Act as a container for GenBank taxonomy information; Bio::Taxonomy::Node, as written by Jason, was meant to be a replacement for Bio::Species. 2) Also act as a bridge, so you had the option to retrieve the Species object from a sequence object and have it act like a Node (be db-aware out-of-the-box, so to speak). Also, I'm trying to follow the original idea as proposed by Jason (this is from perldoc Bio::Taxonomy::Node): DESCRIPTION This is the next generation (for Bioperl) of representing Taxonomy information. Previously all information was managed by a single object called Bio::Species. This new implementation allows representation of the intermediate nodes not just the species nodes and can relate their connections. Which, to me, indicated that this would eventually replace Bio::Species (so, in effect, must at least contain the relevant data for sequence objects w/o being completely reliant on DB, yet still be DB-aware). Everything about Bio::Species on the wiki also leads me to believe that this was the original intent for Bio::Taxonomy::Node. http://www.bioperl.org/wiki/Module:Bio::Species http://www.bioperl.org/wiki/Project_priority_list#Taxonomy_.2F_Species_data And all the original methods (genus(), species(), etc.) also seem to indicate this. That's really it. I could give a toss about getting taxonomy information directly from Bio::Species. And you're right: in hindsight Bio::Species is flawed. However, it seemed from the beginning of this discussion with Sendu and the proposed changes, that Bio::Species should stick around in some capacity but should also be involved with Bio::Taxonomy (contrary to Jason's idea above). Now I'm hearing something completely different (Sendu still argues that it should be involved). I had originally wanted to start delegating everything over to Taxonomy::Node about a month ago, when I found that it was remarkably easy to do so. However, when Sendu proposed making changes to remove methods in Bio::Taxonomy::Node and make sweeping changes to Taxonomy which would prevent an easy transition over to Node, I felt that it would be harder to effectively have it take over for Bio::Species when parsing SeqIO objects (all the calls to genus/species/subspecies etc methods would have to be removed from all the classes which use Bio::Species). Hence Bio::Taxonomy::Species as a compromise. Now it turns out no one wants to have either Bio::Species (your 'contagion' references clues me in there) or Bio::Taxonomy::Species. If we think it would be better to completely toss all this out the window and use only a bare-bones Node, then I'm fine with that. But if we go that route we should just get rid of the Bio::Species 'disease' completely and have things be much simpler. Simple is good! I think Node can still act as a viable container class for the tax data from a GenBank file (it's original purpose) as long as it has the very basic methods for doing so. That would require: scientific_name() - ORGANISM line data common_names() - which could hold common names (in parentheses on the SOURCE line) and the abbreviated name (from the SOURCE line) ncbi_taxid() - from the 'source' seqfeature (already there). The lineage information and organelle information could be stored in Node or in SimpleValue objects. My vote is for the latter as there's no need for a classification() container for Node, which you have repeatedly pointed out. > Frankly, I don't see anything. I.e., the only reason is backward > compatibility (which is a valid reason), but let's not glorify > Bio::Species by adding ill-conceived interfaces. I think we should just get rid of Bio::Species completely. We would need to go in and rework species parsing in the SeqIO modules that use Bio::Species, but that would only make things simpler, not more complex. Get rid of trying to figure out what is a genus or species based on the GenBank information only, and have the bridge between the sequences be stored in a Taxonomy::Node object (which should contain the NCBI TaxID, so then it can use the associated DB object to traverse up and down other nodes). The interface idea was a proposed compromise i.e. my 'bridge' between GenBank taxonomy hell and Bio::Taxonomy bliss, and intended to follow what I thought was Jason's original intent for Bio::Taxonomy::Node. Nothing more. > > Bio::DB::Taxonomy uses a factory to return NodeI-implementing modules; > > specifically, a SpeciesNode for species ranks or below, and a Node for > > anything else. > > Like I said before, SpeciesNode or whatever it's called would draw > its right of existence solely from backward compatibility - don't use > it for anything else. And if you can achieve backward compatibility > by other means, don't even create a SpeciesNode. Agreed. But, if there is such venom towards Bio::Species, why not put it out of it's misery as well? Seems like it has outlived it's usefulness. Chris From cjfields at uiuc.edu Mon Jul 24 17:53:46 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 16:53:46 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C52345.5060903@sendu.me.uk> Message-ID: <000201c6af6b$a4534580$15327e82@pyrimidine> > > I'll repeat: a Node and a Species is-not-a Taxonomy. > > I'll repeat: A Node is a Node and a Bio::Species is a Taxonomy ;) Nope. I think this is incorrect. Here's why. Let's look at the reasons Bio::Taxonomy was started, shall we? >From perldoc Bio::Taxonomy: DESCRIPTION Bio::Taxonomy object represents any rank-level in taxonomy system, rather than Bio::Species which is able to represent only species-level. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >From perldoc Bio::Taxonomy::Node DESCRIPTION This is the next generation (for Bioperl) of representing Taxonomy information. Previously all information was managed by a single object called Bio::Species. This new implementation allows representation of the intermediate nodes not just the species nodes and can relate their ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ connections. Bioperl wiki: http://www.bioperl.org/wiki/Project_priority_list#Taxonomy_.2F_Species_data http://www.bioperl.org/wiki/Module:Bio::Species Both talk about delegating or replacing Bio::Species with Bio::Taxonomy::Node. Everyone of those indicates what the original idea for Bio::Taxonomy::Node was (eventual replacement for Bio::Species). Even the original methods for Bio::Taxonomy::Node are the same. So, according to this alone, Bio::Species would eventually be replaced by Bio::Taxonomy::Node. I wanted an easier transition to Node from Bio::Species (hell, just a few changes and using Bio::Taxonomy::Node worked fine!) , but your proposals made sense. I saw having a Species-based Tax object as a nice compromise, but Hilmar has made a few good points: would we have a Bio::Species object around knowing what we know now? When Bio::Species was originally designed, it was probably before the NCBI Tax database existed. I think it has outlasted its current use. I have posted a response to Hilmar. I think we should just get rid of Bio::Species altogether and have a Taxonomy::Node contain the basic data (scientific_name(), common_names(), etc). And remove any SeqIO parsing of genus/species to simplify everything. All this extra parsing and hand-wringing over trying to get species/genus information from a GenBank file just mucks up ORGANISM and SOURCE line parsing anyway. Simplify it. Simple is good. Radical? Yes, but I agree with him that Bio::Species has outlasted it's use. As for organelle and lineage information, they could be placed in SimpleValue objects. If anyone wants to grab tax information, they can use the Node object to get it but they'll need a local flatfile database or network connection to do so. This also means there is no need for a Bio::DB::Taxonomy factory: just return Node objects directly. Each format (flatfile and entrez) currently works this way anyway, correct? Simplifies that. Simple is better. Of course, we couldn't get rid of Bio::Species until all the following were shifted over to Node somehow: ; > Instances: 2 BP Module : Bio::Cluster::SequenceFamily Instances: 4 BP Module : Bio::Cluster::UniGene Instances: 1 BP Module : Bio::Cluster::UniGeneI Instances: 1 BP Module : Bio::DB::FileCache Instances: 3 BP Module : Bio::DB::GFF::Segment Instances: 1 BP Module : Bio::DB::Taxonomy::flatfile Instances: 2 BP Module : Bio::Graph::IO::psi_xml Instances: 1 BP Module : Bio::Map::CytoMap Instances: 1 BP Module : Bio::Map::LinkageMap Instances: 3 BP Module : Bio::Map::MapI Instances: 3 BP Module : Bio::Map::SimpleMap Instances: 3 BP Module : Bio::Matrix::PSM::InstanceSite Instances: 6 BP Module : Bio::Phenotype::Correlate Instances: 1 BP Module : Bio::Phenotype::OMIM::OMIMentry Instances: 3 BP Module : Bio::Phenotype::OMIM::OMIMparser Instances: 5 BP Module : Bio::Phenotype::Phenotype Instances: 2 BP Module : Bio::Phenotype::PhenotypeI Instances: 4 BP Module : Bio::Seq Instances: 3 BP Module : Bio::SeqI Instances: 2 BP Module : Bio::SeqIO::agave Instances: 4 BP Module : Bio::SeqIO::bsml Instances: 2 BP Module : Bio::SeqIO::bsml_sax Instances: 1 BP Module : Bio::SeqIO::chadoxml Instances: 1 BP Module : Bio::SeqIO::chaos Instances: 4 BP Module : Bio::SeqIO::embl Instances: 2 BP Module : Bio::SeqIO::entrezgene Instances: 3 BP Module : Bio::SeqIO::game::seqHandler Instances: 4 BP Module : Bio::SeqIO::genbank Instances: 2 BP Module : Bio::SeqIO::kegg Instances: 2 BP Module : Bio::SeqIO::locuslink Instances: 4 BP Module : Bio::SeqIO::swiss Instances: 2 BP Module : Bio::SeqIO::table Instances: 2 BP Module : Bio::SeqIO::tigr Instances: 2 BP Module : Bio::SeqIO::tigrxml Instances: 7 BP Module : Bio::SeqIO::tinyseq Instances: 4 BP Module : Bio::Taxonomy Instances: 1 BP Module : Bio::Taxonomy::Node Instances: 6 BP Module : Bio::Taxonomy::Taxon Instances: 9 BP Module : Bio::Taxonomy::Tree Instances: 5 BP Module : Bio::Tools::Analysis::Protein::ELM Chris From bix at sendu.me.uk Mon Jul 24 18:15:31 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 24 Jul 2006 23:15:31 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <000101c6af68$f27521a0$15327e82@pyrimidine> References: <000101c6af68$f27521a0$15327e82@pyrimidine> Message-ID: <44C54683.70707@sendu.me.uk> Chris Fields wrote: > > Also, I'm trying to follow the original idea as proposed by Jason (this is > from perldoc Bio::Taxonomy::Node): > > Which, to me, indicated that this would eventually replace Bio::Species Well, we don't really know that Jason didn't later change his mind, but in any case it doesn't make sense (anymore, given that we have Bio::Taxonomy). In a direct reply to me you point out specific passages in the current docs that explain why you have thought we should delegate or replace Bio::Species with Bio::Taxonomy::Node. With respect, the old plans are not something we are forced to blindly follow. We decide for ourselves if they make sense, we decide for ourselves if there is a better way of doing it, and then we do it the best way. So if you ignore what those old bits of documentation say, just pretend you never ever read them, would my proposals make sense or not? Since those old proposals were never implemented we have no reason to try and stick with them if there is a better proposal. And for the record, '...Bio::Species which is able to represent only species-level' can (correctly) be interpreted as 'Bio::Species is only supposed to be used for representing a taxonomy that includes the species-level'. You can't interpret it literally because Bio::Species is used for levels below species, and also represents all the levels above species-level as well. Either Jason got it wrong when he wrote that, or you have misinterpreted it. Likewise, let's play the interpretation game again: 'Previously all information was managed by a single object called Bio::Species. [the Bio::Taxonomy::Node] implementation allows representation of the intermediate nodes not just the species nodes'. Note the apposition of 'single object' vs implication of multiple Node objects to do the same job. I imagine at the time Jason wrote that there was no Bio::Taxonomy, no holder for multiple Nodes. > I had originally wanted to start delegating everything over to > Taxonomy::Node about a month ago, when I found that it was remarkably easy > to do so. However, when Sendu proposed making changes to remove methods in > Bio::Taxonomy::Node and make sweeping changes to Taxonomy which would > prevent an easy transition over to Node, But an equally easy transition to Bio::Taxonomy instead. I don't know why you would care about the name of the class we switch to. My concern is that when the switch is made it makes sense. > If we think it would be better to completely toss all this out the window > and use only a bare-bones Node, then I'm fine with that. But if we go that > route we should just get rid of the Bio::Species 'disease' completely and > have things be much simpler. Simple is good! > > I think Node can still act as a viable container class for the tax data from > a GenBank file (it's original purpose) as long as it has the very basic > methods for doing so. That would require: > > scientific_name() - ORGANISM line data > common_names() - which could hold common names (in parentheses on the SOURCE > line) and the abbreviated name (from the SOURCE line) > ncbi_taxid() - from the 'source' seqfeature (already there). > > The lineage information and organelle information could be stored in Node or > in SimpleValue objects. My vote is for the latter as there's no need for a > classification() container for Node, which you have repeatedly pointed out. No, this is the whole point. The lineage information can NOT be stored in a Node (unless you absuse Node by having all those crufty methods like genus() and classification()), and why would we store it in SimpleValue objects when we have Bio::Taxonomy? Bio::Taxonomy is completely perfect for storing the taxonomic information from a GenBank file. That's all you need to worry about. Can we represent the data correctly? Yes. Do we gain all the good things about a pure Bio::Taxonomy? Yes. Can we still do everything we used to be able to do? Yes. > I think we should just get rid of Bio::Species completely. There's no need to get rid of Bio::Species. It can be a Bio::Taxonomy with backward-compatible methods. No harm done, all good. I'll tell you what. This will be easier if I just write the code for my proposals, including whatever changes would be needed in Bio::SeqIO::genbank et al. You'll see how easy and appropriate it is, and hopefully everyone will be happy. Perhaps you could just hold off doing any similar-but-contradictory work until then. From hlapp at gmx.net Mon Jul 24 19:47:10 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 24 Jul 2006 19:47:10 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C54683.70707@sendu.me.uk> References: <000101c6af68$f27521a0$15327e82@pyrimidine> <44C54683.70707@sendu.me.uk> Message-ID: <86C97545-8DDF-4FE3-9CE9-1799D8BD2EB5@gmx.net> On Jul 24, 2006, at 6:15 PM, Sendu Bala wrote: > I'll tell you what. This will be easier if I just write the code > for my > proposals, including whatever changes would be needed in > Bio::SeqIO::genbank et al. Never get in the way of somebody who threatens to code :-) so I certainly won't. I think you're on the right track. My suggestion is, if you have a good picture in front of you of how it's going to look like when done, just pretend for a second it is done already and give us some code examples that use the new (to be done) API. As a start, some of the situations it's currently used in: - genbank.pm parsing and setting species information for the sequence - user asking for the scientific name of the species of the sequence (obviously, the call would remain unchanged: $seq->species->binomial (). But what happens behind the scene?) - genbank.pm writing the SOURCE information for a sequence Replace genbank.pm with your rich annotation source parser of choice. Then maybe some advanced uses: - from a sequence stream, retain only those of primates - like above, but only mitochondrial sequences - for an organism, query entrez for all sequences of strains, varieties, or subspecies sequences for that organism Add your own if these sound stupid ... Just an idea. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Mon Jul 24 22:06:16 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 21:06:16 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <86C97545-8DDF-4FE3-9CE9-1799D8BD2EB5@gmx.net> References: <000101c6af68$f27521a0$15327e82@pyrimidine> <44C54683.70707@sendu.me.uk> <86C97545-8DDF-4FE3-9CE9-1799D8BD2EB5@gmx.net> Message-ID: <4678548F-ABEC-4E14-AD7F-D282D2DC2730@uiuc.edu> > >> I'll tell you what. This will be easier if I just write the code >> for my >> proposals, including whatever changes would be needed in >> Bio::SeqIO::genbank et al. > > Never get in the way of somebody who threatens to code :-) so I > certainly won't. I think you're on the right track. Fine by me. My only request: I don't want every sequence passing through SeqIO having an automatic DB lookup performed on it. SeqIO parsing of GenBank files is slow enough as it is w/o enforcing lookups, even if they are cached. If you want lookups, have it as an option and not as default behavior. We could have the option for a lookup added pretty easily in genbank.pm _initialize or the main SeqIO constructor as a simple Boolean flag. That might be pretty nice. ... > (). But what happens behind the scene?) > - genbank.pm writing the SOURCE information for a sequence You know, the only really divisive point here is the lineage data and how to store it in _read_GenBank_Species or reproduce it in write_seq (). Again, I don't think we should have a forced lookup for this; it should just be stored as is, either in Node or SimpleValue. Again, I think the latter as everyone seems averse to containing this in Node. > Then maybe some advanced uses: > > - from a sequence stream, retain only those of primates > - like above, but only mitochondrial sequences > - for an organism, query entrez for all sequences of strains, > varieties, or subspecies sequences for that organism For the primate example, would you screen those out via the in-file lineage or using lookups? Something like '$seqout->write_seq($seq) if ($seq->species->organelle eq 'mitochondrion');' for the mitochondria example, which would mean leaving organelle() in Species/Node or whatever is used. The last one, I think, can be done w/o using the sequence directly using NCBI's ELink and the TaxID to cross-reference the nucleotide database. You would probably have to walk through all child nodes, but it's feasible that way. > Add your own if these sound stupid ... > > Just an idea. > > -hilmar > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Mon Jul 24 22:29:57 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 21:29:57 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C54683.70707@sendu.me.uk> References: <000101c6af68$f27521a0$15327e82@pyrimidine> <44C54683.70707@sendu.me.uk> Message-ID: <6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu> Look, we're just going back and forth on this stupid little thing, when the only point we really are divided on is what object type we should store certain items in a GenBank file (Bio::Species/ Bio::Tax::Node/Bio::Whatever). In particular, the main sticking point is the lineage. We could go back and forth on what Jason really intended. Personally, I think his past statements are quite clear on what his intent was (he's very clear in the wiki on what Bio::Taxonomy::Node was built to replace, in two separate posts and within the last four months). The reality is he's not here and you're willing to do the job. There is one thing I will make perfectly clear here: there should never, ever be enforced lookups for SeqIO (even using caches), though I have no problem having optional ones. This is something I have stated before and what you propose below steers dangerously in that direction. Where, for instance, do you store the lineage from a GenBank file? Do you want to do a series of Tax lookups to restore that data? I think that the number one complaint for sequence parsing is speed, which would only get slower with lookups (even cached). What I propose is we make it as simple as possible. Remove the unnecessary genus/species/subspecies parsing in genbank.pm, store the scientific name, common names, and lineage in some easily accessible way to make it easier for everyday users to use, have it tied to Bio::Taxonomy in some way (I propose Node, as it contains almost all the methods needed) so that you could get more information by moving up and down nodes, or retrieve more information. I, personally, don't see the point in having Bio:Species around after this discussion as Node seems to do the job adequately. My last word (I will be exiting this discussion and the group for two weeks): This would have been MUCH easier if all three of us could have gone to the local bar for a beer and discussed it. We should just take the time out to videoconference next time. Chris > Chris Fields wrote: >> >> Also, I'm trying to follow the original idea as proposed by Jason >> (this is >> from perldoc Bio::Taxonomy::Node): >> >> Which, to me, indicated that this would eventually replace >> Bio::Species > > Well, we don't really know that Jason didn't later change his mind, > but > in any case it doesn't make sense (anymore, given that we have > Bio::Taxonomy). > > In a direct reply to me you point out specific passages in the current > docs that explain why you have thought we should delegate or replace > Bio::Species with Bio::Taxonomy::Node. With respect, the old plans are > not something we are forced to blindly follow. We decide for ourselves > if they make sense, we decide for ourselves if there is a better > way of > doing it, and then we do it the best way. > > So if you ignore what those old bits of documentation say, just > pretend > you never ever read them, would my proposals make sense or not? Since > those old proposals were never implemented we have no reason to try > and > stick with them if there is a better proposal. > > And for the record, '...Bio::Species which is able to represent only > species-level' can (correctly) be interpreted as 'Bio::Species is only > supposed to be used for representing a taxonomy that includes the > species-level'. You can't interpret it literally because > Bio::Species is > used for levels below species, and also represents all the levels > above > species-level as well. Either Jason got it wrong when he wrote > that, or > you have misinterpreted it. > > Likewise, let's play the interpretation game again: 'Previously all > information was managed by a single object called Bio::Species. [the > Bio::Taxonomy::Node] implementation allows representation of the > intermediate nodes not just the species nodes'. Note the apposition of > 'single object' vs implication of multiple Node objects to do the same > job. I imagine at the time Jason wrote that there was no > Bio::Taxonomy, > no holder for multiple Nodes. > > >> I had originally wanted to start delegating everything over to >> Taxonomy::Node about a month ago, when I found that it was >> remarkably easy >> to do so. However, when Sendu proposed making changes to remove >> methods in >> Bio::Taxonomy::Node and make sweeping changes to Taxonomy which would >> prevent an easy transition over to Node, > > But an equally easy transition to Bio::Taxonomy instead. I don't know > why you would care about the name of the class we switch to. My > concern > is that when the switch is made it makes sense. > > >> If we think it would be better to completely toss all this out the >> window >> and use only a bare-bones Node, then I'm fine with that. But if >> we go that >> route we should just get rid of the Bio::Species 'disease' >> completely and >> have things be much simpler. Simple is good! >> >> I think Node can still act as a viable container class for the tax >> data from >> a GenBank file (it's original purpose) as long as it has the very >> basic >> methods for doing so. That would require: >> >> scientific_name() - ORGANISM line data >> common_names() - which could hold common names (in parentheses on >> the SOURCE >> line) and the abbreviated name (from the SOURCE line) >> ncbi_taxid() - from the 'source' seqfeature (already there). >> >> The lineage information and organelle information could be stored >> in Node or >> in SimpleValue objects. My vote is for the latter as there's no >> need for a >> classification() container for Node, which you have repeatedly >> pointed out. > > No, this is the whole point. The lineage information can NOT be stored > in a Node (unless you absuse Node by having all those crufty methods > like genus() and classification()), and why would we store it in > SimpleValue objects when we have Bio::Taxonomy? > > Bio::Taxonomy is completely perfect for storing the taxonomic > information from a GenBank file. That's all you need to worry > about. Can > we represent the data correctly? Yes. Do we gain all the good things > about a pure Bio::Taxonomy? Yes. Can we still do everything we used to > be able to do? Yes. > > >> I think we should just get rid of Bio::Species completely. > > There's no need to get rid of Bio::Species. It can be a Bio::Taxonomy > with backward-compatible methods. No harm done, all good. > > > I'll tell you what. This will be easier if I just write the code > for my > proposals, including whatever changes would be needed in > Bio::SeqIO::genbank et al. You'll see how easy and appropriate it is, > and hopefully everyone will be happy. > > Perhaps you could just hold off doing any similar-but-contradictory > work > until then. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From hlapp at gmx.net Mon Jul 24 23:31:41 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 24 Jul 2006 23:31:41 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu> References: <000101c6af68$f27521a0$15327e82@pyrimidine> <44C54683.70707@sendu.me.uk> <6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu> Message-ID: <49AA1351-5741-4E19-AFCE-EEA5532531B6@gmx.net> On Jul 24, 2006, at 10:29 PM, Chris Fields wrote: > [...] > We could go back and forth on what Jason really intended. [...] The > reality is he's not here and you're willing to do the job. Right. And, knowing Jason, I think he'd be perfectly fine with seeing his original idea develop in a possibly different direction, provided it will all work nicely in the end. I'm willing to take the beating on me if that doesn't turn out to be true ... > > There is one thing I will make perfectly clear here: there should > never, ever be enforced lookups for SeqIO (even using caches), You certainly don't want taxonomy lookups during the parsing stage, and also not for the client requesting properties of the species that have been parsed with high confidence, i.e., genus and species for a straightforward binomial like 'Homo sapiens'. Writing sequences, IMHO, doesn't have to be as fast. It may be better to emit strict format a bit slower rather than sloppy format a bit faster. Upon parsing, one idea could be for the flat file parser to set a dirty bit in the parsed out species if the parsed text didn't follow strict binomial conventions, hence the parser may have made a mistake and if a client requests the information it is better to lookup the correct values from a taxonomy database. I.e., you could try with a strict regex first that would imply a high-confidence result. If that fails you don't give up but mark the result as untrustworthy. > [...] > This would have been MUCH easier if all three of us could have gone > to the local bar for a beer and discussed it. We should just take > the time out to videoconference next time. You're not honestly suggesting that a videoconference is better than having beer together? Enjoy your trip, and thanks for hanging in there in the discussion, I appreciate it. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Tue Jul 25 01:53:33 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 25 Jul 2006 00:53:33 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <49AA1351-5741-4E19-AFCE-EEA5532531B6@gmx.net> References: <000101c6af68$f27521a0$15327e82@pyrimidine> <44C54683.70707@sendu.me.uk> <6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu> <49AA1351-5741-4E19-AFCE-EEA5532531B6@gmx.net> Message-ID: <190F1365-1F8B-401F-8FBA-8DE39D851111@uiuc.edu> So do we intend on having everyone who installs bioperl have a local copy of the taxonomy dumpfile? Or perform a remote lookup via Entrez? Seems a bit extreme. I would like the option of not having the lookup run; as I mentioned to Sendu, one of the biggest complaints about bioperl is speed. Additional lookups won't help on that end. Chris On Jul 24, 2006, at 10:31 PM, Hilmar Lapp wrote: > > On Jul 24, 2006, at 10:29 PM, Chris Fields wrote: > >> [...] >> We could go back and forth on what Jason really intended. [...] The >> reality is he's not here and you're willing to do the job. > > Right. And, knowing Jason, I think he'd be perfectly fine with seeing > his original idea develop in a possibly different direction, provided > it will all work nicely in the end. I'm willing to take the beating > on me if that doesn't turn out to be true ... > >> >> There is one thing I will make perfectly clear here: there should >> never, ever be enforced lookups for SeqIO (even using caches), > > You certainly don't want taxonomy lookups during the parsing stage, > and also not for the client requesting properties of the species that > have been parsed with high confidence, i.e., genus and species for a > straightforward binomial like 'Homo sapiens'. > > Writing sequences, IMHO, doesn't have to be as fast. It may be better > to emit strict format a bit slower rather than sloppy format a bit > faster. > > Upon parsing, one idea could be for the flat file parser to set a > dirty bit in the parsed out species if the parsed text didn't follow > strict binomial conventions, hence the parser may have made a mistake > and if a client requests the information it is better to lookup the > correct values from a taxonomy database. I.e., you could try with a > strict regex first that would imply a high-confidence result. If that > fails you don't give up but mark the result as untrustworthy. > > >> [...] >> This would have been MUCH easier if all three of us could have gone >> to the local bar for a beer and discussed it. We should just take >> the time out to videoconference next time. > > You're not honestly suggesting that a videoconference is better than > having beer together? > > Enjoy your trip, and thanks for hanging in there in the discussion, I > appreciate it. > > -hilmar > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Tue Jul 25 03:05:23 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Tue, 25 Jul 2006 08:05:23 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu> References: <000101c6af68$f27521a0$15327e82@pyrimidine> <44C54683.70707@sendu.me.uk> <6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu> Message-ID: <44C5C2B3.1020304@sendu.me.uk> Chris Fields wrote: > > There is one thing I will make perfectly clear here: there should > never, ever be enforced lookups for SeqIO (even using caches), though > I have no problem having optional ones. This is something I have > stated before and what you propose below steers dangerously in that > direction. Where, for instance, do you store the lineage from a > GenBank file? Do you want to do a series of Tax lookups to restore > that data? I think that the number one complaint for sequence > parsing is speed, which would only get slower with lookups (even > cached). I already gave a code example of exactly how Bio::Taxonomy is perfect for storing the lineage data in a GenBank file with or without a database lookup. I think perhaps at the time you first read this you basically ignored it because you had trouble with the idea of adding nodes to a species. If you have been glossing over my argument, it may be instructive to go over what I've been saying with a clear eye. Anyway, here it is again, and remember in this example, Bio::Species isa Bio::Taxonomy: ## the fully-manual way my $species = new Bio::Species; my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces cerevisiae', -rank => 'species', -object_id => 1, -parent_id => 2); my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces', -object_id => 2, -parent_id => 3); # (no assumption that 'Saccharomyces' is the genus, so rank() undefined) my $n3 = [etc] $species->add_node($node); $species->add_node($n2); [etc] ## Using a factory without db access # assume that Bio::Taxonomy::GenbankFactory implements # some modified Bio::Taxonomy::FactoryI my $factory = Bio::Taxonomy::GenbankFactory->new(); my $species = $factory->generate(-classification => ['Saccharomyces cerevisiae', 'Saccharomyces', 'Saccharomycetaceae' ...]); # the generate() method above just does the fully-manual way for you ## Using a factory with db access # assume that Bio::Taxonomy::EntrezFactory implements some # modified Bio::Taxonomy::FactoryI and uses Bio::DB::Taxonomy::entrez # to get the nodes my $factory = Bio::Taxonomy::EntrezFactory->new(); my $species = $factory->fetch(-scientifc_name => 'Saccharomyces cerevisiae'); So now do you see how we're able to do the Genbank no-db way and the db-using way with the same object model? We're able to do it the same, sane way because a Node is just a node; you can make them yourself manually, or retrieve them from a database. Once you stick them in a Taxonomy you can then (potentially) ask all the questions of the data that you can with existing Bio::Species. No cruft is required anywhere at all. All the Taxonomy classes can be 'pure', while only Bio::Species has to have backward-compatibility methods. From bernd.web at gmail.com Tue Jul 25 06:47:50 2006 From: bernd.web at gmail.com (Bernd Web) Date: Tue, 25 Jul 2006 12:47:50 +0200 Subject: [Bioperl-l] Structure::IO Message-ID: <716af09c0607250347w24e0d8dbj674758049bc0a4e2@mail.gmail.com> Hi, Does someone have experience with Bio::Structure::IO? The example III.9.1 from the bptutorial.pl covers most, but what is e.g. the chain() method of Bio::Structure::Entry doing? The POD states: Title : chain Usage : @chains = $structure->chain($chain); Function: Connects a (or a list of) Chain objects to a Bio::Structure::Entry. Returns : list of Bio::Structure::Residue objects Args : One Residue or a reference to an array of Residue objects But in e.g my $stream = Bio::Structure::IO->new(-file => $filename, -format => 'pdb'); while ( my $struc = $stream->next_structure() ) { for my $chain ($struc->get_chains) { my $chainid = $chain->id; my @chains = $struc->chain($chain); } } I get Bio::Structure::Chain=HASH(0x9f1ab50). What is the function of the chain method and how to use it? Best regards, bernd From bernd.web at gmail.com Tue Jul 25 07:44:28 2006 From: bernd.web at gmail.com (Bernd Web) Date: Tue, 25 Jul 2006 13:44:28 +0200 Subject: [Bioperl-l] SeqUtils Message-ID: <716af09c0607250444y3e005fb1t4e20094fd8db993d@mail.gmail.com> Hi, With Bio::SeqUtils it may be nice to support 3 letter codes with capitals only, too. Now my $string = Bio::SeqUtils->seq3in($seqobj, 'METGLYTER'); will give in $string->seq: XXX. Possibly the capitals in MetGlyTer are used to find the amino acids codes? If not maybe it's easy to implement case-insensitive, or all-capitals for AA codes in SeqUtils? In addition about the POD: maybe it's better not use use $string since Bio::SeqUtils->seq3in does not return a string but a Bio::PrimarySeq object. Regards, Bernd From cjfields at uiuc.edu Tue Jul 25 08:28:01 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 25 Jul 2006 07:28:01 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C5C2B3.1020304@sendu.me.uk> References: <000101c6af68$f27521a0$15327e82@pyrimidine> <44C54683.70707@sendu.me.uk> <6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu> <44C5C2B3.1020304@sendu.me.uk> Message-ID: <05718B29-6432-4ACA-A10F-2B384CB68191@uiuc.edu> Look, you explaining this to me, as you see it, does not convince me that its the correct or right way to do it. Okay? Can we agree on that? I do not think that Species and Taxonomy are the same thing. A species should not hold more than one node. A species, by definition, is a rank in Taxonomy, and is a node, not a full Taxonomy, so Bio::Species should be a Node, not a Taxonomy. I don't see how I can be any clearer... The fact that it may work is beyond the point. That's like putting duct tape on a leak to me. Why not just simplify Bio::Species into a Node? Or make it into a Node and get rid of it altogether. You are going to do what you want to do, regardless of what I say. Seems to be par for the course here. I'm REALLY tired of arguing the point. Okay? Just drop it. I have other priorities in life besides goddamned bioperl right now... Chris On Jul 25, 2006, at 2:05 AM, Sendu Bala wrote: > Chris Fields wrote: >> >> There is one thing I will make perfectly clear here: there should >> never, ever be enforced lookups for SeqIO (even using caches), though >> I have no problem having optional ones. This is something I have >> stated before and what you propose below steers dangerously in that >> direction. Where, for instance, do you store the lineage from a >> GenBank file? Do you want to do a series of Tax lookups to restore >> that data? I think that the number one complaint for sequence >> parsing is speed, which would only get slower with lookups (even >> cached). > > I already gave a code example of exactly how Bio::Taxonomy is perfect > for storing the lineage data in a GenBank file with or without a > database lookup. I think perhaps at the time you first read this you > basically ignored it because you had trouble with the idea of adding > nodes to a species. If you have been glossing over my argument, it may > be instructive to go over what I've been saying with a clear eye. > Anyway, here it is again, and remember in this example, > Bio::Species isa > Bio::Taxonomy: > > > ## the fully-manual way > my $species = new Bio::Species; > my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces > cerevisiae', > -rank => 'species', -object_id > => 1, > -parent_id => 2); > my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces', > -object_id => 2, -parent_id => 3); > # (no assumption that 'Saccharomyces' is the genus, so rank() > undefined) > my $n3 = [etc] > $species->add_node($node); > $species->add_node($n2); > [etc] > > ## Using a factory without db access > # assume that Bio::Taxonomy::GenbankFactory implements > # some modified Bio::Taxonomy::FactoryI > my $factory = Bio::Taxonomy::GenbankFactory->new(); > my $species = $factory->generate(-classification => ['Saccharomyces > cerevisiae', 'Saccharomyces', > 'Saccharomycetaceae' ...]); > # the generate() method above just does the fully-manual way for you > > ## Using a factory with db access > # assume that Bio::Taxonomy::EntrezFactory implements some > # modified Bio::Taxonomy::FactoryI and uses Bio::DB::Taxonomy::entrez > # to get the nodes > my $factory = Bio::Taxonomy::EntrezFactory->new(); > my $species = $factory->fetch(-scientifc_name => 'Saccharomyces > cerevisiae'); > > > So now do you see how we're able to do the Genbank no-db way and the > db-using way with the same object model? We're able to do it the same, > sane way because a Node is just a node; you can make them yourself > manually, or retrieve them from a database. Once you stick them in a > Taxonomy you can then (potentially) ask all the questions of the data > that you can with existing Bio::Species. No cruft is required anywhere > at all. All the Taxonomy classes can be 'pure', while only > Bio::Species > has to have backward-compatibility methods. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Tue Jul 25 08:52:03 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Tue, 25 Jul 2006 13:52:03 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <05718B29-6432-4ACA-A10F-2B384CB68191@uiuc.edu> References: <000101c6af68$f27521a0$15327e82@pyrimidine> <44C54683.70707@sendu.me.uk> <6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu> <44C5C2B3.1020304@sendu.me.uk> <05718B29-6432-4ACA-A10F-2B384CB68191@uiuc.edu> Message-ID: <44C613F3.7070903@sendu.me.uk> Chris Fields wrote: > A species should not hold more than one node. A species, by > definition, is a rank in Taxonomy, and is a node, not a full > Taxonomy, so Bio::Species should be a Node, not a Taxonomy. I don't > see how I can be any clearer... Right, we have differing viewpoints because you're concerned with what Bio::Species /should/ be, based on the name of the file and perhaps its original intent, whilst I am treating it as what it actually /is/, which is an object that is used to contain information about multiple taxonomic nodes. > The fact that it may work is beyond the point. That's like putting > duct tape on a leak to me. Why not just simplify Bio::Species into a > Node? Or make it into a Node and get rid of it altogether. Bio::Species, again ignore the name, is just a thing that lets us store and retrieve a certain set of data. If we simplified it into a pure Node, it could no longer do that job. If we just get rid of it all together it can no longer do its job. By making it a Bio::Taxonomy it can continue to do its job without having to have Node objects with cruft. It would also gain the useful methods of Bio::Taxonomy at the same time. I really don't mean to upset you, and I apologise for having done so. I've been presenting what I thought was a logical argument in favour of Bio::Species as Bio::Taxonomy, and waiting to see if anyone would come up with a logical argument why that would be inappropriate, or why something else would be better. I'm not saying you're wrong and I'm certainly listening and would change my choice based on what you have to say. I don't think it's fair to say that disregarding what you have to say is 'par for the course' - I already /have/ regarded what you had to say in this thread and ended up doing scientific_name() as purely what we get from the database. From hlapp at gmx.net Tue Jul 25 09:47:47 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 25 Jul 2006 09:47:47 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C5C2B3.1020304@sendu.me.uk> References: <000101c6af68$f27521a0$15327e82@pyrimidine> <44C54683.70707@sendu.me.uk> <6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu> <44C5C2B3.1020304@sendu.me.uk> Message-ID: On Jul 25, 2006, at 3:05 AM, Sendu Bala wrote: > [...] > ## the fully-manual way > my $species = new Bio::Species; > my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces > cerevisiae', > -rank => 'species', -object_id > => 1, > -parent_id => 2); If this is meant as an example for the use cases I enumerated, then you wouldn't have the parent_id from a Genbank file. However, you didn't have that before either, so no problem. > my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces', > -object_id => 2, -parent_id => 3); > # (no assumption that 'Saccharomyces' is the genus, so rank() > undefined) I think in a confident parse you want to assign 'genus' if there's little doubt, for example 'Saccharomyces cerevisiae'. Not sure whether there are weird viri whose names look innocuous but in reality the name doesn't follow binomial convention. > my $n3 = [etc] > $species->add_node($node); > $species->add_node($n2); I know why you are doing this, but seeing this people will hit a mental snag. You should listen to Chris' refusal to see the sense in this as an indication that many people down the road won't see the sense either. So instead, make the logical model in your design more obvious, which I think ultimately will help maintainability as well. For example: my $taxonomy = Bio::Taxonomy->new(); my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces cerevisiae', -rank => 'species', -object_id => 1, -parent_id => 2); my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces', -object_id => 2, -parent_id => 3); $taxonomy->add_node($node); $taxonomy->add_node($n2); my $species = Bio::Species->new(-lineage => $taxonomy); print $species->binomial(); print $species->genus(); # this may trigger a lookup if a taxonomy db handle has been set, e.g.: # $taxonomy->db_handle(Bio::DB::Taxonomy->new(-source => 'entrez')); print $species->classification(); > [etc] > > ## Using a factory without db access > # assume that Bio::Taxonomy::GenbankFactory implements > # some modified Bio::Taxonomy::FactoryI > my $factory = Bio::Taxonomy::GenbankFactory->new(); > my $species = $factory->generate(-classification => ['Saccharomyces > cerevisiae', 'Saccharomyces', > 'Saccharomycetaceae' ...]); > # the generate() method above just does the fully-manual way for you Except the method name would be create_object(), the parameter would be a hash ref, and the return value would be a Bio::TaxonomyI compliant object: my $taxonomy = $factory->create_object({-classification => ['Saccharomyces cerevisiae', 'Saccharomyces', 'Saccharomycetaceae' ...]}); my $species = Bio::Species->new(-lineage => $taxonomy); > > ## Using a factory with db access > # assume that Bio::Taxonomy::EntrezFactory implements some > # modified Bio::Taxonomy::FactoryI and uses Bio::DB::Taxonomy::entrez > # to get the nodes > my $factory = Bio::Taxonomy::EntrezFactory->new(); The logic where to do a lookup on should not be duplicated here. It only belongs under Bio::DB::Taxonomy::*. > my $species = $factory->fetch(-scientifc_name => 'Saccharomyces > cerevisiae'); Likewise, use the methods defined in Bio::DB::Taxonomy, and again, the return type is Bio::Taxonomy, which you would pass to Bio::Species->new(). -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Tue Jul 25 09:54:14 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 25 Jul 2006 09:54:14 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <190F1365-1F8B-401F-8FBA-8DE39D851111@uiuc.edu> References: <000101c6af68$f27521a0$15327e82@pyrimidine> <44C54683.70707@sendu.me.uk> <6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu> <49AA1351-5741-4E19-AFCE-EEA5532531B6@gmx.net> <190F1365-1F8B-401F-8FBA-8DE39D851111@uiuc.edu> Message-ID: <793AFD5C-D220-493F-BE11-B9023DC9F569@gmx.net> We intend on having everyone who wants correct taxonomy parsing results for the entire kingdom of life to define his/her authoritative taxonomy database, be it local or not, be it HTTP or SQL queried. If you don't care about the correctness of the taxonomy parse, or if the taxonomy information in the flat file is trivially parseable because it conforms to standard binomial convention, then whatever is to be put in place needs to work fine regardless of whether a taxonomy database is defined or not. -hilmar On Jul 25, 2006, at 1:53 AM, Chris Fields wrote: > So do we intend on having everyone who installs bioperl have a local > copy of the taxonomy dumpfile? Or perform a remote lookup via > Entrez? Seems a bit extreme. > > I would like the option of not having the lookup run; as I mentioned > to Sendu, one of the biggest complaints about bioperl is speed. > Additional lookups won't help on that end. > > Chris > > On Jul 24, 2006, at 10:31 PM, Hilmar Lapp wrote: > >> >> On Jul 24, 2006, at 10:29 PM, Chris Fields wrote: >> >>> [...] >>> We could go back and forth on what Jason really intended. [...] The >>> reality is he's not here and you're willing to do the job. >> >> Right. And, knowing Jason, I think he'd be perfectly fine with seeing >> his original idea develop in a possibly different direction, provided >> it will all work nicely in the end. I'm willing to take the beating >> on me if that doesn't turn out to be true ... >> >>> >>> There is one thing I will make perfectly clear here: there should >>> never, ever be enforced lookups for SeqIO (even using caches), >> >> You certainly don't want taxonomy lookups during the parsing stage, >> and also not for the client requesting properties of the species that >> have been parsed with high confidence, i.e., genus and species for a >> straightforward binomial like 'Homo sapiens'. >> >> Writing sequences, IMHO, doesn't have to be as fast. It may be better >> to emit strict format a bit slower rather than sloppy format a bit >> faster. >> >> Upon parsing, one idea could be for the flat file parser to set a >> dirty bit in the parsed out species if the parsed text didn't follow >> strict binomial conventions, hence the parser may have made a mistake >> and if a client requests the information it is better to lookup the >> correct values from a taxonomy database. I.e., you could try with a >> strict regex first that would imply a high-confidence result. If that >> fails you don't give up but mark the result as untrustworthy. >> >> >>> [...] >>> This would have been MUCH easier if all three of us could have gone >>> to the local bar for a beer and discussed it. We should just take >>> the time out to videoconference next time. >> >> You're not honestly suggesting that a videoconference is better than >> having beer together? >> >> Enjoy your trip, and thanks for hanging in there in the discussion, I >> appreciate it. >> >> -hilmar >> -- >> =========================================================== >> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : >> =========================================================== >> >> >> >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Tue Jul 25 10:58:29 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 25 Jul 2006 09:58:29 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <793AFD5C-D220-493F-BE11-B9023DC9F569@gmx.net> Message-ID: <002601c6affa$ca4433f0$15327e82@pyrimidine> Agreed. I fully support the addition of an optional lookup; it gives much more flexibility SeqIO re: your previous examples of screening sequence streams for sequences that are primate, mitochondrial, etc. The key word I want to emphasize is 'optional', not 'enforced'. I appreciate what Sendu is trying to do; I really do. I think carrying over an object named 'Bio::Species' into Taxonomy is too confusing (your 'contagion' analogy, as it were). The 'species' concept (biologically speaking here, not talking about the Bioperl class) is a taxonomic rank (i.e. part of a taxonomy). I'm trying to take a biologist's point of view here. What is a 'species'? Or, if we were to stick strictly with using NCBI definitions, what is a 'species'? The NCBI definition of 'species' is simply a rank in a lineage, so it is (in Bioperl terms) a Node. If we were to follow that line of reasoning, why also have a Species object represent a Taxonomy as well? It's way too confusing. Sendu's repeatedly stating "a Species is a Taxonomy" makes some sense in a BioPerl world only, as we're speaking about a class that has been around for a long time, one that acted as a container of sorts for sequence data. And I understand what he intends to do. Conceptually speaking here, though, the way it is laid out, a Bio::Species object can hold a Node that represents a 'species' rank, as well as a 'genus' Node, and a 'family' node, and on and on. That's not a 'species', that's a taxonomy. So just call it a Taxonomy. The object itself (Bio::Species) never truly represented a 'species' anyway, biologically speaking, every time it held sequence data. It could be a subspecies, strain, plasmid, unknown, or an unclassified rank ('no rank') or environmental sample. It really held a fancier representation of a node, as based on the TaxID. My final point is, saying "a species is a taxonomy" to the rest of the biological world doesn't make sense. Maybe it makes sense to you and I and Sendu, in our little Bioperl world. But to the thousands of users out there who don't completely grok the Bioperl class structure, it's just confusing. If I were to get an object back that was labeled Bio::Species, as a biologist I would expect it to be part of a taxonomy, not the actual Taxonomy itself. So, why not cut to the chase: if we are to fundamentally change the concept of what Bio::Species is by making it a Taxonomy/TaxonomyI or whatever, why not just use a Taxonomy object altogether and not bother with Bio::Species at all? Deprecate it. BTW, I'll be in Connecticut for five days at UConn. So I hope to escape the heat for a bit. Thanks for listening to my side of things. Chris > -----Original Message----- > From: Hilmar Lapp [mailto:hlapp at gmx.net] > Sent: Tuesday, July 25, 2006 8:54 AM > To: Chris Fields > Cc: Sendu Bala; bioperl-l at bioperl.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > We intend on having everyone who wants correct taxonomy parsing > results for the entire kingdom of life to define his/her > authoritative taxonomy database, be it local or not, be it HTTP or > SQL queried. > > If you don't care about the correctness of the taxonomy parse, or if > the taxonomy information in the flat file is trivially parseable > because it conforms to standard binomial convention, then whatever is > to be put in place needs to work fine regardless of whether a > taxonomy database is defined or not. > > -hilmar > > On Jul 25, 2006, at 1:53 AM, Chris Fields wrote: > > > So do we intend on having everyone who installs bioperl have a local > > copy of the taxonomy dumpfile? Or perform a remote lookup via > > Entrez? Seems a bit extreme. > > > > I would like the option of not having the lookup run; as I mentioned > > to Sendu, one of the biggest complaints about bioperl is speed. > > Additional lookups won't help on that end. > > > > Chris > > > > On Jul 24, 2006, at 10:31 PM, Hilmar Lapp wrote: > > > >> > >> On Jul 24, 2006, at 10:29 PM, Chris Fields wrote: > >> > >>> [...] > >>> We could go back and forth on what Jason really intended. [...] The > >>> reality is he's not here and you're willing to do the job. > >> > >> Right. And, knowing Jason, I think he'd be perfectly fine with seeing > >> his original idea develop in a possibly different direction, provided > >> it will all work nicely in the end. I'm willing to take the beating > >> on me if that doesn't turn out to be true ... > >> > >>> > >>> There is one thing I will make perfectly clear here: there should > >>> never, ever be enforced lookups for SeqIO (even using caches), > >> > >> You certainly don't want taxonomy lookups during the parsing stage, > >> and also not for the client requesting properties of the species that > >> have been parsed with high confidence, i.e., genus and species for a > >> straightforward binomial like 'Homo sapiens'. > >> > >> Writing sequences, IMHO, doesn't have to be as fast. It may be better > >> to emit strict format a bit slower rather than sloppy format a bit > >> faster. > >> > >> Upon parsing, one idea could be for the flat file parser to set a > >> dirty bit in the parsed out species if the parsed text didn't follow > >> strict binomial conventions, hence the parser may have made a mistake > >> and if a client requests the information it is better to lookup the > >> correct values from a taxonomy database. I.e., you could try with a > >> strict regex first that would imply a high-confidence result. If that > >> fails you don't give up but mark the result as untrustworthy. > >> > >> > >>> [...] > >>> This would have been MUCH easier if all three of us could have gone > >>> to the local bar for a beer and discussed it. We should just take > >>> the time out to videoconference next time. > >> > >> You're not honestly suggesting that a videoconference is better than > >> having beer together? > >> > >> Enjoy your trip, and thanks for hanging in there in the discussion, I > >> appreciate it. > >> > >> -hilmar > >> -- > >> =========================================================== > >> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > >> =========================================================== > >> > >> > >> > >> > >> > >> _______________________________________________ > >> Bioperl-l mailing list > >> Bioperl-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > Christopher Fields > > Postdoctoral Researcher > > Lab of Dr. Robert Switzer > > Dept of Biochemistry > > University of Illinois Urbana-Champaign > > > > > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > From cjfields at uiuc.edu Tue Jul 25 11:36:40 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 25 Jul 2006 10:36:40 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: Message-ID: <003301c6b000$203cc560$15327e82@pyrimidine> > On Jul 25, 2006, at 3:05 AM, Sendu Bala wrote: > > > [...] > > ## the fully-manual way > > my $species = new Bio::Species; > > my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces > > cerevisiae', > > -rank => 'species', -object_id > > => 1, > > -parent_id => 2); > > If this is meant as an example for the use cases I enumerated, then > you wouldn't have the parent_id from a Genbank file. However, you > didn't have that before either, so no problem. > > > my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces', > > -object_id => 2, -parent_id => 3); > > # (no assumption that 'Saccharomyces' is the genus, so rank() > > undefined) > > I think in a confident parse you want to assign 'genus' if there's > little doubt, for example 'Saccharomyces cerevisiae'. Not sure > whether there are weird viri whose names look innocuous but in > reality the name doesn't follow binomial convention. > > > my $n3 = [etc] > > $species->add_node($node); > > $species->add_node($n2); > > I know why you are doing this, but seeing this people will hit a > mental snag. You should listen to Chris' refusal to see the sense in > this as an indication that many people down the road won't see the > sense either. Thanks for pointing that out. I think there is only a small, fundamental difference in our views here. I'm trying to view this as an outsider would, a biologist not familiar with the Bioperl class structure. I understand what Sendu's trying to accomplish but it's really confusing to someone not familiar with what Bio::Species is. Hilmar, you had pointed out several times that Bio::Species and Bio::Taxonomy shouldn't directly intermingle. My original thought for genbank.pm _read_GenBank_Species() was this, copied and pasted from my local genbank.pm. It's sort of extreme, but it passes tests just fine. sub _read_GenBank_Species { my( $self,$buffer) = @_; $_ = $$buffer; my @organelles = qw(plastid chloroplast mitochondrion); my( $source_data, $common_name, @class, $ns_name, $organelle, $source_flag, $sci_name, $abbr ); while (defined($_) || defined($_ = $self->_readline())) { # de-HTMLify (links that may be encountered here don't contain # escaped '>', so a simple-minded approach suffices) s/<[^>]+>//g; if ( /^SOURCE\s+(.*)/o ) { $source_data = $1; $source_data =~ s/\.$//; # remove trailing dot # does it have a GenBank common name in parentheses? $common_name = $source_data =~ m{\((.*)\)}xms; # organelle? If we find additional odd ones, # add to @organelle $organelle = grep { $_ =~ $source_data } @organelles; $source_flag = 1; } elsif ( /^\s{2}ORGANISM\s+(.*)/o ) { $sci_name = $1; $source_flag = 0; } elsif ($source_flag) { # no ORGANISM $common_name .= $source_data; $common_name =~ s/\n//g; $common_name =~ s/\s+/ /g; $source_flag = 0; } elsif ( /^\s+(.+)/o ) { # lineage information my $line = $1; # only split on ';' or '.' so that classification # that is 2 words will still get matched, use # map() to remove trailing/leading spaces push(@class, map { s/^\s+//; s/\s+$//; $_; } split /[;\.]+/, $line) if ( $line =~ /(;|\.)/ ); } else { # reach end of GenBank tax info last; } $_ = undef; # Empty $_ to trigger read of next line } $$buffer = $_; @class = reverse @class; my $make = Bio::Taxonomy::Node->new(); $make->common_name( $common_name ) if $common_name; $make->scientific_name($sci_name) if $sci_name; # could use SimpleValue objs here instead $make->classification( @class ) if @class; $make->organelle($organelle) if $organelle; return $make; } # back in next_seq...grab the TaxID from 'source' # seqfeature # could check organelle() here as well # add taxon_id from source if available if($species && ($feat->primary_tag eq 'source') && $feat->has_tag('db_xref') && (! $species->ncbi_taxid())) { foreach my $tagval ($feat->get_tag_values('db_xref')) { if(index($tagval,"taxon:") == 0) { $species->ncbi_taxid(substr($tagval,6)); last; } } } In other words, remove the extra parsing of genus() species() subspecies etc. All GenBank sequences have a node represented in NCBI's tax database (I checked it out). Even plasmids, unknowns, environmental samples. Chris > So instead, make the logical model in your design more obvious, which > I think ultimately will help maintainability as well. For example: > > my $taxonomy = Bio::Taxonomy->new(); > my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces cerevisiae', > -rank => 'species', -object_id > => 1, > -parent_id => 2); > my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces', > -object_id => 2, -parent_id => 3); > $taxonomy->add_node($node); > $taxonomy->add_node($n2); > > my $species = Bio::Species->new(-lineage => $taxonomy); > print $species->binomial(); > print $species->genus(); > # this may trigger a lookup if a taxonomy db handle has been set, e.g.: > # $taxonomy->db_handle(Bio::DB::Taxonomy->new(-source => 'entrez')); > print $species->classification(); > > > > [etc] > > > > ## Using a factory without db access > > # assume that Bio::Taxonomy::GenbankFactory implements > > # some modified Bio::Taxonomy::FactoryI > > my $factory = Bio::Taxonomy::GenbankFactory->new(); > > my $species = $factory->generate(-classification => ['Saccharomyces > > cerevisiae', 'Saccharomyces', > > 'Saccharomycetaceae' ...]); > > # the generate() method above just does the fully-manual way for you > > Except the method name would be create_object(), the parameter would > be a hash ref, and the return value would be a Bio::TaxonomyI > compliant object: > > my $taxonomy = $factory->create_object({-classification => > ['Saccharomyces > cerevisiae', 'Saccharomyces', > 'Saccharomycetaceae' ...]}); > my $species = Bio::Species->new(-lineage => $taxonomy); > > > > > > ## Using a factory with db access > > # assume that Bio::Taxonomy::EntrezFactory implements some > > # modified Bio::Taxonomy::FactoryI and uses Bio::DB::Taxonomy::entrez > > # to get the nodes > > my $factory = Bio::Taxonomy::EntrezFactory->new(); > > The logic where to do a lookup on should not be duplicated here. It > only belongs under Bio::DB::Taxonomy::*. > > > my $species = $factory->fetch(-scientifc_name => 'Saccharomyces > > cerevisiae'); > > Likewise, use the methods defined in Bio::DB::Taxonomy, and again, > the return type is Bio::Taxonomy, which you would pass to > Bio::Species->new(). > > -hilmar > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From bix at sendu.me.uk Tue Jul 25 13:49:04 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Tue, 25 Jul 2006 18:49:04 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <003301c6b000$203cc560$15327e82@pyrimidine> References: <003301c6b000$203cc560$15327e82@pyrimidine> Message-ID: <44C65990.4080500@sendu.me.uk> Chris Fields wrote: > If I were to get an object back that was labeled Bio::Species, as a > biologist I would expect it to be part of a taxonomy, not the actual > Taxonomy itself. I think this is the most important sentence in the discussion. Ok, so it's clear to me that a better solution is needed than my Bio::Taxonomy-related proposal. Sorry for being so slow on the uptake. I also needed to start trying to code my Taxonomy proposal to see some issues with it. [... in another email...] > I'm trying to view this as an outsider would, > a biologist not familiar with the Bioperl class structure. Ok, let's come up with a proposal that makes sense to the biologist and better matches Jason's original idea. ---- long post follows; there's a summary at the end As a biologist when I consider a species I have the following primary questions. Let's see how we would answer them using a) Bio::Species and genbank.pm as they are now, b) Bio::Species if it was a 'pure' Bio::Taxonomy::Node with no cruft (or if we just dropped Bio::Species and used Node directly), and Chris' updated genbank.pm. Let's say we got our species information from a genbank file where the scientific name and tax id are available to be parsed out. # What is the species' name? a) Not guaranteed to be correct. b) Correct thanks to recent changes to Node, just use scientific_name() # What is the lineage of this species? a) I can get a classification array with classification(). It's a bit rubbish though, I can't tell what any of the array elements are supposed to be. b) A pure Node wouldn't store the lineage on itself. There are two obvious solutions: 1) add cruft to Node by giving it a classification() method - works as well/bad as a). 2) call get_Lineage_Nodes(), which has the benefit of telling me what rank each ancestor was, if that information had been in the file (more likely, if Node was generated from database). Problem: get_Lineage_Nodes() only works if it can $self->db_handle->get_Taxonomy_Node(-taxonid => $self->parent_id); which obviously doesn't work if the nodes in our lineage didn't come from a database, but from the parsing of a genbank flat file. As we parse the genbank file we can certainly make nodes for each word in the list: inside genbank.pm... @class = reverse @class; my @nodes; my $fake_id = 1; foreach my $sci_name (@class) { push(@nodes, new Bio::Taxonomy::Node(-name => $sci_name, object_id => $fake_id++, parent_id => $fake_id); } But how do we keep these nodes and make them returnable later by get_Lineage_Nodes? Perhaps: my $taxonomy = new Bio::Taxonomy; foreach my $node (@nodes) { $taxonomy->add_node($node); } ... my $make = Bio::Taxonomy::Node->new(); ... $make->db_handle($taxonomy); Bio::Taxonomy would have to implement get_Taxonomy_Node (it has get_node which only accepts a rank). Of course this is ugly, storing a Taxonomy in our database handle. We could have a new Bio::DB::Taxonomy:: class instead, that treated a classification array like a database? It could have the added bonus of building up an entire database internally as more input arrays are given to it, able to therefore give each node a unique but consistent id. It would break if one time you gave it qw(Homo Primates) and another time qw(Homo Hominidae Primates), however. Ideas? # What if I don't want the whole lineage, just to know what a specific rank like genus is for my species? a) use genus(), but not guaranteed to be correct. b) two solutions: 1) add cruft to Node by adding a genus() method: as good/bad as a). 2) use get_Lineage_Nodes() or get_Parent_Node() until you find a node with your rank() of interest. Same problems as for lineage question, but also it would be nicer to have a get_node('rank_name') style method. But such a method belongs in something like Bio::Taxonomy, not Node. At the very least a method like genus() would be implemented using pure Node methods like get_Parent_Node(), returning undefined if no parent had a rank() of 'genus', never guessing it. # Is this species the same as another species? a) Not guaranteed to be correct. (no unique id so forced to compare names) b) Correct answer by using object_id() method, along with Chris' change to genbank.pm. # What is the most recent common ancestor of this species and another? a) Can't be answered. b) Use get_LCA_Node(), but same issues as the lineage question, since get_LCA_Node requires a working get_Lineage_Nodes(). It also requires correct (unique) ids for all nodes in all lineages to give the guaranteed correct answer. But at least you /might/ get the correct answer even using only the data in genbank files and no db lookup. ---- summary: It seems like the main problem with Node right now is that it has classification() and things like genus(). I propose pure Node method solutions to answer the questions classification() and genus() were implemented to answer, but in a better, cruft-free way. Bio::DB::Taxonomy::genbank anyone? Then if you started with a Species/Node generated by a genbank parse, and wanted certain questions answered correctly, you only have to set a different db_handle(). The Node only stores the static and hopefully correct information about itself, whilst all other questions go via db_handle, so you can dynamically swap back and forth between databases depending on if you need speed or accuracy. From cjfields at uiuc.edu Tue Jul 25 14:24:12 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 25 Jul 2006 13:24:12 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C65990.4080500@sendu.me.uk> Message-ID: <000001c6b017$873176a0$15327e82@pyrimidine> Sendu, you'll have to make the changes how you see fit. You see my point now, which is great. >From my perspective, all the object type (used to contain taxonomy file information) needs to contain is the scientific name and common names like the SOURCE line abbreviated name and the actual GenBank common name, if present. All the other cruft (i.e. genus/species/subspecies) can be excised, and the proper taxonomic information, if wanted, could be accessed via the object and it's TaxID. Organelle and lineage information needs to be retained (for the non-taxonomists) and could be stored in that object, bumped to SimpleValue objects, or just set (alternative, since the data is small) using a get/set value within the sequence object itself. This would be the bare-bones approach, which Node can fulfill. I also like Hilmar's proposal about including optional lookups, which greatly increases the flexibility when screening sequences. This will likely require a more complicated object structure (i.e. taxonomy with nodes). You suggested a Taxonomy-like object which would work; but don't force Bio::Species into the mix. Why not just use a simple Bio::Taxonomy object for that (Hilmar's point). When one asks for $species->species, they'll get a Node or Taxonomy, whichever is used (that's up to you). The Node represents a more-barebones variation, while the Taxonomy object scheme would be more fully-realized. Either way will work for me. Just don't call it 'species'. ; > Once this is all done, will we really have a need for Bio::Species? That's my other point. The only real use for it was as a container object for sequence data. That job is now done via a Taxonomy/Node object. The only real use it would have is as a container for taxonomic information for species ranks or below. I think Node/Taxonomy can handle evan that though, so now it's also redundant. If a class is not useful and is redundant, maybe it should be deprecated. Anyway, I can't get involved anymore at this point; I'm too busy with getting ready for the Kadner Institute next week. Good luck! Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Sendu Bala > Sent: Tuesday, July 25, 2006 12:49 PM > To: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > Chris Fields wrote: > > If I were to get an object back that was labeled Bio::Species, as a > > biologist I would expect it to be part of a taxonomy, not the actual > > Taxonomy itself. > > I think this is the most important sentence in the discussion. Ok, so > it's clear to me that a better solution is needed than my > Bio::Taxonomy-related proposal. Sorry for being so slow on the uptake. I > also needed to start trying to code my Taxonomy proposal to see some > issues with it. > > > [... in another email...] > > I'm trying to view this as an outsider would, > > a biologist not familiar with the Bioperl class structure. > > Ok, let's come up with a proposal that makes sense to the biologist and > better matches Jason's original idea. > > ---- long post follows; there's a summary at the end > > As a biologist when I consider a species I have the following primary > questions. Let's see how we would answer them using a) Bio::Species and > genbank.pm as they are now, b) Bio::Species if it was a 'pure' > Bio::Taxonomy::Node with no cruft (or if we just dropped Bio::Species > and used Node directly), and Chris' updated genbank.pm. Let's say we got > our species information from a genbank file where the scientific name > and tax id are available to be parsed out. > > # What is the species' name? > a) Not guaranteed to be correct. > b) Correct thanks to recent changes to Node, just use scientific_name() > > > # What is the lineage of this species? > a) I can get a classification array with classification(). It's a bit > rubbish though, I can't tell what any of the array elements are supposed > to be. > b) A pure Node wouldn't store the lineage on itself. There are two > obvious solutions: 1) add cruft to Node by giving it a classification() > method - works as well/bad as a). 2) call get_Lineage_Nodes(), which has > the benefit of telling me what rank each ancestor was, if that > information had been in the file (more likely, if Node was generated > from database). Problem: get_Lineage_Nodes() only works if it can > $self->db_handle->get_Taxonomy_Node(-taxonid => $self->parent_id); > which obviously doesn't work if the nodes in our lineage didn't come > from a database, but from the parsing of a genbank flat file. As we > parse the genbank file we can certainly make nodes for each word in the > list: > inside genbank.pm... @class = reverse @class; > my @nodes; my $fake_id = 1; > foreach my $sci_name (@class) { > push(@nodes, new Bio::Taxonomy::Node(-name => $sci_name, object_id => > $fake_id++, parent_id => $fake_id); > } > But how do we keep these nodes and make them returnable later by > get_Lineage_Nodes? Perhaps: > my $taxonomy = new Bio::Taxonomy; > foreach my $node (@nodes) { > $taxonomy->add_node($node); > } > ... > my $make = Bio::Taxonomy::Node->new(); > ... > $make->db_handle($taxonomy); > Bio::Taxonomy would have to implement get_Taxonomy_Node (it has get_node > which only accepts a rank). Of course this is ugly, storing a Taxonomy > in our database handle. We could have a new Bio::DB::Taxonomy:: class > instead, that treated a classification array like a database? It could > have the added bonus of building up an entire database internally as > more input arrays are given to it, able to therefore give each node a > unique but consistent id. It would break if one time you gave it qw(Homo > Primates) and another time qw(Homo Hominidae Primates), however. Ideas? > > > # What if I don't want the whole lineage, just to know what a specific > rank like genus is for my species? > a) use genus(), but not guaranteed to be correct. > b) two solutions: 1) add cruft to Node by adding a genus() method: as > good/bad as a). 2) use get_Lineage_Nodes() or get_Parent_Node() until > you find a node with your rank() of interest. Same problems as for > lineage question, but also it would be nicer to have a > get_node('rank_name') style method. But such a method belongs in > something like Bio::Taxonomy, not Node. At the very least a method like > genus() would be implemented using pure Node methods like > get_Parent_Node(), returning undefined if no parent had a rank() of > 'genus', never guessing it. > > > # Is this species the same as another species? > a) Not guaranteed to be correct. (no unique id so forced to compare names) > b) Correct answer by using object_id() method, along with Chris' change > to genbank.pm. > > > # What is the most recent common ancestor of this species and another? > a) Can't be answered. > b) Use get_LCA_Node(), but same issues as the lineage question, since > get_LCA_Node requires a working get_Lineage_Nodes(). It also requires > correct (unique) ids for all nodes in all lineages to give the > guaranteed correct answer. But at least you /might/ get the correct > answer even using only the data in genbank files and no db lookup. > > > ---- summary: > > It seems like the main problem with Node right now is that it has > classification() and things like genus(). I propose pure Node method > solutions to answer the questions classification() and genus() were > implemented to answer, but in a better, cruft-free way. > > Bio::DB::Taxonomy::genbank anyone? > > Then if you started with a Species/Node generated by a genbank parse, > and wanted certain questions answered correctly, you only have to set a > different db_handle(). The Node only stores the static and hopefully > correct information about itself, whilst all other questions go via > db_handle, so you can dynamically swap back and forth between databases > depending on if you need speed or accuracy. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From hlapp at gmx.net Tue Jul 25 15:18:00 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 25 Jul 2006 15:18:00 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <000001c6b017$873176a0$15327e82@pyrimidine> References: <000001c6b017$873176a0$15327e82@pyrimidine> Message-ID: <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> On Jul 25, 2006, at 2:24 PM, Chris Fields wrote: > Once this is all done, will we really have a need for Bio::Species? No, except for backwards compatibility. Phasing it out will go over a couple of releases. E.g., v1.6.x could have deprecation warning in the documentation. v1.7+ would have deprecation warnings in the code written to stderr. Just as an aside, we can't just drastically change the return type of a method. Instead, if at all possible, there should be a new method so that the old can be phased out over time but otherwise not changed. I.e., don't change $seq->species() to now all of a sudden return a node or taxonomic lineage, even if initially Bio::Species is returned with some magic under the hood. Instead, create something like # return a Bio::Taxonomy::Node: my $taxon = $seq->taxon(); # alternative approach: return a lineage (taxonomy) # this would be Bio::TaxonomyI compliant my $lineage = $seq->lineage(); The former would require the lineage (and organelle for completeness) information to be either easily (though not necessarily directly) accessible through the node, or added as annotation. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Tue Jul 25 15:30:40 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 25 Jul 2006 14:30:40 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> Message-ID: <000101c6b020$d09bc7b0$15327e82@pyrimidine> Sounds good to me. I'm fine with any way that it's worked out, either Taxonomy or Node-based, as long as there no Bio::Species-based confusion re: Taxonomy, and that this eventually leads to getting rid of Bio::Species altogether. Have fun, guys! (hey, probably the shortest response I have written)... Chris > -----Original Message----- > From: Hilmar Lapp [mailto:hlapp at gmx.net] > Sent: Tuesday, July 25, 2006 2:18 PM > To: Chris Fields > Cc: 'Sendu Bala'; bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > > On Jul 25, 2006, at 2:24 PM, Chris Fields wrote: > > > Once this is all done, will we really have a need for Bio::Species? > > No, except for backwards compatibility. Phasing it out will go over a > couple of releases. E.g., v1.6.x could have deprecation warning in > the documentation. v1.7+ would have deprecation warnings in the code > written to stderr. > > Just as an aside, we can't just drastically change the return type of > a method. Instead, if at all possible, there should be a new method > so that the old can be phased out over time but otherwise not > changed. I.e., don't change $seq->species() to now all of a sudden > return a node or taxonomic lineage, even if initially Bio::Species is > returned with some magic under the hood. Instead, create something like > > # return a Bio::Taxonomy::Node: > my $taxon = $seq->taxon(); > > # alternative approach: return a lineage (taxonomy) > # this would be Bio::TaxonomyI compliant > my $lineage = $seq->lineage(); > > The former would require the lineage (and organelle for completeness) > information to be either easily (though not necessarily directly) > accessible through the node, or added as annotation. > > -hilmar > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > From cjfields at uiuc.edu Tue Jul 25 22:16:36 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 25 Jul 2006 21:16:36 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C65990.4080500@sendu.me.uk> References: <003301c6b000$203cc560$15327e82@pyrimidine> <44C65990.4080500@sendu.me.uk> Message-ID: <1748BD60-EFBE-4872-94A4-7CA3D58BE458@uiuc.edu> One last thing before I shut off bioperl for a week and concentrate on Connecticut; On Jul 25, 2006, at 12:49 PM, Sendu Bala wrote: > Chris Fields wrote: >> If I were to get an object back that was labeled Bio::Species, as a >> biologist I would expect it to be part of a taxonomy, not the actual >> Taxonomy itself. > > I think this is the most important sentence in the discussion. Ok, so > it's clear to me that a better solution is needed than my > Bio::Taxonomy-related proposal. Sorry for being so slow on the > uptake. I > also needed to start trying to code my Taxonomy proposal to see some > issues with it. ... Again, thanks for noticing that. > ---- summary: > > It seems like the main problem with Node right now is that it has > classification() and things like genus(). I propose pure Node method > solutions to answer the questions classification() and genus() were > implemented to answer, but in a better, cruft-free way. > > Bio::DB::Taxonomy::genbank anyone? Ach... You're compromising here; that's not like you. I think you're making this too complicated by trying too many things at once. Don't think sudden dramatic changes in the API. Sneak changes in in a way that doesn't scare users away, then let them get used to the new way of grabbing Tax data. Make your point that it's more accurate to do it this way (you'll have defenders in Hilmar and I, BTW). Do this (start with genbank.pm): 1) Switch out Bio::Species with Node or Taxonomy; relocate other information temporarily (Bio::Species, get/sets in Seq object, SimpleValue). Leave Bio::Species in for the time being, but don't bother making any additional changes to it. 2) Make sure next_seq() and write_seq() work and pass tests. Add additional tests for the Tax/Node object (you could even use the tax dump data you recently added for more complicated tests). 3) Add in additional stuff bit by bit until it is where you would like it. 4) Make sure parsing is kosher with the latest release notes. Probably should make sure write_seq follows what the release note state to some degree. And, really, you won't break anything with genbank.pm organelle() parsing. If you look at the module the organelle isn't even touched in next_seq() or _read_GenBank_Species(), so it was broken to begin with! My proposal, though extreme, was to remove genus() etc (which you wanted as well with Node). You could leave this cruft for the time being in Bio::Species, which could still act as a sequence tax info holder object. It just won't be the >default< Seq tax information object, which would be Bio::Taxonomy or Node. Hence Hilmar's suggestion to use a $seq->taxon() method to return a Node/Taxonomy, and a $seq->species() would still return a Bio::Species object. It's redundant, but only for the time being, and the redundant information wouldn't have a major memory footprint anyway (not like the feature table or the full sequence might). Any information that isn't stored in whatever Tax object you use (i.e. lineage or organelle) could be stored temporarily in another fashion, such as a get/set in Seq or SimpleValue object, to make next_seq/ write_seq work (such as $seq->organelle() or $seq->classification(), instead of $seq->species->organelle and so on). Hilmar then suggests, around 1.6-ish release, note the changes made to SeqIO towards Bio::Taxonomy-based objects, and indicate that Bio::Species via species() and it's associated methods will be deprecated around 1.7 (gives everybody notice on API issues). Then add warnings to Bio::Species in 1.7 noting the deprecation, then remove from core completely in 1.8 - 2.0. One last thing, which is minor really: I remember seeing something about having Nodes with 'no rank' ignored unless a flag is used. That may be bad news for some organisms in sequence files where the TaxID is for a 'no rank' rank, such as environmental samples. May want to think about that here. I'm hoping the releases will start popping out a bit more periodically than they have been. There have been volunteers to release periodic updates for bug fixes etc. If I get a chance I'll try keeping up. Don't count on it though. The conference is 7am-9pm most days, for five days straight! Chris > > Then if you started with a Species/Node generated by a genbank parse, > and wanted certain questions answered correctly, you only have to > set a > different db_handle(). The Node only stores the static and hopefully > correct information about itself, whilst all other questions go via > db_handle, so you can dynamically swap back and forth between > databases > depending on if you need speed or accuracy. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From vrramnar at student.cs.uwaterloo.ca Tue Jul 25 22:44:17 2006 From: vrramnar at student.cs.uwaterloo.ca (vrramnar at student.cs.uwaterloo.ca) Date: Tue, 25 Jul 2006 22:44:17 -0400 Subject: [Bioperl-l] SNP reference file download In-Reply-To: <775C8983-D81E-474C-B4F5-C765ABEF4A2C@uiuc.edu> References: <000001c6b01f$bfd54e20$15327e82@pyrimidine> <1153868024.44c6a0f83fce6@www.nexusmail.uwaterloo.ca> <775C8983-D81E-474C-B4F5-C765ABEF4A2C@uiuc.edu> Message-ID: <1153881857.44c6d7016775f@www.nexusmail.uwaterloo.ca> Hey Chris, I believe I updated all those modules already as I downloaded the entire DB.tar from Bioperl live. Here is my code: #!/usr/bin/perl -w use Bio::Perl; use Bio::DB::EUtilities; my @ids = qw(rs4986950); # With the "rs" before the number the warning says: "no returned links" # Without the "rs" before the number the warning says: "No databases returned; empty linkset" my $elink = Bio::DB::EUtilities->new( -eutil => 'elink', -id => \@ids, -db => 'omim', -dbfrom => 'snp'); $elink->get_response; print "IDs: ", join q(,), $elink->get_ids; Which gives the following error: -------------------- WARNING --------------------- MSG: No databases returned; empty linkset --------------------------------------------------- ------------- EXCEPTION ------------- MSG: Must use database to access IDs STACK Bio::DB::EUtilities::ElinkData::get_LinkIds_by_db /Library/Perl/5.8.6/Bio/ DB/EUtilities/ElinkData.pm:201 STACK Bio::DB::EUtilities::get_ids /Library/Perl/5.8.6/Bio/DB/EUtilities.pm:482 STACK toplevel getOmimNum:13 -------------------------------------- All I really want is the OMIM id number under the section: NCBI Resource Links from the page: http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=1800562 Any idea why this still isn't working?? Rohan Quoting Chris Fields : > Odd, I thought XML::Simple was part of the 5.8 core. Guess I was > wrong. I plan on changing this to a more robust parser soon (likely > XML::SAX or XML::Twig, which will also require a download). > > That warning occurs when if you don't have a link to OMIM present (No > databases returned; empty linkset). The way Elink works is it stores > internal data in a separate object (ELinkData) contained in an > internal cache. The method get_ids() works for all EUtilities to > retrieve IDs, even from ELink objects. The unique problem with ELink > is, since you can search multiple databases. you can retrieve > multiple sets of IDs. > > If you haven't done it, update your EUtilities; the problem is > similar to one I fixed today (I stated something about updating in my > last post). Also, update the main Bio::DB::EUtilities and > Bio::GenericWebDBI as well (the last is the base class from which > EUtilities is based). The 'Count:1' was a debugging statement I > forgot to remove a while ago which I changed in CVS yesterday. It's > possible that commit had other changes which I forgot about. > > Sorry about that, but it is still experimental (emphasis on the > 'mental'). > > Chris > > On Jul 25, 2006, at 5:53 PM, vrramnar at student.cs.uwaterloo.ca wrote: > > > > > Hey Chris, > > > > Ignore the last email, I fixed that problem and downloaded/ > > installed the > > required XML modules. > > > > However, I am now getting this error message: > > > > -------------------- WARNING --------------------- > > MSG: No databases returned; empty linkset > > --------------------------------------------------- > > Count: 1 > > > > ------------- EXCEPTION ------------- > > MSG: Must use database to access IDs > > STACK Bio::DB::EUtilities::ElinkData::get_LinkIds_by_db /Library/ > > Perl/5.8.6/Bio/ > > DB/EUtilities/ElinkData.pm:201 > > STACK Bio::DB::EUtilities::get_ids /Library/Perl/5.8.6/Bio/DB/ > > EUtilities.pm:483 > > STACK toplevel getOmimNum:15 > > > > -------------------------------------- > > > > What does this mean?? > > > > Rohan > > > > Quoting Chris Fields : > > > >> Okay, had to fix an odd bug from ELink due to the way NCBI returns > >> data. > >> > >> You'll need to update the EUtilities modules in bioperl from CVS > >> to make > >> sure this works. > >> > >> This is how it's done: > ---------------------------------------- This mail sent through www.mywaterloo.ca From cjfields at uiuc.edu Wed Jul 26 01:01:41 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 26 Jul 2006 00:01:41 -0500 Subject: [Bioperl-l] SNP reference file download In-Reply-To: <1153881857.44c6d7016775f@www.nexusmail.uwaterloo.ca> References: <000001c6b01f$bfd54e20$15327e82@pyrimidine> <1153868024.44c6a0f83fce6@www.nexusmail.uwaterloo.ca> <775C8983-D81E-474C-B4F5-C765ABEF4A2C@uiuc.edu> <1153881857.44c6d7016775f@www.nexusmail.uwaterloo.ca> Message-ID: The below ID doesn't have any OMIM linked data, hence the warning. The problem is that NCBI, when it doesn't find a link, doesn't send something constructive to tell you that. It sends the original ID encoded in XML, but no actual DB's or ID data links. That's what the warning means. I'll make the original warning a bit more direct: No databases returned; no IDs found. The thrown error is from a logic problem; I have fixed it and committed to CVS. Here's the web page: no OMIM data there either... http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=4986950 Try changing your ID list to this: my @ids = qw(4986950 1800562); You should get back only one ID (only one has an OMIM number). By the way, the SNP data ID is only the digits (don't include the 'rs' designation). Chris On Jul 25, 2006, at 9:44 PM, vrramnar at student.cs.uwaterloo.ca wrote: > > Hey Chris, > > I believe I updated all those modules already as I downloaded the > entire DB.tar > from Bioperl live. Here is my code: > > #!/usr/bin/perl -w > > use Bio::Perl; > use Bio::DB::EUtilities; > > my @ids = qw(rs4986950); > # With the "rs" before the number the warning says: "no returned > links" > # Without the "rs" before the number the warning says: "No > databases returned; > empty linkset" > > > my $elink = Bio::DB::EUtilities->new( -eutil => 'elink', > -id => \@ids, > -db => 'omim', > -dbfrom => 'snp'); > $elink->get_response; > print "IDs: ", join q(,), $elink->get_ids; > > Which gives the following error: > > -------------------- WARNING --------------------- > MSG: No databases returned; empty linkset > --------------------------------------------------- > > ------------- EXCEPTION ------------- > MSG: Must use database to access IDs > STACK Bio::DB::EUtilities::ElinkData::get_LinkIds_by_db /Library/ > Perl/5.8.6/Bio/ > DB/EUtilities/ElinkData.pm:201 > STACK Bio::DB::EUtilities::get_ids /Library/Perl/5.8.6/Bio/DB/ > EUtilities.pm:482 > STACK toplevel getOmimNum:13 > > -------------------------------------- > > All I really want is the OMIM id number under the section: NCBI > Resource Links > from the page: > http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=1800562 > > Any idea why this still isn't working?? > > Rohan > > > Quoting Chris Fields : > >> Odd, I thought XML::Simple was part of the 5.8 core. Guess I was >> wrong. I plan on changing this to a more robust parser soon (likely >> XML::SAX or XML::Twig, which will also require a download). >> >> That warning occurs when if you don't have a link to OMIM present (No >> databases returned; empty linkset). The way Elink works is it stores >> internal data in a separate object (ELinkData) contained in an >> internal cache. The method get_ids() works for all EUtilities to >> retrieve IDs, even from ELink objects. The unique problem with ELink >> is, since you can search multiple databases. you can retrieve >> multiple sets of IDs. >> >> If you haven't done it, update your EUtilities; the problem is >> similar to one I fixed today (I stated something about updating in my >> last post). Also, update the main Bio::DB::EUtilities and >> Bio::GenericWebDBI as well (the last is the base class from which >> EUtilities is based). The 'Count:1' was a debugging statement I >> forgot to remove a while ago which I changed in CVS yesterday. It's >> possible that commit had other changes which I forgot about. >> >> Sorry about that, but it is still experimental (emphasis on the >> 'mental'). >> >> Chris >> >> On Jul 25, 2006, at 5:53 PM, vrramnar at student.cs.uwaterloo.ca wrote: >> >>> >>> Hey Chris, >>> >>> Ignore the last email, I fixed that problem and downloaded/ >>> installed the >>> required XML modules. >>> >>> However, I am now getting this error message: >>> >>> -------------------- WARNING --------------------- >>> MSG: No databases returned; empty linkset >>> --------------------------------------------------- >>> Count: 1 >>> >>> ------------- EXCEPTION ------------- >>> MSG: Must use database to access IDs >>> STACK Bio::DB::EUtilities::ElinkData::get_LinkIds_by_db /Library/ >>> Perl/5.8.6/Bio/ >>> DB/EUtilities/ElinkData.pm:201 >>> STACK Bio::DB::EUtilities::get_ids /Library/Perl/5.8.6/Bio/DB/ >>> EUtilities.pm:483 >>> STACK toplevel getOmimNum:15 >>> >>> -------------------------------------- >>> >>> What does this mean?? >>> >>> Rohan >>> >>> Quoting Chris Fields : >>> >>>> Okay, had to fix an odd bug from ELink due to the way NCBI returns >>>> data. >>>> >>>> You'll need to update the EUtilities modules in bioperl from CVS >>>> to make >>>> sure this works. >>>> >>>> This is how it's done: >> > > > > > ---------------------------------------- > This mail sent through www.mywaterloo.ca Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Wed Jul 26 05:19:29 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Wed, 26 Jul 2006 10:19:29 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <1748BD60-EFBE-4872-94A4-7CA3D58BE458@uiuc.edu> References: <003301c6b000$203cc560$15327e82@pyrimidine> <44C65990.4080500@sendu.me.uk> <1748BD60-EFBE-4872-94A4-7CA3D58BE458@uiuc.edu> Message-ID: <44C733A1.9070201@sendu.me.uk> Chris Fields wrote: > >> It seems like the main problem with Node right now is that it has >> classification() and things like genus(). I propose pure Node method >> solutions to answer the questions classification() and genus() were >> implemented to answer, but in a better, cruft-free way. >> >> Bio::DB::Taxonomy::genbank anyone? > > Ach... You're compromising here; No, I don't think so. Let me explain... (another very long email, but with the same conclusion as above) > 1) Switch out Bio::Species with Node or Taxonomy; relocate other > information temporarily (Bio::Species, get/sets in Seq object, > SimpleValue). Leave Bio::Species in for the time being, but don't > bother making any additional changes to it. [...] > Hence Hilmar's suggestion to use a $seq->taxon() method to return a > Node/Taxonomy, and a $seq->species() would still return a > Bio::Species object. It's redundant, As I see it, the problem to be solved is this: a) A node should just be a node, holding only information about itself (but this can include information on who its parent is, and methods relating to getting its parents/children as new objects - but the data of its parents/children must never be stored on itself). b) Bio::Species isn't very good at its job; you can't ask reasonable taxonomic questions of it and get correct answers. c) We need to transition Bio::Species to something better - something that lets us do the same job as Bio::Species, but do it better. An important aspect of 'better' is that we can switch from the taxonomic information in a genbank file or similar to the information in a taxonomic database if we want certain taxonomic questions answered correctly. But also, we should be able to answer all questions with a good chance of a correct answer even without database access/installation. There are a variety of possible solutions. How can we decide which is best? What would a good solution be? The 'something better' we transition Bio::Species to will become the preferred (or at least de facto standard) way of dealing with taxonomic information in bioperl. This taxonomic module (or set of modules) must be able to model taxonomic information anywhere it is found - databases or genbank files or anything else. If it can't, it would be fundamentally flawed. d) We can immediately discount any solution that involves storing some taxonomic information outside of the tax module. If we find ourselves putting lineage data in a genbank file in SimpleValue objects or similar, we can be pretty sure we've used a poor solution to the problem. That would be a compromise. e) If the thing we transition Bio::Species to can't do everything Bio::Species did (doing it in a different and better way is fine of course), it's not suitable for transitioning to (this is why Node needed all the cruft added to it before it was a suitable candidate). If it /can/ do everything Bio::Species did, there would be no harm immediately making Bio::Species inherit from the new tax module, reimplementing Bio::Species as necessary but making no API change. So any solution that would /require/ $seq->taxon() and $seq->species() wouldn't be a good one, and would be a compromise. But we do want to get rid of Bio::Species eventually, so I'm not saying we shouldn't have a $seq->taxon() or similar, only that either method would give you the same type of object with the same methods available ($seq->taxon->isa('tax module') && ($seq->species->isa('Bio::Species') && $seq->species->isa('tax module')). I see 2 possible solutions to the problem. What should 'tax module' be?: 1) Bio::Taxonomy or other similar class that is a container of multiple nodes. Naively this makes logical sense since one of the jobs Bio::Species has is to store a lineage, and a lineage is best represented as a set of Nodes. So let's have a single object with all our Nodes in it. Problems: Bio::Taxonomy itself, as currently written, is fundamentally flawed. It requires that you know the ranks and order of ranks of all your input nodes before you input them. It requires that all ranks have unique names. It doesn't handle ranks of 'no rank'. You can't have more than one lineage in an instance because you can't have two nodes with the same rank. If you don't know the ranks of your nodes (ie. genbank) there is no way to maintain the order of your lineage because there is no modelling of parent/child. I had planned to re-write it such that the rank-centric implementation was removed and we had parent/child implementation instead. But then there is nothing to stop you adding nodes that are disconnected from the others, creating a broken mess. Bio::Taxonomy::Tree might have been a little more suitable because it implements Bio::Tree::TreeI, but sadly it is also rank-centric and actually requires input of both Bio::Species and Bio::Taxonomy objects to its most useful methods. More important than issues with current implementations of node-container classes, such classes are unable to let us solve problem c) in a good way, and also leave us potentially storing in memory Node objects representing the same taxonomic node multiple times in different instances of the node-container. For problem c) if we were to switch from genbank nodes to database the solution is to delete all the nodes in the container and then get them all again from the database. What if you didn't even have a lineage-related question? You've just retrieved 10s of nodes from the database for no reason (and then store them), when all you wanted was accurate information on the node you were interested in. All in all, it's pretty horrible. Unsuitable implementations plus excess database retrieval plus massive waste of memory with duplicated nodes does not equal a good solution. 2) Bio::Taxonomy::Node. First, solve problem a) by getting rid of methods binomial(), species(), genus(), sub_species(), variant(), organelle(), classification() and show_all(). Except for organelle() which doesn't belong in taxonomy, all of these Bio::Species 'questions' can still be answered by Node - just not in a single method call. I outlined how to answer them in the previous post. For backward compatibility make Bio::Species a Node and implement the suggested way of answering the questions the proper 'Node' way under those methods. Problems: Well, those questions can't actually be answered by Node if the starting point was genbank data or manually created Nodes. The solution is clean and simple: Bio::DB::Taxonomy::genbank or perhaps better named Bio::DB::Taxonomy::list (because it makes a taxonomy database from an ordered list of names - I don't see anything inherently wrong or ugly with that). Then everything magically just works. We get all the power to ask all our questions that Node has already when working with the ncbi database, but we get it when working with genbank data. We suffer none of the problems of a node-container class. We can easily switch databases on the fly. What's not to like? From bix at sendu.me.uk Wed Jul 26 06:00:01 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Wed, 26 Jul 2006 11:00:01 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> Message-ID: <44C73D21.3010301@sendu.me.uk> Hilmar Lapp wrote: > Instead, create something like > > # return a Bio::Taxonomy::Node: > my $taxon = $seq->taxon(); Yes, but $seq->species() would also > # alternative approach: return a lineage (taxonomy) > # this would be Bio::TaxonomyI compliant > my $lineage = $seq->lineage(); I've since come to the conclusion that anything Taxonomy-ish would be inappropriate - see recent post. > The former would require the lineage (and organelle for completeness) > information to be either easily (though not necessarily directly) > accessible through the node, or added as annotation. That specifically is the main problem with Node as it is now. You shouldn't store information about the lineage (essentially information about other nodes) on the node object itself. Storing it as annotation on the Node or elsewhere is terrible: you lose all the power of Node and can no longer ask any lineage-related questions. There is no need for this split in functionality - when you don't have database access and just some genbank files, you can't answer any taxonomic questions involving lineage, vs. when you do have database access suddenly you can start doing useful things. My proposed solution is that bioperl's taxonomy model always lets you answer the same questions regardless of your source for taxonomic information - see recent post. From cjfields at uiuc.edu Wed Jul 26 08:16:29 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 26 Jul 2006 07:16:29 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C733A1.9070201@sendu.me.uk> References: <003301c6b000$203cc560$15327e82@pyrimidine> <44C65990.4080500@sendu.me.uk> <1748BD60-EFBE-4872-94A4-7CA3D58BE458@uiuc.edu> <44C733A1.9070201@sendu.me.uk> Message-ID: <0155C616-8E07-49C8-B4A1-2B77B4DE09F3@uiuc.edu> > ... > > I see 2 possible solutions to the problem. What should 'tax module' > be?: > > 1) Bio::Taxonomy or other similar class that is a container of > multiple > nodes. Naively this makes logical sense since one of the jobs > Bio::Species has is to store a lineage, and a lineage is best > represented as a set of Nodes. So let's have a single object with all > our Nodes in it. Problems: > > Bio::Taxonomy itself, as currently written, is fundamentally > flawed. It > requires that you know the ranks and order of ranks of all your input > nodes before you input them. It requires that all ranks have unique > names. It doesn't handle ranks of 'no rank'. You can't have more than > one lineage in an instance because you can't have two nodes with the > same rank. If you don't know the ranks of your nodes (ie. genbank) > there > is no way to maintain the order of your lineage because there is no > modelling of parent/child. > I had planned to re-write it such that the rank-centric implementation > was removed and we had parent/child implementation instead. But then > there is nothing to stop you adding nodes that are disconnected > from the > others, creating a broken mess. > > Bio::Taxonomy::Tree might have been a little more suitable because it > implements Bio::Tree::TreeI, but sadly it is also rank-centric and > actually requires input of both Bio::Species and Bio::Taxonomy objects > to its most useful methods. > > More important than issues with current implementations of > node-container classes, such classes are unable to let us solve > problem > c) in a good way, and also leave us potentially storing in memory Node > objects representing the same taxonomic node multiple times in > different > instances of the node-container. For problem c) if we were to switch > from genbank nodes to database the solution is to delete all the nodes > in the container and then get them all again from the database. > What if > you didn't even have a lineage-related question? You've just retrieved > 10s of nodes from the database for no reason (and then store them), > when > all you wanted was accurate information on the node you were > interested in. > > All in all, it's pretty horrible. Unsuitable implementations plus > excess > database retrieval plus massive waste of memory with duplicated nodes > does not equal a good solution. > > > 2) Bio::Taxonomy::Node. First, solve problem a) by getting rid of > methods binomial(), species(), genus(), sub_species(), > variant(), organelle(), classification() and show_all(). Except for > organelle() which doesn't belong in taxonomy, all of these > Bio::Species > 'questions' can still be answered by Node - just not in a single > method > call. I outlined how to answer them in the previous post. For backward > compatibility make Bio::Species a Node and implement the suggested way > of answering the questions the proper 'Node' way under those methods. > Problems: > > Well, those questions can't actually be answered by Node if the > starting > point was genbank data or manually created Nodes. The solution is > clean > and simple: Bio::DB::Taxonomy::genbank or perhaps better named > Bio::DB::Taxonomy::list (because it makes a taxonomy database from an > ordered list of names - I don't see anything inherently wrong or ugly > with that). Then everything magically just works. We get all the power > to ask all our questions that Node has already when working with the > ncbi database, but we get it when working with genbank data. We suffer > none of the problems of a node-container class. We can easily switch > databases on the fly. That 'broken mess' (referring to Bio::Taxonomy) is up to the user. You could make it more stringent (i.e. only allow connected nodes, starting with a single initiating node then build from there), though I don't think that's necessary as most people would probably use some sort of factory to generate a taxonomy (a warning might be appropriate). You would have to watch out for potential circular structures. Have it do what you want. I believe the original intent of Taxonomy was to allow building a full-fledged taxonomic structure, so it should stay that way. Sendu, you have to realize this is up to how you want to implement it. We're giving you the freedom to do what you want to Bio::Taxonomy. Of course, if we think you're off we'll reel you back in, but you seem to be on the right track. Realize that the only contentious issue here is that horrible lineage line in the GenBank file. We should have a way to rebuild it as it was from the original file (i.e. not rebuild it from scratch with DB lookups by default). However, you should also have the option to rebuild it from lookups (i.e. correctly), which you could do with a Taxonomy. Note this Bio::Taxonomy method: classify Title : classify Usage : @obj[][0-1] = taxonomy->classify($species); Function: return a ranked classification Returns : @obj of taxa and ranks as word pairs separated by "@" Args : Bio::Species object As Bio::Species will be deprecated, you can use that method in a dual, sneaky way: 1) directly store the lineage information, 2) return the real one (DB lookups) if needed (i,e, if some flag is set, for instance). And, if a Bio::Species argument is used, do what the docs state (catch it early on with an if block and return within it). Bio::Species, as used within genbank.pm, doesn't use Bio::Taxonomy in any way. I don't know if you even need to retain its original purpose here; you might be able to get away with changing the fundamental way this method works altogether. That's up to you. my 2c Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Wed Jul 26 08:49:05 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Wed, 26 Jul 2006 13:49:05 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <0155C616-8E07-49C8-B4A1-2B77B4DE09F3@uiuc.edu> References: <003301c6b000$203cc560$15327e82@pyrimidine> <44C65990.4080500@sendu.me.uk> <1748BD60-EFBE-4872-94A4-7CA3D58BE458@uiuc.edu> <44C733A1.9070201@sendu.me.uk> <0155C616-8E07-49C8-B4A1-2B77B4DE09F3@uiuc.edu> Message-ID: <44C764C1.9010804@sendu.me.uk> Chris Fields wrote: > We're giving you the freedom to do what you want to Bio::Taxonomy. I don't want to do anything with Bio::Taxonomy any more. I've already shown that it isn't suitable for the job. Regardless of how it is implemented, the entire idea of a class that contains Nodes isn't appropriate, for reasons already stated. > Realize that the only contentious issue here is > that horrible lineage line in the GenBank file. We should have a way to > rebuild it as it was from the original file (i.e. not rebuild it from > scratch with DB lookups by default). However, you should also have the > option to rebuild it from lookups (i.e. correctly), which you could do > with a Taxonomy. And I've already shown how rebuilding with a Taxonomy is very far from ideal, while switching db_handle on a Node would be perfect. Why are you now advocating Taxonomy when there is no reason to? > Note this Bio::Taxonomy method: > > classify > > Title : classify > Usage : @obj[][0-1] = taxonomy->classify($species); > Function: return a ranked classification > Returns : @obj of taxa and ranks as word pairs separated by "@" > Args : Bio::Species object Note that all this method does is let you combine a list of rank names with the classification array in a Bio::Species, spitting out some weird data structure. It is only of interest to Bio::Taxonomy::Tree. We're in the situation where we don't know the rank names corresponding to the classification array in a Bio::Species generated by genbank et al. So classify() is of zero value. > As Bio::Species will be deprecated, you can use that method in a dual, > sneaky way: 1) directly store the lineage information, No. Lineage information must be in the form of Nodes or you can't answer lineage-related taxonomic questions. > 2) return the real one (DB lookups) if needed Messy. Doing it with Node would be far superior. Again, Node works all the time, while Taxonomy would work badly or not at all some of the time. Rather than suggest ways of using Taxonomy, tell me what is wrong with my current Node plan. From cjfields at uiuc.edu Wed Jul 26 11:15:28 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 26 Jul 2006 10:15:28 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C764C1.9010804@sendu.me.uk> Message-ID: <002801c6b0c6$59279fa0$15327e82@pyrimidine> I advocate anything but Bio::Species that allows you the option to use lookups for correct taxonomic information and not guesswork (current Bio::Species). So, you could pretty much replace Species immediately with a DB-aware container object with simple get/sets. As of now, that would be that Node or Taxonomy. I have done this already, just haven't committed it yet. And, when I mentioned having freedom to do what you want with Bio::Taxonomy, that includes all of it (including Node, Tree, etc). We just want it to be reasonable and not 'duct tape' for the various Bio::Species mistakes of the past. I don't think the problem here is really that complicated (still, the only thing is the lineage stuff in a sequence file, right?). > > As Bio::Species will be deprecated, you can use that method in a dual, > > sneaky way: 1) directly store the lineage information, > > No. Lineage information must be in the form of Nodes or you can't answer > lineage-related taxonomic questions. You must have a way to store the 'horrible lineage information' data, as is, for those users who do not care about taxonomy and just want to convert seq streams. You shouldn't burden the everyday user with something that is pretty specialized, this being finding correct taxonomic information based on DB lookups for a particular reason (screening sequences, as Hilmar pointed out, was one possibility). I don't care how, but store lineage information as it appears in the file (scalar string) or in a simple data structure (array, maybe?) capable of retaining the information in some way. There are many many ways of doing this which I have previously pointed out; take your pick. Hilmar, in a previous post, told me to take a step back and contemplate a world w/o Bio::Species, where you would design a system capable of dealing with sequence file taxonomic data in a way that allows you to get correct tax information when needed via NCBI Taxonomy data, yet not sacrifice speed if you're just interested in converting sequences via SeqIO. Would you design a Bio::Species class, then? Would you attempt to spend time parsing out species and genus information, when the correct data is sitting on the NCBI server or in a local flatfile? No. You would retain the minimal data necessary in an object for reading and writing data, but have the >option< available to run a lookup. Therefore, Bio::Taxonomy::Node was born. A little prematurely, yes. Probably needed to bake a bit more... Anyway, we must eventually sever our reliance on Bio::Species in order to deprecate it, so the lineage information must be contained, as it appears in the file, somewhere else. And my point with the classify() Bio::Taxonomy method is not to use it as is; you could sneak in your own data if needed. It was an example of a possible way of containing the lineage data, but not meant to be an absolute way. It's up to you how you want to implement it. I think the classes that are currently in place are more than capable of handling the job. Hence my statement before that you are trying to get too many things going right out the starting gate. Start simply by replacing Bio::Species, then worry about other issues. If you think that a specialized class would work, fine, but IMHO I don't think it's absolutely necessary. I had proposed such a class before (more like a Bio::Species-like Tax object) but was shut down, and rightly so; it's unnecessarily complicated and 'contaminates' Bio::Taxonomy with extra unnecessary methods (classification(), genus(), and so on). My last proposal was to eventually strip out the unreliable taxonomic parsing in the various SeqIO modules and replace it with something simple, which seemed to be a consensus among us all. This has to do with Hilmar's post-apocalyptic vision of a Bio::Species-free world. That will eventually happen, and Bioperl will eventually switch over completely to Bio::Taxonomy::Whatever. And Bio::Species can join BPLite and other deprecated modules in the BioPerl Boot Hill. But, for now that can't happen. We all strive for the best information possible. However, you can't sacrifice the needs of other users, a majority whom probably care squat about taxonomy, with your (our) own needs. As I have repeatedly stated, simple is good. We can't just usurp the API for our own wishes w/o warning, so the change has to be gradual and Bio::Species must stick around for the time being. And we must make it optional to have DB lookups or the villagers will be storming the castle. Listen, Sendu. If you can wait a couple of weeks for further discussion then we can slog on with this. But right now I just don't have any more time for this, sorry. You can have the last word and I'll respond when I get back. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Sendu Bala > Sent: Wednesday, July 26, 2006 7:49 AM > To: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > Chris Fields wrote: > > We're giving you the freedom to do what you want to Bio::Taxonomy. > > I don't want to do anything with Bio::Taxonomy any more. I've already > shown that it isn't suitable for the job. Regardless of how it is > implemented, the entire idea of a class that contains Nodes isn't > appropriate, for reasons already stated. > > > > Realize that the only contentious issue here is > > that horrible lineage line in the GenBank file. We should have a way to > > rebuild it as it was from the original file (i.e. not rebuild it from > > scratch with DB lookups by default). However, you should also have the > > option to rebuild it from lookups (i.e. correctly), which you could do > > with a Taxonomy. > > And I've already shown how rebuilding with a Taxonomy is very far from > ideal, while switching db_handle on a Node would be perfect. Why are you > now advocating Taxonomy when there is no reason to? > > > > Note this Bio::Taxonomy method: > > > > classify > > > > Title : classify > > Usage : @obj[][0-1] = taxonomy->classify($species); > > Function: return a ranked classification > > Returns : @obj of taxa and ranks as word pairs separated by "@" > > Args : Bio::Species object > > Note that all this method does is let you combine a list of rank names > with the classification array in a Bio::Species, spitting out some weird > data structure. It is only of interest to Bio::Taxonomy::Tree. > We're in the situation where we don't know the rank names corresponding > to the classification array in a Bio::Species generated by genbank et > al. So classify() is of zero value. > > > > As Bio::Species will be deprecated, you can use that method in a dual, > > sneaky way: 1) directly store the lineage information, > > No. Lineage information must be in the form of Nodes or you can't answer > lineage-related taxonomic questions. > > > > 2) return the real one (DB lookups) if needed > > Messy. Doing it with Node would be far superior. > > > Again, Node works all the time, while Taxonomy would work badly or not > at all some of the time. Rather than suggest ways of using Taxonomy, > tell me what is wrong with my current Node plan. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From morissardj at gmail.com Wed Jul 26 10:59:54 2006 From: morissardj at gmail.com (Morissard =?utf-8?b?asOpcm9tZQ==?=) Date: Wed, 26 Jul 2006 14:59:54 +0000 (UTC) Subject: [Bioperl-l] Accessing TRANSFAC matrices References: <20060717145000.0aig7ymcmurk4wsk@webmail.embl.de> <44BEA9FB.1070009@utk.edu> Message-ID: Hi that may help you ? http://morissardjerome.free.fr/Data/files/matrices.zip From hlapp at gmx.net Wed Jul 26 11:36:32 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 26 Jul 2006 11:36:32 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C73D21.3010301@sendu.me.uk> References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> Message-ID: On Jul 26, 2006, at 6:00 AM, Sendu Bala wrote: > Hilmar Lapp wrote: >> Instead, create something like >> >> # return a Bio::Taxonomy::Node: >> my $taxon = $seq->taxon(); > > Yes, but $seq->species() would also $seq->species() would return a Bio::Species object which may not be more than a thin shell anymore around an implementation that delegates almost everything to a lineage object (Bio::Taxonomy). $seq->taxon() in contrast need not return such a backwards-compatible construct. > >> # alternative approach: return a lineage (taxonomy) >> # this would be Bio::TaxonomyI compliant >> my $lineage = $seq->lineage(); > > I've since come to the conclusion that anything Taxonomy-ish would be > inappropriate - see recent post. Not sure which one you mean, and please don't reference really long emails, you're asking a lot of other people to organize your thoughts for them. At any rate, my point is that if you only name it appropriately you can avoid misconceptions about what is being returned. The fact that it's confusing to return a taxonomy from a method called species() doesn't mean it's equally bad to return a lineage (which is a limited taxonomy) from a method called lineage(). > [...] > > My proposed solution is that bioperl's taxonomy model always lets you > answer the same questions regardless of your source for taxonomic > information - see recent post. See above ... And I'd rather see some code or API examples than extensive elaborations. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Wed Jul 26 11:38:50 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 26 Jul 2006 11:38:50 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C733A1.9070201@sendu.me.uk> References: <003301c6b000$203cc560$15327e82@pyrimidine> <44C65990.4080500@sendu.me.uk> <1748BD60-EFBE-4872-94A4-7CA3D58BE458@uiuc.edu> <44C733A1.9070201@sendu.me.uk> Message-ID: <3E7DE5A0-C094-4894-BAB9-03F62EA98A3D@gmx.net> On Jul 26, 2006, at 5:19 AM, Sendu Bala wrote: > Chris Fields wrote: >> >>> It seems like the main problem with Node right now is that it has >>> classification() and things like genus(). I propose pure Node method >>> solutions to answer the questions classification() and genus() were >>> implemented to answer, but in a better, cruft-free way. >>> >>> Bio::DB::Taxonomy::genbank anyone? >> >> Ach... You're compromising here; > > No, I don't think so. Let me explain... > (another very long email, but with the same conclusion as above) Sorry, can you summarize this in a few sentences? If you do want feedback from me you really need to be more concise. -hilmar > > >> 1) Switch out Bio::Species with Node or Taxonomy; relocate other >> information temporarily (Bio::Species, get/sets in Seq object, >> SimpleValue). Leave Bio::Species in for the time being, but don't >> bother making any additional changes to it. > [...] >> Hence Hilmar's suggestion to use a $seq->taxon() method to return a >> Node/Taxonomy, and a $seq->species() would still return a >> Bio::Species object. It's redundant, > > As I see it, the problem to be solved is this: > > a) A node should just be a node, holding only information about itself > (but this can include information on who its parent is, and methods > relating to getting its parents/children as new objects - but the data > of its parents/children must never be stored on itself). > > b) Bio::Species isn't very good at its job; you can't ask reasonable > taxonomic questions of it and get correct answers. > > c) We need to transition Bio::Species to something better - something > that lets us do the same job as Bio::Species, but do it better. An > important aspect of 'better' is that we can switch from the taxonomic > information in a genbank file or similar to the information in a > taxonomic database if we want certain taxonomic questions answered > correctly. But also, we should be able to answer all questions with a > good chance of a correct answer even without database access/ > installation. > > There are a variety of possible solutions. How can we decide which is > best? What would a good solution be? > > The 'something better' we transition Bio::Species to will become the > preferred (or at least de facto standard) way of dealing with > taxonomic > information in bioperl. This taxonomic module (or set of modules) must > be able to model taxonomic information anywhere it is found - > databases > or genbank files or anything else. If it can't, it would be > fundamentally flawed. > > d) We can immediately discount any solution that involves storing some > taxonomic information outside of the tax module. If we find ourselves > putting lineage data in a genbank file in SimpleValue objects or > similar, we can be pretty sure we've used a poor solution to the > problem. That would be a compromise. > > e) If the thing we transition Bio::Species to can't do everything > Bio::Species did (doing it in a different and better way is fine of > course), it's not suitable for transitioning to (this is why Node > needed > all the cruft added to it before it was a suitable candidate). If it > /can/ do everything Bio::Species did, there would be no harm > immediately > making Bio::Species inherit from the new tax module, reimplementing > Bio::Species as necessary but making no API change. So any solution > that > would /require/ $seq->taxon() and $seq->species() wouldn't be a good > one, and would be a compromise. But we do want to get rid of > Bio::Species eventually, so I'm not saying we shouldn't have a > $seq->taxon() or similar, only that either method would give you the > same type of object with the same methods available > ($seq->taxon->isa('tax module') && ($seq->species->isa('Bio::Species') > && $seq->species->isa('tax module')). > > > I see 2 possible solutions to the problem. What should 'tax module' > be?: > > 1) Bio::Taxonomy or other similar class that is a container of > multiple > nodes. Naively this makes logical sense since one of the jobs > Bio::Species has is to store a lineage, and a lineage is best > represented as a set of Nodes. So let's have a single object with all > our Nodes in it. Problems: > > Bio::Taxonomy itself, as currently written, is fundamentally > flawed. It > requires that you know the ranks and order of ranks of all your input > nodes before you input them. It requires that all ranks have unique > names. It doesn't handle ranks of 'no rank'. You can't have more than > one lineage in an instance because you can't have two nodes with the > same rank. If you don't know the ranks of your nodes (ie. genbank) > there > is no way to maintain the order of your lineage because there is no > modelling of parent/child. > I had planned to re-write it such that the rank-centric implementation > was removed and we had parent/child implementation instead. But then > there is nothing to stop you adding nodes that are disconnected > from the > others, creating a broken mess. > > Bio::Taxonomy::Tree might have been a little more suitable because it > implements Bio::Tree::TreeI, but sadly it is also rank-centric and > actually requires input of both Bio::Species and Bio::Taxonomy objects > to its most useful methods. > > More important than issues with current implementations of > node-container classes, such classes are unable to let us solve > problem > c) in a good way, and also leave us potentially storing in memory Node > objects representing the same taxonomic node multiple times in > different > instances of the node-container. For problem c) if we were to switch > from genbank nodes to database the solution is to delete all the nodes > in the container and then get them all again from the database. > What if > you didn't even have a lineage-related question? You've just retrieved > 10s of nodes from the database for no reason (and then store them), > when > all you wanted was accurate information on the node you were > interested in. > > All in all, it's pretty horrible. Unsuitable implementations plus > excess > database retrieval plus massive waste of memory with duplicated nodes > does not equal a good solution. > > > 2) Bio::Taxonomy::Node. First, solve problem a) by getting rid of > methods binomial(), species(), genus(), sub_species(), > variant(), organelle(), classification() and show_all(). Except for > organelle() which doesn't belong in taxonomy, all of these > Bio::Species > 'questions' can still be answered by Node - just not in a single > method > call. I outlined how to answer them in the previous post. For backward > compatibility make Bio::Species a Node and implement the suggested way > of answering the questions the proper 'Node' way under those methods. > Problems: > > Well, those questions can't actually be answered by Node if the > starting > point was genbank data or manually created Nodes. The solution is > clean > and simple: Bio::DB::Taxonomy::genbank or perhaps better named > Bio::DB::Taxonomy::list (because it makes a taxonomy database from an > ordered list of names - I don't see anything inherently wrong or ugly > with that). Then everything magically just works. We get all the power > to ask all our questions that Node has already when working with the > ncbi database, but we get it when working with genbank data. We suffer > none of the problems of a node-container class. We can easily switch > databases on the fly. > > What's not to like? > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From jay at jays.net Wed Jul 26 11:32:53 2006 From: jay at jays.net (Jay Hannah) Date: Wed, 26 Jul 2006 08:32:53 -0700 Subject: [Bioperl-l] Anyone else at OSCON right now? Message-ID: <44C78B25.80503@jays.net> Any other BioPerl'ers here in Portland for OSCON? I'd love to chat about your life w/ BioPerl. I'm here until Saturday morning. j http://oscon.kwiki.org/index.cgi?JayHannah From adamnkraut at gmail.com Wed Jul 26 10:32:42 2006 From: adamnkraut at gmail.com (Adam Kraut) Date: Wed, 26 Jul 2006 10:32:42 -0400 Subject: [Bioperl-l] Structure::IO In-Reply-To: <716af09c0607250347w24e0d8dbj674758049bc0a4e2@mail.gmail.com> References: <716af09c0607250347w24e0d8dbj674758049bc0a4e2@mail.gmail.com> Message-ID: <134ede0b0607260732u79f0dea2if8f4ea98a5e03524@mail.gmail.com> Hi bernd, Can you better explain what it is you want to do with pdb files? From your example it looks like you want to do something with each chain, but it is unclear what you want to do here: my @chains = $struc->chain($chain); With that said, I was never able to use Bio::Structure in the way that I wanted. I now use the MMTSB Perl libraries instead: http://mmtsb.scripps.edu/cgi-bin/tooldoc?perlpackages Specifically the Molecule module may be useful here. Regards, Adam On 7/25/06, Bernd Web wrote: > > Hi, > > Does someone have experience with Bio::Structure::IO? > The example III.9.1 from the bptutorial.pl covers most, but what is e.g. > the > chain() method of Bio::Structure::Entry doing? The POD states: > > Title : chain > Usage : @chains = $structure->chain($chain); > Function: Connects a (or a list of) Chain objects to a > Bio::Structure::Entry. > Returns : list of Bio::Structure::Residue objects > Args : One Residue or a reference to an array of Residue objects > > But in e.g > my $stream = Bio::Structure::IO->new(-file => $filename, > -format => 'pdb'); > while ( my $struc = $stream->next_structure() ) { > for my $chain ($struc->get_chains) { > my $chainid = $chain->id; > my @chains = $struc->chain($chain); > } > } > > I get Bio::Structure::Chain=HASH(0x9f1ab50). > > What is the function of the chain method and how to use it? > > Best regards, > bernd > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- Adam N. Kraut National Resource for Biomedical Supercomputing http://www.nrbsc.org/sb/ From bix at sendu.me.uk Wed Jul 26 12:11:25 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Wed, 26 Jul 2006 17:11:25 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <002801c6b0c6$59279fa0$15327e82@pyrimidine> References: <002801c6b0c6$59279fa0$15327e82@pyrimidine> Message-ID: <44C7942D.6050603@sendu.me.uk> Chris Fields wrote: >> No. Lineage information must be in the form of Nodes or you can't answer >> lineage-related taxonomic questions. > > You must have a way to store the 'horrible lineage information' data, as is, > for those users who do not care about taxonomy and just want to convert seq > streams. You shouldn't burden the everyday user with something that is > pretty specialized, this being finding correct taxonomic information based > on DB lookups for a particular reason (screening sequences, as Hilmar > pointed out, was one possibility). I am certainly not requiring that anyone find 'correct taxonomic information'. The whole reason I am backing my current proposal is that it works equally well with or without access to NCBI's taxonomy database. Your proposals work poorly without access to such. > I don't care how, but store lineage information as it appears in the file > (scalar string) or in a simple data structure (array, maybe?) capable of > retaining the information in some way. There are many many ways of doing > this which I have previously pointed out; take your pick. I've taken my pick. To set: my $db = new Bio::DB::Taxonomy(-source => 'list', -lineage => @lineage); $node->db_handle($db); To get: @lineage = map { $_->scientific_name } $node->get_Lineage_Nodes; That is as simple as it is going to get in a world where we have 'pure' Nodes or any other kind of pure taxonomic class. If you want to hide the taxonomic complexity from end-users who want to make and store their own lineage of their species without having to know the details of how bioperl's taxonomy modules are supposed to work, tell them to use Bio::Species: To set: $species->classification(@lineage); To get: @lineage = $species->classification; Of course in this example I propose that behind the scenes Bio::Species is a Bio:Taxonomy::Node and just implements classification() the pure Node way, given above. Let me make my requirement very clear: the solution must allow you to find the most recent common ancestor of two solution-objects without access to the NCBI taxonomy database, using exactly the same method call you would use if you /did/ have access to the NCBI taxonomy database. The method in question shouldn't need any special-case code depending on the presence or absence of NCBI taxonomy database. That's the litmus test. I'll tend to reject any solution that fails. From bix at sendu.me.uk Wed Jul 26 12:25:41 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Wed, 26 Jul 2006 17:25:41 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <3E7DE5A0-C094-4894-BAB9-03F62EA98A3D@gmx.net> References: <003301c6b000$203cc560$15327e82@pyrimidine> <44C65990.4080500@sendu.me.uk> <1748BD60-EFBE-4872-94A4-7CA3D58BE458@uiuc.edu> <44C733A1.9070201@sendu.me.uk> <3E7DE5A0-C094-4894-BAB9-03F62EA98A3D@gmx.net> Message-ID: <44C79785.6050705@sendu.me.uk> Hilmar Lapp wrote: > > On Jul 26, 2006, at 5:19 AM, Sendu Bala wrote: > >>>> It seems like the main problem with Node right now is that it has >>>> classification() and things like genus(). I propose pure Node method >>>> solutions to answer the questions classification() and genus() were >>>> implemented to answer, but in a better, cruft-free way. >>>> >>>> Bio::DB::Taxonomy::genbank anyone? > > Sorry, can you summarize this in a few sentences? If you do want > feedback from me you really need to be more concise. A bad solution-module stores any kind of taxonomic information outside of the solution-module or in an inconsistent form. By 'inconsistent' I mean, sometimes you store the name of a taxonomic rank with $node->node_name, other times you store it in an array or scalar held directly on the solution-module or elsewhere. Bio::Taxonomy specifically is not usable. Generally speaking, classes that are containers of multiple nodes are also inappropriate, because they result in excess database retrieval and excess storage of duplicated information amongst instances of such classes. Bio::Taxonomy::Node combined with Bio::DB::Taxonomy::list would probably be ideal. From cjfields at uiuc.edu Wed Jul 26 12:49:40 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 26 Jul 2006 11:49:40 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: Message-ID: <000001c6b0d3$7d936ec0$15327e82@pyrimidine> Hilmar, apologies ahead of time for not being too concise! It's my last hurrah on this thread. No, really! ... > > Yes, but $seq->species() would also > > $seq->species() would return a Bio::Species object which may not be > more than a thin shell anymore around an implementation that > delegates almost everything to a lineage object (Bio::Taxonomy). > > $seq->taxon() in contrast need not return such a backwards-compatible > construct. In genbank.pm _read_GenBank_Species (initial implementation, to switch out Bio::Species with Taxonomy/Node object): 1) Assign data to both Bio::Species (as currently implemented) and Bio::Taxonomy::Node (new way). 2) Assign organelle to Bio::Species and the Seq object get/set organelle(). 3) Assign lineage information to Bio::Species and as an array to the Seq object get/set lineage(). Replace the get/set above with your method of choice, just no Bio::Species. In genbank.pm write_seq() 1) if DB_lookup flag is defined, use $seq->taxon() to build lineage 2) If not, use $seq->lineage(). The fine details (how do you build the lineage?!?) can be worked out along the way. The wonders of CVS! The Taxonomy class used here could be returned using Hilmar's $seq->taxon() and Bio::Species can be returned via $seq->species(). Makes perfect sense! Separated! Nothing complicated about it. Nice and clean. And Bio::Species can eventually be shown the exit door. Elvis has left the building... Organelle-specific sequence TaxIDs, as they refer to the organism and not the organelle, could be placed elsewhere, preferably somewhere more accessible such as $seq->organelle(). And lineage, similarly, could be placed in $seq->lineage(), which would store it as a raw string or as an array. There are many other ways I had pointed out (SimpleValue, Node, etc); I don't care, as long as we eventually sever the Bio::Species tumor from SeqIO. ... > ...And I'd rather see some code or API examples than > extensive elaborations. > > -hilmar Hilmar's right; working code does speaks louder than words. The energy spent in writing up full expositions is better spent elsewhere, hence: I need to get back to work! Wish I could contribute more. Chris From bix at sendu.me.uk Wed Jul 26 13:13:43 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Wed, 26 Jul 2006 18:13:43 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> Message-ID: <44C7A2C7.2070100@sendu.me.uk> Hilmar Lapp wrote: > On Jul 26, 2006, at 6:00 AM, Sendu Bala wrote: > >> Hilmar Lapp wrote: >>> Instead, create something like >>> >>> # return a Bio::Taxonomy::Node: >>> my $taxon = $seq->taxon(); >> Yes, but $seq->species() would also > > $seq->species() would return a Bio::Species object which may not be > more than a thin shell anymore around an implementation that > delegates almost everything to a lineage object (Bio::Taxonomy). I actually forgot to finish that sentence. I was going to suggest Bio::Species isa Bio::Taxonomy::Node and would indeed delegate most of its implementation to Node. >>> # alternative approach: return a lineage (taxonomy) >>> # this would be Bio::TaxonomyI compliant >>> my $lineage = $seq->lineage(); >> I've since come to the conclusion that anything Taxonomy-ish would be >> inappropriate - see recent post. > > The fact that it's confusing to return a taxonomy from a method called species() > doesn't mean it's equally bad to return a lineage (which is a limited > taxonomy) from a method called lineage(). You wouldn't need to though. If you want a lineage you could ask your node for its lineage. There's no point in having a whole other class that contains a node and all its ancestor nodes, when to get the ancestors of a node all you have to do is $node->get_Lineage_Nodes(). >> My proposed solution is that bioperl's taxonomy model always lets you >> answer the same questions regardless of your source for taxonomic >> information - see recent post. > > See above ... And I'd rather see some code or API examples The fine details of the following may be slightly off, but it's just to provide an example. I'll use Test.pm syntax. my @human = qw('Homo sapiens' Homo Mammalia Eukaryota); my @mouse = qw('Mus musculus' Mus Mammalia Eukaryota); Old way with Node ----------------- my $h_node = new Bio::Taxonomy::Node(-classification => @human); my $m_node = new Bio::Taxonomy::Node(-classification => @mouse); @human = map { $_->scientific_name } $h_node->get_Lineage_Nodes; ok @human, 0; # failure to work as expected @human = $h_node->classification; ok join(", ", @human), "Homo sapiens, Homo, Mammalia, Eukaryota"; my $lca = $h_node->get_LCA_Node($m_node); ok $lca, undef; # failure to do anything useful because our lineage data # is in an array, not in nodes # try again with entrez - must make brand new objects my $db = new Bio::DB::Taxonomy(-source => 'entrez'); $h_node = $db->get_Taxonomy_Node(-name => 'Homo sapiens'); $m_node = $db->get_Taxonomy_Node(-name => 'Mus musculus'); @human = map { $_->scientific_name } $h_node->get_Lineage_Nodes; ok join(", ", @human) eq "Homo sapiens, Homo, Homo/Pan/Gorilla group, Hominidae, ..."; # now it works! $lca = $h_node->get_LCA_Node($m_node); ok $lca->scientific_name, 'Mammalia'; # and now this works! Old way with Bio::Species ------------------------- # forget about it, Species has nothing like a get_LCA_Node() Proposed way with Node ---------------------- my $db = new Bio::DB::Taxonomy(-source => 'list', -lineage => @human); my $h_node = $db->get_Taxonomy_Node(-name => 'Homo sapiens'); $db->add_lineage(@mouse); # or make a new db my $m_node = $db->get_Taxonomy_Node(-name => 'Mus musculus'); @human = map { $_->scientific_name } $h_node->get_Lineage_Nodes; ok join(", ", @human), "Homo sapiens, Homo, Mammalia, Eukaryota"; # works as expected my $lca = $h_node->get_LCA_Node($m_node); ok $lca->scientific_name, 'Mammalia'; # works first time # try again with entrez - just change the db_handle $h_node->db_handle(new Bio:DB::Taxonomy(-source => 'entrez'); @human = map { $_->scientific_name } $h_node->get_Lineage_Nodes; ok join(", ", @human) eq "Homo sapiens, Homo, Homo/Pan/Gorilla group, Hominidae, ..."; $lca = $h_node->get_LCA_Node($m_node); ok $lca->scientific_name, 'Mammalia'; Proposed way with Bio::Species ------------------------------ # (Bio::Species isa Bio::Taxonomy::Node, implements its methods like # above) my $h_species = new Bio::Species(-classification => @human); my $m_species = new Bio::Species(-classification => @mouse); @human = map { $_->scientific_name } $h_node->get_Lineage_Nodes; ok join(", ", @human), "Homo sapiens, Homo, Mammalia, Eukaryota"; @human = $h_species->classification; ok join(", ", @human), "Homo sapiens, Homo, Mammalia, Eukaryota"; my $lca = $h_species->get_LCA_Node($m_species); ok $lca->scientific_name, 'Mammalia'; # trying again with entrez behaves as per proposed Node, above From angshu96 at gmail.com Wed Jul 26 13:15:35 2006 From: angshu96 at gmail.com (Angshu Kar) Date: Wed, 26 Jul 2006 12:15:35 -0500 Subject: [Bioperl-l] WUBLASTP parsing problem Message-ID: Hi, Does WU-BLASTP has got something to do with the length of the sequence names (or the sequence names)? What is happening here is I use fasta format proteins to build the blast (I do a distributed blastp) report. But when I parse the report (using bioperl), the query column remains empty for some results as : * 328857 6.6e-135 325331 6.3e-114 325329 1.0e-113 325332 1.7e-113 325330 2.7e-113 . . *. while for some its perfect as: *267750 280003 7.5e-301 267750 348279 7.5e-301 267750 345867 2.0e-300 267750 251915 2.0e-300 267750 346539 6.7e-300 . *. . Some of my sequences are as: *IMGA|AC159872_38.1 hypothetical protein AC159872.12 35121-35051 H EGN_Mt050401 20060209 TIGR 1671.m00013 mrsciilhnmivederdtyaqrwtefeqpggngsstpqpystelrdpdvhhklqtdlvkh iwikfgmyrd* * And part of the blastp (the one where I'm facing the issue) result is as: *Smallest * * Sum High Probability Sequences producing High-scoring Segment Pairs: Score P(N) N gi|33333045|gb|AAQ11687.1| MADS box protein [Triticum aes... 1318 6.6e-135 1 gi|47681327|gb|AAT37484.1| MADS5 protein [Dendrocalamus l... 1120 6.3e-114 1 gi|47681331|gb|AAT37486.1| MADS7 protein [Dendrocalamus l... 1118 1.0e-113 1 gi|47681325|gb|AAT37483.1| MADS4 protein [Dendrocalamus l... 1116 1.7e-113 1 gi|47681329|gb|AAT37485.1| MADS6 protein [Dendrocalamus l... 1114 2.7e-113 1 gi|47681323|gb|AAT37482.1| MADS3 protein [Dendrocalamus l... 1114 2.7e-113 1 11674.m04224|LOC_Os08g41950|protein K-box region, putative 976 1.1e-98 1 gi|28630961|gb|AAO45877.1| MADS5 [Lolium perenne] 967 1.0e-97 1 gi|44888605|gb|AAS48129.1| AGAMOUS LIKE9-like protein [Ho... 964 2.1e-97 1 11674.m04223|LOC_Os08g41950|protein K-box region, putative 899 1.6e-90 1 gi|34979580|gb|AAQ83834.1| MADS box protein [Asparagus of... 875 5.8e-88 1* Could you please let me know if I'm missing something? Has the gi got to do anything with this? Thanking you, Angshu From cain.cshl at gmail.com Wed Jul 26 12:19:26 2006 From: cain.cshl at gmail.com (Scott Cain) Date: Wed, 26 Jul 2006 12:19:26 -0400 Subject: [Bioperl-l] Installing staden io_lib on windows? Message-ID: <1153930767.2632.5.camel@localhost.localdomain> Hi all, I'm wondering if anyone has tried to install Staden's io_lib on Windows, and if so, how did it go? I am not much of a Windows person, but I've tried to make it under cygwin only to get this message: make all-recursive make[1]: Entering directory `/home/scott/io_lib-1.9.2' Making all in read make[2]: Entering directory `/home/scott/io_lib-1.9.2/read' if gcc -DHAVE_CONFIG_H -I. -I. -I.. -I.. -I../include -I../read -I../alf -I../abi -I../ctf -I../ztr -I../plain -I../scf -I../sff -I../exp_file -I../utils -I/usr/local/include -g -O2 -MT Read.o -MD -MP -MF ".deps/Read.Tpo" -c -o Read.o Read.c; \ then mv -f ".deps/Read.Tpo" ".deps/Read.Po"; else rm -f ".deps/Read.Tpo"; exit 1; fi In file included from Read.h:43, from Read.c:40: ../utils/os.h:346:2: #error Must define SP_BIG_ENDIAN or SP_LITTLE_ENDIAN in Makefile make[2]: *** [Read.o] Error 1 make[2]: Leaving directory `/home/scott/io_lib-1.9.2/read' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/home/scott/io_lib-1.9.2' make: *** [all] Error 2 I'm guessing there is a flag I can pass to the configure script to get the endian-ness right, but I don't know (and I don't know if this is just the beginning of a long, fruitless road :-) I would like to use Bio::SCF (from CPAN) in conjuction with the trace glyph in BioGraphics to view traces in GBrowse. Thanks for any advice, Scott -- ------------------------------------------------------------------------ Scott Cain, Ph. D. cain.cshl at gmail.com GMOD Coordinator (http://www.gmod.org/) 216-392-3087 Cold Spring Harbor Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From morissardj at gmail.com Wed Jul 26 16:49:58 2006 From: morissardj at gmail.com (leverdeterre) Date: Wed, 26 Jul 2006 13:49:58 -0700 (PDT) Subject: [Bioperl-l] Accessing TRANSFAC matrices In-Reply-To: References: <44BEA9FB.1070009@utk.edu> Message-ID: <5510746.post@talk.nabble.com> i'm happy for helping you i'have done a page whitch can interrest you http://morissardjerome.free.fr/Data/index.html there is more information about the 397 matrix file ( in the 3 first line) and i'm done all the logo file . ++ -- View this message in context: http://www.nabble.com/Re%3A-Accessing-TRANSFAC-matrices-tf1969467.html#a5510746 Sent from the Perl - Bioperl-L forum at Nabble.com. From morissardj at gmail.com Wed Jul 26 17:15:19 2006 From: morissardj at gmail.com (leverdeterre) Date: Wed, 26 Jul 2006 14:15:19 -0700 (PDT) Subject: [Bioperl-l] Blast Output Parsing In-Reply-To: References: Message-ID: <5511136.post@talk.nabble.com> and without Bioperl i think that may help you http://morissardjerome.free.fr/perl/blastparser.html -- View this message in context: http://www.nabble.com/Blast-Output-Parsing-tf1974691.html#a5511136 Sent from the Perl - Bioperl-L forum at Nabble.com. From osborne1 at optonline.net Wed Jul 26 17:00:50 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Wed, 26 Jul 2006 17:00:50 -0400 Subject: [Bioperl-l] SeqUtils In-Reply-To: <716af09c0607250444y3e005fb1t4e20094fd8db993d@mail.gmail.com> Message-ID: Bernd, That's easily done, changed both POD and code. Brian O. On 7/25/06 7:44 AM, "Bernd Web" wrote: > Hi, > > With Bio::SeqUtils it may be nice to support 3 letter codes with > capitals only, too. > Now > > my $string = Bio::SeqUtils->seq3in($seqobj, 'METGLYTER'); > > will give in $string->seq: XXX. > > Possibly the capitals in MetGlyTer are used to find the amino acids codes? > If not maybe it's easy to implement case-insensitive, or all-capitals > for AA codes in SeqUtils? > > In addition about the POD: maybe it's better not use use $string since > Bio::SeqUtils->seq3in does not return a string but a Bio::PrimarySeq > object. > > Regards, > Bernd > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From osborne1 at optonline.net Wed Jul 26 17:24:34 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Wed, 26 Jul 2006 17:24:34 -0400 Subject: [Bioperl-l] Structure::IO In-Reply-To: <716af09c0607250347w24e0d8dbj674758049bc0a4e2@mail.gmail.com> Message-ID: Bernd, I'm not following your question. The POD in the latest Bio::Structure::Entry shows: =head2 chain() Title : chain Usage : @chains = $structure->chain($chain); Function: Connects a Chain or a list of Chain objects to a Bio::Structure::Entry. Returns : List of Bio::Structure::Chain objects Args : A Chain or a reference to an array of Chain objects =cut Which is not what you've copied and pasted. What version of Bioperl do you use? Brian O. On 7/25/06 6:47 AM, "Bernd Web" wrote: > Hi, > > Does someone have experience with Bio::Structure::IO? > The example III.9.1 from the bptutorial.pl covers most, but what is e.g. the > chain() method of Bio::Structure::Entry doing? The POD states: > > Title : chain > Usage : @chains = $structure->chain($chain); > Function: Connects a (or a list of) Chain objects to a Bio::Structure::Entry. > Returns : list of Bio::Structure::Residue objects > Args : One Residue or a reference to an array of Residue objects > > But in e.g > my $stream = Bio::Structure::IO->new(-file => $filename, > -format => 'pdb'); > while ( my $struc = $stream->next_structure() ) { > for my $chain ($struc->get_chains) { > my $chainid = $chain->id; > my @chains = $struc->chain($chain); > } > } > > I get Bio::Structure::Chain=HASH(0x9f1ab50). > > What is the function of the chain method and how to use it? > > Best regards, > bernd > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From hlapp at gmx.net Thu Jul 27 01:06:52 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 27 Jul 2006 01:06:52 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C7A2C7.2070100@sendu.me.uk> References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> <44C7A2C7.2070100@sendu.me.uk> Message-ID: <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net> I think this looks like a great solution. You could also name Bio::DB::Taxonomy::list as Bio::DB::Taxonomy::inmemory because it really isn't much else than an in-memory database (of limited content if you populate it from flat-file sequence annotation). The only reservation I have is that you'd have methods on Node that don't really operate on the node instance but rather operate on the taxonomy (database) behind the scenes. That's what I would have used Bio::Taxonomy for, not so much as a container than as a class with (conceptually) 'static' methods corresponding to those that are now in Node, like get_Lineage_Nodes(). They would optionally accept a db_handle too, or use a default one set as an attribute. However, leaving/having these methods on Node really isn't such a big deal and I'm sure would even be preferred by many people for the sake of simplicity. So overall I think you should just go ahead. -hilmar On Jul 26, 2006, at 1:13 PM, Sendu Bala wrote: > > The fine details of the following may be slightly off, but it's > just to > provide an example. I'll use Test.pm syntax. > > my @human = qw('Homo sapiens' Homo Mammalia Eukaryota); > my @mouse = qw('Mus musculus' Mus Mammalia Eukaryota); > > > [...] > Proposed way with Node > ---------------------- > > my $db = new Bio::DB::Taxonomy(-source => 'list', -lineage => @human); > my $h_node = $db->get_Taxonomy_Node(-name => 'Homo sapiens'); > $db->add_lineage(@mouse); # or make a new db > my $m_node = $db->get_Taxonomy_Node(-name => 'Mus musculus'); > > @human = map { $_->scientific_name } $h_node->get_Lineage_Nodes; > ok join(", ", @human), "Homo sapiens, Homo, Mammalia, Eukaryota"; > # works as expected > > my $lca = $h_node->get_LCA_Node($m_node); > ok $lca->scientific_name, 'Mammalia'; # works first time > > # try again with entrez - just change the db_handle > $h_node->db_handle(new Bio:DB::Taxonomy(-source => 'entrez'); > > @human = map { $_->scientific_name } $h_node->get_Lineage_Nodes; > ok join(", ", @human) eq "Homo sapiens, Homo, Homo/Pan/Gorilla group, > Hominidae, ..."; > > $lca = $h_node->get_LCA_Node($m_node); > ok $lca->scientific_name, 'Mammalia'; > > [...] -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From bix at sendu.me.uk Thu Jul 27 03:07:22 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 27 Jul 2006 08:07:22 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net> References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> <44C7A2C7.2070100@sendu.me.uk> <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net> Message-ID: <44C8662A.3080904@sendu.me.uk> Hilmar Lapp wrote: > The only reservation I have is that you'd have methods on Node that > don't really operate on the node instance but rather operate on the > taxonomy (database) behind the scenes. That's what I would have used > Bio::Taxonomy for, not so much as a container than as a class with > (conceptually) 'static' methods corresponding to those that are now > in Node, like get_Lineage_Nodes(). Yes, I had the same reservation. But it somehow seemed reasonable for me to ask a node for its lineage, though I draw the line at having a method like get_node('rank_name'). That's the only thing Bio::Taxonomy would have been good for, so it's a trade off between some nice methods and the problems inherent in a node-container class. Though, perhaps we almost have the best of both worlds, since the database is effectively a container without the problems: $node->db_handle->get_Taxonomy_Node(-rank 'rank_name', -lineage_of => $node); ? > So overall I think you should just go ahead. Great, will do. From maximilianh at gmail.com Thu Jul 27 04:56:44 2006 From: maximilianh at gmail.com (Maximilian Haeussler) Date: Thu, 27 Jul 2006 10:56:44 +0200 Subject: [Bioperl-l] TRANSFAC matrices, open acces Message-ID: <76f031ae0607270156lde64ff9m590257d98899fe64@mail.gmail.com> Actually, the fact that the transfac matrices are belonging to a company is quite inconvenient for biologists and bioinformatics analyses working in this field. There are some projects to annotate cis-sequences in regular intervals by volunteers and put the data into the public domain, one of them is the oreganno database http://www.oreganno.org/. Its first annotation jamboree will be held in Gent at the end of this year. If you're interested in cis-sequences, want to meet others that are and are willing to contribute some annotation efforts, don't hestitate to come to gent, it's conveniently placed in the middle of europe and registration costs almost nothing. http://www.dmbr.ugent.be/bioit/contents/regcreative/ One day, hopefully, journals will oblige authors to put their sequences in a common format into genbank but as long as regulation is not seen as an important part of genome annotation, a lot manual annotation will have to be done. cheers max > On 26/07/06, leverdeterre wrote: > > > > i'm happy for helping you > > i'have done a page whitch can interrest you > > http://morissardjerome.free.fr/Data/index.html > > > > there is more information about the 397 matrix file ( in the 3 first line) > > and i'm done all the logo file . > > > > ++ > > -- > > View this message in context: http://www.nabble.com/Re%3A-Accessing-TRANSFAC-matrices-tf1969467.html#a5510746 > > Sent from the Perl - Bioperl-L forum at Nabble.com. > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > -- Maximilian Haeussler, CNRS/INRA Gif-sur-Yvette, France tel: +33 6 12 82 76 16 skype: maximilianhaeussler From morissardj at gmail.com Thu Jul 27 05:10:19 2006 From: morissardj at gmail.com (leverdeterre) Date: Thu, 27 Jul 2006 02:10:19 -0700 (PDT) Subject: [Bioperl-l] Accessing TRANSFAC matrices In-Reply-To: <5510746.post@talk.nabble.com> References: <44BEA9FB.1070009@utk.edu> <5510746.post@talk.nabble.com> Message-ID: <5517747.post@talk.nabble.com> Sorry i remove all this data because they are the proprity of TRANSFAC .. http://www.gene-regulation.com/pub/databases/transfac/doc/misc.html The TRANSFAC? database is free for users from non-profit organizations only. Users from commercial enterprises have to license the TRANSFAC? database and accompanying programs. -- View this message in context: http://www.nabble.com/Re%3A-Accessing-TRANSFAC-matrices-tf1969467.html#a5517747 Sent from the Perl - Bioperl-L forum at Nabble.com. From maximilianh at gmail.com Thu Jul 27 04:44:47 2006 From: maximilianh at gmail.com (Maximilian Haeussler) Date: Thu, 27 Jul 2006 10:44:47 +0200 Subject: [Bioperl-l] Accessing TRANSFAC matrices In-Reply-To: <5510746.post@talk.nabble.com> References: <44BEA9FB.1070009@utk.edu> <5510746.post@talk.nabble.com> Message-ID: <76f031ae0607270144of6ff9cbtbd9f3045bbc4e6e1@mail.gmail.com> I'm pretty sure that you are not allowed to distribute these matrices: http://www.gene-regulation.com/pub/databases/transfac/doc/misc.html [well...but if you don't care and biobase doesn't complain... actually anyone can scrape the matrices from the website with wget.] max On 26/07/06, leverdeterre wrote: > > i'm happy for helping you > i'have done a page whitch can interrest you > http://morissardjerome.free.fr/Data/index.html > > there is more information about the 397 matrix file ( in the 3 first line) > and i'm done all the logo file . > > ++ > -- > View this message in context: http://www.nabble.com/Re%3A-Accessing-TRANSFAC-matrices-tf1969467.html#a5510746 > Sent from the Perl - Bioperl-L forum at Nabble.com. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From bix at sendu.me.uk Thu Jul 27 05:55:01 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 27 Jul 2006 10:55:01 +0100 Subject: [Bioperl-l] TRANSFAC matrices, open acces In-Reply-To: <76f031ae0607270156lde64ff9m590257d98899fe64@mail.gmail.com> References: <76f031ae0607270156lde64ff9m590257d98899fe64@mail.gmail.com> Message-ID: <44C88D75.7040102@sendu.me.uk> Maximilian Haeussler wrote: > Actually, the fact that the transfac matrices are belonging to a > company is quite inconvenient for biologists and bioinformatics > analyses working in this field. The public version is adequate though. It would certainly be useful to have Bioperl access to transfac and other regulation databases. I'll probably write some suitable modules if no one beats me to it. From sdavis2 at mail.nih.gov Thu Jul 27 07:43:09 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 27 Jul 2006 07:43:09 -0400 Subject: [Bioperl-l] TRANSFAC matrices, open acces In-Reply-To: <44C88D75.7040102@sendu.me.uk> Message-ID: On 7/27/06 5:55 AM, "Sendu Bala" wrote: > Maximilian Haeussler wrote: >> Actually, the fact that the transfac matrices are belonging to a >> company is quite inconvenient for biologists and bioinformatics >> analyses working in this field. > > The public version is adequate though. It would certainly be useful to > have Bioperl access to transfac and other regulation databases. I'll > probably write some suitable modules if no one beats me to it. I haven't used it in a while, but the TFBS family of modules are, if I recall correctly, bioperl-compatible, though not part of bioperl. In any case, for those who aren't aware, it might be worth a quick look: http://forkhead.cgb.ki.se/TFBS/ Sean From bix at sendu.me.uk Thu Jul 27 08:01:03 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 27 Jul 2006 13:01:03 +0100 Subject: [Bioperl-l] TRANSFAC matrices, open acces In-Reply-To: References: Message-ID: <44C8AAFF.6060100@sendu.me.uk> Sean Davis wrote: > > On 7/27/06 5:55 AM, "Sendu Bala" wrote: > >> Maximilian Haeussler wrote: >>> Actually, the fact that the transfac matrices are belonging to a >>> company is quite inconvenient for biologists and bioinformatics >>> analyses working in this field. > >> The public version is adequate though. It would certainly be useful to >> have Bioperl access to transfac and other regulation databases. I'll >> probably write some suitable modules if no one beats me to it. > > I haven't used it in a while, but the TFBS family of modules are, if I > recall correctly, bioperl-compatible, though not part of bioperl. In any > case, for those who aren't aware, it might be worth a quick look: Yes. It only has online access to Transfac though, and the inheritance and returned objects are TFBS specific so you miss out on whatever goodness there may be in the rest of bioperl. Still, recommended to use if you want programmatic access to Transfac matrices right now. From bernd.web at gmail.com Thu Jul 27 06:14:13 2006 From: bernd.web at gmail.com (Bernd Web) Date: Thu, 27 Jul 2006 12:14:13 +0200 Subject: [Bioperl-l] Structure::IO In-Reply-To: References: <716af09c0607250347w24e0d8dbj674758049bc0a4e2@mail.gmail.com> Message-ID: <716af09c0607270314u4e2b1eb8y6c1b87f5b3abd8e1@mail.gmail.com> Hi Thanks for your notes. The text I pasted comes from http://doc.bioperl.org/releases/bioperl-1.5.1/ but indeed Entry.pm (v1.25 2006/07/04) shows a different POD. I am trying to get annotation out of PDB. ID is not a problem, but I would like to have the HEADER and possibly comment fields to a (FastA) description line, but how? Bio::Structure::Entry v.1.25 does not list the annotation method in the POD anymore (due to a missing empty line before =head). $struc->annotation still exists; I can get the keys but not the values with $struc->annotation($struc->seqres) (Can't locate object method "get_Annotations" via package "Bio::PrimarySeq"). (Example script attached). The POD states: annotation: $obj->annotation($seq_obj). So I thought of a PrimarySeq object to pass to annotation. The PrimarySeq object ($struc->seqres) does not contain a description: $struc->seqres->desc is uninitialized. Is it possible to get annotation from header/comments fields with Bio::Structure? Best regards, Bernd On 7/26/06, Brian Osborne wrote: > Bernd, > > I'm not following your question. The POD in the latest Bio::Structure::Entry > shows: > > =head2 chain() > > Title : chain > Usage : @chains = $structure->chain($chain); > Function: Connects a Chain or a list of Chain objects to a > Bio::Structure::Entry. > Returns : List of Bio::Structure::Chain objects > Args : A Chain or a reference to an array of Chain objects > > =cut > > Which is not what you've copied and pasted. What version of Bioperl do you > use? > > Brian O. > > > > On 7/25/06 6:47 AM, "Bernd Web" wrote: > > > Hi, > > > > Does someone have experience with Bio::Structure::IO? > > The example III.9.1 from the bptutorial.pl covers most, but what is e.g. the > > chain() method of Bio::Structure::Entry doing? The POD states: > > > > Title : chain > > Usage : @chains = $structure->chain($chain); > > Function: Connects a (or a list of) Chain objects to a Bio::Structure::Entry. > > Returns : list of Bio::Structure::Residue objects > > Args : One Residue or a reference to an array of Residue objects > > > > But in e.g > > my $stream = Bio::Structure::IO->new(-file => $filename, > > -format => 'pdb'); > > while ( my $struc = $stream->next_structure() ) { > > for my $chain ($struc->get_chains) { > > my $chainid = $chain->id; > > my @chains = $struc->chain($chain); > > } > > } > > > > I get Bio::Structure::Chain=HASH(0x9f1ab50). > > > > What is the function of the chain method and how to use it? > > > > Best regards, > > bernd > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > -------------- next part -------------- #!/usr/bin/perl -w use warnings; use strict; use Bio::Structure::IO; my $filename = $ARGV[0]; my $stream = Bio::Structure::IO->new( -file => $filename, -format => 'pdb'); while ( my $struc = $stream->next_structure() ) { print "SEQRES DESC: ", $struc->seqres->desc, "\n"; print join(" ", keys %{$struc->annotation($struc->seqres)}), "\n"; print join(" ", keys %{$struc->annotation()}), "\n"; print join(" ", values %{$struc->annotation()}), "\n"; #(partly) works print join(" ", values %{$struc->annotation($struc->seqres)}), "\n"; #does not work } From bix at sendu.me.uk Thu Jul 27 09:31:54 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 27 Jul 2006 14:31:54 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net> References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> <44C7A2C7.2070100@sendu.me.uk> <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net> Message-ID: <44C8C04A.8070504@sendu.me.uk> Hilmar Lapp wrote: > > So overall I think you should just go ahead. One last suggestion for discussion: It may be appropriate is to rename Bio::Taxonomy::Node to clarify that Node has no particular reliance on or association with Bio::Taxonomy or the other modules in Bio/Taxonomy/. How about calling it Bio::Taxon? It is more obvious what to expect from something called 'Bio::Taxon' when you know that it is the new 'Bio::Species': like Bio::Species but for any taxon. It also makes the class 'top-level' which I think most people are happier using; seems like things in sub-directories are more for advanced users. From hlapp at gmx.net Thu Jul 27 09:44:25 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 27 Jul 2006 09:44:25 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C8C04A.8070504@sendu.me.uk> References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> <44C7A2C7.2070100@sendu.me.uk> <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net> <44C8C04A.8070504@sendu.me.uk> Message-ID: <077A86AB-174D-426E-B65F-0954A72A6019@gmx.net> I don't think the top-level or sub-directory matters at all and I don't want anybody to get used to the notion that that may imply anything (except possibly better thought-out structure for the sub- directory level). For instance RichSeq is what all rich annotation sequence format parsers return, yet it is in a sub-directory. I don't any real objection to Bio::Taxon though if that's what you'd like to name it - although, what will happen to the Bio::Taxonomy hierarchy then? Phased out? -hilmar On Jul 27, 2006, at 9:31 AM, Sendu Bala wrote: > Hilmar Lapp wrote: >> >> So overall I think you should just go ahead. > > One last suggestion for discussion: > > It may be appropriate is to rename Bio::Taxonomy::Node to clarify that > Node has no particular reliance on or association with > Bio::Taxonomy or > the other modules in Bio/Taxonomy/. > > How about calling it Bio::Taxon? > > It is more obvious what to expect from something called 'Bio::Taxon' > when you know that it is the new 'Bio::Species': like Bio::Species but > for any taxon. It also makes the class 'top-level' which I think most > people are happier using; seems like things in sub-directories are > more > for advanced users. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Thu Jul 27 09:48:32 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 27 Jul 2006 08:48:32 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C8662A.3080904@sendu.me.uk> Message-ID: <002a01c6b183$59779880$15327e82@pyrimidine> Sounds good to me; agree with Hilmar's suggestion of 'in_memory' as well, but it's your choice. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Sendu Bala > Sent: Thursday, July 27, 2006 2:07 AM > To: bioperl-l at bioperl.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > Hilmar Lapp wrote: > > The only reservation I have is that you'd have methods on Node that > > don't really operate on the node instance but rather operate on the > > taxonomy (database) behind the scenes. That's what I would have used > > Bio::Taxonomy for, not so much as a container than as a class with > > (conceptually) 'static' methods corresponding to those that are now > > in Node, like get_Lineage_Nodes(). > > Yes, I had the same reservation. But it somehow seemed reasonable for me > to ask a node for its lineage, though I draw the line at having a method > like get_node('rank_name'). That's the only thing Bio::Taxonomy would > have been good for, so it's a trade off between some nice methods and > the problems inherent in a node-container class. > > Though, perhaps we almost have the best of both worlds, since the > database is effectively a container without the problems: > $node->db_handle->get_Taxonomy_Node(-rank 'rank_name', > -lineage_of => $node); ? > > > > So overall I think you should just go ahead. > > Great, will do. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From osborne1 at optonline.net Thu Jul 27 09:44:33 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Thu, 27 Jul 2006 09:44:33 -0400 Subject: [Bioperl-l] Structure::IO In-Reply-To: <716af09c0607270314u4e2b1eb8y6c1b87f5b3abd8e1@mail.gmail.com> Message-ID: Bernd, I'll need to take a look a closer look at the POD but from your description it seems it's wrong, or certainly incomplete. To get the HEADER line you'll do something like: my $stream = Bio::Structure::IO->new(-file => $filename, -format => 'pdb'); my $struc = $stream->next_structure(); my $anncoll = $struc->annotation; my @headers = $anncoll->get_Annotations('header'); This implies that all these top-level annotations are associated with the entry, not with the chains. I don't use Bio::Structure so don't assume this is true for all annotations. There are 2 ways to explore this further. One is to look at t/StructIO.t or other tests, useful examples are frequently found in the tests. The other is to run your script in the debugger: >perl -d pdb.pl 1CAM.pdb By examining the variables your script creates using the "x" command you get to see exactly where strings are stored and what the names of the keys are, this is how I found the HEADER line. Type "h" for the debugger's Help. Brian O. On 7/27/06 6:14 AM, "Bernd Web" wrote: > Hi > > Thanks for your notes. The text I pasted comes from > http://doc.bioperl.org/releases/bioperl-1.5.1/ but indeed Entry.pm > (v1.25 2006/07/04) shows a different POD. > > I am trying to get annotation out of PDB. ID is not a problem, but I > would like to have the HEADER and possibly comment fields to a (FastA) > description line, but how? > > Bio::Structure::Entry v.1.25 does not list the annotation method in > the POD anymore (due to a missing empty line before =head). > $struc->annotation still exists; I can get the keys but not the values > with $struc->annotation($struc->seqres) (Can't locate object method > "get_Annotations" via package "Bio::PrimarySeq"). > (Example script attached). > > The POD states: annotation: $obj->annotation($seq_obj). So I thought > of a PrimarySeq object to pass to annotation. > > The PrimarySeq object ($struc->seqres) does not contain a description: > $struc->seqres->desc is uninitialized. > > Is it possible to get annotation from header/comments fields with > Bio::Structure? > > Best regards, > Bernd > > > On 7/26/06, Brian Osborne wrote: >> Bernd, >> >> I'm not following your question. The POD in the latest Bio::Structure::Entry >> shows: >> >> =head2 chain() >> >> Title : chain >> Usage : @chains = $structure->chain($chain); >> Function: Connects a Chain or a list of Chain objects to a >> Bio::Structure::Entry. >> Returns : List of Bio::Structure::Chain objects >> Args : A Chain or a reference to an array of Chain objects >> >> =cut >> >> Which is not what you've copied and pasted. What version of Bioperl do you >> use? >> >> Brian O. >> >> >> >> On 7/25/06 6:47 AM, "Bernd Web" wrote: >> >>> Hi, >>> >>> Does someone have experience with Bio::Structure::IO? >>> The example III.9.1 from the bptutorial.pl covers most, but what is e.g. the >>> chain() method of Bio::Structure::Entry doing? The POD states: >>> >>> Title : chain >>> Usage : @chains = $structure->chain($chain); >>> Function: Connects a (or a list of) Chain objects to a >>> Bio::Structure::Entry. >>> Returns : list of Bio::Structure::Residue objects >>> Args : One Residue or a reference to an array of Residue objects >>> >>> But in e.g >>> my $stream = Bio::Structure::IO->new(-file => $filename, >>> -format => 'pdb'); >>> while ( my $struc = $stream->next_structure() ) { >>> for my $chain ($struc->get_chains) { >>> my $chainid = $chain->id; >>> my @chains = $struc->chain($chain); >>> } >>> } >>> >>> I get Bio::Structure::Chain=HASH(0x9f1ab50). >>> >>> What is the function of the chain method and how to use it? >>> >>> Best regards, >>> bernd >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> >> From aaron.j.mackey at gsk.com Thu Jul 27 08:54:05 2006 From: aaron.j.mackey at gsk.com (aaron.j.mackey at gsk.com) Date: Thu, 27 Jul 2006 08:54:05 -0400 Subject: [Bioperl-l] Installing staden io_lib on windows? In-Reply-To: <1153930767.2632.5.camel@localhost.localdomain> Message-ID: Hi Scott, > In file included from Read.h:43, > from Read.c:40: > ../utils/os.h:346:2: #error Must define SP_BIG_ENDIAN or > SP_LITTLE_ENDIAN in Makefile os.h has a bunch of #ifdef statements that check for platforms, and there isn't one for cygwin (but there is for MinGW) Try running configure with "--CFLAGS=-DSP_LITTLE_ENDIAN" or somesuch Also take a look at the MinGW section of os.h to see if there are others you will likely need (e.g. NOPIPE, NOLOCKF, etc) Alternatively, you may want to just edit the original os.h to duplicate the MinGW section with the appropriate compiler constant for CYGWIN (__CYGWIN__ I'm guessing, but don't really know for sure). Good luck, -Aaron From bix at sendu.me.uk Thu Jul 27 10:06:23 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 27 Jul 2006 15:06:23 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <077A86AB-174D-426E-B65F-0954A72A6019@gmx.net> References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> <44C7A2C7.2070100@sendu.me.uk> <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net> <44C8C04A.8070504@sendu.me.uk> <077A86AB-174D-426E-B65F-0954A72A6019@gmx.net> Message-ID: <44C8C85F.2010104@sendu.me.uk> Hilmar Lapp wrote: > I don't think the top-level or sub-directory matters at all and I don't > want anybody to get used to the notion that that may imply anything > (except possibly better thought-out structure for the sub-directory > level). For instance RichSeq is what all rich annotation sequence format > parsers return, yet it is in a sub-directory. Well, I'm not aware that I've ever used a RichSeq ;). But your point is taken. > I don't any real objection to Bio::Taxon though if that's what you'd > like to name it - although, what will happen to the Bio::Taxonomy > hierarchy then? Phased out? At the moment it seems to me that the Bio::Taxonomy modules (excluding Node) aren't really usable. Jason wrote a comment in t/TaxonTree.t which tests Taxon and Tree: ## I am pretty sure this module is going the way of the dodo bird so ## I am not sure how much work to put into fixing the tests/module FactoryI is strange (it isn't intended to work like any other Bioperl factory) and there are no implementers of it, while Taxonomy.pm itself would be redundant after my Node changes and has implementation issues, though it may make more sense to some people. My vote is phase out. What is the actual process involved in renaming a module in Bioperl? From hlapp at gmx.net Thu Jul 27 10:29:09 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 27 Jul 2006 10:29:09 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C8C85F.2010104@sendu.me.uk> References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> <44C7A2C7.2070100@sendu.me.uk> <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net> <44C8C04A.8070504@sendu.me.uk> <077A86AB-174D-426E-B65F-0954A72A6019@gmx.net> <44C8C85F.2010104@sendu.me.uk> Message-ID: How do you mean 'process'? You create a new module, and then you deprecate the ones you're phasing out. If possible you rewrite the implementation to use the new module. Not sure this answers your question? -hilmar On Jul 27, 2006, at 10:06 AM, Sendu Bala wrote: > Hilmar Lapp wrote: >> I don't think the top-level or sub-directory matters at all and I >> don't >> want anybody to get used to the notion that that may imply anything >> (except possibly better thought-out structure for the sub-directory >> level). For instance RichSeq is what all rich annotation sequence >> format >> parsers return, yet it is in a sub-directory. > > Well, I'm not aware that I've ever used a RichSeq ;). But your > point is > taken. > > >> I don't any real objection to Bio::Taxon though if that's what you'd >> like to name it - although, what will happen to the Bio::Taxonomy >> hierarchy then? Phased out? > > At the moment it seems to me that the Bio::Taxonomy modules (excluding > Node) aren't really usable. Jason wrote a comment in t/TaxonTree.t > which > tests Taxon and Tree: > > ## I am pretty sure this module is going the way of the dodo bird so > ## I am not sure how much work to put into fixing the tests/module > > FactoryI is strange (it isn't intended to work like any other Bioperl > factory) and there are no implementers of it, while Taxonomy.pm itself > would be redundant after my Node changes and has implementation > issues, > though it may make more sense to some people. > > My vote is phase out. > > > What is the actual process involved in renaming a module in Bioperl? > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Thu Jul 27 10:29:39 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 27 Jul 2006 09:29:39 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <077A86AB-174D-426E-B65F-0954A72A6019@gmx.net> Message-ID: <003101c6b189$17f5d2e0$15327e82@pyrimidine> I'll respond to both here: > Sendu Bala wrote: > > One last suggestion for discussion: > > It may be appropriate is to rename Bio::Taxonomy::Node to clarify that > Node has no particular reliance on or association with Bio::Taxonomy or > the other modules in Bio/Taxonomy/. > > How about calling it Bio::Taxon? > > It is more obvious what to expect from something called 'Bio::Taxon' > when you know that it is the new 'Bio::Species': like Bio::Species but > for any taxon. It also makes the class 'top-level' which I think most > people are happier using; seems like things in sub-directories are more > for advanced users. Hilmar explains the namespace issue with Bioperl more concisely below. You should still be able to use a Node in a Taxonomy, but then again you should also be able to use a Taxon in a Taxonomy as well (by definition, a Taxon is part of a Taxonomy as it is a taxonomic unit). The whole "looking at this from a biologist's perspective" thing again... http://en.wikipedia.org/wiki/Taxon BTW, what exactly is Bio::Taxonomy::Taxon used for? Looks like it is used more for building taxonomic trees that anything, so shouldn't it be moved to Bio::Tree:Taxon (that name isn't used)? Then you could use Bio::Taxonomy::Taxon for your purposes. See, the only concern I have with using the name Bio::Taxon is people confusing it with Bio::Taxonomy itself or with Bio::Taxonomy::Taxon. Though I agree that the name makes sense for what you want. > Hilmar Lapp wrote: > > I don't think the top-level or sub-directory matters at all and I > don't want anybody to get used to the notion that that may imply > anything (except possibly better thought-out structure for the sub- > directory level). For instance RichSeq is what all rich annotation > sequence format parsers return, yet it is in a sub-directory. > > I don't any real objection to Bio::Taxon though if that's what you'd > like to name it - although, what will happen to the Bio::Taxonomy > hierarchy then? Phased out? > > -hilmar I'm not sure how many people out there use Bio::Taxonomy. I think they use the tree-building modules in Bio::Tree more than anything. And there haven't been any panicked users protesting at the gates yet about the many posts for Bio::Taxonomy changes (well, except me, and 'I got better'). Chris From cjfields at uiuc.edu Thu Jul 27 10:54:06 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 27 Jul 2006 09:54:06 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C8C85F.2010104@sendu.me.uk> Message-ID: <003201c6b18c$829330e0$15327e82@pyrimidine> > > I don't any real objection to Bio::Taxon though if that's what you'd > > like to name it - although, what will happen to the Bio::Taxonomy > > hierarchy then? Phased out? > > At the moment it seems to me that the Bio::Taxonomy modules (excluding > Node) aren't really usable. Jason wrote a comment in t/TaxonTree.t which > tests Taxon and Tree: > > ## I am pretty sure this module is going the way of the dodo bird so > ## I am not sure how much work to put into fixing the tests/module > > FactoryI is strange (it isn't intended to work like any other Bioperl > factory) and there are no implementers of it, while Taxonomy.pm itself > would be redundant after my Node changes and has implementation issues, > though it may make more sense to some people. > > My vote is phase out. > > > What is the actual process involved in renaming a module in Bioperl? This is how many times the phrase "Bio::Taxonomy" is used in Bioperl in directory Bio (which should catch any namespace usage like Node, etc.): Instances: 2 BP Module : Bio::DB::Taxonomy Instances: 4 BP Module : Bio::DB::Taxonomy::entrez Instances: 7 BP Module : Bio::DB::Taxonomy::flatfile Instances: 1 BP Module : Bio::Expression::Platform Instances: 1 BP Module : Bio::SeqIO::genbank Instances: 22 BP Module : Bio::Taxonomy Instances: 8 BP Module : Bio::Taxonomy::FactoryI Instances: 17 BP Module : Bio::Taxonomy::Node Instances: 20 BP Module : Bio::Taxonomy::Taxon Instances: 39 BP Module : Bio::Taxonomy::Tree Hmm, not much. Almost all hits are within Bio::DB::taxonomy or Bio::Taxonomy. The SeqIO::genbank was my change BTW; just haven't tossed the code yet. Therefore, the only one left that would be affected (outside of Bio::Taxonomy and Bio::DB::Taxonomy) is Allen Day's Bio::Expression::Platform class, which uses Bio::DB::Taxonomy::entrez to grab Nodes; that would just be changed over to whatever class you plan on using. And that class hasn't been documented at all outside the methods. Furthermore, judging by the mail list archives the Bio::Taxonomy modules had very little usage outside of Node. Jason mentioned on an old post that he could never get Bio::Taxonomy::Taxon/Tree to work and that Dan Kortschak had moved (Dan's last post was in 2003). Hence the test file comments. And you make a good point with Bio::Taxonomy::FactoryI. I agree, if the modules haven't served a useful purpose they should be phased out. Chris From cjfields at uiuc.edu Thu Jul 27 11:15:25 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 27 Jul 2006 10:15:25 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: Message-ID: <003301c6b18f$7d114000$15327e82@pyrimidine> Wow, we're doing a little bioperl spring cleaning here! I agree with Hilmar: create a new module (Bio::Taxon), which claims the namespace, and deprecate the old ones. How 'broken', exactly, is Bio::Taxonomy? The idea behind it seems just (container for Nodes) but maybe it should just be reconfigured, and all the classes in directory Bio/Taxonomy deprecated. Or should we start from scratch completely? Don't know if it has been attempted but it would be nice to have a way for building taxonomic trees from Node/Taxon information using a Taxonomy-like container object. I like the way NCBI does something along these lines with BLAST output now. BTW, thanks guys for a rousing discussion! Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Hilmar Lapp > Sent: Thursday, July 27, 2006 9:29 AM > To: Sendu Bala > Cc: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > How do you mean 'process'? You create a new module, and then you > deprecate the ones you're phasing out. If possible you rewrite the > implementation to use the new module. > > Not sure this answers your question? > > -hilmar > > On Jul 27, 2006, at 10:06 AM, Sendu Bala wrote: > > > Hilmar Lapp wrote: > >> I don't think the top-level or sub-directory matters at all and I > >> don't > >> want anybody to get used to the notion that that may imply anything > >> (except possibly better thought-out structure for the sub-directory > >> level). For instance RichSeq is what all rich annotation sequence > >> format > >> parsers return, yet it is in a sub-directory. > > > > Well, I'm not aware that I've ever used a RichSeq ;). But your > > point is > > taken. > > > > > >> I don't any real objection to Bio::Taxon though if that's what you'd > >> like to name it - although, what will happen to the Bio::Taxonomy > >> hierarchy then? Phased out? > > > > At the moment it seems to me that the Bio::Taxonomy modules (excluding > > Node) aren't really usable. Jason wrote a comment in t/TaxonTree.t > > which > > tests Taxon and Tree: > > > > ## I am pretty sure this module is going the way of the dodo bird so > > ## I am not sure how much work to put into fixing the tests/module > > > > FactoryI is strange (it isn't intended to work like any other Bioperl > > factory) and there are no implementers of it, while Taxonomy.pm itself > > would be redundant after my Node changes and has implementation > > issues, > > though it may make more sense to some people. > > > > My vote is phase out. > > > > > > What is the actual process involved in renaming a module in Bioperl? > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From hlapp at gmx.net Thu Jul 27 11:29:04 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 27 Jul 2006 11:29:04 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <003101c6b189$17f5d2e0$15327e82@pyrimidine> References: <003101c6b189$17f5d2e0$15327e82@pyrimidine> Message-ID: On Jul 27, 2006, at 10:29 AM, Chris Fields wrote: > See, the only concern I have with using the name Bio::Taxon is people > confusing it with Bio::Taxonomy itself or with > Bio::Taxonomy::Taxon. Though > I agree that the name makes sense for what you want. I don't think Bio::Taxonomy is used a lot in earnest if at all, so it you even test the waters by deprecating them right away by putting warning statements there and see whether anybody complains about the cluttered terminal screens. If this goes into snapshot releases and release candidates leading up to 1.6 then they may be phased out right away. Unless anybody on the list has strong objections? Anybody using Bio::Taxonomy? -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From skirov at utk.edu Thu Jul 27 09:57:19 2006 From: skirov at utk.edu (skirov) Date: Thu, 27 Jul 2006 09:57:19 -0400 Subject: [Bioperl-l] TRANSFAC matrices, open acces Message-ID: <44E2E794@webmail.utk.edu> Just a quick note- Bio::Matrix::PSM::IO can parse matrix.dat if you can get it- and as far as I can tell this is not easy- you have to contact the company to get access and it is not clear what their conditions are. This is the reason I have decided not to maintain the transfac parser. Stefan >===== Original Message From Sendu Bala ===== >Sean Davis wrote: >> >> On 7/27/06 5:55 AM, "Sendu Bala" wrote: >> >>> Maximilian Haeussler wrote: >>> Actually, the fact that the transfac matrices are belonging to a >>> company is quite inconvenient for biologists and bioinformatics >>> analyses working in this field. > > >>> The public version is adequate though. It would certainly be useful to >>> have Bioperl access to transfac and other regulation databases. I'll >>> probably write some suitable modules if no one beats me to it. >> >> I haven't used it in a while, but the TFBS family of modules are, if I >> recall correctly, bioperl-compatible, though not part of bioperl. In any >> case, for those who aren't aware, it might be worth a quick look: > >Yes. It only has online access to Transfac though, and the inheritance >and returned objects are TFBS specific so you miss out on whatever >goodness there may be in the rest of bioperl. > >Still, recommended to use if you want programmatic access to Transfac >matrices right now. >_______________________________________________ >Bioperl-l mailing list >Bioperl-l at lists.open-bio.org >http://lists.open-bio.org/mailman/listinfo/bioperl-l From bix at sendu.me.uk Thu Jul 27 12:30:38 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 27 Jul 2006 17:30:38 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> <44C7A2C7.2070100@sendu.me.uk> <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net> <44C8C04A.8070504@sendu.me.uk> <077A86AB-174D-426E-B65F-0954A72A6019@gmx.net> <44C8C85F.2010104@sendu.me.uk> Message-ID: <44C8EA2E.8030000@sendu.me.uk> Hilmar Lapp wrote: > How do you mean 'process'? You create a new module, and then you > deprecate the ones you're phasing out. If possible you rewrite the > implementation to use the new module. > > Not sure this answers your question? I guess. I was thinking of just making Bio::Taxonomy::Node isa Bio::Taxon and then simply removing all the code from Node, leaving just some perldoc that said it had been renamed? Or should there be some methods that issue a warning and then call SUPER? From hlapp at gmx.net Thu Jul 27 12:38:30 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 27 Jul 2006 12:38:30 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C8EA2E.8030000@sendu.me.uk> References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> <44C7A2C7.2070100@sendu.me.uk> <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net> <44C8C04A.8070504@sendu.me.uk> <077A86AB-174D-426E-B65F-0954A72A6019@gmx.net> <44C8C85F.2010104@sendu.me.uk> <44C8EA2E.8030000@sendu.me.uk> Message-ID: <881BB312-DF1B-43F6-A38D-8B543738244F@gmx.net> That's what I said could be possible here on much shorter notice that we'd do usually due to the low usage. Eventually deprecated modules should also be physically removed, so you want to prepare for that. (removing a module breaks scripts that used it; issuing a warning alerts to this being forthcoming.) -hilmar On Jul 27, 2006, at 12:30 PM, Sendu Bala wrote: > Hilmar Lapp wrote: >> How do you mean 'process'? You create a new module, and then you >> deprecate the ones you're phasing out. If possible you rewrite the >> implementation to use the new module. >> >> Not sure this answers your question? > > I guess. I was thinking of just making Bio::Taxonomy::Node isa > Bio::Taxon and then simply removing all the code from Node, leaving > just > some perldoc that said it had been renamed? > > Or should there be some methods that issue a warning and then call > SUPER? > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From sanges at biogem.it Thu Jul 27 12:37:05 2006 From: sanges at biogem.it (Remo Sanges) Date: Thu, 27 Jul 2006 18:37:05 +0200 Subject: [Bioperl-l] TRANSFAC matrices, open acces In-Reply-To: <44E2E794@webmail.utk.edu> References: <44E2E794@webmail.utk.edu> Message-ID: <44C8EBB1.5070709@biogem.it> Here is also my 2c on TFBS: skirov wrote: >Just a quick note- Bio::Matrix::PSM::IO can parse matrix.dat if you can get >it- and as far as I can tell this is not easy- you have to contact the company >to get access and it is not clear what their conditions are. This is the >reason I have decided not to maintain the transfac parser. >Stefan > > >>===== Original Message From Sendu Bala ===== >>Sean Davis wrote: >> >> >>>On 7/27/06 5:55 AM, "Sendu Bala" wrote: >>> >>> >>> >>>>Maximilian Haeussler wrote: >>>>Actually, the fact that the transfac matrices are belonging to a >>>>company is quite inconvenient for biologists and bioinformatics >>>>analyses working in this field. >>>> >>>> >>>>The public version is adequate though. It would certainly be useful to >>>>have Bioperl access to transfac and other regulation databases. I'll >>>>probably write some suitable modules if no one beats me to it. >>>> >>>> >>>I haven't used it in a while, but the TFBS family of modules are, if I >>>recall correctly, bioperl-compatible, though not part of bioperl. In any >>>case, for those who aren't aware, it might be worth a quick look: >>> >>> >>Yes. It only has online access to Transfac though >> TFBS::DB::LocalTRANSFAC - can parse local transfac matrices (matrix.dat) >>, and the inheritance >>and returned objects are TFBS specific so you miss out on whatever >>goodness there may be in the rest of bioperl. >> >> >> In TFBS there are modules which inherithed from Bio::SeqFeature::Generic and Bio::Root::Root. See for example TFBS::Site. So probably it is not so bad.... Here is the link cutted from the Sean's e-mail: http://forkhead.cgb.ki.se/TFBS/ HTH Remo From osborne1 at optonline.net Thu Jul 27 12:49:26 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Thu, 27 Jul 2006 12:49:26 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <881BB312-DF1B-43F6-A38D-8B543738244F@gmx.net> Message-ID: Sendu, And add the module or modules names to the DEPRECATED file. Brian O. On 7/27/06 12:38 PM, "Hilmar Lapp" wrote: > Eventually deprecated modules should also be physically removed From MEC at stowers-institute.org Thu Jul 27 13:28:03 2006 From: MEC at stowers-institute.org (Cook, Malcolm) Date: Thu, 27 Jul 2006 12:28:03 -0500 Subject: [Bioperl-l] TRANSFAC matrices, open acces Message-ID: re: >Yes. It only has online access to Transfac though, not quite true. It does support access to local transfac data files if you have them. --Malcolm From cjfields at uiuc.edu Thu Jul 27 13:45:28 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 27 Jul 2006 12:45:28 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <881BB312-DF1B-43F6-A38D-8B543738244F@gmx.net> Message-ID: <000301c6b1a4$73ef3fd0$15327e82@pyrimidine> Makes sense to me. From my previous post the only bioperl class that used it was Bio::Expression::Platform, and that only for grabbing Node objects from Bio::DB::Taxonomy::entrez (so, change it to use whatever object Bio::DB::Taxonomy returns). I couldn't find anything else in the core outside of the Bio::DB::Taxonomy and Bio::Taxonomy classes and tests that use them. There aren't even any scripts or examples. If you implement Bio::Root::RootI (and pretty much everything does), you could use warn() or deprecated() for these easily: ... Title : warn Usage : $object->warn("Warning message"); Function: Places a warning. What happens now is down to the verbosity of the object (value of $obj->verbose) verbosity 0 or not set => small warning verbosity -1 => no warning verbosity 1 => warning with stack trace verbosity 2 => converts warnings into throw ... Title : deprecated Usage : $obj->deprecated("Method X is deprecated"); Function: Prints a message about deprecation unless verbose is < 0 (which means be quiet) Returns : none Args : Message string to print to STDERR ... Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Hilmar Lapp > Sent: Thursday, July 27, 2006 11:39 AM > To: Sendu Bala > Cc: bioperl-l at bioperl.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > That's what I said could be possible here on much shorter notice that > we'd do usually due to the low usage. > > Eventually deprecated modules should also be physically removed, so > you want to prepare for that. (removing a module breaks scripts that > used it; issuing a warning alerts to this being forthcoming.) > > -hilmar > > On Jul 27, 2006, at 12:30 PM, Sendu Bala wrote: > > > Hilmar Lapp wrote: > >> How do you mean 'process'? You create a new module, and then you > >> deprecate the ones you're phasing out. If possible you rewrite the > >> implementation to use the new module. > >> > >> Not sure this answers your question? > > > > I guess. I was thinking of just making Bio::Taxonomy::Node isa > > Bio::Taxon and then simply removing all the code from Node, leaving > > just > > some perldoc that said it had been renamed? > > > > Or should there be some methods that issue a warning and then call > > SUPER? > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From bix at sendu.me.uk Thu Jul 27 15:30:47 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 27 Jul 2006 20:30:47 +0100 Subject: [Bioperl-l] TRANSFAC matrices, open acces In-Reply-To: References: Message-ID: <44C91467.5050001@sendu.me.uk> Cook, Malcolm wrote: > re: > >> Yes. It only has online access to Transfac though, > > not quite true. It does support access to local transfac data files if > you have them. And to local Jaspar files. I wasn't clear, but I meant for the 'only' to modify 'online'. Ie. it doesn't give you access to any other online databases. From bix at sendu.me.uk Thu Jul 27 15:55:32 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 27 Jul 2006 20:55:32 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <003101c6b189$17f5d2e0$15327e82@pyrimidine> References: <003101c6b189$17f5d2e0$15327e82@pyrimidine> Message-ID: <44C91A34.1040406@sendu.me.uk> Chris Fields wrote: > BTW, what exactly is Bio::Taxonomy::Taxon used for? Looks like it is used > more for building taxonomic trees that anything, so shouldn't it be moved to > Bio::Tree:Taxon (that name isn't used)? Then you could use > Bio::Taxonomy::Taxon for your purposes. It actually seemed more like a possible replacement for Bio::Taxonomy::Node. Thanks to its Tree::NodeI implementation it has the big advantage over Bio::Taxonomy::Node that you access the lineage without a database. From the programmer's point of view it seemed more natural, being able to create nodes and add descendants. I decided against it because I felt the added complexity wasn't really worth it, and Bio::Taxonomy::Node had some of its own advantages. If this turns out to be the wrong choice, my Bio::Taxon can always be reimplemented to also implement Tree::NodeI in the future. > See, the only concern I have with using the name Bio::Taxon is people > confusing it with Bio::Taxonomy itself or with Bio::Taxonomy::Taxon. Though > I agree that the name makes sense for what you want. I don't think you'd confuse it directly with Bio::Taxonomy, but you could certainly waste some time thinking it was appropriate to stick Bio::Taxon objects in Bio::Taxonomy objects - theoretically it might work but ultimately you'd just be wasting your time. I'll make sure the docs in the Taxonomy modules steer people in the right direction. From bix at sendu.me.uk Thu Jul 27 16:18:06 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 27 Jul 2006 21:18:06 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <003301c6b18f$7d114000$15327e82@pyrimidine> References: <003301c6b18f$7d114000$15327e82@pyrimidine> Message-ID: <44C91F7E.2040000@sendu.me.uk> Chris Fields wrote: > How 'broken', exactly, is Bio::Taxonomy? Its certainly usable as-is, but there are some gotchas. # It has an acknowledged weakness of not coping with multiple ranks of the same name (notably 'no rank'). # You can't have 2 nodes with the same rank (so can only build a single lineage, not a whole menagerie). # You must supply a list of all your rank names correctly ordered before you can add any nodes (or trust that the default list is satisfactory - it won't be if you have just a single 'no rank'). # You simply don't need it if you have Bio::Taxonomy::Nodes with db_handle set, or Bio::Taxonomy::Taxons. In my opinion, the burden is just too great for this ever to have been a 'fun' module to use. It was only required so that people could manually create their own Bio::Taxonomy::Nodes and form a lineage without a database. > Don't know if it has been attempted but it would be nice to have a way for > building taxonomic trees from Node/Taxon information using a Taxonomy-like > container object. I like the way NCBI does something along these lines with > BLAST output now. Not really sure what you mean. I don't think you'd require a container object to do any particular task. Can you clarify? From clarsen at vecna.com Thu Jul 27 15:59:50 2006 From: clarsen at vecna.com (Chris Larsen) Date: Thu, 27 Jul 2006 15:59:50 -0400 (EDT) Subject: [Bioperl-l] Working code Message-ID: <7263.70.106.6.26.1154030390.squirrel@mail.vecna.com> Hey gang, You said you wanted to see working code: ------------------------------------------- > ...And I'd rather see some code or API examples than > extensive elaborations. > > -hilmar Hilmar's right; working code does speaks louder than words. -Chris -------------------------------------------- So here's some: http://www.biohealthbase.org/GSearch/ We've just released the v2 of Bioinformatic Resource Center's website "Biohealthbase". Earlier I pointed out BHB v1 to the list; then we had implemented GBrowse on top of GUS 3. There was some data processing using BioPerl packages to generate well-formatted data for the Oracle instance. But new micro-organisms are added now, so we have Francisella, Mycobacterium, Microsporidia, Giardia, and Influenza. They are under GUS 3.5. We also now have some web-capable BLASTing under there (Please no spam!) And multiple sequence alignments and dendrograms are to come, using MUSCLE instead of ClustalW. Currently, a Bioperl I/O module accepts the output from BLAST and writes up some HTML, then our web app on another server displays the URL content. But we will improve on this model in v3 for MSA et al. since the requirements are different for multiple vs single alignments. Thanks again for the open source! Chris ---------------------------- Christopher Larsen, Ph.D. Senior Scientist Vecna Technologies, Inc. 5004 Lehigh Rd College Park, MD 20740-3821 e: clarsen at vecna.com ph: (240) 737-1625 f: (301) 699-3180 From skirov at utk.edu Thu Jul 27 09:56:45 2006 From: skirov at utk.edu (skirov) Date: Thu, 27 Jul 2006 09:56:45 -0400 Subject: [Bioperl-l] TRANSFAC matrices, open acces Message-ID: <44E2E5B9@webmail.utk.edu> Just a quick note- Bio::Matrix::PSM::IO can parse matrix.dat if you can get it- and as far as I can tell this is not easy- you have to contact the company to get access and it is not clear what their conditions are. Stefan >===== Original Message From Sendu Bala ===== >Sean Davis wrote: >> >> On 7/27/06 5:55 AM, "Sendu Bala" wrote: >> >>> Maximilian Haeussler wrote: >>> Actually, the fact that the transfac matrices are belonging to a >>> company is quite inconvenient for biologists and bioinformatics >>> analyses working in this field. > > >>> The public version is adequate though. It would certainly be useful to >>> have Bioperl access to transfac and other regulation databases. I'll >>> probably write some suitable modules if no one beats me to it. >> >> I haven't used it in a while, but the TFBS family of modules are, if I >> recall correctly, bioperl-compatible, though not part of bioperl. In any >> case, for those who aren't aware, it might be worth a quick look: > >Yes. It only has online access to Transfac though, and the inheritance >and returned objects are TFBS specific so you miss out on whatever >goodness there may be in the rest of bioperl. > >Still, recommended to use if you want programmatic access to Transfac >matrices right now. >_______________________________________________ >Bioperl-l mailing list >Bioperl-l at lists.open-bio.org >http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Thu Jul 27 21:19:51 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 27 Jul 2006 20:19:51 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C91F7E.2040000@sendu.me.uk> References: <003301c6b18f$7d114000$15327e82@pyrimidine> <44C91F7E.2040000@sendu.me.uk> Message-ID: <3DAB9065-3633-4D50-B97E-41F2BB58C6EB@uiuc.edu> ... >> Don't know if it has been attempted but it would be nice to have a >> way for >> building taxonomic trees from Node/Taxon information using a >> Taxonomy-like >> container object. I like the way NCBI does something along these >> lines with >> BLAST output now. > > Not really sure what you mean. I don't think you'd require a container > object to do any particular task. Can you clarify? Let's say you start with a list of sequence IDs from a BLAST report and wanted to find the taxonomic relationship between the BLAST hits. NCBI does something similar to this in their last few BLAST output revisions from the CGI interface; they have a link which contains the organisms ranked taxonomically in various ways. There is probably a Bioperl-specific way of doing this but I haven't spent the effort yet working out how. No big deal, really. I have PLENTY else to work on. Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From R.Birnie at leeds.ac.uk Fri Jul 28 05:39:34 2006 From: R.Birnie at leeds.ac.uk (Richard Birnie) Date: Fri, 28 Jul 2006 10:39:34 +0100 Subject: [Bioperl-l] whole genome annotation Message-ID: Hello all, I'm just trying to familiarise myself with BioPerl and I'm a little overwhelmed by the sheer volume of information available on the wiki. I'm hoping someone can point in the right direction through the labyrinth. This may become a little longwinded but I'll try and get all the annoying newbie questions out of the way in one go. Let me try and explain what I'm aiming for. I have some CGH data downloaded from the Progenetix database (http://www.progenetix.de/~pgscripts/progenetix/Aboutprogenetix.html), this data is simplified to record simply gain/loss/amplification of whole chromosome bands at 862 band resolution to facilitate the combination of data from multiple different studies. What I'd like to be able to do is download a copy of the human genome sequence with annotation describing the locations of chromosome bands and preferably of known genes. I then want to be able to manipulate the genome data based on the CGH data to mimic deletions. The ultimate goal of this is to be able to feed the manipulated genome data into a program (metashark) that predicts the structure of metabolic networks based on genome annotation compared to a reference genome, in this case a complete 'normal' human genome and see what effect that has on the metabolic pathways. I appreciate that is a bit vague but thats sort of my problem, I'm not a bioinformatician really so I'm not sue of the details of what I want. I just happen to have an question to answer and bioperl seems the way to go (for this project and more generally). I've started looking at the HOWTOs and read the main bioperl tutorial. I also looked at the CGL comparative genomics library but I haven't penetrated far into that yet. I'm ok with basic perl although not much object oriented stuff. I don't really have much experience with handling sequence data on a whole genome scale either, a few genbank files for my favourite genes is fine but I need some guidance to work on this scale. What I'm looking for is someone to give me a start. I'd greatly appreciate it if someone could spell out the general steps for downloading a complete copy of the human genome and its annotations (if this is even a feasible approach) and how to put it all together. Not actual code just the general concept for each step and which tools from the bioperl set would be most appropriate for each step so that I can focus what I need to read about, even a little pseudo-code if I'm lucky. If I can get the genome data downloaded and setup properly I'll work out how to apply the CGH data to it myself. If example code for what I'm trying to describe is included somewhere, great could someone point to where. Thanks for your patience. best regards, Richard Dr Richard Birnie Scientific Officer Section of Pathology and Tumour Biology Welcome Brenner Building, LIMM St James University Hospital Beckett St, Leeds, LS9 7TF Tel:0113 3438624 e-mail: r.birnie at leeds.ac.uk From sdavis2 at mail.nih.gov Fri Jul 28 07:59:17 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Fri, 28 Jul 2006 07:59:17 -0400 Subject: [Bioperl-l] whole genome annotation In-Reply-To: References: Message-ID: <44C9FC15.3040503@mail.nih.gov> Richard Birnie wrote: > Hello all, > > I'm just trying to familiarise myself with BioPerl and I'm a little overwhelmed by the sheer volume of information available on the wiki. I'm hoping someone can point in the right direction through the labyrinth. This may become a little longwinded but I'll try and get all the annoying newbie questions out of the way in one go. > > Let me try and explain what I'm aiming for. I have some CGH data downloaded from the Progenetix database (http://www.progenetix.de/~pgscripts/progenetix/Aboutprogenetix.html), this data is simplified to record simply gain/loss/amplification of whole chromosome bands at 862 band resolution to facilitate the combination of data from multiple different studies. > > What I'd like to be able to do is download a copy of the human genome sequence with annotation describing the locations of chromosome bands and preferably of known genes. I then want to be able to manipulate the genome data based on the CGH data to mimic deletions. The ultimate goal of this is to be able to feed the manipulated genome data into a program (metashark) that predicts the structure of metabolic networks based on genome annotation compared to a reference genome, in this case a complete 'normal' human genome and see what effect that has on the metabolic pathways. > > I appreciate that is a bit vague but thats sort of my problem, I'm not a bioinformatician really so I'm not sue of the details of what I want. I just happen to have an question to answer and bioperl seems the way to go (for this project and more generally). I've started looking at the HOWTOs and read the main bioperl tutorial. I also looked at the CGL comparative genomics library but I haven't penetrated far into that yet. I'm ok with basic perl although not much object oriented stuff. I don't really have much experience with handling sequence data on a whole genome scale either, a few genbank files for my favourite genes is fine but I need some guidance to work on this scale. > > What I'm looking for is someone to give me a start. I'd greatly appreciate it if someone could spell out the general steps for downloading a complete copy of the human genome and its annotations (if this is even a feasible approach) and how to put it all together. Not actual code just the general concept for each step and which tools from the bioperl set would be most appropriate for each step so that I can focus what I need to read about, even a little pseudo-code if I'm lucky. If I can get the genome data downloaded and setup properly I'll work out how to apply the CGH data to it myself. > > If example code for what I'm trying to describe is included somewhere, great could someone point to where. Hi, Richard. Bioperl is good for many things, but for simply grabbing all the locations of human genes in the genome and chromosome band locations, I wouldn't use bioperl. It sounds to me like you are interested in getting the genes associated with each chromosomal band? If so, just download the cytoband.txt and refFlat.txt files from the UCSC genome browser site. cytoband.txt contains the base pair locations for each of the cytobands. refFlat.txt contains the base pair locations of "refseq" genes. It is then simply a matter of finding overlapping regions (genes with cytobands) to determine which genes are in which cytobands. Since the files are tab-delimited text, they are very easy to work with (in perl, excel, python, ...). Don't get me wrong--I really appreciate the power of bioperl, but in this case, your task lends itself to a simpler (and MUCH) faster approach. Sean From R.Birnie at leeds.ac.uk Fri Jul 28 08:21:46 2006 From: R.Birnie at leeds.ac.uk (Richard Birnie) Date: Fri, 28 Jul 2006 13:21:46 +0100 Subject: [Bioperl-l] whole genome annotation References: <44C9FC15.3040503@mail.nih.gov> Message-ID: -----Original Message----- From: Sean Davis [mailto:sdavis2 at mail.nih.gov] Sent: Fri 7/28/2006 12:59 To: Richard Birnie Cc: bioperl-l at lists.open-bio.org Subject: Re: [Bioperl-l] whole genome annotation Richard Birnie wrote: > Hello all, > > I'm just trying to familiarise myself with BioPerl and I'm a little overwhelmed by the sheer volume of information available on the wiki. I'm hoping someone can point in the right direction through the labyrinth. This may become a little longwinded but I'll try and get all the annoying newbie questions out of the way in one go. > > Let me try and explain what I'm aiming for. I have some CGH data downloaded from the Progenetix database (http://www.progenetix.de/~pgscripts/progenetix/Aboutprogenetix.html), this data is simplified to record simply gain/loss/amplification of whole chromosome bands at 862 band resolution to facilitate the combination of data from multiple different studies. > > What I'd like to be able to do is download a copy of the human genome sequence with annotation describing the locations of chromosome bands and preferably of known genes. I then want to be able to manipulate the genome data based on the CGH data to mimic deletions. The ultimate goal of this is to be able to feed the manipulated genome data into a program (metashark) that predicts the structure of metabolic networks based on genome annotation compared to a reference genome, in this case a complete 'normal' human genome and see what effect that has on the metabolic pathways. > > I appreciate that is a bit vague but thats sort of my problem, I'm not a bioinformatician really so I'm not sue of the details of what I want. I just happen to have an question to answer and bioperl seems the way to go (for this project and more generally). I've started looking at the HOWTOs and read the main bioperl tutorial. I also looked at the CGL comparative genomics library but I haven't penetrated far into that yet. I'm ok with basic perl although not much object oriented stuff. I don't really have much experience with handling sequence data on a whole genome scale either, a few genbank files for my favourite genes is fine but I need some guidance to work on this scale. > > What I'm looking for is someone to give me a start. I'd greatly appreciate it if someone could spell out the general steps for downloading a complete copy of the human genome and its annotations (if this is even a feasible approach) and how to put it all together. Not actual code just the general concept for each step and which tools from the bioperl set would be most appropriate for each step so that I can focus what I need to read about, even a little pseudo-code if I'm lucky. If I can get the genome data downloaded and setup properly I'll work out how to apply the CGH data to it myself. > > If example code for what I'm trying to describe is included somewhere, great could someone point to where. Hi, Richard. Bioperl is good for many things, but for simply grabbing all the locations of human genes in the genome and chromosome band locations, I wouldn't use bioperl. It sounds to me like you are interested in getting the genes associated with each chromosomal band? If so, just download the cytoband.txt and refFlat.txt files from the UCSC genome browser site. cytoband.txt contains the base pair locations for each of the cytobands. refFlat.txt contains the base pair locations of "refseq" genes. It is then simply a matter of finding overlapping regions (genes with cytobands) to determine which genes are in which cytobands. Since the files are tab-delimited text, they are very easy to work with (in perl, excel, python, ...). Don't get me wrong--I really appreciate the power of bioperl, but in this case, your task lends itself to a simpler (and MUCH) faster approach. Sean Thanks for the response Sean, getting the genes associated with each band is certainly part of what I need and your suggestion will help with that. I did look at the UCSC site but as you know there is such a volume of info on there I didn't really know which files I needed. However my main goal requires slightly more. What I want to be able to do is take the chromosomal band annotation info and compare that against the CGH data I have. From this I'd like to then be able say "OK band 8q13.1 (or whatever) is deleted, so make a copy of the genome with the actual sequence associated with that band removed." Then I could feed both sequences into metashark which predicts the structure of metabolic pathways based on genome annotation and see what effect deleting that region of DNA has on the structure of the metabolic network. Knowing which genes are involved is necessary for identifying what are the important components within the region. Are there tools in Bioperl for making this comparison? It can probably be reduced to a straight comparison of data structures so I may just use regular perl for this part unless there is anything designed for purpose. The thing I was struggling with was how to store and manipulate genomic sequence data in such quantities. Since this morning I've had a better look at the CGL library and associated datastore module, I think I can do it using these tools but I'm having a few dependency issues getting it installed right now. So I'll go back to wrestling with that. regards, Richard From valiente at lsi.upc.edu Fri Jul 28 08:10:19 2006 From: valiente at lsi.upc.edu (Gabriel Valiente) Date: Fri, 28 Jul 2006 15:10:19 +0300 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: References: Message-ID: >>> At the moment it seems to me that the Bio::Taxonomy modules >>> (excluding >>> Node) aren't really usable. I've been using Bio::Taxonomy and Bio::DB::Taxonomy a lot, they are very useful modules. Whatever Bio::Taxon or Bio::Taxonomy::Taxon turns out to be, please do keep the Bio::DB::Taxonomy functionality. BTW, does anybody know how to include branch lengths in Bio::DB::Taxonomy? Thanks a lot, Gabriel From y.itan at ucl.ac.uk Fri Jul 28 08:07:32 2006 From: y.itan at ucl.ac.uk (Yuval Itan) Date: Fri, 28 Jul 2006 13:07:32 +0100 Subject: [Bioperl-l] Getting sequences by base pair locations Message-ID: <05a8f85bdc36dc7e370d0935b6394f9d@ucl.ac.uk> Hello all, I was BLATing a few hundred human genes against the chimp genome, and kept the best chimp hits for every human gene. I have the base pair start and end location for every chimp hit, and I need to get the sequence for each of these chimp hits. Here is an example for a few chimp hits bp locations: Start End 142854 144504 154479 155198 153066 167370 163146 163559 I have one chimp genome file (about 3GB) including all chromosomes, but I could also get one file per chromosome if that would make things easier. Does anyone have a script or a link for an interface that can do the job? Thank you very much. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/enriched Size: 835 bytes Desc: not available URL: From hlapp at gmx.net Fri Jul 28 08:59:43 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Fri, 28 Jul 2006 08:59:43 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: References: Message-ID: <233D3060-5CF7-4DF7-8EF6-6762CF45B94D@gmx.net> If I understand Sendu's proposal correctly then the existing methods in Bio::DB::Taxonomy will remain largely unchanged (methods may be added though). Can you describe briefly what you use Bio::Taxonomy for, e.g., which methods you use primarily and the context? -hilmar On Jul 28, 2006, at 8:10 AM, Gabriel Valiente wrote: >>>> At the moment it seems to me that the Bio::Taxonomy modules >>>> (excluding >>>> Node) aren't really usable. > > I've been using Bio::Taxonomy and Bio::DB::Taxonomy a lot, they are > very useful modules. Whatever Bio::Taxon or Bio::Taxonomy::Taxon > turns out to be, please do keep the Bio::DB::Taxonomy functionality. > > BTW, does anybody know how to include branch lengths in > Bio::DB::Taxonomy? > > Thanks a lot, > > Gabriel > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From bix at sendu.me.uk Fri Jul 28 09:01:44 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Fri, 28 Jul 2006 14:01:44 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: References: Message-ID: <44CA0AB8.7040205@sendu.me.uk> Gabriel Valiente wrote: >>>> At the moment it seems to me that the Bio::Taxonomy modules >>>> (excluding >>>> Node) aren't really usable. > > I've been using Bio::Taxonomy Can I ask how you've been using it? > and Bio::DB::Taxonomy a lot, they are > very useful modules. Whatever Bio::Taxon or Bio::Taxonomy::Taxon > turns out to be, please do keep the Bio::DB::Taxonomy functionality. Bio::DB::Taxonomy is staying virtually unaltered. > BTW, does anybody know how to include branch lengths in > Bio::DB::Taxonomy? At the moment, you don't 'include' anything at all in the DB modules yourself, since they are read-only. They give you Nodes which you can alter afterwards. I plan to add something like a 'distance to parent' in Node (Bio::Taxon) so you can work out branch lengths; you can't do that yet. From bix at sendu.me.uk Fri Jul 28 09:13:44 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Fri, 28 Jul 2006 14:13:44 +0100 Subject: [Bioperl-l] Getting sequences by base pair locations In-Reply-To: <05a8f85bdc36dc7e370d0935b6394f9d@ucl.ac.uk> References: <05a8f85bdc36dc7e370d0935b6394f9d@ucl.ac.uk> Message-ID: <44CA0D88.3000404@sendu.me.uk> Yuval Itan wrote: > Hello all, > > I was BLATing a few hundred human genes against the chimp genome, and > kept the best chimp hits for every human gene. > I have the base pair start and end location for every chimp hit, and I > need to get the sequence for each of these chimp hits. Here is an > example for a few chimp hits bp locations: > > Start End* > *142854 144504 > 154479 155198 > 153066 167370 > 163146 163559 > > I have one chimp genome file (about 3GB) including all chromosomes, but > I could also get one file per chromosome if that would make things > easier. Does anyone have a script or a link for an interface that can do > the job? If your genome file is in some standard format, use SeqIO. http://www.bioperl.org/wiki/HOWTO:SeqIO And then get the sequence corresponding to the correct chromosome and get the desired chunk with subseq(); http://www.bioperl.org/wiki/HOWTO:Beginners#The_Sequence_Object You'd also have to make sure that the data used during the blat is exactly the same data you have in your big file. From sdavis2 at mail.nih.gov Fri Jul 28 09:28:02 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Fri, 28 Jul 2006 09:28:02 -0400 Subject: [Bioperl-l] whole genome annotation In-Reply-To: References: <44C9FC15.3040503@mail.nih.gov> Message-ID: <44CA10E2.8010205@mail.nih.gov> Richard Birnie wrote: > > -----Original Message----- > From: Sean Davis [mailto:sdavis2 at mail.nih.gov] > Sent: Fri 7/28/2006 12:59 > To: Richard Birnie > Cc: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] whole genome annotation > > Richard Birnie wrote: > >>Hello all, >> >>I'm just trying to familiarise myself with BioPerl and I'm a little overwhelmed by the sheer volume of information available on the wiki. I'm hoping someone can point in the right direction through the labyrinth. This may become a little longwinded but I'll try and get all the annoying newbie questions out of the way in one go. >> >>Let me try and explain what I'm aiming for. I have some CGH data downloaded from the Progenetix database (http://www.progenetix.de/~pgscripts/progenetix/Aboutprogenetix.html), this data is simplified to record simply gain/loss/amplification of whole chromosome bands at 862 band resolution to facilitate the combination of data from multiple different studies. >> >>What I'd like to be able to do is download a copy of the human genome sequence with annotation describing the locations of chromosome bands and preferably of known genes. I then want to be able to manipulate the genome data based on the CGH data to mimic deletions. The ultimate goal of this is to be able to feed the manipulated genome data into a program (metashark) that predicts the structure of metabolic networks based on genome annotation compared to a reference genome, in this case a complete 'normal' human genome and see what effect that has on the metabolic pathways. >> >>I appreciate that is a bit vague but thats sort of my problem, I'm not a bioinformatician really so I'm not sue of the details of what I want. I just happen to have an question to answer and bioperl seems the way to go (for this project and more generally). I've started looking at the HOWTOs and read the main bioperl tutorial. I also looked at the CGL comparative genomics library but I haven't penetrated far into that yet. I'm ok with basic perl although not much object oriented stuff. I don't really have much experience with handling sequence data on a whole genome scale either, a few genbank files for my favourite genes is fine but I need some guidance to work on this scale. >> >>What I'm looking for is someone to give me a start. I'd greatly appreciate it if someone could spell out the general steps for downloading a complete copy of the human genome and its annotations (if this is even a feasible approach) and how to put it all together. Not actual code just the general concept for each step and which tools from the bioperl set would be most appropriate for each step so that I can focus what I need to read about, even a little pseudo-code if I'm lucky. If I can get the genome data downloaded and setup properly I'll work out how to apply the CGH data to it myself. >> >>If example code for what I'm trying to describe is included somewhere, great could someone point to where. > > > Hi, Richard. > > Bioperl is good for many things, but for simply grabbing all the > locations of human genes in the genome and chromosome band locations, I > wouldn't use bioperl. It sounds to me like you are interested in > getting the genes associated with each chromosomal band? If so, just > download the cytoband.txt and refFlat.txt files from the UCSC genome > browser site. cytoband.txt contains the base pair locations for each of > the cytobands. refFlat.txt contains the base pair locations of "refseq" > genes. It is then simply a matter of finding overlapping regions (genes > with cytobands) to determine which genes are in which cytobands. Since > the files are tab-delimited text, they are very easy to work with (in > perl, excel, python, ...). Don't get me wrong--I really appreciate the > power of bioperl, but in this case, your task lends itself to a simpler > (and MUCH) faster approach. > > Sean > > Thanks for the response Sean, > > getting the genes associated with each band is certainly part of what I need and your suggestion will help with that. I did look at the UCSC site but as you know there is such a volume of info on there I didn't really know which files I needed. > > However my main goal requires slightly more. What I want to be able to do is take the chromosomal band annotation info and compare that against the CGH data I have. From this I'd like to then be able say "OK band 8q13.1 (or whatever) is deleted, so make a copy of the genome with the actual sequence associated with that band removed." Then I could feed both sequences into metashark which predicts the structure of metabolic pathways based on genome annotation and see what effect deleting that region of DNA has on the structure of the metabolic network. Knowing which genes are involved is necessary for identifying what are the important components within the region. Are there tools in Bioperl for making this comparison? It can probably be reduced to a straight comparison of data structures so I may just use regular perl for this part unless there is anything designed for purpose. > > The thing I was struggling with was how to store and manipulate genomic sequence data in such quantities. Since this morning I've had a better look at the CGL library and associated datastore module, I think I can do it using these tools but I'm having a few dependency issues getting it installed right now. So I'll go back to wrestling with that. Ahh. I see. Metashark actually searches the remaining sequence in the human genome? If that is the case, then you need the start and end positions of the chromosomal bands, which you can download from the ucsc genome browser. Follow the links to download and then to the genome of your choice and finally get the chromband.txt file. The other piece of the puzzle is the bio::DB::Fasta module. It allows extremely fast access to a set of fasta files, which it first indexes. Here is the documentation for it: http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/DB/Fasta.html You could imagine making a hash indexed by chromosome band of a hash of starts and ends for each band. For each CGH experiment, find those regions that are deleted. Exclude those when looping through all the chromosome bands, pulling the sequence using Bio::DB::Fasta for each band and writing that to a file for metashark. Sean From sdavis2 at mail.nih.gov Fri Jul 28 09:30:52 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Fri, 28 Jul 2006 09:30:52 -0400 Subject: [Bioperl-l] Getting sequences by base pair locations In-Reply-To: <05a8f85bdc36dc7e370d0935b6394f9d@ucl.ac.uk> References: <05a8f85bdc36dc7e370d0935b6394f9d@ucl.ac.uk> Message-ID: <44CA118C.7010401@mail.nih.gov> Yuval Itan wrote: > Hello all, > > I was BLATing a few hundred human genes against the chimp genome, and > kept the best chimp hits for every human gene. > I have the base pair start and end location for every chimp hit, and I > need to get the sequence for each of these chimp hits. Here is an > example for a few chimp hits bp locations: > > Start End* > *142854 144504 > 154479 155198 > 153066 167370 > 163146 163559 > > I have one chimp genome file (about 3GB) including all chromosomes, but > I could also get one file per chromosome if that would make things > easier. Does anyone have a script or a link for an interface that can do > the job? See this module: http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/DB/Fasta.html Sean From osborne1 at optonline.net Fri Jul 28 09:35:02 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Fri, 28 Jul 2006 09:35:02 -0400 Subject: [Bioperl-l] whole genome annotation In-Reply-To: Message-ID: Richard, A good starting point is a FAQ page we created that describes various ways of extracting genomic sequence: http://www.bioperl.org/wiki/Getting_Genomic_Sequences Check that out, and Sean's suggestion, and write back to bioperl-l if you have questions. One thing that this page doesn't really address is the special challenge that comes with working with very large sequences, this is something you might have to consider as well. You also asked about downloading the human genome and its annotations. There's also more than one way to do this as well. You'd have access to this data if you used the ENSEMBL API but you can get the Genbank files at ftp://ftp.ncbi.nih.gov/genomes/. Having said that I should add that one of the advantages of the ENSEMBL API approach is that you don't have to download the entire genome. Don't know what machine you're working on but, again, trying to manipulate very large sequences may tax your computer as well as your patience. Brian O. On 7/28/06 5:39 AM, "Richard Birnie" wrote: > Hello all, > > I'm just trying to familiarise myself with BioPerl and I'm a little > overwhelmed by the sheer volume of information available on the wiki. I'm > hoping someone can point in the right direction through the labyrinth. This > may become a little longwinded but I'll try and get all the annoying newbie > questions out of the way in one go. > > Let me try and explain what I'm aiming for. I have some CGH data downloaded > from the Progenetix database > (http://www.progenetix.de/~pgscripts/progenetix/Aboutprogenetix.html), this > data is simplified to record simply gain/loss/amplification of whole > chromosome bands at 862 band resolution to facilitate the combination of data > from multiple different studies. > > What I'd like to be able to do is download a copy of the human genome sequence > with annotation describing the locations of chromosome bands and preferably of > known genes. I then want to be able to manipulate the genome data based on the > CGH data to mimic deletions. The ultimate goal of this is to be able to feed > the manipulated genome data into a program (metashark) that predicts the > structure of metabolic networks based on genome annotation compared to a > reference genome, in this case a complete 'normal' human genome and see what > effect that has on the metabolic pathways. > > I appreciate that is a bit vague but thats sort of my problem, I'm not a > bioinformatician really so I'm not sue of the details of what I want. I just > happen to have an question to answer and bioperl seems the way to go (for this > project and more generally). I've started looking at the HOWTOs and read the > main bioperl tutorial. I also looked at the CGL comparative genomics library > but I haven't penetrated far into that yet. I'm ok with basic perl although > not much object oriented stuff. I don't really have much experience with > handling sequence data on a whole genome scale either, a few genbank files for > my favourite genes is fine but I need some guidance to work on this scale. > > What I'm looking for is someone to give me a start. I'd greatly appreciate it > if someone could spell out the general steps for downloading a complete copy > of the human genome and its annotations (if this is even a feasible approach) > and how to put it all together. Not actual code just the general concept for > each step and which tools from the bioperl set would be most appropriate for > each step so that I can focus what I need to read about, even a little > pseudo-code if I'm lucky. If I can get the genome data downloaded and setup > properly I'll work out how to apply the CGH data to it myself. > > If example code for what I'm trying to describe is included somewhere, great > could someone point to where. > > Thanks for your patience. > best regards, > Richard > > > > Dr Richard Birnie > Scientific Officer > Section of Pathology and Tumour Biology > Welcome Brenner Building, LIMM > St James University Hospital > Beckett St, Leeds, LS9 7TF > Tel:0113 3438624 > e-mail: r.birnie at leeds.ac.uk > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From sdavis2 at mail.nih.gov Fri Jul 28 09:41:45 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Fri, 28 Jul 2006 09:41:45 -0400 Subject: [Bioperl-l] Getting sequences by base pair locations In-Reply-To: <44CA0D88.3000404@sendu.me.uk> References: <05a8f85bdc36dc7e370d0935b6394f9d@ucl.ac.uk> <44CA0D88.3000404@sendu.me.uk> Message-ID: <44CA1419.3030100@mail.nih.gov> Sendu Bala wrote: > Yuval Itan wrote: > >>Hello all, >> >>I was BLATing a few hundred human genes against the chimp genome, and >>kept the best chimp hits for every human gene. >>I have the base pair start and end location for every chimp hit, and I >>need to get the sequence for each of these chimp hits. Here is an >>example for a few chimp hits bp locations: >> >>Start End* >>*142854 144504 >>154479 155198 >>153066 167370 >>163146 163559 >> >>I have one chimp genome file (about 3GB) including all chromosomes, but >>I could also get one file per chromosome if that would make things >>easier. Does anyone have a script or a link for an interface that can do >>the job? > > > If your genome file is in some standard format, use SeqIO. > http://www.bioperl.org/wiki/HOWTO:SeqIO > > And then get the sequence corresponding to the correct chromosome and > get the desired chunk with subseq(); > http://www.bioperl.org/wiki/HOWTO:Beginners#The_Sequence_Object My guess is that Yuval will need random access to the sequences. With seqIO, this is possible with a relatively large amount of memory, but Bio::DB::Fasta might be the better bet. Alternatively, make a custom track (see the documentation for doing so at the UCSC genome browser site), upload it, and then getting the DNA is trivial with just a couple of mouseclicks. This method also has the advantage of being able to do things like viewing the data in genome coordinates and allows the possibility of doing interections with known chimp genes so you could find hits that don't overlap known chimp genes, for example. Sean From valiente at lsi.upc.edu Fri Jul 28 09:53:10 2006 From: valiente at lsi.upc.edu (Gabriel Valiente) Date: Fri, 28 Jul 2006 16:53:10 +0300 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: <001301c6b24b$da38ba80$15327e82@pyrimidine> References: <001301c6b24b$da38ba80$15327e82@pyrimidine> Message-ID: <1D2EF357-E8DE-4F15-824A-9C9359389685@lsi.upc.edu> > Would be nice to know how you use Bio::Taxonomy. You are the first > here who > seems to have a use for it. I'm using it to obtain a reference taxonomy for a set of organisms, against which to assess a phylogeny obtained by the usual protocol (fetch rRNA sequences, align them, obtain a distance matrix, cluster). Roughly: use Bio::DB::Taxonomy; my $nodesfile = "nodes.dmp"; my $namesfile = "names.dmp"; my $db = new Bio::DB::Taxonomy(-source => 'flatfile', -directory => "./db/", -nodesfile => $nodesfile, -namesfile => $namesfile); my @species = (...); for my $ncbi_name (@species) { my $ncbi_id = $db->get_taxonid($ncbi_name); my $node = $db->get_Taxonomy_Node(-taxonid => $ncbi_id); my @lineage = get_lineage_nodes($node); # ... } Here, get_lineage_nodes could be added as a method to Bio::Taxonomy::Node or equivalent: sub get_lineage_nodes{ my $node = shift; my @lineage; while ($node->node_name ne "root") { $node = $node->get_Parent_Node; unshift @lineage, $node; } return @lineage; } I've also written a method to merge the full lineages of a set of Bio::Taxonomy::Node object into a Bio::Tree::Tree object. I'd be glad to contribute it as well, but I'm not sure where it would fit. > As for branch lengths, I think you're confusing > 'taxonomy' (classification > of organisms based on just about anything) with > 'phylogeny' (evolutionary > relatedness). Note in the Wikipedia article below the use of the term > 'phylogenetic taxonomy', which is the classification of organisms > based on > evolutionary relationships. > > http://en.wikipedia.org/wiki/Taxonomy > > http://en.wikipedia.org/wiki/Phylogeny > > NCBI has a disclaimer about the Taxonomy database that is related > to this: > > http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi? > chapter=how > cite > > There are HOWTOs on tree manipulation, population genetics, and > PAML on the > wiki which might be a good start for Bioperl phylogenetic methods: > > http://www.bioperl.org/wiki/HOWTO:Trees > > http://www.bioperl.org/wiki/HOWTO:PAML > > http://www.bioperl.org/wiki/HOWTO:PopGen Thanks a lot. Let me check it and get back to the discussion later on. Gabriel > Chris > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> bounces at lists.open-bio.org] On Behalf Of Gabriel Valiente >> Sent: Friday, July 28, 2006 7:10 AM >> To: bioperl-l at lists.open-bio.org >> Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) >> >>>>> At the moment it seems to me that the Bio::Taxonomy modules >>>>> (excluding >>>>> Node) aren't really usable. >> >> I've been using Bio::Taxonomy and Bio::DB::Taxonomy a lot, they are >> very useful modules. Whatever Bio::Taxon or Bio::Taxonomy::Taxon >> turns out to be, please do keep the Bio::DB::Taxonomy functionality. >> >> BTW, does anybody know how to include branch lengths in >> Bio::DB::Taxonomy? >> >> Thanks a lot, >> >> Gabriel >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l From R.Birnie at leeds.ac.uk Fri Jul 28 09:56:15 2006 From: R.Birnie at leeds.ac.uk (Richard Birnie) Date: Fri, 28 Jul 2006 14:56:15 +0100 Subject: [Bioperl-l] whole genome annotation References: Message-ID: Thanks folks, That should be enough to get me going. At least I can see the wood for the trees now. Richard Dr Richard Birnie Scientific Officer Section of Pathology and Tumour Biology Welcome Brenner Building, LIMM St James University Hospital Beckett St, Leeds, LS9 7TF Tel:0113 3438624 e-mail: r.birnie at leeds.ac.uk -----Original Message----- From: Brian Osborne [mailto:osborne1 at optonline.net] Sent: Fri 7/28/2006 14:35 To: Richard Birnie; bioperl-l at lists.open-bio.org Subject: Re: [Bioperl-l] whole genome annotation Richard, A good starting point is a FAQ page we created that describes various ways of extracting genomic sequence: http://www.bioperl.org/wiki/Getting_Genomic_Sequences Check that out, and Sean's suggestion, and write back to bioperl-l if you have questions. One thing that this page doesn't really address is the special challenge that comes with working with very large sequences, this is something you might have to consider as well. You also asked about downloading the human genome and its annotations. There's also more than one way to do this as well. You'd have access to this data if you used the ENSEMBL API but you can get the Genbank files at ftp://ftp.ncbi.nih.gov/genomes/. Having said that I should add that one of the advantages of the ENSEMBL API approach is that you don't have to download the entire genome. Don't know what machine you're working on but, again, trying to manipulate very large sequences may tax your computer as well as your patience. Brian O. On 7/28/06 5:39 AM, "Richard Birnie" wrote: > Hello all, > > I'm just trying to familiarise myself with BioPerl and I'm a little > overwhelmed by the sheer volume of information available on the wiki. I'm > hoping someone can point in the right direction through the labyrinth. This > may become a little longwinded but I'll try and get all the annoying newbie > questions out of the way in one go. > > Let me try and explain what I'm aiming for. I have some CGH data downloaded > from the Progenetix database > (http://www.progenetix.de/~pgscripts/progenetix/Aboutprogenetix.html), this > data is simplified to record simply gain/loss/amplification of whole > chromosome bands at 862 band resolution to facilitate the combination of data > from multiple different studies. > > What I'd like to be able to do is download a copy of the human genome sequence > with annotation describing the locations of chromosome bands and preferably of > known genes. I then want to be able to manipulate the genome data based on the > CGH data to mimic deletions. The ultimate goal of this is to be able to feed > the manipulated genome data into a program (metashark) that predicts the > structure of metabolic networks based on genome annotation compared to a > reference genome, in this case a complete 'normal' human genome and see what > effect that has on the metabolic pathways. > > I appreciate that is a bit vague but thats sort of my problem, I'm not a > bioinformatician really so I'm not sue of the details of what I want. I just > happen to have an question to answer and bioperl seems the way to go (for this > project and more generally). I've started looking at the HOWTOs and read the > main bioperl tutorial. I also looked at the CGL comparative genomics library > but I haven't penetrated far into that yet. I'm ok with basic perl although > not much object oriented stuff. I don't really have much experience with > handling sequence data on a whole genome scale either, a few genbank files for > my favourite genes is fine but I need some guidance to work on this scale. > > What I'm looking for is someone to give me a start. I'd greatly appreciate it > if someone could spell out the general steps for downloading a complete copy > of the human genome and its annotations (if this is even a feasible approach) > and how to put it all together. Not actual code just the general concept for > each step and which tools from the bioperl set would be most appropriate for > each step so that I can focus what I need to read about, even a little > pseudo-code if I'm lucky. If I can get the genome data downloaded and setup > properly I'll work out how to apply the CGH data to it myself. > > If example code for what I'm trying to describe is included somewhere, great > could someone point to where. > > Thanks for your patience. > best regards, > Richard > > > > Dr Richard Birnie > Scientific Officer > Section of Pathology and Tumour Biology > Welcome Brenner Building, LIMM > St James University Hospital > Beckett St, Leeds, LS9 7TF > Tel:0113 3438624 > e-mail: r.birnie at leeds.ac.uk > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Fri Jul 28 09:43:47 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 28 Jul 2006 08:43:47 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: Message-ID: <001301c6b24b$da38ba80$15327e82@pyrimidine> Now I get personal email? Yikes! Sendu has indicated that Bio::DB::Taxonomy will stay essentially unchanged. If anything changes, it >may< be the class used to hold the Node information. Would be nice to know how you use Bio::Taxonomy. You are the first here who seems to have a use for it. As for branch lengths, I think you're confusing 'taxonomy' (classification of organisms based on just about anything) with 'phylogeny' (evolutionary relatedness). Note in the Wikipedia article below the use of the term 'phylogenetic taxonomy', which is the classification of organisms based on evolutionary relationships. http://en.wikipedia.org/wiki/Taxonomy http://en.wikipedia.org/wiki/Phylogeny NCBI has a disclaimer about the Taxonomy database that is related to this: http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=how cite There are HOWTOs on tree manipulation, population genetics, and PAML on the wiki which might be a good start for Bioperl phylogenetic methods: http://www.bioperl.org/wiki/HOWTO:Trees http://www.bioperl.org/wiki/HOWTO:PAML http://www.bioperl.org/wiki/HOWTO:PopGen Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Gabriel Valiente > Sent: Friday, July 28, 2006 7:10 AM > To: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) > > >>> At the moment it seems to me that the Bio::Taxonomy modules > >>> (excluding > >>> Node) aren't really usable. > > I've been using Bio::Taxonomy and Bio::DB::Taxonomy a lot, they are > very useful modules. Whatever Bio::Taxon or Bio::Taxonomy::Taxon > turns out to be, please do keep the Bio::DB::Taxonomy functionality. > > BTW, does anybody know how to include branch lengths in > Bio::DB::Taxonomy? > > Thanks a lot, > > Gabriel > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Fri Jul 28 10:15:38 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 28 Jul 2006 09:15:38 -0500 Subject: [Bioperl-l] Getting sequences by base pair locations In-Reply-To: <44CA118C.7010401@mail.nih.gov> Message-ID: <001401c6b250$4e3c2490$15327e82@pyrimidine> Yutal, You can also do this remotely if the file you want is in GenBank (and you don't want to store the data locally). The nice thing about using this is any seqfeatures in the GenBank file within the region requested is also returned. Note that if data is stored in a RefSeq file you'll need to add the parameter '-no_redirect => 1,' to the Bio::DB::GenBank object. I would NOT recommend this for huge numbers of sequences (>2000) as you would be spamming NCBI with thousands of repeated requests; if you did have a relatively large number you could run this overnight, which is what I do. Bio::DB::Fasta would be better if you have tons of hits. Use this in a loop to grab the sequences one at a time based on the start, stop positions, (and strand, if you need it): # Below is from Bio::DB::GenBank POD, with some modifications my $factory = Bio::DB::GenBank->new( -seq_start => $start, -seq_stop => $end, -strand => $strand # 1=plus, 2=minus ); my $seq_obj; eval { $seq_obj = $factory->get_Seq_by_acc($sf->seq_id); }; if( $@ ) { print STDERR "Unable to retrieve from $start to $end.\n"; print STDERR "Error!\n$@"; print STDERR "Attempting to move on...\n"; next; } print STDERR "Got sequence: ",$seq_obj->description,"\n"; print STDERR "\tLength: ",$seq_obj->length,"\n"; my $sf_len = $sf->length; The eval{} block is needed to make sure retrieval worked via network connections and to not end based on a network error (the object throws an error which eval catches, logs it to STDERR, thus allowing you to continue on). Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Sean Davis > Sent: Friday, July 28, 2006 8:31 AM > To: Yuval Itan > Cc: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Getting sequences by base pair locations > > Yuval Itan wrote: > > Hello all, > > > > I was BLATing a few hundred human genes against the chimp genome, and > > kept the best chimp hits for every human gene. > > I have the base pair start and end location for every chimp hit, and I > > need to get the sequence for each of these chimp hits. Here is an > > example for a few chimp hits bp locations: > > > > Start End* > > *142854 144504 > > 154479 155198 > > 153066 167370 > > 163146 163559 > > > > I have one chimp genome file (about 3GB) including all chromosomes, but > > I could also get one file per chromosome if that would make things > > easier. Does anyone have a script or a link for an interface that can do > > the job? > > See this module: > > http://doc.bioperl.org/releases/bioperl-current/bioperl- > live/Bio/DB/Fasta.html > > Sean > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Fri Jul 28 10:35:21 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 28 Jul 2006 09:35:21 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: <1D2EF357-E8DE-4F15-824A-9C9359389685@lsi.upc.edu> Message-ID: <001501c6b253$0fed08a0$15327e82@pyrimidine> > use Bio::DB::Taxonomy; > I've also written a method to merge the full lineages of a set of > Bio::Taxonomy::Node object into a Bio::Tree::Tree object. I'd be glad > to contribute it as well, but I'm not sure where it would fit. Ah, that would be great (I had mentioned something along these lines to do with BLAST reports). But does this actually use Bio::Taxonomy directly? Taxonomy::Node does not inherit methods from Bio::Taxonomy AFAIK. So, anything that Sendu does may not dramatically impact your code. Sendu? You might need to address some of this to Sendu. Big changes are afoot for Bio::Taxonomy and Bio::Taxonomy::Node. He's heading that up. Chris > ... > Thanks a lot. Let me check it and get back to the discussion later on. > > Gabriel > > > Chris > > ... From cjfields at uiuc.edu Fri Jul 28 10:37:09 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 28 Jul 2006 09:37:09 -0500 Subject: [Bioperl-l] Getting sequences by base pair locations In-Reply-To: <44CA1419.3030100@mail.nih.gov> Message-ID: <001601c6b253$4ec57170$15327e82@pyrimidine> ... > > If your genome file is in some standard format, use SeqIO. > > http://www.bioperl.org/wiki/HOWTO:SeqIO > > > > And then get the sequence corresponding to the correct chromosome and > > get the desired chunk with subseq(); > > http://www.bioperl.org/wiki/HOWTO:Beginners#The_Sequence_Object > > My guess is that Yuval will need random access to the sequences. With > seqIO, this is possible with a relatively large amount of memory, but > Bio::DB::Fasta might be the better bet. Agreed. This is one of the bioperl 'speed' issue areas: http://www.bioperl.org/wiki/Project_priority_list Bio::DB::Fasta returns a specialized PrimarySeq object which gets around the current speed issues with SeqIO. > Alternatively, make a custom track (see the documentation for doing so > at the UCSC genome browser site), upload it, and then getting the DNA is > trivial with just a couple of mouseclicks. This method also has the > advantage of being able to do things like viewing the data in genome > coordinates and allows the possibility of doing interections with known > chimp genes so you could find hits that don't overlap known chimp genes, > for example. > > Sean Would be nice to have a more automated and direct way of doing something along these lines within bioperl (with the obvious caveat of not spamming the server). You can currently retrieve chunks of sequence based on start, stop, strand from GenBank. Ah, one can dream... Chris From bix at sendu.me.uk Fri Jul 28 10:38:20 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Fri, 28 Jul 2006 15:38:20 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: <1D2EF357-E8DE-4F15-824A-9C9359389685@lsi.upc.edu> References: <001301c6b24b$da38ba80$15327e82@pyrimidine> <1D2EF357-E8DE-4F15-824A-9C9359389685@lsi.upc.edu> Message-ID: <44CA215C.2070607@sendu.me.uk> Gabriel Valiente wrote: >> Would be nice to know how you use Bio::Taxonomy. You are the first >> here who >> seems to have a use for it. > > I'm using it to obtain a reference taxonomy for a set of organisms, > against which to assess a phylogeny obtained by the usual protocol > (fetch rRNA sequences, align them, obtain a distance matrix, > cluster). Roughly: > > use Bio::DB::Taxonomy; Ah, we were specifically wondering if you had used Bio/Taxonomy.pm, not Taxonomy modules in general. Again, DB::Taxonomy usage will be unaffected. > Here, get_lineage_nodes could be added as a method to > Bio::Taxonomy::Node or equivalent: > > sub get_lineage_nodes{ > my $node = shift; > my @lineage; > while ($node->node_name ne "root") { > $node = $node->get_Parent_Node; > unshift @lineage, $node; > } > return @lineage; > } I think you must have an older version of bioperl. Bio::Taxonomy::Node has a method get_Lineage_Nodes() which more or less does exactly that. > I've also written a method to merge the full lineages of a set of > Bio::Taxonomy::Node object into a Bio::Tree::Tree object. I'd be glad > to contribute it as well, but I'm not sure where it would fit. Post it and I'll see if it will fit anywhere :) From cuiw at ncbi.nlm.nih.gov Fri Jul 28 09:46:50 2006 From: cuiw at ncbi.nlm.nih.gov (Cui, Wenwu (NIH/NLM/NCBI) [C]) Date: Fri, 28 Jul 2006 09:46:50 -0400 Subject: [Bioperl-l] Getting sequences by base pair locations In-Reply-To: <05a8f85bdc36dc7e370d0935b6394f9d@ucl.ac.uk> Message-ID: <18C407FD4FFB424292D769FBD68C1987C7C254@NIHCESMLBX8.nih.gov> Maybe the easiest way is to use LWP to get the webpage. Here is an example for CHIMP1A:10:12345678:12348888: http://www.ensembl.org/Pan_troglodytes/exportview?format=fasta&l=10%3A12 345678-12348888&action=export&_format=Text&output=txt&submit=Continue+%3 E%3E Wenwu Cui ________________________________ From: Yuval Itan [mailto:y.itan at ucl.ac.uk] Sent: Friday, July 28, 2006 8:08 AM To: bioperl-l at lists.open-bio.org Subject: [Bioperl-l] Getting sequences by base pair locations Hello all, I was BLATing a few hundred human genes against the chimp genome, and kept the best chimp hits for every human gene. I have the base pair start and end location for every chimp hit, and I need to get the sequence for each of these chimp hits. Here is an example for a few chimp hits bp locations: Start End 142854 144504 154479 155198 153066 167370 163146 163559 I have one chimp genome file (about 3GB) including all chromosomes, but I could also get one file per chromosome if that would make things easier. Does anyone have a script or a link for an interface that can do the job? Thank you very much. From valiente at lsi.upc.edu Fri Jul 28 10:49:28 2006 From: valiente at lsi.upc.edu (Gabriel Valiente) Date: Fri, 28 Jul 2006 17:49:28 +0300 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: <001501c6b253$0fed08a0$15327e82@pyrimidine> References: <001501c6b253$0fed08a0$15327e82@pyrimidine> Message-ID: <5563CD94-DC99-46A3-A56A-485D4A4D3031@lsi.upc.edu> >> use Bio::DB::Taxonomy; > > > >> I've also written a method to merge the full lineages of a set of >> Bio::Taxonomy::Node object into a Bio::Tree::Tree object. I'd be glad >> to contribute it as well, but I'm not sure where it would fit. > > Ah, that would be great (I had mentioned something along these > lines to do > with BLAST reports). But does this actually use Bio::Taxonomy > directly? > Taxonomy::Node does not inherit methods from Bio::Taxonomy AFAIK. So, > anything that Sendu does may not dramatically impact your code. > Sendu? It is a general algorithm I devised that takes a set of paths and builds up a tree. The input paths are full lineages coming from Bio::DB::Taxonomy, the output tree is a Bio::Tree::Tree. This is why I said I don't know exactly where it would belong, it looks to me more like a standalone script than a Bio::Taxonomy or Bio::Tree method. Gabriel > You might need to address some of this to Sendu. Big changes are > afoot for > Bio::Taxonomy and Bio::Taxonomy::Node. He's heading that up. > > Chris > >> ... >> Thanks a lot. Let me check it and get back to the discussion later >> on. >> >> Gabriel >> >>> Chris >>> > ... From sdavis2 at mail.nih.gov Fri Jul 28 11:21:09 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Fri, 28 Jul 2006 11:21:09 -0400 Subject: [Bioperl-l] Getting sequences by base pair locations In-Reply-To: <001601c6b253$4ec57170$15327e82@pyrimidine> References: <001601c6b253$4ec57170$15327e82@pyrimidine> Message-ID: <44CA2B65.8070906@mail.nih.gov> Chris Fields wrote: > Would be nice to have a more automated and direct way of doing something > along these lines within bioperl (with the obvious caveat of not spamming > the server). You can currently retrieve chunks of sequence based on start, > stop, strand from GenBank. The ENSembl API has some features that can be useful for these types of things. I, personally, have a mirror of the UCSC mysql database (very easy to do with just rsync and mysql) and try to turn questions like these into SQL queries. That, combined with Bio::DB::Fasta, can make a useful automated pipeline for getting arbitrary sequences associated with genomic locations meeting specific criteria. It is much faster than anything one can do over the web and doesn't have access limitations. Sean From cjfields at uiuc.edu Fri Jul 28 11:27:17 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 28 Jul 2006 10:27:17 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: <5563CD94-DC99-46A3-A56A-485D4A4D3031@lsi.upc.edu> Message-ID: <000001c6b25a$4f9392b0$15327e82@pyrimidine> > It is a general algorithm I devised that takes a set of paths and > builds up a tree. The input paths are full lineages coming from > Bio::DB::Taxonomy, the output tree is a Bio::Tree::Tree. This is why > I said I don't know exactly where it would belong, it looks to me > more like a standalone script than a Bio::Taxonomy or Bio::Tree method. > > Gabriel Agreed. You could submit the script as an example here if it is short, or via Bugzilla as an enhancement request: http://bugzilla.open-bio.org/ It could be added to the scripts\ or examples\ directory in bioperl-core. Chris From valiente at lsi.upc.edu Fri Jul 28 12:35:20 2006 From: valiente at lsi.upc.edu (Gabriel Valiente) Date: Fri, 28 Jul 2006 19:35:20 +0300 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: <000001c6b25a$4f9392b0$15327e82@pyrimidine> References: <000001c6b25a$4f9392b0$15327e82@pyrimidine> Message-ID: <3DB992C6-DF16-42B9-8C36-F3B5C8CCBDE7@lsi.upc.edu> >> It is a general algorithm I devised that takes a set of paths and >> builds up a tree. The input paths are full lineages coming from >> Bio::DB::Taxonomy, the output tree is a Bio::Tree::Tree. This is why >> I said I don't know exactly where it would belong, it looks to me >> more like a standalone script than a Bio::Taxonomy or Bio::Tree >> method. >> >> Gabriel > > Agreed. You could submit the script as an example here if it is > short, or > via Bugzilla as an enhancement request: > > http://bugzilla.open-bio.org/ > > It could be added to the scripts\ or examples\ directory in bioperl- > core. Here it is. Please check it and include for instance as taxonomy2tree.PLS in the scripts/tree or scripts/taxonomy directory. Disclaimer: I'm also publishing part of this code in a conference paper. The script is already fully functional but anyway, I have a couple of improvements in mind. The minor one is provision for cmdline input. How would you like to input an array of names? The other one is to remove internal node labels and contract elementary paths, for instance reducing the tree: (((((((((((((((((((((((((((("Pongo pygmaeus")Pongo,(("Gorilla gorilla")Gorilla,("Pan troglodytes")Pan,("Homo sapiens")Homo)"Homo/ Pan/Gorilla group")Hominidae)Hominoidea)Catarrhini)Simiiformes) Primates)Euarchontoglires)Eutheria)Theria)Mammalia)Amniota)Tetrapoda) Sarcopterygii)Euteleostomi)Teleostomi)"Gnathostomata ") Vertebrata)"Craniata ")Chordata)Deuterostomia)Coelomata) Bilateria)Eumetazoa)Metazoa)"Fungi/Metazoa group")Eukaryota)"cellular organisms")root; to the tree: ("Pongo pygmaeus",("Gorilla gorilla","Pan troglodytes","Homo sapiens")); It is certainly easy to remove all internal node labels. On the other hand, I've been working on contraction of elementary paths for quite a while, but always got stuck with internals of the Bio::Tree methods to remove nodes. Thus, so far the only working code I have consists of removing elementary branches while making a deep copy of the tree, which certainly is not quite elegant... Thanks a lot, Gabriel #!/usr/bin/perl -w # Author: Gabriel Valiente # Purpose: Bio::DB::Taxonomy -> Bio::Tree::Tree use strict; use Bio::DB::Taxonomy; use Bio::TreeIO; my $nodesfile = "nodes.dmp"; my $namesfile = "names.dmp"; my $db = new Bio::DB::Taxonomy(-source => 'flatfile', -directory => "./db/", -nodesfile => $nodesfile, -namesfile => $namesfile); # the input to the script is an array of species names my @species = ('Orangutan', 'Gorilla', 'Chimpanzee', 'Human'); my $root = new Bio::Tree::Node(-id => "root"); my $tree = new Bio::Tree::Tree(-root => $root); # the full lineages of the species are merged into a tree for my $name (@species) { my $ncbi_id = $db->get_taxonid($name); if ($ncbi_id) { my $node = $db->get_Taxonomy_Node(-taxonid => $ncbi_id); my @lineage = get_lineage_nodes($node); shift @lineage; # discard root push @lineage, $node; merge_path($root, \@lineage); } else { warn "no NCBI Taxonomy node for species ",$name,"\n"; } } # the tree is output in Newick format my $output = new Bio::TreeIO(-format => 'newick'); $output->write_tree($tree); # the actual merging of full lineages is performed by a recursive method sub merge_path { my $root = shift; my $path = shift; my @path = @{$path}; if (@path) { my $top = shift @path; my @children = grep { $_->id eq $top->node_name } $root- >each_Descendent; if (@children) { # $root has a $child with id eq $top name my $child = shift @children; merge_path($child,\@path); } else { # add $top and @path below $root my $node = $root; unshift @path, $top; while (@path) { my $top = shift @path; my $name = $top->node_name; my $child = new Bio::Tree::Node(-id => "$name"); $node->add_Descendent($child); $node = $child; } } } } # the full lineage of a species is recovered by traversing the taxonomy sub get_lineage_nodes{ my $node = shift; my @lineage; while ($node->node_name ne "root") { $node = $node->get_Parent_Node; unshift @lineage, $node; } return @lineage; } =head1 NAME taxonomy2tree - builds a taxonomic tree based on the full lineages of a set of species names =head1 DESCRIPTION This script requires that the bioperl-run pkg be also installed. Providing the nodes.dmp and names.dmp files from the NCBI Taxonomy dump (see Bio::DB::Taxonomy::flatfile for more info) is only necessary on the first time running. This will create the local indexes and may take quite a long time. However once created, these indexes will allow fast access for species to taxon id OR taxon id to species name lookups. =cut From MEC at stowers-institute.org Fri Jul 28 12:44:43 2006 From: MEC at stowers-institute.org (Cook, Malcolm) Date: Fri, 28 Jul 2006 11:44:43 -0500 Subject: [Bioperl-l] Getting sequences by base pair locations Message-ID: There are many options. But, it looks like you only have start end coordinates! Where do you know which chromosome/contig the hit was on? Assuming you have this, if you did the blat with a local copy of the blat program and a the genome, then in addition to the blat command, you have the twoBitToFa command which can extract the hits from the blat index (see http://genome.ucsc.edu/goldenPath/help/blatSpec.html ) Or did you do the blat at ucsc? Malcolm Cook Database Applications Manager, Bioinformatics Stowers Institute for Medical Research oh - I replied similarly in the Bio BB forum, but it is held for moderation so am replying here as well ________________________________ From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Yuval Itan Sent: Friday, July 28, 2006 7:08 AM To: bioperl-l at lists.open-bio.org Subject: [Bioperl-l] Getting sequences by base pair locations Hello all, I was BLATing a few hundred human genes against the chimp genome, and kept the best chimp hits for every human gene. I have the base pair start and end location for every chimp hit, and I need to get the sequence for each of these chimp hits. Here is an example for a few chimp hits bp locations: Start End 142854 144504 154479 155198 153066 167370 163146 163559 I have one chimp genome file (about 3GB) including all chromosomes, but I could also get one file per chromosome if that would make things easier. Does anyone have a script or a link for an interface that can do the job? Thank you very much. From osborne1 at optonline.net Fri Jul 28 13:25:12 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Fri, 28 Jul 2006 13:25:12 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: <3DB992C6-DF16-42B9-8C36-F3B5C8CCBDE7@lsi.upc.edu> Message-ID: Gabriel, It looks like most of the Bioperl scripts use Getopt::Long. It's documentation says, in part: Options can take multiple values at once, for example --coordinates 52.2 16.4 --rgbcolor 255 255 149 This can be accomplished by adding a repeat specifier to the option specification. Repeat specifiers are very similar to the {...} repeat specifiers that can be used with regular expression patterns. For example, the above command line would be handled as follows: GetOptions('coordinates=f{2}' => \@coor, 'rgbcolor=i{3}' => \@color); So the arguments are space-delimited on the command line. Is the problem that the names can be binomial? Brian O. On 7/28/06 12:35 PM, "Gabriel Valiente" wrote: > The minor one is provision for cmdline input. > How would you like to input an array of names? From golharam at umdnj.edu Fri Jul 28 14:03:39 2006 From: golharam at umdnj.edu (Ryan Golhar) Date: Fri, 28 Jul 2006 14:03:39 -0400 Subject: [Bioperl-l] Bio::Align::DNAStatistics module has errors? Message-ID: <01a701c6b270$28232130$2f01a8c0@GOLHARMOBILE1> This is from the description: This object contains routines for calculating various statistics and distances for DNA alignments. The routines are not well tested and do contain errors at this point. Work is underway to correct them, but do not expect this code to give you the right answer currently! Use dnadist/distmat in the PHLYIP or EMBOSS packages to calculate the dis- tances. Any idea what the errors are and what is/is not usable? From lzhtom at hotmail.com Fri Jul 28 22:00:23 2006 From: lzhtom at hotmail.com (zhihua li) Date: Sat, 29 Jul 2006 02:00:23 +0000 Subject: [Bioperl-l] how to get annotations (especially ensembl IDs) for a list of genes? Message-ID: Hi all, I have a list of like 300 genes (actually their refseq IDs). Now I wanna get more information (annotations) for each of the genes. Speficially, I want a mapping of the refseq IDs to Ensembl gene IDs. I know how to do it through a web page. But I'm wondering if I can also do it via bioperl, by using some modules or packages. Can anyone help me out here? Thanks a lot! From jason.stajich at duke.edu Sat Jul 29 01:18:50 2006 From: jason.stajich at duke.edu (Jason Stajich) Date: Fri, 28 Jul 2006 22:18:50 -0700 Subject: [Bioperl-l] Bio::Align::DNAStatistics module has errors? Message-ID: I think that msg was CYA by me at some point - I am pretty sure I made tests based on numbers from PHYLIP and EMBOSS but was hoping for someone else to help. At this point I have no reliable time to really work on, but I hope someone who is interested in it will give it a whirl. There may be some boundary cases that don't work where seqs are too short or have a zero number of a particular nt but in general the nums should jive. I am not sure about all the NG Ks and Ka as I didn't write those but I believe Richard vetted them pretty well first. There are a couple of methods not implemented too - am always hopeful other people will see this as a great starting point and roll up their sleeves to join in... -jason -- Jason Stajich Duke University http://www.duke.edu/~jes12 From bix at sendu.me.uk Sat Jul 29 03:25:38 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Sat, 29 Jul 2006 08:25:38 +0100 Subject: [Bioperl-l] how to get annotations (especially ensembl IDs) for a list of genes? In-Reply-To: References: Message-ID: <44CB0D72.20104@sendu.me.uk> zhihua li wrote: > Hi all, > > I have a list of like 300 genes (actually their refseq IDs). Now I > wanna get more information (annotations) for each of the genes. > Speficially, I want a mapping of the refseq IDs to Ensembl gene IDs. > > I know how to do it through a web page. But I'm wondering if I can also > do it via bioperl One possible way is to use the Ensembl perl API: http://www.ensembl.org/info/software/core/core_tutorial.html You'd get a gene or transcript adapator and use fetch_all_by_external_name() iirc. I'm aware that not every entrez id can be mapped that way, but perhaps most if not all refseqs will work. From bix at sendu.me.uk Sat Jul 29 03:54:52 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Sat, 29 Jul 2006 08:54:52 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: <001301c6b24b$da38ba80$15327e82@pyrimidine> References: <001301c6b24b$da38ba80$15327e82@pyrimidine> Message-ID: <44CB144C.6050507@sendu.me.uk> Chris Fields wrote: > > As for branch lengths, I think you're confusing 'taxonomy' (classification > of organisms based on just about anything) with 'phylogeny' (evolutionary > relatedness). Note in the Wikipedia article below the use of the term > 'phylogenetic taxonomy', which is the classification of organisms based on > evolutionary relationships. > > http://en.wikipedia.org/wiki/Taxonomy > > http://en.wikipedia.org/wiki/Phylogeny Indeed. The two can be considered closely intertwined - if you were making a phylogeny you might hang it on a taxonomy. At any rate, you need to know a bunch of evolutionarily related species names before you start work, and Bio::Taxonomy::Node has been as good a place as any to get that. > There are HOWTOs on tree manipulation, population genetics, and PAML on the > wiki which might be a good start for Bioperl phylogenetic methods: > > http://www.bioperl.org/wiki/HOWTO:Trees Which is why the Trees HOWTO talks about taxa, and a number of the Taxonomy modules have phylogenetic methods like get_lca. (And why there is Bio::Taxonomy::Taxon and Tree.) I suppose this is another reason to make Bio::Taxonomy::Node (ne Bio::Taxon) implement Bio::Tree::NodeI. (for these reasons I don't think Gabriel's method isn't best appropriate as a script - it's something you might do all the time, as a matter of course. If Bio::Taxon wasa Bio::Tree::NodeI you would just do my $tree = new Bio::Tree::Tree(-root => $bio_taxon); and blamo, instant phylogenetic taxonomy) From cjfields at uiuc.edu Sat Jul 29 07:49:29 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sat, 29 Jul 2006 06:49:29 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: <44CB144C.6050507@sendu.me.uk> References: <001301c6b24b$da38ba80$15327e82@pyrimidine> <44CB144C.6050507@sendu.me.uk> Message-ID: <5C765CAF-4976-45D1-BDA4-1E432081FF4D@uiuc.edu> On Jul 29, 2006, at 2:54 AM, Sendu Bala wrote: > Chris Fields wrote: >> >> As for branch lengths, I think you're confusing >> 'taxonomy' (classification >> of organisms based on just about anything) with >> 'phylogeny' (evolutionary >> relatedness). Note in the Wikipedia article below the use of the >> term >> 'phylogenetic taxonomy', which is the classification of organisms >> based on >> evolutionary relationships. >> >> http://en.wikipedia.org/wiki/Taxonomy >> >> http://en.wikipedia.org/wiki/Phylogeny > > Indeed. The two can be considered closely intertwined - if you were > making a phylogeny you might hang it on a taxonomy. At any rate, you > need to know a bunch of evolutionarily related species names before > you > start work, and Bio::Taxonomy::Node has been as good a place as any to > get that. Intertwined, yes, but not exactly the same. Hence the NCBI disclaimer I mentioned: How to reference the NCBI taxonomy database The NCBI taxonomy database is not a primary source for taxonomic or phylogenetic information. Furthermore, the database does not follow a single taxonomic treatise but rather attempts to incorporate phylogenetic and taxonomic knowledge from a variety of sources, including the published literature, web-based databases, and the advice of sequence submitters and outside taxonomy experts. Consequently, the NCBI taxonomy database is not a phylogenetic or taxonomic authority and should not be cited as such. >> There are HOWTOs on tree manipulation, population genetics, and >> PAML on the >> wiki which might be a good start for Bioperl phylogenetic methods: >> >> http://www.bioperl.org/wiki/HOWTO:Trees > > Which is why the Trees HOWTO talks about taxa, and a number of the > Taxonomy modules have phylogenetic methods like get_lca. (And why > there > is Bio::Taxonomy::Taxon and Tree.) Are we still thinking about deprecating those? I have seen very little mention of those modules from the mail list archives, and Jason mentioned that Bio::Taxonomy::Taxon hasn't been modified in a long time. > I suppose this is another reason to make Bio::Taxonomy::Node (ne > Bio::Taxon) implement Bio::Tree::NodeI. > > (for these reasons I don't think Gabriel's method isn't best > appropriate > as a script - it's something you might do all the time, as a matter of > course. If Bio::Taxon wasa Bio::Tree::NodeI you would just do my > $tree = > new Bio::Tree::Tree(-root => $bio_taxon); and blamo, instant > phylogenetic taxonomy) Brian already deposited the script (see bioperl-guts). You could use it for the methods, of course noting Gabriel's contribution. Sounds like a good plan to me ; > Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From nabil at broad.mit.edu Sun Jul 30 00:28:00 2006 From: nabil at broad.mit.edu (Nabil Hafez) Date: Sun, 30 Jul 2006 00:28:00 -0400 Subject: [Bioperl-l] Strangeness in parsing blast file Message-ID: <44CC3550.5070105@broad.mit.edu> Hi, I am having a somewhat similar problem to what was posted in http://bioperl.org/pipermail/bioperl-l/2006-May/021416.html however, I have read through all of that thread and I don't believe what I am experiencing is the exact same problem. I also realize that the Bioperl version 1.5.1 fixes a problem with blast parsing. My problem: My blastresults file parses fine and everything works swimmingly if I pass the blast output file by name such as $blast_result = 'test.blastout'; however when I do $blast_result = &do_blast($sample_fasta); even though in both cases $blast_result evaluate to "test.blastout", the parsing doesn't work, more specifically it gets an undefined variable for $result in while( my $result = $report_obj->next_result ) { Sorr y for the long email - any help would be appreciated, Thanks - Nabil The code...non releavant parts trimmed for size constraints....debugging from working and non-working versions after the code. use strict; use Bio::SearchIO; use Getopt::Std; use List::Util qw(shuffle); use Benchmark; my ($inputfile, $samplefile, $blastfile, $blast_result, $blast_report, $blast_verbose); #files generated #------------------# # Subroutine Calls # #------------------# my $test = &create_sample_file($inputfile); #inputfile being a fasta file containing nucleotide sequence $blast_result = &do_blast($test); ##$blast_result = 'test.blastout'; #when this is uncommented and replace the previous two lines with test.blastout being normal blast output - the script works fine. &parse_blast($blast_result); ####################### # create_sample_file # # Input: Original Fasta File # # Output: Fasta file containing randomly sampled reads # # sub create_sample_file { my $in = shift; my $linecount = 0; my @lines; $samplefile = $in . "_sample"; #Determine total # of reads in input fasta $totalreads = `$grep -c '>' $inputfile`; $totalreads =~ s/\s+//; chomp $totalreads; if ($totalreads > 1000) { #sample if more than 1000 reads $sample_reads = sprintf("%.0f", $totalreads * ($per_to_sample/100)); #number of reads to sample } else { #otherwise use all reads $sample_reads = $totalreads; } $/ = '>'; #define fasta record input seperator open (IN, "<$in") or die "Cannot open $in $!\n"; open (OUT, ">$samplefile") or die "Cannot open $samplefile $!\n"; while () { #read lines into an array chomp; push (@lines, $_); } @lines = shuffle(@lines); #shuffle array foreach (@lines) { print OUT ">$_" if $linecount <= $sample_reads; #output to file sampled number of reads $linecount++; } close IN; close OUT; return $samplefile; }#end create_sample_file ####################### # do_blast # # Input: Fasta File containing SCREENED sampled reads # # Output: Blast File # # sub do_blast { my $bf = shift; my $blastoutput = $bf . ".blastout"; print "Blasting against $db ...\n"; `blast/blast-2.2.10/bin/megablast -d /prodinfo/proddata_ntblastdb/nt -e 1e-50 -p 99 -D 2 -i test -o test.blastout`; return $blastoutput; }#end do_blast ####################### # parse_blast # # Input: Blast file # # Output: Creates hash containing best hit for each read # # sub parse_blast { my $blastoutfile = shift; if (! -e $blastoutfile) { die "$blastoutfile does not exist $!\n"; } print "Parsing blast hits ...\n"; my $report_obj = new Bio::SearchIO(-verbose => 1, -format => 'blast', -file => $blastoutfile); die "no valid $report_obj" unless defined $report_obj; while( my $result = $report_obj->next_result ) { die "no valid $result" unless defined $result; while( my $hit = $result->next_hit ) { while( my $hsp = $hit->next_hsp ) { my $name = $result->query_name; my $hitDesc = $hit->description; my $length = $hsp->length('total'); my $per_id = sprintf("%.2f", $hsp->percent_identity); my $eval = $hsp->evalue; next if (defined $blast_results{$name} && $blast_results{$name}->[0] > $length); #only keep best hit for any read $blast_results{$name} = [$length, $per_id, $eval, $hitDesc]; #store in hash of arrays } } } } #end parse_blast Debug of NON-working blast-parse: main::(454/scripts/fasta_blasta_mb.pl:151): 151: my $sample_fasta = &create_sample_file($inputfile); DB<2> n main::(454/scripts/fasta_blasta_mb.pl:152): 152: $blast_result = &do_blast($sample_fasta); DB<2> x $sample_fasta 0 'G782.2005-08-16-16-48.fasta_sample' DB<3> n Blasting against NT ... main::(454/scripts/fasta_blasta_mb.pl:154): 154: &parse_blast($blast_result); DB<3> s main::parse_blast(454/scripts/fasta_blasta_mb.pl:293): 293: my $blastoutfile = shift; DB<3> s main::parse_blast(454/scripts/fasta_blasta_mb.pl:295): 295: if (! -e $blastoutfile) { DB<3> x $blastoutfile 0 'G782.2005-08-16-16-48.fasta_sample.blastout' DB<4> s main::parse_blast(454/scripts/fasta_blasta_mb.pl:299): 299: print "Parsing blast hits ...\n"; DB<4> s Parsing blast hits ... main::parse_blast(454/scripts/fasta_blasta_mb.pl:302): 302: my $report_obj = new Bio::SearchIO(-verbose => 1, 303: -format => 'blast', 304: -file => $blastoutfile);#or die "Could not open blast report $!"; DB<4> s Bio::SearchIO::new(/util/lib/perl5/site_perl/5.8.0/Bio/SearchIO.pm:129): 129: my($caller, at args) = @_; DB<4> r scalar context return from Bio::SearchIO::new: '_file' => 'G782.2005-08-16-16-48.fasta_sample.blastout' '_filehandle' => GLOB(0x8cef40c) -> *Symbol::GEN1 FileHandle({*Symbol::GEN1}) => fileno(3) '_flush_on_write' => 1 '_handler' => Bio::SearchIO::IteratedSearchResultEventBuilder=HASH(0x9505724) '_factories' => HASH(0x95054c0) 'hit' => Bio::Factory::ObjectFactory=HASH(0x95017b8) '_loaded_types' => HASH(0x9506c0c) 'Bio::Search::Hit::BlastHit' => 1 '_root_verbose' => 0 'interface' => 'Bio::Search::Hit::HitI' 'type' => 'Bio::Search::Hit::BlastHit' 'hsp' => Bio::Factory::ObjectFactory=HASH(0x9500e10) '_loaded_types' => HASH(0x9506c18) 'Bio::Search::HSP::GenericHSP' => 1 '_root_verbose' => 0 'interface' => 'Bio::Search::HSP::HSPI' 'type' => 'Bio::Search::HSP::GenericHSP' 'iteration' => Bio::Factory::ObjectFactory=HASH(0x9506c60) '_loaded_types' => HASH(0x9506af8) 'Bio::Search::Iteration::GenericIteration' => 1 '_root_verbose' => 0 'interface' => 'Bio::Search::Iteration::IterationI' 'type' => 'Bio::Search::Iteration::GenericIteration' 'result' => Bio::Factory::ObjectFactory=HASH(0x9504c80) '_loaded_types' => HASH(0x9501f74) 'Bio::Search::Result::BlastResult' => 1 '_root_verbose' => 0 'interface' => 'Bio::Search::Result::ResultI' 'type' => 'Bio::Search::Result::BlastResult' '_inclusion_threshold' => 0.001 '_root_verbose' => 1 '_handler_cache' => Bio::SearchIO::IteratedSearchResultEventBuilder=HASH(0x9505724) -> REUSED_ADDRESS '_notfirsttime' => 0 '_reporttype' => '' '_root_cleanup_methods' => ARRAY(0x8cde434) 0 CODE(0x82a9aec) -> &Bio::Root::IO::_io_cleanup in /util/lib/perl5/site_perl/5.8.0/Bio/Root/IO.pm:544-570 1 CODE(0x82a9aec) -> REUSED_ADDRESS '_root_verbose' => 1 main::parse_blast(454/scripts/fasta_blasta_mb.pl:307): 307: die "no valid $report_obj" unless defined $report_obj; DB<4> s main::parse_blast(454/scripts/fasta_blasta_mb.pl:310): 310: while( my $result = $report_obj->next_result ) { DB<4> s Bio::SearchIO::blast::next_result(/util/lib/perl5/site_perl/5.8.0/Bio/SearchIO/blast.pm:389): 389: my ($self) = @_; DB<4> r scalar context return from Bio::SearchIO::blast::next_result: undef Bio::SearchIO::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/SearchIO.pm:438): 438: my $self = shift; DB<4> r scalar context return from Bio::SearchIO::DESTROY: '' Bio::Root::Root::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/Root/Root.pm:404): 404: my $self = shift; DB<4> r scalar context return from Bio::Root::Root::DESTROY: undef Bio::Root::Root::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/Root/Root.pm:404): 404: my $self = shift; DB<4> r scalar context return from Bio::Root::Root::DESTROY: undef Bio::Root::Root::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/Root/Root.pm:404): 404: my $self = shift; DB<4> r scalar context return from Bio::Root::Root::DESTROY: undef Bio::Root::Root::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/Root/Root.pm:404): 404: my $self = shift; DB<4> r scalar context return from Bio::Root::Root::DESTROY: undef Bio::Root::Root::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/Root/Root.pm:404): 404: my $self = shift; DB<4> r scalar context return from Bio::Root::Root::DESTROY: undef main::(454/scripts/fasta_blasta_mb.pl:155): 155: &output_results(); DB<4> x $result 0 undef Debug of WORKING blast-parse: Parsing blast hits ... main::parse_blast(454/scripts/fasta_blasta_mb.pl:302): 302: my $report_obj = new Bio::SearchIO(-verbose => 1, 303: -format => 'blast', 304: -file => $blastoutfile);#or die "Could not open blast report $!"; DB<3> s Bio::SearchIO::new(/util/lib/perl5/site_perl/5.8.0/Bio/SearchIO.pm:129): 129: my($caller, at args) = @_; DB<3> r scalar context return from Bio::SearchIO::new: '_file' => 'G782.2005-08-16-16-48.fasta_sample.blastout' '_filehandle' => GLOB(0x8763100) -> *Symbol::GEN1 FileHandle({*Symbol::GEN1}) => fileno(3) '_flush_on_write' => 1 '_handler' => Bio::SearchIO::IteratedSearchResultEventBuilder=HASH(0x8ab3be4) '_factories' => HASH(0x8ab1594) 'hit' => Bio::Factory::ObjectFactory=HASH(0x8a7b7c0) '_loaded_types' => HASH(0x8abee10) 'Bio::Search::Hit::BlastHit' => 1 '_root_verbose' => 0 'interface' => 'Bio::Search::Hit::HitI' 'type' => 'Bio::Search::Hit::BlastHit' 'hsp' => Bio::Factory::ObjectFactory=HASH(0x8a87200) '_loaded_types' => HASH(0x8abee1c) 'Bio::Search::HSP::GenericHSP' => 1 '_root_verbose' => 0 'interface' => 'Bio::Search::HSP::HSPI' 'type' => 'Bio::Search::HSP::GenericHSP' 'iteration' => Bio::Factory::ObjectFactory=HASH(0x8abee64) '_loaded_types' => HASH(0x8abecfc) 'Bio::Search::Iteration::GenericIteration' => 1 '_root_verbose' => 0 'interface' => 'Bio::Search::Iteration::IterationI' 'type' => 'Bio::Search::Iteration::GenericIteration' 'result' => Bio::Factory::ObjectFactory=HASH(0x8a81a84) '_loaded_types' => HASH(0x8a96ce8) 'Bio::Search::Result::BlastResult' => 1 '_root_verbose' => 0 'interface' => 'Bio::Search::Result::ResultI' 'type' => 'Bio::Search::Result::BlastResult' '_inclusion_threshold' => 0.001 '_root_verbose' => 1 '_handler_cache' => Bio::SearchIO::IteratedSearchResultEventBuilder=HASH(0x8ab3be4) -> REUSED_ADDRESS '_notfirsttime' => 0 '_reporttype' => '' '_root_cleanup_methods' => ARRAY(0x8762efc) 0 CODE(0x82a9aec) -> &Bio::Root::IO::_io_cleanup in /util/lib/perl5/site_perl/5.8.0/Bio/Root/IO.pm:544-570 1 CODE(0x82a9aec) -> REUSED_ADDRESS '_root_verbose' => 1 main::parse_blast(454/scripts/fasta_blasta_mb.pl:307): 307: die "no valid $report_obj" unless defined $report_obj; DB<3> s main::parse_blast(454/scripts/fasta_blasta_mb.pl:310): 310: while( my $result = $report_obj->next_result ) { DB<3> s Bio::SearchIO::blast::next_result(/util/lib/perl5/site_perl/5.8.0/Bio/SearchIO/blast.pm:389): 389: my ($self) = @_; DB<3> r blast.pm: unrecognized line Reference: Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb Miller (2000), blast.pm: unrecognized line "A greedy algorithm for aligning DNA sequences", blast.pm: unrecognized line J Comput Biol 2000; 7(1-2):203-14. blast.pm: unrecognized line Score E Got NCBI HSP score=354, evalue 0.0 scalar context return from Bio::SearchIO::blast::next_result: '_algorithm' => 'MEGABLAST' '_algorithm_version' => '2.2.10 [Oct-19-2004]' '_dbentries' => 4249067 '_dbletters' => 17735149364 '_dbname' => 'All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS,GSS,environmental samples or phase 0, 1 or 2 HTGS sequences) ' '_hitindex' => 0 '_hits' => ARRAY(0x8b2acd0) empty array '_inclusion_threshold' => 0.001 '_iteration_count' => 1 '_iteration_index' => 0 '_iterations' => ARRAY(0x8b2ac4c) 0 Bio::Search::Iteration::GenericIteration=HASH(0x8b1cacc) '_newhits_below_threshold' => ARRAY(0x8b1ca84) 0 Bio::Search::Hit::BlastHit=HASH(0x8b1cf64) '_accession' => 'AE004091' '_algorithm' => 'MEGABLAST' '_description' => 'Pseudomonas aeruginosa PAO1, complete genome' '_hsps' => ARRAY(0x8b1ceb0) 0 Bio::Search::HSP::GenericHSP=HASH(0x8b2098c) '_algorithm' => 'MEGABLAST' '_frac_conserved' => HASH(0x8b266a0) 'hit' => 0.991803278688525 'query' => 0.991803278688525 'total' => 0.991803278688525 '_frac_identical' => HASH(0x8b2658c) 'hit' => 0.991803278688525 'query' => 0.991803278688525 'total' => 0.991803278688525 '_gaps' => HASH(0x8b24d94) 'hit' => 0 'query' => 0 'total' => 0 '_gsf_tag_hash' => HASH(0x8b20998) empty hash '_hit_string' => 'cctgacctccgctcaactgcgcaaatacgccagcgccggtcggccgttccccgaagggcgcctgctggccgcctcctgccacgacgcggaggaactggccctggctgcctcgatgggagtggagttcgtcaccctttcgccggtacagccgaccgagagccatcccggcgagccggcgctgggttgggacaaggccgccgaactgatcgccggcttcaaccagccggtctacctgctgggtggcctcggtccgcagcaagccgagcaggcttgggagcatggagcccagggcgtggcgggtatccgtgcgttctggccgggcggcctttgacggtggaatgaagaaaaaaggaggcttcggcctcc' '_homology_string' => '|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||' etc...... From torsten.seemann at infotech.monash.edu.au Sun Jul 30 01:41:30 2006 From: torsten.seemann at infotech.monash.edu.au (Torsten Seemann) Date: Sun, 30 Jul 2006 15:41:30 +1000 Subject: [Bioperl-l] Strangeness in parsing blast file In-Reply-To: <44CC3550.5070105@broad.mit.edu> References: <44CC3550.5070105@broad.mit.edu> Message-ID: <44CC468A.40700@infotech.monash.edu.au> > sub do_blast { > my $bf = shift; > my $blastoutput = $bf . ".blastout"; > print "Blasting against $db ...\n"; > `blast/blast-2.2.10/bin/megablast -d /prodinfo/proddata_ntblastdb/nt > -e 1e-50 -p 99 -D 2 -i test -o test.blastout`; > return $blastoutput; > }#end do_blast Should "-o test.blastoutput" be "-o $blastoutput" ? Otherwise you are returning the name of a non-existent file, which naturally Bio::SearchIO will not be able to find a blast result in. Alternatively use Bio::Tools::Run::StandaloneBlast to invoke megablast rather than back-ticks - that way you avoid any intermediate file and get a Bio::SearchIO object back directly. --Torsten From nabil at broad.mit.edu Sun Jul 30 10:11:03 2006 From: nabil at broad.mit.edu (Nabil Hafez) Date: Sun, 30 Jul 2006 10:11:03 -0400 Subject: [Bioperl-l] Strangeness in parsing blast file In-Reply-To: <44CC468A.40700@infotech.monash.edu.au> References: <44CC3550.5070105@broad.mit.edu> <44CC468A.40700@infotech.monash.edu.au> Message-ID: <44CCBDF7.2010601@broad.mit.edu> I had modified the variables a bit to try and make them more readable than what is in my code, in my code -o $blastoutput is what it is, like I said, the blast portion works absolutely fine - i.e. the do_blast sub routine is fully functional. here's a cut and paste from my actual code my $MBLAST = "/prodinfo/prod3pty/blast/blast-2.2.10/bin/megablast"; my $blastdb = "/prodinfo/proddata_ntblastdb/nt"; my $e_val = "1e-50"; #Default e-value Getopt_long my $percent_id = "99"; #Default percentage identity my $per_to_sample ="10"; #Default for percentage of reads to sample sub do_blast { my $bf = shift; my $blastoutput = $bf . ".blastout"; print "Blasting against $db ...\n"; `$MBLAST -d $blastdb -e $e_val -p $percent_id -D 2 -i $bf -o $blastoutput`; return $blastoutput; } I will try your suggestion to use the Bio::Tools::Run::StandaloneBlast, is megablast supported by this module? Thanks Nabil Torsten Seemann wrote: > >> sub do_blast { >> my $bf = shift; >> my $blastoutput = $bf . ".blastout"; >> print "Blasting against $db ...\n"; >> `blast/blast-2.2.10/bin/megablast -d >> /prodinfo/proddata_ntblastdb/nt -e 1e-50 -p 99 -D 2 -i test -o >> test.blastout`; > > > return $blastoutput; > > }#end do_blast > > Should "-o test.blastoutput" be "-o $blastoutput" ? > > Otherwise you are returning the name of a non-existent file, which > naturally Bio::SearchIO will not be able to find a blast result in. > > Alternatively use Bio::Tools::Run::StandaloneBlast to invoke megablast > rather than back-ticks - that way you avoid any intermediate file and > get a Bio::SearchIO object back directly. > > --Torsten > From bix at sendu.me.uk Sun Jul 30 12:20:54 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Sun, 30 Jul 2006 17:20:54 +0100 Subject: [Bioperl-l] Strangeness in parsing blast file In-Reply-To: <44CCBDF7.2010601@broad.mit.edu> References: <44CC3550.5070105@broad.mit.edu> <44CC468A.40700@infotech.monash.edu.au> <44CCBDF7.2010601@broad.mit.edu> Message-ID: <44CCDC66.2030604@sendu.me.uk> Nabil Hafez wrote: > I had modified the variables a bit to try and make them more readable > than what is in my code, in my code -o $blastoutput is > what it is, like I said, the blast portion works absolutely fine - i.e. > the do_blast sub routine is fully functional. How do you know? > `$MBLAST -d $blastdb -e $e_val -p $percent_id -D 2 -i $bf -o > $blastoutput`; Does this command definitely produce exactly the same file as the one you use to show that parse_blast() does sometimes work (when you avoid using do_blast())? Btw, http://perldoc.perl.org/perlfaq8.html#What's-wrong-with-using-backticks-in-a-void-context%3f > I will try your suggestion to use the Bio::Tools::Run::StandaloneBlast, > is megablast supported by this module? No, it doesn't. You could cheat and call _runblast() directly (give it an executable string and a string of args to megablast), and provide -outfile to new(). From nabil at broad.mit.edu Sun Jul 30 20:13:16 2006 From: nabil at broad.mit.edu (Nabil Hafez) Date: Sun, 30 Jul 2006 20:13:16 -0400 Subject: [Bioperl-l] Strangeness in parsing blast file In-Reply-To: <44CCDC66.2030604@sendu.me.uk> References: <44CC3550.5070105@broad.mit.edu> <44CC468A.40700@infotech.monash.edu.au> <44CCBDF7.2010601@broad.mit.edu> <44CCDC66.2030604@sendu.me.uk> Message-ID: <44CD4B1C.5070907@broad.mit.edu> Sendu Bala wrote: >Nabil Hafez wrote: > > >>I had modified the variables a bit to try and make them more readable >>than what is in my code, in my code -o $blastoutput is >>what it is, like I said, the blast portion works absolutely fine - i.e. >>the do_blast sub routine is fully functional. >> >> > >How do you know? > > > Because it creates a file containing all of the blastoutput, this works every time - a file is created with the blastoutput. >> `$MBLAST -d $blastdb -e $e_val -p $percent_id -D 2 -i $bf -o >>$blastoutput`; >> >> > >Does this command definitely produce exactly the same file as the one >you use to show that parse_blast() does sometimes work (when you avoid >using do_blast())? > > > Yes - the exact same file because I produce the file with do_blast() and then when it fails to parse it ends but there is a blastoutput file created in my directory. If i re-run the script again just feeding in the name of the file that was created, it parses it just fine. So basically the parsing works whenever I feed it a blastoupt file but it can't seem to parse the same file that was created and then passed to the parse_blast() subroutine >Btw, >http://perldoc.perl.org/perlfaq8.html#What's-wrong-with-using-backticks-in-a-void-context%3f > >Good to know. Thanks. > > >>I will try your suggestion to use the Bio::Tools::Run::StandaloneBlast, >>is megablast supported by this module? >> >> > >No, it doesn't. You could cheat and call _runblast() directly (give it >an executable string and a string of args to megablast), and provide >-outfile to new(). > > > I still don't think the blast is a problem since I get perfect blastoutput everytime. >_______________________________________________ >Bioperl-l mailing list >Bioperl-l at lists.open-bio.org >http://lists.open-bio.org/mailman/listinfo/bioperl-l > > From cjfields at uiuc.edu Sun Jul 30 22:52:16 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sun, 30 Jul 2006 21:52:16 -0500 Subject: [Bioperl-l] Strangeness in parsing blast file In-Reply-To: <44CD4B1C.5070907@broad.mit.edu> References: <44CC3550.5070105@broad.mit.edu> <44CC468A.40700@infotech.monash.edu.au> <44CCBDF7.2010601@broad.mit.edu> <44CCDC66.2030604@sendu.me.uk> <44CD4B1C.5070907@broad.mit.edu> Message-ID: <81C49D1F-0468-4B63-8D7A-09E1C48573F0@uiuc.edu> As an aside, BLAST 2.2.13 or later cannot be parsed using Bioperl 1.5.1. You have to update to the latest bioperl-live (from CVS). Chris On Jul 30, 2006, at 7:13 PM, Nabil Hafez wrote: > > > Sendu Bala wrote: > >> Nabil Hafez wrote: >> >> >>> I had modified the variables a bit to try and make them more >>> readable >>> than what is in my code, in my code -o $blastoutput is >>> what it is, like I said, the blast portion works absolutely fine >>> - i.e. >>> the do_blast sub routine is fully functional. >>> >>> >> >> How do you know? >> >> >> > Because it creates a file containing all of the blastoutput, this > works > every time - a file is created with the > blastoutput. > >>> `$MBLAST -d $blastdb -e $e_val -p $percent_id -D 2 -i $bf -o >>> $blastoutput`; >>> >>> >> >> Does this command definitely produce exactly the same file as the one >> you use to show that parse_blast() does sometimes work (when you >> avoid >> using do_blast())? >> >> >> > Yes - the exact same file because I produce the file with do_blast() > and then when it fails to parse it ends but > there is a blastoutput file created in my directory. If i re-run the > script again just feeding in the name of the file that was > created, it parses it just fine. So basically the parsing works > whenever I feed it a blastoupt file but it can't seem to parse > the same file that was created and then passed to the parse_blast() > subroutine > >> Btw, >> http://perldoc.perl.org/perlfaq8.html#What's-wrong-with-using- >> backticks-in-a-void-context%3f >> >> Good to know. Thanks. >> >> >>> I will try your suggestion to use the >>> Bio::Tools::Run::StandaloneBlast, >>> is megablast supported by this module? >>> >>> >> >> No, it doesn't. You could cheat and call _runblast() directly >> (give it >> an executable string and a string of args to megablast), and provide >> -outfile to new(). >> >> >> > I still don't think the blast is a problem since I get perfect > blastoutput everytime. > >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Mon Jul 31 04:29:28 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 31 Jul 2006 09:29:28 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: <5C765CAF-4976-45D1-BDA4-1E432081FF4D@uiuc.edu> References: <001301c6b24b$da38ba80$15327e82@pyrimidine> <44CB144C.6050507@sendu.me.uk> <5C765CAF-4976-45D1-BDA4-1E432081FF4D@uiuc.edu> Message-ID: <44CDBF68.2040803@sendu.me.uk> Chris Fields wrote: > On Jul 29, 2006, at 2:54 AM, Sendu Bala wrote: > >>> http://www.bioperl.org/wiki/HOWTO:Trees >> Which is why the Trees HOWTO talks about taxa, and a number of the >> Taxonomy modules have phylogenetic methods like get_lca. (And why >> there >> is Bio::Taxonomy::Taxon and Tree.) > > Are we still thinking about deprecating those? I have seen very > little mention of those modules from the mail list archives, and > Jason mentioned that Bio::Taxonomy::Taxon hasn't been modified in a > long time. Yes, they would both be redundant and nonsensical with the planned changes to Bio::Species. From Xianjun.Dong at bccs.uib.no Mon Jul 31 07:55:59 2006 From: Xianjun.Dong at bccs.uib.no (Xianjun Dong) Date: Mon, 31 Jul 2006 13:55:59 +0200 Subject: [Bioperl-l] PAML + Codeml problem.. In-Reply-To: 4A98ACB8EC146149872BAC9A132A582C277AC4@icex5.ic.ac.uk Message-ID: <1154346960.6517.19.camel@lauvtre.ii.uib.no> Hi, I have a problem during running the Codeml Wiki-HOWTO code: Here is the error message: //////////////////////////////////////////////////////////////// [xianjund at lauvtre kaks]$ perl paml.pl test.fa -------------------- WARNING --------------------- MSG: There was an error - see error_string for the program output STACK Bio::Tools::Run::Phylo::PAML::Codeml::run /Home/extern/xianjund/src/bioperl/bioperl-run/Bio/Tools/Run/Phylo/PAML/Codeml.pm:581 STACK toplevel paml.pl:61 ------------- EXCEPTION: Bio::Root::NotImplemented ------------- MSG: Unknown format of PAML output STACK: Error::throw STACK: Bio::Root::Root::throw /usr/lib/perl5/site_perl/5.8.5/Bio/Root/Root.pm:328 STACK: Bio::Tools::Phylo::PAML::_parse_summary /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:359 STACK: Bio::Tools::Phylo::PAML::next_result /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:224 STACK: paml.pl:62 ---------------------------------------------------------------- //////////////////////////////////////////////////////////////// My test sequence is: >ENST00000361390 ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCTAATCGCAATGGCATTCCTAATGCTTACCGAACGAAAAATTCTAGGCTATATACAACTACGCAAAGGCCCCAACGTTGTAGGCCCCTACGGGCTACTACAACCCTTCGCTGACGCCATAAAACTCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATCACCGCCCCGACCTTAGCTCTCACCATCGCTCTTCTACTATGAACCCCCCTCCCCATACCCAACCCCCTGGTCAACCTCAACCTAGGCCTCCTATTTATTCTAGCCACCTCTAGCCTAGCCGTTTACTCAATCCTCTGATCAGGGTGAGCATCAAACTCAAACTACGCCCTGATCGGCGCACTGCGAGCAGTAGCCCAAACAATCTCATATGAAGTCACCCTAGCCATCATTCTACTATCAACATTACTAATAAGTGGCTCCTTTAACCTCTCCACCCTTATCACAACACAAGAACACCTCTGATTACTCCTGCCATCATGACCCTTGGCCATAATATGATTTATCTCCACACTAGCAGAGACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACTAGTCTCAGGCTTCAACATCGAATACGCCGCAGGCCCCTTCGCCCTATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTACAATCTTCCTAGGAACAACATATGACGCACTCTCCCCTGAACTCTACACAACATATTTTGTCACCAAGACCCTACTTCTAACCTCCCTGTTCTTATGAATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACACCTCCTATGAAAAAACTTCCTACCACTCACCCTAGCATTACTTATATGATATGTCTCCATACCCATTACAATCTCCAGCATTCCCCCTCAAACCTAA >ENSMUST00000082392 GTGTTCTTTATTAATATCCTAACACTCCTCGTCCCCATTCTAATCGCCATAGCCTTCCTAACATTAGTAGAACGCAAAATCTTAGGGTACATACAACTACGAAAAGGCCCTAACATTGTTGGTCCATACGGCATTTTACAACCATTTGCAGACGCCATAAAATTATTTATAAAAGAACCAATACGCCCTTTAACAACCTCTATATCCTTATTTATTATTGCACCTACCCTATCACTCACACTAGCATTAAGTCTATGAGTTCCCCTACCAATACCACACCCATTAATTAATTTAAACCTAGGGATTTTATTTATTTTAGCAACATCTAGCCTATCAGTTTACTCCATTCTATGATCAGGATGAGCCTCAAACTCCAAATACTCACTATTCGGAGCTTTACGAGCCGTAGCCCAAACAATTTCATATGAAGTAACCATAGCTATTATCCTTTTATCAGTTCTATTAATAAATGGATCCTACTCTCTACAAACACTTATTACAACCCAAGAACACATATGATTACTTCTGCCAGCCTGACCCATAGCCATAATATGATTTATCTCAACCCTAGCAGAAACAAACCGGGCCCCCTTCGACCTGACAGAAGGAGAATCAGAATTAGTATCAGGGTTTAACGTAGAATACGCAGCCGGCCCATTCGCGTTATTCTTTATAGCAGAGTACACTAACATTATTCTAATAAACGCCCTAACAACTATTATCTTCCTAGGACCCCTATACTATATCAATTTACCAGAACTCTACTCAACTAACTTCATAATAGAAGCTCTACTACTATCATCAACATTCCTATGGATCCGAGCATCTTATCCACGCTTCCGTTACGATCAACTTATACATCTTCTATGAAAAAACTTTCTACCCCTAACACTAGCATTATGTATGTGACATATTTCTTTACCAATTTTTACAGCGGGAGTACCACCATACATATAG Sure, I checked it. There is some stop codon in it. If I replace it with non-stop codon, it works. For example, >ENST00000361390 ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCcaaTCGCAATGGCATTCCcaaTGCTTACCGAACGAAAAATTCcaaGCTATATACAACTACGCAAAGGCCCCAACGTTGcaaGCCCCTACGGGCTACTACAACCCTTCGCcaaCGCCAcaaAACTCTTCACCAAAGAGCCCCcaaAACCCGCCACATCTACCATCACCCTCTACATCACCGCCCCGACCTcaaCTCTCACCATCGCTCTTCTACTAcaaACCCCCCTCCCCATACCCAACCCCCTGGTCAACCTCAACCcaaGCCTCCTATTTATTCcaaCCACCTCcaaCCcaaCCGTTTACTCAATCCTCcaaTCAGGGcaaGCATCAAACTCAAACTACGCCCcaaTCGGCGCACTGCGAGCAGcaaCCCAAACAATCTCATAcaaAGTCACCCcaaCCATCATTCTACTATCAACATTACcaacaaGTGGCTCCTTcaaCCTCTCCACCCTTATCACAACACAAGAACACCTCcaaTTACTCCTGCCATCAcaaCCCTTGGCCAcaaTAcaaTTTATCTCCACACcaaCAGAGACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACcaaTCTCAGGCTTCAACATCGAATACGCCGCAGGCCCCTTCGCCCTATTCTTCAcaaCCGAATACACAAACATTATTAcaacaaACACCCTCACCACTACAATCTTCCcaaGAACAACATAcaaCGCACTCTCCCCcaaACTCTACACAACATATTTTGTCACCAAGACCCTACTTCcaaCCTCCCTGTTCTTAcaaATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACACCTCCTAcaaAAAAACTTCCTACCACTCACCCcaaCATTACTTATAcaaTATGTCTCCATACCCATTACAATCTCCAGCATTCCCCCTCAAACCcaa >ENSMUST00000082392 GTGTTCTTTATcaaTATCCcaaCACTCCTCGTCCCCATTCcaaTCGCCAcaaCCTTCCcaaCATcaacaaAACGCAAAATCTcaaGGTACATACAACTACGAAAAGGCCCcaaCATTGTTGGTCCATACGGCATTTTACAACCATTTGCAGACGCCAcaaAATTATTTAcaaAAGAACCAATACGCCCTTcaaCAACCTCTATATCCTTATTTATTATTGCACCTACCCTATCACTCACACcaaCATcaaGTCTAcaaGTTCCCCTACCAATACCACACCCATcaaTcaaTTcaaACCcaaGGATTTTATTTATTTcaaCAACATCcaaCCTATCAGTTTACTCCATTCTAcaaTCAGGAcaaGCCTCAAACTCCAAATACTCACTATTCGGAGCTTTACGAGCCGcaaCCCAAACAATTTCATAcaaAGcaaCCAcaaCTATTATCCTTTTATCAGTTCTATcaacaaATGGATCCTACTCTCTACAAACACTTATTACAACCCAAGAACACATAcaaTTACTTCTGCCAGCCcaaCCCAcaaCCAcaaTAcaaTTTATCTCAACCCcaaCAGAAACAAACCGGGCCCCCTTCGACCcaaCAGAAGGAGAATCAGAATcaaTATCAGGGTTcaaCGcaaAATACGCAGCCGGCCCATTCGCGTTATTCTTTAcaaCAGAGTACACcaaCATTATTCcaacaaACGCCCcaaCAACTATTATCTTCCcaaGACCCCTATACTATATCAATTTACCAGAACTCTACTCAACcaaCTTCAcaacaaAAGCTCTACTACTATCATCAACATTCCTATGGATCCGAGCATCTTATCCACGCTTCCGTTACGATCAACTTATACATCTTCTAcaaAAAAACTTTCTACCCCcaaCACcaaCATTATGTATGcaaCATATTTCTTTACCAATTTTTACAGCGGGAGTACCACCATACATAcaa But my question is: it does not occur in the codon position (say, the third codon's position is not a times of 3). Why it effect the result? And also there is code to filter out the stop codon in the sample code (as the following shown) /////////////////////////////// if( $pseq =~ /\*/ && $pseq !~ /\*$/ ) { warn("provided a CDS sequence with a stop codon, PAML will choke!"); exit(0); } # Tcoffee can't handle '*' even if it is trailing $pseq =~ s/\*//g; ///////////////////////////// So, when translate back from aa_aln to dna_aln, there should be no stop codon included. SO, why it can not pass? Thanks for answer! P.S: attach my code here: ///////////////////////////////////////////////////////// #!/usr/bin/perl -w use strict; use Bio::Tools::Run::Phylo::PAML::Codeml; use Bio::Tools::Run::Alignment::Clustalw; # for projecting alignments from protein to R/DNA space use Bio::Align::Utilities qw(aa_to_dna_aln); # for input of the sequence data use Bio::SeqIO; use Bio::AlignIO; my $aln_factory = Bio::Tools::Run::Alignment::Clustalw->new('quiet'=>1); my $seqdata = shift || 'test.fa'; my $seqio = new Bio::SeqIO(-file => $seqdata, -format => 'fasta'); my %seqs; my @prots; # process each sequence while ( my $seq = $seqio->next_seq ) { $seqs{$seq->display_id} = $seq; # translate them into protein my $protein = $seq->translate(); my $pseq = $protein->seq(); if( $pseq =~ /\*/ && $pseq !~ /\*$/ ) { warn("provided a CDS sequence with a stop codon, PAML will choke!"); exit(0); } # Tcoffee can't handle '*' even if it is trailing $pseq =~ s/\*//g; $protein->seq($pseq); push @prots, $protein; } if( @prots < 2 ) { warn("Need at least 2 CDS sequences to proceed"); exit(0); } # open(OUT, ">align_output.txt") || die("cannot open output align_output for writing"); # Align the sequences with clustalw my $aa_aln = $aln_factory->align(\@prots); # project the protein alignment back to CDS coordinates my $dna_aln = aa_to_dna_aln($aa_aln, \%seqs); my @each = $dna_aln->each_seq(); my $kaks_factory = Bio::Tools::Run::Phylo::PAML::Codeml->new ( -params => { 'runmode' => -2, 'seqtype' => 1, }, -save_tempfiles => 1, -verbose => 1); # set the alignment object $kaks_factory->alignment($dna_aln); # run the KaKs analysis my ($rc,$parser) = $kaks_factory->run(); my $result = $parser->next_result; my $MLmatrix = $result->get_MLmatrix(); my @otus = $result->get_seqs(); # this gives us a mapping from the PAML order of sequences back to # the input order (since names get truncated) my @pos = map { my $c= 1; foreach my $s ( @each ) { last if( $s->display_id eq $_->display_id ); $c++; } $c; } @otus; print join("\t", qw(SEQ1 SEQ2 Ka Ks Ka/Ks PROT_PERCENTID CDNA_PERCENTID)),"\n"; for( my $i = 0; $i < (scalar @otus -1) ; $i++) { for( my $j = $i+1; $j < (scalar @otus); $j++ ) { my $sub_aa_aln = $aa_aln->select_noncont($pos[$i],$pos[$j]); my $sub_dna_aln = $dna_aln->select_noncont($pos[$i],$pos[$j]); print join("\t", $otus[$i]->display_id, $otus[$j]->display_id,$MLmatrix->[$i]->[$j]- >{'dN'}, $MLmatrix->[$i]->[$j]->{'dS'}, $MLmatrix->[$i]->[$j]->{'omega'}, sprintf("%.2f",$sub_aa_aln- >percentage_identity), sprintf("%.2f",$sub_dna_aln- >percentage_identity), ), "\n"; } } -- Xianjun Dong PhD Student Computational Biology Unit Bergen Center for Computational Science University of Bergen H?yteknologisenteret, Thorm?hlensgate 55 N-5008 Bergen,Norway. Webpage: http://www.ii.uib.no/~xianjund/ MSN: sterding at hotmail.com Phone No: +47 - 55584354 (office) +47 - 47361688 (mobile) Fax No: +47 - 55584295 From golharam at umdnj.edu Mon Jul 31 11:20:33 2006 From: golharam at umdnj.edu (Ryan Golhar) Date: Mon, 31 Jul 2006 11:20:33 -0400 Subject: [Bioperl-l] PAML + Codeml problem.. In-Reply-To: <1154346960.6517.19.camel@lauvtre.ii.uib.no> Message-ID: <027201c6b4b4$ddc201f0$2f01a8c0@GOLHARMOBILE1> Hi Xianjun, I just did some work on this module including the example. >> it does not occur in the codon position >>(say, the third codon's position is not a times of 3). >>Why it effect the result? If I'm interpreting your question correctly, the stop codons in your sequence occur in-frame. This is why it is choking. >>So, when translate back from aa_aln to dna_aln, there should be no stop codon included. SO, why it can not pass? The Ka and Ks statistics are not calculated based on the protein sequence, they are calculated based on the DNA sequence. The protein sequence is used to provide a alignment for the codons of the DNA sequence. Checking the protein sequence for * is easier to identify in-frame stop codons than scanning the DNA sequence. The two checks for stop codons you mentioned are to check for stop codons within the sequence without worry for the last amino acid. The second part remove the * at the end of the sequence (not the middle). If you want to remove the in-frame stop codons, you can, but do so before translating it to protein sequences. Ryan -----Original Message----- From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Xianjun Dong Sent: Monday, July 31, 2006 7:56 AM To: bioperl-l at lists.open-bio.org Subject: [Bioperl-l] PAML + Codeml problem.. Hi, I have a problem during running the Codeml Wiki-HOWTO code: Here is the error message: //////////////////////////////////////////////////////////////// [xianjund at lauvtre kaks]$ perl paml.pl test.fa -------------------- WARNING --------------------- MSG: There was an error - see error_string for the program output STACK Bio::Tools::Run::Phylo::PAML::Codeml::run /Home/extern/xianjund/src/bioperl/bioperl-run/Bio/Tools/Run/Phylo/PAML/C odeml.pm:581 STACK toplevel paml.pl:61 ------------- EXCEPTION: Bio::Root::NotImplemented ------------- MSG: Unknown format of PAML output STACK: Error::throw STACK: Bio::Root::Root::throw /usr/lib/perl5/site_perl/5.8.5/Bio/Root/Root.pm:328 STACK: Bio::Tools::Phylo::PAML::_parse_summary /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:359 STACK: Bio::Tools::Phylo::PAML::next_result /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:224 STACK: paml.pl:62 ---------------------------------------------------------------- //////////////////////////////////////////////////////////////// My test sequence is: >ENST00000361390 ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCTAATCGCAATGGCATTCCTAATGCTTACCGAA CGAAAAATTCTAGGCTATATACAACTACGCAAAGGCCCCAACGTTGTAGGCCCCTACGGGCTACTACAACCC TTCGCTGACGCCATAAAACTCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATC ACCGCCCCGACCTTAGCTCTCACCATCGCTCTTCTACTATGAACCCCCCTCCCCATACCCAACCCCCTGGTC AACCTCAACCTAGGCCTCCTATTTATTCTAGCCACCTCTAGCCTAGCCGTTTACTCAATCCTCTGATCAGGG TGAGCATCAAACTCAAACTACGCCCTGATCGGCGCACTGCGAGCAGTAGCCCAAACAATCTCATATGAAGTC ACCCTAGCCATCATTCTACTATCAACATTACTAATAAGTGGCTCCTTTAACCTCTCCACCCTTATCACAACA CAAGAACACCTCTGATTACTCCTGCCATCATGACCCTTGGCCATAATATGATTTATCTCCACACTAGCAGAG ACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACTAGTCTCAGGCTTCAACATCGAATACGCC GCAGGCCCCTTCGCCCTATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTACA ATCTTCCTAGGAACAACATATGACGCACTCTCCCCTGAACTCTACACAACATATTTTGTCACCAAGACCCTA CTTCTAACCTCCCTGTTCTTATGAATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACACCTC CTATGAAAAAACTTCCTACCACTCACCCTAGCATTACTTATATGATATGTCTCCATACCCATTACAATCTCC AGCATTCCCCCTCAAACCTAA >ENSMUST00000082392 GTGTTCTTTATTAATATCCTAACACTCCTCGTCCCCATTCTAATCGCCATAGCCTTCCTAACATTAGTAGAA CGCAAAATCTTAGGGTACATACAACTACGAAAAGGCCCTAACATTGTTGGTCCATACGGCATTTTACAACCA TTTGCAGACGCCATAAAATTATTTATAAAAGAACCAATACGCCCTTTAACAACCTCTATATCCTTATTTATT ATTGCACCTACCCTATCACTCACACTAGCATTAAGTCTATGAGTTCCCCTACCAATACCACACCCATTAATT AATTTAAACCTAGGGATTTTATTTATTTTAGCAACATCTAGCCTATCAGTTTACTCCATTCTATGATCAGGA TGAGCCTCAAACTCCAAATACTCACTATTCGGAGCTTTACGAGCCGTAGCCCAAACAATTTCATATGAAGTA ACCATAGCTATTATCCTTTTATCAGTTCTATTAATAAATGGATCCTACTCTCTACAAACACTTATTACAACC CAAGAACACATATGATTACTTCTGCCAGCCTGACCCATAGCCATAATATGATTTATCTCAACCCTAGCAGAA ACAAACCGGGCCCCCTTCGACCTGACAGAAGGAGAATCAGAATTAGTATCAGGGTTTAACGTAGAATACGCA GCCGGCCCATTCGCGTTATTCTTTATAGCAGAGTACACTAACATTATTCTAATAAACGCCCTAACAACTATT ATCTTCCTAGGACCCCTATACTATATCAATTTACCAGAACTCTACTCAACTAACTTCATAATAGAAGCTCTA CTACTATCATCAACATTCCTATGGATCCGAGCATCTTATCCACGCTTCCGTTACGATCAACTTATACATCTT CTATGAAAAAACTTTCTACCCCTAACACTAGCATTATGTATGTGACATATTTCTTTACCAATTTTTACAGCG GGAGTACCACCATACATATAG Sure, I checked it. There is some stop codon in it. If I replace it with non-stop codon, it works. For example, >ENST00000361390 ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCcaaTCGCAATGGCATTCCcaaTGCTTACCGAA CGAAAAATTCcaaGCTATATACAACTACGCAAAGGCCCCAACGTTGcaaGCCCCTACGGGCTACTACAACCC TTCGCcaaCGCCAcaaAACTCTTCACCAAAGAGCCCCcaaAACCCGCCACATCTACCATCACCCTCTACATC ACCGCCCCGACCTcaaCTCTCACCATCGCTCTTCTACTAcaaACCCCCCTCCCCATACCCAACCCCCTGGTC AACCTCAACCcaaGCCTCCTATTTATTCcaaCCACCTCcaaCCcaaCCGTTTACTCAATCCTCcaaTCAGGG caaGCATCAAACTCAAACTACGCCCcaaTCGGCGCACTGCGAGCAGcaaCCCAAACAATCTCATAcaaAGTC ACCCcaaCCATCATTCTACTATCAACATTACcaacaaGTGGCTCCTTcaaCCTCTCCACCCTTATCACAACA CAAGAACACCTCcaaTTACTCCTGCCATCAcaaCCCTTGGCCAcaaTAcaaTTTATCTCCACACcaaCAGAG ACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACcaaTCTCAGGCTTCAACATCGAATACGCC GCAGGCCCCTTCGCCCTATTCTTCAcaaCCGAATACACAAACATTATTAcaacaaACACCCTCACCACTACA ATCTTCCcaaGAACAACATAcaaCGCACTCTCCCCcaaACTCTACACAACATATTTTGTCACCAAGACCCTA CTTCcaaCCTCCCTGTTCTTAcaaATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACACCTC CTAcaaAAAAACTTCCTACCACTCACCCcaaCATTACTTATAcaaTATGTCTCCATACCCATTACAATCTCC AGCATTCCCCCTCAAACCcaa >ENSMUST00000082392 GTGTTCTTTATcaaTATCCcaaCACTCCTCGTCCCCATTCcaaTCGCCAcaaCCTTCCcaaCATcaacaaAA CGCAAAATCTcaaGGTACATACAACTACGAAAAGGCCCcaaCATTGTTGGTCCATACGGCATTTTACAACCA TTTGCAGACGCCAcaaAATTATTTAcaaAAGAACCAATACGCCCTTcaaCAACCTCTATATCCTTATTTATT ATTGCACCTACCCTATCACTCACACcaaCATcaaGTCTAcaaGTTCCCCTACCAATACCACACCCATcaaTc aaTTcaaACCcaaGGATTTTATTTATTTcaaCAACATCcaaCCTATCAGTTTACTCCATTCTAcaaTCAGGA caaGCCTCAAACTCCAAATACTCACTATTCGGAGCTTTACGAGCCGcaaCCCAAACAATTTCATAcaaAGca aCCAcaaCTATTATCCTTTTATCAGTTCTATcaacaaATGGATCCTACTCTCTACAAACACTTATTACAACC CAAGAACACATAcaaTTACTTCTGCCAGCCcaaCCCAcaaCCAcaaTAcaaTTTATCTCAACCCcaaCAGAA ACAAACCGGGCCCCCTTCGACCcaaCAGAAGGAGAATCAGAATcaaTATCAGGGTTcaaCGcaaAATACGCA GCCGGCCCATTCGCGTTATTCTTTAcaaCAGAGTACACcaaCATTATTCcaacaaACGCCCcaaCAACTATT ATCTTCCcaaGACCCCTATACTATATCAATTTACCAGAACTCTACTCAACcaaCTTCAcaacaaAAGCTCTA CTACTATCATCAACATTCCTATGGATCCGAGCATCTTATCCACGCTTCCGTTACGATCAACTTATACATCTT CTAcaaAAAAACTTTCTACCCCcaaCACcaaCATTATGTATGcaaCATATTTCTTTACCAATTTTTACAGCG GGAGTACCACCATACATAcaa But my question is: it does not occur in the codon position (say, the third codon's position is not a times of 3). Why it effect the result? And also there is code to filter out the stop codon in the sample code (as the following shown) /////////////////////////////// if( $pseq =~ /\*/ && $pseq !~ /\*$/ ) { warn("provided a CDS sequence with a stop codon, PAML will choke!"); exit(0); } # Tcoffee can't handle '*' even if it is trailing $pseq =~ s/\*//g; ///////////////////////////// So, when translate back from aa_aln to dna_aln, there should be no stop codon included. SO, why it can not pass? Thanks for answer! P.S: attach my code here: ///////////////////////////////////////////////////////// #!/usr/bin/perl -w use strict; use Bio::Tools::Run::Phylo::PAML::Codeml; use Bio::Tools::Run::Alignment::Clustalw; # for projecting alignments from protein to R/DNA space use Bio::Align::Utilities qw(aa_to_dna_aln); # for input of the sequence data use Bio::SeqIO; use Bio::AlignIO; my $aln_factory = Bio::Tools::Run::Alignment::Clustalw->new('quiet'=>1); my $seqdata = shift || 'test.fa'; my $seqio = new Bio::SeqIO(-file => $seqdata, -format => 'fasta'); my %seqs; my @prots; # process each sequence while ( my $seq = $seqio->next_seq ) { $seqs{$seq->display_id} = $seq; # translate them into protein my $protein = $seq->translate(); my $pseq = $protein->seq(); if( $pseq =~ /\*/ && $pseq !~ /\*$/ ) { warn("provided a CDS sequence with a stop codon, PAML will choke!"); exit(0); } # Tcoffee can't handle '*' even if it is trailing $pseq =~ s/\*//g; $protein->seq($pseq); push @prots, $protein; } if( @prots < 2 ) { warn("Need at least 2 CDS sequences to proceed"); exit(0); } # open(OUT, ">align_output.txt") || die("cannot open output align_output for writing"); # Align the sequences with clustalw my $aa_aln = $aln_factory->align(\@prots); # project the protein alignment back to CDS coordinates my $dna_aln = aa_to_dna_aln($aa_aln, \%seqs); my @each = $dna_aln->each_seq(); my $kaks_factory = Bio::Tools::Run::Phylo::PAML::Codeml->new ( -params => { 'runmode' => -2, 'seqtype' => 1, }, -save_tempfiles => 1, -verbose => 1); # set the alignment object $kaks_factory->alignment($dna_aln); # run the KaKs analysis my ($rc,$parser) = $kaks_factory->run(); my $result = $parser->next_result; my $MLmatrix = $result->get_MLmatrix(); my @otus = $result->get_seqs(); # this gives us a mapping from the PAML order of sequences back to # the input order (since names get truncated) my @pos = map { my $c= 1; foreach my $s ( @each ) { last if( $s->display_id eq $_->display_id ); $c++; } $c; } @otus; print join("\t", qw(SEQ1 SEQ2 Ka Ks Ka/Ks PROT_PERCENTID CDNA_PERCENTID)),"\n"; for( my $i = 0; $i < (scalar @otus -1) ; $i++) { for( my $j = $i+1; $j < (scalar @otus); $j++ ) { my $sub_aa_aln = $aa_aln->select_noncont($pos[$i],$pos[$j]); my $sub_dna_aln = $dna_aln->select_noncont($pos[$i],$pos[$j]); print join("\t", $otus[$i]->display_id, $otus[$j]->display_id,$MLmatrix->[$i]->[$j]- >{'dN'}, $MLmatrix->[$i]->[$j]->{'dS'}, $MLmatrix->[$i]->[$j]->{'omega'}, sprintf("%.2f",$sub_aa_aln- >percentage_identity), sprintf("%.2f",$sub_dna_aln- >percentage_identity), ), "\n"; } } -- Xianjun Dong PhD Student Computational Biology Unit Bergen Center for Computational Science University of Bergen H?yteknologisenteret, Thorm?hlensgate 55 N-5008 Bergen,Norway. Webpage: http://www.ii.uib.no/~xianjund/ MSN: sterding at hotmail.com Phone No: +47 - 55584354 (office) +47 - 47361688 (mobile) Fax No: +47 - 55584295 _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l From nabil at broad.mit.edu Mon Jul 31 14:57:48 2006 From: nabil at broad.mit.edu (Nabil Hafez) Date: Mon, 31 Jul 2006 14:57:48 -0400 Subject: [Bioperl-l] Strangeness in parsing blast file In-Reply-To: <44CC3550.5070105@broad.mit.edu> References: <44CC3550.5070105@broad.mit.edu> Message-ID: <44CE52AC.4080108@broad.mit.edu> I have figured out the problem - not a problem with Bioperl. In my create_sample_file() subroutine I defined $/ = '>'; #define fasta record input seperator when it should have been this local $/ = "\n>"; the use of local made a big difference. Thanks to all for your help. Nabil Hafez Nabil Hafez wrote: > Hi, > I am having a somewhat similar problem to what was posted in > http://bioperl.org/pipermail/bioperl-l/2006-May/021416.html > however, I have read through all of that thread and I don't believe what > I am > experiencing is the exact same problem. I also realize that the Bioperl > version 1.5.1 > fixes a problem with blast parsing. > > My problem: > My blastresults file parses fine and everything works swimmingly if > I pass > the blast output file by name such as > $blast_result = 'test.blastout'; > > however when I do > $blast_result = &do_blast($sample_fasta); > > even though in both cases $blast_result evaluate to "test.blastout", the > parsing doesn't work, more specifically > it gets an undefined variable for $result in while( my $result = > $report_obj->next_result ) { > > Sorr y for the long email - any help would be appreciated, > Thanks - Nabil > > > The code...non releavant parts trimmed for size constraints....debugging > from working and non-working > versions after the code. > > use strict; > use Bio::SearchIO; > use Getopt::Std; > use List::Util qw(shuffle); > use Benchmark; > > my ($inputfile, $samplefile, $blastfile, $blast_result, $blast_report, > $blast_verbose); #files generated > > > #------------------# > # Subroutine Calls # > #------------------# > > my $test = &create_sample_file($inputfile); #inputfile being a fasta > file containing nucleotide sequence > $blast_result = &do_blast($test); > ##$blast_result = 'test.blastout'; #when this is uncommented and > replace the previous two lines with test.blastout being normal blast > output - the script works fine. > &parse_blast($blast_result); > > > ####################### > # create_sample_file > # > # Input: Original Fasta File > # > # Output: Fasta file containing randomly sampled reads > # > # > sub create_sample_file { > my $in = shift; > my $linecount = 0; > my @lines; > > $samplefile = $in . "_sample"; > > #Determine total # of reads in input fasta > $totalreads = `$grep -c '>' $inputfile`; > $totalreads =~ s/\s+//; > chomp $totalreads; > > if ($totalreads > 1000) { #sample if more than 1000 reads > $sample_reads = sprintf("%.0f", $totalreads * > ($per_to_sample/100)); #number of reads to sample > } > else { #otherwise use all reads > $sample_reads = $totalreads; > } > > $/ = '>'; #define fasta record input seperator > > open (IN, "<$in") or die "Cannot open $in $!\n"; > open (OUT, ">$samplefile") or die "Cannot open $samplefile $!\n"; > > > while () { #read lines into an array > chomp; > push (@lines, $_); > } > > @lines = shuffle(@lines); #shuffle array > foreach (@lines) { > print OUT ">$_" if $linecount <= $sample_reads; #output to > file sampled number of reads > $linecount++; > } > > close IN; > close OUT; > > return $samplefile; > > }#end create_sample_file > > > ####################### > # do_blast > # > # Input: Fasta File containing SCREENED sampled reads > # > # Output: Blast File > # > # > > sub do_blast { > my $bf = shift; > my $blastoutput = $bf . ".blastout"; > > print "Blasting against $db ...\n"; > > `blast/blast-2.2.10/bin/megablast -d /prodinfo/proddata_ntblastdb/nt > -e 1e-50 -p 99 -D 2 -i test -o test.blastout`; > > return $blastoutput; > > }#end do_blast > > > > ####################### > # parse_blast > # > # Input: Blast file > # > # Output: Creates hash containing best hit for each read > # > # > > sub parse_blast { > my $blastoutfile = shift; > > if (! -e $blastoutfile) { > die "$blastoutfile does not exist $!\n"; > } > > print "Parsing blast hits ...\n"; > > > my $report_obj = new Bio::SearchIO(-verbose => 1, > -format => 'blast', > -file => $blastoutfile); > > > die "no valid $report_obj" unless defined $report_obj; > > > while( my $result = $report_obj->next_result ) { > die "no valid $result" unless defined $result; > while( my $hit = $result->next_hit ) { > while( my $hsp = $hit->next_hsp ) { > my $name = $result->query_name; > my $hitDesc = $hit->description; > my $length = $hsp->length('total'); > my $per_id = sprintf("%.2f", $hsp->percent_identity); > my $eval = $hsp->evalue; > next if (defined $blast_results{$name} && > $blast_results{$name}->[0] > $length); #only keep best hit for any read > $blast_results{$name} = [$length, $per_id, $eval, $hitDesc]; > #store in hash of arrays > } > } > } > > } #end parse_blast > > > > > > Debug of NON-working blast-parse: > > main::(454/scripts/fasta_blasta_mb.pl:151): > 151: my $sample_fasta = &create_sample_file($inputfile); > DB<2> n > main::(454/scripts/fasta_blasta_mb.pl:152): > 152: $blast_result = &do_blast($sample_fasta); > DB<2> x $sample_fasta > 0 'G782.2005-08-16-16-48.fasta_sample' > DB<3> n > Blasting against NT ... > main::(454/scripts/fasta_blasta_mb.pl:154): > 154: &parse_blast($blast_result); > DB<3> s > main::parse_blast(454/scripts/fasta_blasta_mb.pl:293): > 293: my $blastoutfile = shift; > DB<3> s > main::parse_blast(454/scripts/fasta_blasta_mb.pl:295): > 295: if (! -e $blastoutfile) { > DB<3> x $blastoutfile > 0 'G782.2005-08-16-16-48.fasta_sample.blastout' > DB<4> s > main::parse_blast(454/scripts/fasta_blasta_mb.pl:299): > 299: print "Parsing blast hits ...\n"; > DB<4> s > Parsing blast hits ... > main::parse_blast(454/scripts/fasta_blasta_mb.pl:302): > 302: my $report_obj = new Bio::SearchIO(-verbose => 1, > 303: -format => 'blast', > 304: -file => > $blastoutfile);#or die "Could not open blast report $!"; > DB<4> s > Bio::SearchIO::new(/util/lib/perl5/site_perl/5.8.0/Bio/SearchIO.pm:129): > 129: my($caller, at args) = @_; > DB<4> r > scalar context return from Bio::SearchIO::new: '_file' => > 'G782.2005-08-16-16-48.fasta_sample.blastout' > '_filehandle' => GLOB(0x8cef40c) > -> *Symbol::GEN1 > FileHandle({*Symbol::GEN1}) => fileno(3) > '_flush_on_write' => 1 > '_handler' => > Bio::SearchIO::IteratedSearchResultEventBuilder=HASH(0x9505724) > '_factories' => HASH(0x95054c0) > 'hit' => Bio::Factory::ObjectFactory=HASH(0x95017b8) > '_loaded_types' => HASH(0x9506c0c) > 'Bio::Search::Hit::BlastHit' => 1 > '_root_verbose' => 0 > 'interface' => 'Bio::Search::Hit::HitI' > 'type' => 'Bio::Search::Hit::BlastHit' > 'hsp' => Bio::Factory::ObjectFactory=HASH(0x9500e10) > '_loaded_types' => HASH(0x9506c18) > 'Bio::Search::HSP::GenericHSP' => 1 > '_root_verbose' => 0 > 'interface' => 'Bio::Search::HSP::HSPI' > 'type' => 'Bio::Search::HSP::GenericHSP' > 'iteration' => Bio::Factory::ObjectFactory=HASH(0x9506c60) > '_loaded_types' => HASH(0x9506af8) > 'Bio::Search::Iteration::GenericIteration' => 1 > '_root_verbose' => 0 > 'interface' => 'Bio::Search::Iteration::IterationI' > 'type' => 'Bio::Search::Iteration::GenericIteration' > 'result' => Bio::Factory::ObjectFactory=HASH(0x9504c80) > '_loaded_types' => HASH(0x9501f74) > 'Bio::Search::Result::BlastResult' => 1 > '_root_verbose' => 0 > 'interface' => 'Bio::Search::Result::ResultI' > 'type' => 'Bio::Search::Result::BlastResult' > '_inclusion_threshold' => 0.001 > '_root_verbose' => 1 > '_handler_cache' => > Bio::SearchIO::IteratedSearchResultEventBuilder=HASH(0x9505724) > -> REUSED_ADDRESS > '_notfirsttime' => 0 > '_reporttype' => '' > '_root_cleanup_methods' => ARRAY(0x8cde434) > 0 CODE(0x82a9aec) > -> &Bio::Root::IO::_io_cleanup in > /util/lib/perl5/site_perl/5.8.0/Bio/Root/IO.pm:544-570 > 1 CODE(0x82a9aec) > -> REUSED_ADDRESS > '_root_verbose' => 1 > main::parse_blast(454/scripts/fasta_blasta_mb.pl:307): > 307: die "no valid $report_obj" unless defined $report_obj; > DB<4> s > main::parse_blast(454/scripts/fasta_blasta_mb.pl:310): > 310: while( my $result = $report_obj->next_result ) { > DB<4> s > Bio::SearchIO::blast::next_result(/util/lib/perl5/site_perl/5.8.0/Bio/SearchIO/blast.pm:389): > 389: my ($self) = @_; > DB<4> r > scalar context return from Bio::SearchIO::blast::next_result: undef > Bio::SearchIO::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/SearchIO.pm:438): > 438: my $self = shift; > DB<4> r > scalar context return from Bio::SearchIO::DESTROY: '' > Bio::Root::Root::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/Root/Root.pm:404): > 404: my $self = shift; > DB<4> r > scalar context return from Bio::Root::Root::DESTROY: undef > Bio::Root::Root::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/Root/Root.pm:404): > 404: my $self = shift; > DB<4> r > scalar context return from Bio::Root::Root::DESTROY: undef > Bio::Root::Root::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/Root/Root.pm:404): > 404: my $self = shift; > DB<4> r > scalar context return from Bio::Root::Root::DESTROY: undef > Bio::Root::Root::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/Root/Root.pm:404): > 404: my $self = shift; > DB<4> r > scalar context return from Bio::Root::Root::DESTROY: undef > Bio::Root::Root::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/Root/Root.pm:404): > 404: my $self = shift; > DB<4> r > scalar context return from Bio::Root::Root::DESTROY: undef > main::(454/scripts/fasta_blasta_mb.pl:155): > 155: &output_results(); > DB<4> x $result > 0 undef > > > > Debug of WORKING blast-parse: > Parsing blast hits ... > main::parse_blast(454/scripts/fasta_blasta_mb.pl:302): > 302: my $report_obj = new Bio::SearchIO(-verbose => 1, > 303: -format => 'blast', > 304: -file => > $blastoutfile);#or die "Could not open blast report $!"; > DB<3> s > Bio::SearchIO::new(/util/lib/perl5/site_perl/5.8.0/Bio/SearchIO.pm:129): > 129: my($caller, at args) = @_; > DB<3> r > scalar context return from Bio::SearchIO::new: '_file' => > 'G782.2005-08-16-16-48.fasta_sample.blastout' > '_filehandle' => GLOB(0x8763100) > -> *Symbol::GEN1 > FileHandle({*Symbol::GEN1}) => fileno(3) > '_flush_on_write' => 1 > '_handler' => > Bio::SearchIO::IteratedSearchResultEventBuilder=HASH(0x8ab3be4) > '_factories' => HASH(0x8ab1594) > 'hit' => Bio::Factory::ObjectFactory=HASH(0x8a7b7c0) > '_loaded_types' => HASH(0x8abee10) > 'Bio::Search::Hit::BlastHit' => 1 > '_root_verbose' => 0 > 'interface' => 'Bio::Search::Hit::HitI' > 'type' => 'Bio::Search::Hit::BlastHit' > 'hsp' => Bio::Factory::ObjectFactory=HASH(0x8a87200) > '_loaded_types' => HASH(0x8abee1c) > 'Bio::Search::HSP::GenericHSP' => 1 > '_root_verbose' => 0 > 'interface' => 'Bio::Search::HSP::HSPI' > 'type' => 'Bio::Search::HSP::GenericHSP' > 'iteration' => Bio::Factory::ObjectFactory=HASH(0x8abee64) > '_loaded_types' => HASH(0x8abecfc) > 'Bio::Search::Iteration::GenericIteration' => 1 > '_root_verbose' => 0 > 'interface' => 'Bio::Search::Iteration::IterationI' > 'type' => 'Bio::Search::Iteration::GenericIteration' > 'result' => Bio::Factory::ObjectFactory=HASH(0x8a81a84) > '_loaded_types' => HASH(0x8a96ce8) > 'Bio::Search::Result::BlastResult' => 1 > '_root_verbose' => 0 > 'interface' => 'Bio::Search::Result::ResultI' > 'type' => 'Bio::Search::Result::BlastResult' > '_inclusion_threshold' => 0.001 > '_root_verbose' => 1 > '_handler_cache' => > Bio::SearchIO::IteratedSearchResultEventBuilder=HASH(0x8ab3be4) > -> REUSED_ADDRESS > '_notfirsttime' => 0 > '_reporttype' => '' > '_root_cleanup_methods' => ARRAY(0x8762efc) > 0 CODE(0x82a9aec) > -> &Bio::Root::IO::_io_cleanup in > /util/lib/perl5/site_perl/5.8.0/Bio/Root/IO.pm:544-570 > 1 CODE(0x82a9aec) > -> REUSED_ADDRESS > '_root_verbose' => 1 > main::parse_blast(454/scripts/fasta_blasta_mb.pl:307): > 307: die "no valid $report_obj" unless defined $report_obj; > DB<3> s > main::parse_blast(454/scripts/fasta_blasta_mb.pl:310): > 310: while( my $result = $report_obj->next_result ) { > DB<3> s > Bio::SearchIO::blast::next_result(/util/lib/perl5/site_perl/5.8.0/Bio/SearchIO/blast.pm:389): > 389: my ($self) = @_; > DB<3> r > blast.pm: unrecognized line Reference: Zheng Zhang, Scott Schwartz, > Lukas Wagner, and Webb Miller (2000), > blast.pm: unrecognized line "A greedy algorithm for aligning DNA > sequences", > blast.pm: unrecognized line J Comput Biol 2000; 7(1-2):203-14. > blast.pm: unrecognized > line > Score E > Got NCBI HSP score=354, evalue 0.0 > scalar context return from Bio::SearchIO::blast::next_result: > '_algorithm' => 'MEGABLAST' > '_algorithm_version' => '2.2.10 [Oct-19-2004]' > '_dbentries' => 4249067 > '_dbletters' => 17735149364 > '_dbname' => 'All GenBank+EMBL+DDBJ+PDB sequences (but no EST, > STS,GSS,environmental samples or phase 0, 1 or 2 HTGS sequences) ' > '_hitindex' => 0 > '_hits' => ARRAY(0x8b2acd0) > empty array > '_inclusion_threshold' => 0.001 > '_iteration_count' => 1 > '_iteration_index' => 0 > '_iterations' => ARRAY(0x8b2ac4c) > 0 Bio::Search::Iteration::GenericIteration=HASH(0x8b1cacc) > '_newhits_below_threshold' => ARRAY(0x8b1ca84) > 0 Bio::Search::Hit::BlastHit=HASH(0x8b1cf64) > '_accession' => 'AE004091' > '_algorithm' => 'MEGABLAST' > '_description' => 'Pseudomonas aeruginosa PAO1, complete genome' > '_hsps' => ARRAY(0x8b1ceb0) > 0 Bio::Search::HSP::GenericHSP=HASH(0x8b2098c) > '_algorithm' => 'MEGABLAST' > '_frac_conserved' => HASH(0x8b266a0) > 'hit' => 0.991803278688525 > 'query' => 0.991803278688525 > 'total' => 0.991803278688525 > '_frac_identical' => HASH(0x8b2658c) > 'hit' => 0.991803278688525 > 'query' => 0.991803278688525 > 'total' => 0.991803278688525 > '_gaps' => HASH(0x8b24d94) > 'hit' => 0 > 'query' => 0 > 'total' => 0 > '_gsf_tag_hash' => HASH(0x8b20998) > empty hash > '_hit_string' => > 'cctgacctccgctcaactgcgcaaatacgccagcgccggtcggccgttccccgaagggcgcctgctggccgcctcctgccacgacgcggaggaactggccctggctgcctcgatgggagtggagttcgtcaccctttcgccggtacagccgaccgagagccatcccggcgagccggcgctgggttgggacaaggccgccgaactgatcgccggcttcaaccagccggtctacctgctgggtggcctcggtccgcagcaagccgagcaggcttgggagcatggagcccagggcgtggcgggtatccgtgcgttctggccgggcggcctttgacggtggaatgaagaaaaaaggaggcttcggcctcc' > '_homology_string' => > '|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| > ||||||||||||||||||||||||||||||||||||||||| > ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| > |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||' > etc...... > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From andreo_beck at yahoo.com Mon Jul 31 22:59:30 2006 From: andreo_beck at yahoo.com (Andreo Beck) Date: Mon, 31 Jul 2006 19:59:30 -0700 (PDT) Subject: [Bioperl-l] $hit_object->frac_aligned_hit/$hit_object->frac_aligned_query Message-ID: <20060801025930.96806.qmail@web55705.mail.re3.yahoo.com> Hi, Can $hit_object->frac_aligned_hit or $hit_object->frac_aligned_query give outputs > 1 ? I get some > 1 values. Does using the parentheses (e.g. $hit_object->frac_aligned_hit()) make any difference? Thanks, Andy --------------------------------- Talk is cheap. Use Yahoo! Messenger to make PC-to-Phone calls. Great rates starting at 1?/min. From andreo_beck at yahoo.com Mon Jul 31 22:56:45 2006 From: andreo_beck at yahoo.com (Andreo Beck) Date: Mon, 31 Jul 2006 19:56:45 -0700 (PDT) Subject: [Bioperl-l] $hit_object->frac_aligned_hit/$hit_object->frac_aligned_query Message-ID: <20060801025645.12106.qmail@web55703.mail.re3.yahoo.com> Hi, Can $hit_object->frac_aligned_hit or $hit_object->frac_aligned_query give outputs > 1 ? I get them. Does using the parentheses (e.g. $hit_object->frac_aligned_hit()) make any difference? Thanks, Andy __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From darin.london at duke.edu Mon Jul 3 12:41:33 2006 From: darin.london at duke.edu (Darin London) Date: Mon, 03 Jul 2006 08:41:33 -0400 Subject: [Bioperl-l] Call For Birds of a Feather Suggestions Message-ID: <44A9107D.2050304@duke.edu> The BOSC organizing comittee is currently seeking suggestions for Birds of a Feather meeting ideas. Birds of a Feather meetings are one of the more popular activities at BOSC, occurring at the end of each days session. These are free-form meetings organized by the attendees themselves to discuss one or a few topics of interest in greater detail. BOF?s have been formed to allow developers and users of individual OBF software to meet each other face-to-face to discuss the project, or to discuss completely new ideas, and even start new software development projects. These meetings offer a unique opportunity for individuals to explore more about the activities of the various Open Source Projects, and, in some cases, even take an active role influencing the future of Open Source Software development. If you would like to create a BOF, just sign up for a wiki account, login, and edit the BOSC 2006 Birds of a Feather page. From bix at sendu.me.uk Wed Jul 5 12:37:34 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Wed, 05 Jul 2006 13:37:34 +0100 Subject: [Bioperl-l] checkout_all fails on biodata Message-ID: <44ABB28E.2000203@sendu.me.uk> I'm doing: cvs -d:ext:sendu at dev.open-bio.org:/home/repository/bioperl co bioperl_all to check out all the bioperl packages at once. However it only checks out core, db, pedigree, pipeline and run before failing on biodata: cvs checkout: Updating biodata cvs checkout: failed to create lock directory for `/home/repository/bioperl/biodata' (/home/repository/bioperl/biodata/#cvs.lock): Permission denied cvs checkout: failed to obtain dir lock in repository `/home/repository/bioperl/biodata' cvs [checkout aborted]: read lock failed - giving up This failure is consistent for me (had it multiple times, different days, never worked). Biodata isn't even mentioned as a possible package at http://bioperl.org/wiki/Using_CVS. What is it? Could it be moved to the end of the alias list so it is checked out last, letting all the other packages be checked out before failure? PS. neither biodata nor pipeline are mentioned as a package on that wiki page or at http://bioperl.org/wiki/Category:BioPerl_Packages. Are there yet more packages? Cheers, Sendu. From hlapp at gmx.net Wed Jul 5 12:55:42 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 5 Jul 2006 08:55:42 -0400 Subject: [Bioperl-l] checkout_all fails on biodata In-Reply-To: <44ABB28E.2000203@sendu.me.uk> References: <44ABB28E.2000203@sendu.me.uk> Message-ID: <9C3FE120-6B3D-4C81-99E1-921E81926CCA@gmx.net> Should have been fixed - I can cvs update. did you try again? On Jul 5, 2006, at 8:37 AM, Sendu Bala wrote: > I'm doing: > > cvs -d:ext:sendu at dev.open-bio.org:/home/repository/bioperl co > bioperl_all > > to check out all the bioperl packages at once. However it only checks > out core, db, pedigree, pipeline and run before failing on biodata: > > cvs checkout: Updating biodata > cvs checkout: failed to create lock directory for > `/home/repository/bioperl/biodata' > (/home/repository/bioperl/biodata/#cvs.lock): Permission denied > cvs checkout: failed to obtain dir lock in repository > `/home/repository/bioperl/biodata' > cvs [checkout aborted]: read lock failed - giving up > > This failure is consistent for me (had it multiple times, different > days, never worked). > > Biodata isn't even mentioned as a possible package at > http://bioperl.org/wiki/Using_CVS. What is it? Could it be moved to > the > end of the alias list so it is checked out last, letting all the other > packages be checked out before failure? > > PS. neither biodata nor pipeline are mentioned as a package on that > wiki > page or at http://bioperl.org/wiki/Category:BioPerl_Packages. Are > there > yet more packages? > > Cheers, > Sendu. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From bix at sendu.me.uk Wed Jul 5 13:03:50 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Wed, 05 Jul 2006 14:03:50 +0100 Subject: [Bioperl-l] checkout_all fails on biodata In-Reply-To: <9C3FE120-6B3D-4C81-99E1-921E81926CCA@gmx.net> References: <44ABB28E.2000203@sendu.me.uk> <9C3FE120-6B3D-4C81-99E1-921E81926CCA@gmx.net> Message-ID: <44ABB8B6.5040707@sendu.me.uk> Hilmar Lapp wrote: > Should have been fixed - I can cvs update. did you try again? Still doesn't work, no change. I can manually check out the other packages, I just can't do it with bioperl_all alias. co bioperl-biodata fails because: cvs server: cannot find module `bioperl-biodata' - ignored cvs [checkout aborted]: cannot expand modules (not that I want it - if its no longer a bioperl package can it be removed from the alias?) From hlapp at gmx.net Wed Jul 5 13:41:27 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 5 Jul 2006 09:41:27 -0400 Subject: [Bioperl-l] checkout_all fails on biodata In-Reply-To: <44ABB8B6.5040707@sendu.me.uk> References: <44ABB28E.2000203@sendu.me.uk> <9C3FE120-6B3D-4C81-99E1-921E81926CCA@gmx.net> <44ABB8B6.5040707@sendu.me.uk> Message-ID: The idea was once that Bioperl, Biojava, etc had all those unit tests that use specific sample data which take up quite a bit of space. Unifying the largely redundant test data into a single shared repository would save quite a bit of space and therefore download/ update time for people who work on/use more than one Bio* project. However, this was never fully implemented AFAIK. I.e., you don't need biodata. I guess it could be removed from the alias since it's not integrated anyway. Any other opinions? I also forwarded your report to root-l as I couldn't find the offending (stale) lock file. -hilmar On Jul 5, 2006, at 9:03 AM, Sendu Bala wrote: > Hilmar Lapp wrote: >> Should have been fixed - I can cvs update. did you try again? > > Still doesn't work, no change. I can manually check out the other > packages, I just can't do it with bioperl_all alias. > > co bioperl-biodata fails because: > cvs server: cannot find module `bioperl-biodata' - ignored > cvs [checkout aborted]: cannot expand modules > > (not that I want it - if its no longer a bioperl package can it be > removed from the alias?) > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Wed Jul 5 13:48:03 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 5 Jul 2006 08:48:03 -0500 Subject: [Bioperl-l] checkout_all fails on biodata In-Reply-To: <44ABB8B6.5040707@sendu.me.uk> Message-ID: <000f01c6a039$a7a24f10$15327e82@pyrimidine> Bioperl-data was a directory started up a few years ago to hold various data files for testing and as examples (BLAST file examples, GenBank files, etc), somewhat like the t/data directory but cleaned up a bit more. It hasn't been updated in a while. Regardless, you should be able to check it out. As for the problem, looks like Hilmar's checking up on a possible lock file issue. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Sendu Bala > Sent: Wednesday, July 05, 2006 8:04 AM > To: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] checkout_all fails on biodata > > Hilmar Lapp wrote: > > Should have been fixed - I can cvs update. did you try again? > > Still doesn't work, no change. I can manually check out the other > packages, I just can't do it with bioperl_all alias. > > co bioperl-biodata fails because: > cvs server: cannot find module `bioperl-biodata' - ignored > cvs [checkout aborted]: cannot expand modules > > (not that I want it - if its no longer a bioperl package can it be > removed from the alias?) > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Wed Jul 5 15:06:30 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 5 Jul 2006 10:06:30 -0500 Subject: [Bioperl-l] checkout_all fails on biodata In-Reply-To: Message-ID: <001901c6a044$999a14b0$15327e82@pyrimidine> I use TortoiseCVS via WinXP and I'm getting the same issue that Sendu has: --------------------------- In C:\Perl\src: "C:\Program Files\TortoiseCVS\cvs.exe" "-q" "--lf" "checkout" "-P" "bioperl_all" CVSROOT=:ext:cjfields at dev.open-bio.org:/home/repository/bioperl ... cvs checkout: failed to create lock directory for `/home/repository/bioperl/biodata' (/home/repository/bioperl/biodata/#cvs.lock): Permission denied cvs checkout: failed to obtain dir lock in repository `/home/repository/bioperl/biodata' cvs [checkout aborted]: read lock failed - giving up cvs.exe checkout: in directory bioperl: cvs.exe checkout: cannot open CVS/Entries for reading: No such file or directory --------------------------- I had the same problem with schema (BioSQL) a while back. I tried again, and... --------------------------- cvs checkout: failed to create lock directory for `/home/repository/bioperl/biosql-schema' (/home/repository/bioperl/biosql-schema/#cvs.lock): Permission denied cvs checkout: failed to obtain dir lock in repository `/home/repository/bioperl/biosql-schema' cvs [checkout aborted]: read lock failed - giving up cvs.exe checkout: in directory .: cvs.exe checkout: cannot open CVS/Entries for reading: No such file or directory --------------------------- I believe it had something to do with CVS commit privileges (i.e. I had none for schema, which was fine). So maybe this is a permissions issue via the lock file? Looking at the alias: bioperl_all -d bioperl &core &db &run &pipeline &pedigree &biodata &schema &network µarray This may mean if anyone w/o commit privs for any of the above (specifically schema and biodata) tries checkout/update using bioperl-all, they may run into this problem. Since it's not integrated I don't see the problem with removing it from the alias, but if we follow the same line of logic (and privileges are the issue) then schema must be removed as well. To me it doesn't make much sense to not include schema though since we can checkout/update bioperl-db. BTW, I like the idea of biodata as you've outlined it. Would be nice to gear the test suite towards a more general set of data for all the Bio* projects versus having each one come with their own, and the data could be updated a bit more frequently that t/data is. Seems like it would definitely save a large chunk of real estate for the distributions. If one wanted to run the full test suite then they would have to download biodata separately, though, but not a bad compromise. Though, if this is/was its intent, why would it need a lock file? Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Hilmar Lapp > Sent: Wednesday, July 05, 2006 8:41 AM > To: Sendu Bala > Cc: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] checkout_all fails on biodata > > The idea was once that Bioperl, Biojava, etc had all those unit tests > that use specific sample data which take up quite a bit of space. > Unifying the largely redundant test data into a single shared > repository would save quite a bit of space and therefore download/ > update time for people who work on/use more than one Bio* project. > > However, this was never fully implemented AFAIK. I.e., you don't need > biodata. I guess it could be removed from the alias since it's not > integrated anyway. > > Any other opinions? > > I also forwarded your report to root-l as I couldn't find the > offending (stale) lock file. > > -hilmar > > On Jul 5, 2006, at 9:03 AM, Sendu Bala wrote: > > > Hilmar Lapp wrote: > >> Should have been fixed - I can cvs update. did you try again? > > > > Still doesn't work, no change. I can manually check out the other > > packages, I just can't do it with bioperl_all alias. > > > > co bioperl-biodata fails because: > > cvs server: cannot find module `bioperl-biodata' - ignored > > cvs [checkout aborted]: cannot expand modules > > > > (not that I want it - if its no longer a bioperl package can it be > > removed from the alias?) > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Wed Jul 5 15:36:33 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 5 Jul 2006 10:36:33 -0500 Subject: [Bioperl-l] Bio::SearchIO::hmmer hsp behaviour In-Reply-To: Message-ID: <001a01c6a048$cb802420$15327e82@pyrimidine> Okay, I managed to figure out what the problem was. I committed a fix in CVS for the initial bug (Selvi's missing hits). Still has one HSP per hit for now; I think it will take a bit more effort to get a BLAST-like multi HSP/hit up and running. Selvi, update from CVS to see if that works. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Chris Fields > Sent: Friday, June 30, 2006 12:44 PM > To: Sendu Bala; Jason Stajich > Cc: bioperl-l at lists.open-bio.org list > Subject: Re: [Bioperl-l] Bio::SearchIO::hmmer hsp behaviour > > I'll try looking at it this weekend. A suggested workaround is to > either try setting -A for no alignments or setting it to a high > number to retrieve all of them. It's pretty serious as the error > silently dumps those domains, so for those using automated annotation > pipelines would miss it unless they are also checking the raw output. > > You could design a SearchIO::hmmpfam parser then expand it to take in > hmmsearch output at a later point, or keep them separate. I like the > idea of having modules that are more specific about what they parse; > seems at some point you reach serious code bloat and maintenance > becomes an issue. Look at SearchIO::blast; it parses various text > BLAST output very well but with some serious obfuscation. Just don't > know how productive it would be to separate out the PSI-BLAST and > bl2seq stuff since they are pretty close to a standard BLAST > report... oh well. > > To Jason : good luck on your move. Drop us a line here to let us > know everything went well. > > Chris > > On Jun 30, 2006, at 11:14 AM, Sendu Bala wrote: > > > Chris Fields wrote: > >> It may have been just simpler to have it be one HSP (domain) per Hit > >> (model) as that's how the reports are generated. My reasoning was > >> that > >> using the one domain per model made sense based on what you are > >> actually > >> trying to do, which is annotate the sequence based on the order the > >> domain appears. Most others may not view it that way, which is fine. > >> One can always gather the relevant HSP's, convert to seqfeatures, > >> then > >> sort them if order is important, I suppose. > >> > >> I would say, if the overall consensus is to modify it to have > >> multiple > >> domain hits per model (similar to BLAST) then Sendu should go > >> ahead and > >> make those changes then announce it on the list so no one can gripe > >> about it later. My main concern was not changing things so > >> dramatically > >> that it'll break for someone > > > > Going on your earlier suggestion, I was thinking about making > > SearchIO::hmmpfam instead, which would get used if you set the > > format to > > 'hmmpfam' instead of the generic 'hmmer' when making a SearchIO. I > > suppose I would make a SearchIO::hmmsearch as well, if necessary. > > > > > > [...] > >> that the reported bug about missing hits (Bug 2036) is fixed as well. > > > > However, having never made a SearchIO plugin before, it will be some > > time before I get my head around it. I'll want to make one the current > > HOWTO:SearchIO way before I can think about doing it a better way > > (hashes) as well. So I can say I'll make a move on this at some > > point in > > the future, but if someone wants to fix Bug 2036 in the mean time, > > they > > are welcome to. Again as suggested, my priority is Bio::Map right now. > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From arareko at campus.iztacala.unam.mx Wed Jul 5 15:38:14 2006 From: arareko at campus.iztacala.unam.mx (Mauricio Herrera Cuadra) Date: Wed, 05 Jul 2006 10:38:14 -0500 Subject: [Bioperl-l] checkout_all fails on biodata In-Reply-To: <001901c6a044$999a14b0$15327e82@pyrimidine> References: <001901c6a044$999a14b0$15327e82@pyrimidine> Message-ID: <44ABDCE6.7090906@campus.iztacala.unam.mx> Same problem here. I've never used the bioperl_all alias before (I always check-out dirs individually), but to me it seems like a privileges issue as Chris suggests. Also browsed through all the repository in dev.open-bio.org and didn't found such lock file. I guess Chris D. or Jason will know better what's happening here. Mauricio. Chris Fields wrote: > I use TortoiseCVS via WinXP and I'm getting the same issue that Sendu has: > --------------------------- > In C:\Perl\src: "C:\Program Files\TortoiseCVS\cvs.exe" "-q" "--lf" > "checkout" "-P" "bioperl_all" > CVSROOT=:ext:cjfields at dev.open-bio.org:/home/repository/bioperl > > ... > > cvs checkout: failed to create lock directory for > `/home/repository/bioperl/biodata' > (/home/repository/bioperl/biodata/#cvs.lock): Permission denied > cvs checkout: failed to obtain dir lock in repository > `/home/repository/bioperl/biodata' > cvs [checkout aborted]: read lock failed - giving up > cvs.exe checkout: in directory bioperl: > cvs.exe checkout: cannot open CVS/Entries for reading: No such file or > directory > --------------------------- > > I had the same problem with schema (BioSQL) a while back. I tried again, > and... > > --------------------------- > cvs checkout: failed to create lock directory for > `/home/repository/bioperl/biosql-schema' > (/home/repository/bioperl/biosql-schema/#cvs.lock): Permission denied > cvs checkout: failed to obtain dir lock in repository > `/home/repository/bioperl/biosql-schema' > cvs [checkout aborted]: read lock failed - giving up > cvs.exe checkout: in directory .: > cvs.exe checkout: cannot open CVS/Entries for reading: No such file or > directory > --------------------------- > > I believe it had something to do with CVS commit privileges (i.e. I had none > for schema, which was fine). So maybe this is a permissions issue via the > lock file? Looking at the alias: > > bioperl_all -d bioperl &core &db &run &pipeline &pedigree &biodata &schema > &network µarray > > This may mean if anyone w/o commit privs for any of the above (specifically > schema and biodata) tries checkout/update using bioperl-all, they may run > into this problem. > > Since it's not integrated I don't see the problem with removing it from the > alias, but if we follow the same line of logic (and privileges are the > issue) then schema must be removed as well. To me it doesn't make much > sense to not include schema though since we can checkout/update bioperl-db. > > > BTW, I like the idea of biodata as you've outlined it. Would be nice to > gear the test suite towards a more general set of data for all the Bio* > projects versus having each one come with their own, and the data could be > updated a bit more frequently that t/data is. Seems like it would > definitely save a large chunk of real estate for the distributions. If one > wanted to run the full test suite then they would have to download biodata > separately, though, but not a bad compromise. Though, if this is/was its > intent, why would it need a lock file? > > Chris > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> bounces at lists.open-bio.org] On Behalf Of Hilmar Lapp >> Sent: Wednesday, July 05, 2006 8:41 AM >> To: Sendu Bala >> Cc: bioperl-l at lists.open-bio.org >> Subject: Re: [Bioperl-l] checkout_all fails on biodata >> >> The idea was once that Bioperl, Biojava, etc had all those unit tests >> that use specific sample data which take up quite a bit of space. >> Unifying the largely redundant test data into a single shared >> repository would save quite a bit of space and therefore download/ >> update time for people who work on/use more than one Bio* project. >> >> However, this was never fully implemented AFAIK. I.e., you don't need >> biodata. I guess it could be removed from the alias since it's not >> integrated anyway. >> >> Any other opinions? >> >> I also forwarded your report to root-l as I couldn't find the >> offending (stale) lock file. >> >> -hilmar >> >> On Jul 5, 2006, at 9:03 AM, Sendu Bala wrote: >> >>> Hilmar Lapp wrote: >>>> Should have been fixed - I can cvs update. did you try again? >>> Still doesn't work, no change. I can manually check out the other >>> packages, I just can't do it with bioperl_all alias. >>> >>> co bioperl-biodata fails because: >>> cvs server: cannot find module `bioperl-biodata' - ignored >>> cvs [checkout aborted]: cannot expand modules >>> >>> (not that I want it - if its no longer a bioperl package can it be >>> removed from the alias?) >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >> -- >> =========================================================== >> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : >> =========================================================== >> >> >> >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- MAURICIO HERRERA CUADRA arareko at campus.iztacala.unam.mx Laboratorio de Gen?tica Unidad de Morfofisiolog?a y Funci?n Facultad de Estudios Superiores Iztacala, UNAM From bix at sendu.me.uk Thu Jul 6 08:41:57 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 06 Jul 2006 09:41:57 +0100 Subject: [Bioperl-l] Bio::Map changes In-Reply-To: <449A9AF9.2000305@sendu.me.uk> References: <44985915.8010607@sendu.me.uk> <449A9AF9.2000305@sendu.me.uk> Message-ID: <44ACCCD5.3030309@sendu.me.uk> Sendu Bala wrote: > The next step is to tidy up all of Bio::Map*, which involves a major > reimplementation of the whole system [...] > The reimplementation will make Position central to the model, allowing > for lots of other things to work properly without anything becoming > inconsistent (as is currently the case). This is now done. It uses a new PositionHandler class behind the scenes. The next step is to introduce relative positioning across the board, possibly in a way that makes OrderedPosition redundant or an implementer of the system. Has anyone here ever used Bio::Map* modules for anything? I would appreciate you sending me your code, especially if you've used MapIO, Physical (encompassing Clone, Contig, FPCMarker, OrderedPositionWithDistance) or LinkageMap (encompassing LinkagePosition, OrderedPosition, Microsatellite) since these have insufficient tests at the moment. From nidage at yahoo.com Thu Jul 6 18:13:12 2006 From: nidage at yahoo.com (sss lll) Date: Thu, 6 Jul 2006 11:13:12 -0700 (PDT) Subject: [Bioperl-l] PrimarySeqI object Exception Message-ID: <20060706181312.63581.qmail@web35607.mail.mud.yahoo.com> Hi there, I encountered a problem while calling module PrimarySeqI, with the following code: my $db=Bio::DB::Fasta->new($fafile); my $obj=$db->get_Seq_by_id($array_gene_name[$p]); $seqio->write_seq($obj); The error message was: MSG: Did not provide a valid Bio::PrimarySeqI object STACK Bio::SeqIO::fasta::write_seq /usr/lib/perl5/site_perl/5.8.0/Bio/SeqIO/fasta.pm:178 We think it had to do with the lengh of the gene name. For example the following gene name was a problem: gi|59711891|ref|YP_204667.1| NAD-specific glutamate dehydrogenase [Vibrio fischeri ES114]*VIB*COG2902*E Any ideas on what happened? Thanks __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From rmb32 at cornell.edu Thu Jul 6 23:11:00 2006 From: rmb32 at cornell.edu (Robert Buels) Date: Thu, 06 Jul 2006 16:11:00 -0700 Subject: [Bioperl-l] parser for GeneSeqer In-Reply-To: <2FB066C7-12E6-46D8-8F4A-BD096BE2A0CA@uiuc.edu> References: <44A558F2.2050304@cornell.edu> <2FB066C7-12E6-46D8-8F4A-BD096BE2A0CA@uiuc.edu> Message-ID: <44AD9884.6040507@cornell.edu> The Annotation/Annotatable stuff was going to be talked about at the GMOD meeting that just happened, wasn't it? What's the scoop on that? Rob Chris Fields wrote: > If you plan on generating seqfeatures from this output you could check > out the Bio::Tools core modules for examples. There are a few there > that take program output and convert them to Bio::SeqFeature::Generic > objects, including Bio::Tools:RNAMotif and Bio::Tools::tRNAscanSE. If > alignments are involved you might want something like > Bio::SeqFeature::FeaturePair. Not sure about using the > SeqFeature::Annotation or others; I thought that the some of the > Annotation/Annotatable stuff might be changing soon but I may be wrong. > > Chris > > On Jun 30, 2006, at 12:01 PM, Robert Buels wrote: > >> Hi all, >> >> I find myself needing a parser for GeneSeqer output, so I'm writing one >> (which I will submit for your consideration when it's working). In a >> nutshell, GeneSeqer is a (kind of old) program for aligning a bunch of >> ESTs to genomic sequence, then using those alignments to predict where >> in the genomic sequence the genes are. So really what you get from this >> is a bunch of hierarchical features. >> >> I don't really know where I should put it in the bioperl hierarchy >> though. Probably FeatureIO? >> >> And what's the current fashion for objects it should emit? >> Bio::SeqFeature::Generic? Bio::SeqFeature::Annotated? >> >> Rob >> >> --Robert Buels >> SGN Bioinformatics Analyst >> 252A Emerson Hall, Cornell University >> Ithaca, NY 14853 >> Tel: 503-889-8539 >> rmb32 at cornell.edu >> http://www.sgn.cornell.edu >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign > > > -- Robert Buels SGN Bioinformatics Analyst 252A Emerson Hall, Cornell University Ithaca, NY 14853 Tel: 503-889-8539 rmb32 at cornell.edu http://www.sgn.cornell.edu From hlapp at gmx.net Thu Jul 6 23:27:31 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 6 Jul 2006 19:27:31 -0400 Subject: [Bioperl-l] parser for GeneSeqer In-Reply-To: <44AD9884.6040507@cornell.edu> References: <44A558F2.2050304@cornell.edu> <2FB066C7-12E6-46D8-8F4A-BD096BE2A0CA@uiuc.edu> <44AD9884.6040507@cornell.edu> Message-ID: <6B530ED6-5825-47C4-A677-2C75E0F97E26@gmx.net> No scoop b/c no time. I am busy w/ a grant and Lincoln had to leave early as well on Friday. Sorry. On Jul 6, 2006, at 7:11 PM, Robert Buels wrote: > The Annotation/Annotatable stuff was going to be talked about at the > GMOD meeting that just happened, wasn't it? What's the scoop on that? > > Rob > > > Chris Fields wrote: >> If you plan on generating seqfeatures from this output you could >> check >> out the Bio::Tools core modules for examples. There are a few there >> that take program output and convert them to Bio::SeqFeature::Generic >> objects, including Bio::Tools:RNAMotif and >> Bio::Tools::tRNAscanSE. If >> alignments are involved you might want something like >> Bio::SeqFeature::FeaturePair. Not sure about using the >> SeqFeature::Annotation or others; I thought that the some of the >> Annotation/Annotatable stuff might be changing soon but I may be >> wrong. >> >> Chris >> >> On Jun 30, 2006, at 12:01 PM, Robert Buels wrote: >> >>> Hi all, >>> >>> I find myself needing a parser for GeneSeqer output, so I'm >>> writing one >>> (which I will submit for your consideration when it's working). >>> In a >>> nutshell, GeneSeqer is a (kind of old) program for aligning a >>> bunch of >>> ESTs to genomic sequence, then using those alignments to predict >>> where >>> in the genomic sequence the genes are. So really what you get >>> from this >>> is a bunch of hierarchical features. >>> >>> I don't really know where I should put it in the bioperl hierarchy >>> though. Probably FeatureIO? >>> >>> And what's the current fashion for objects it should emit? >>> Bio::SeqFeature::Generic? Bio::SeqFeature::Annotated? >>> >>> Rob >>> >>> --Robert Buels >>> SGN Bioinformatics Analyst >>> 252A Emerson Hall, Cornell University >>> Ithaca, NY 14853 >>> Tel: 503-889-8539 >>> rmb32 at cornell.edu >>> http://www.sgn.cornell.edu >>> >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> Christopher Fields >> Postdoctoral Researcher >> Lab of Dr. Robert Switzer >> Dept of Biochemistry >> University of Illinois Urbana-Champaign >> >> >> > > -- > Robert Buels > SGN Bioinformatics Analyst > 252A Emerson Hall, Cornell University > Ithaca, NY 14853 > Tel: 503-889-8539 > rmb32 at cornell.edu > http://www.sgn.cornell.edu > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Thu Jul 6 23:28:09 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 6 Jul 2006 18:28:09 -0500 Subject: [Bioperl-l] parser for GeneSeqer In-Reply-To: <44AD9884.6040507@cornell.edu> Message-ID: <000001c6a153$d78b83c0$15327e82@pyrimidine> Not any word yet. Been pretty quiet, likely b/c everybody was there planning a roadmap for v1.6. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Robert Buels > Sent: Thursday, July 06, 2006 6:11 PM > To: bioperl-l at bioperl.org > Subject: Re: [Bioperl-l] parser for GeneSeqer > > The Annotation/Annotatable stuff was going to be talked about at the > GMOD meeting that just happened, wasn't it? What's the scoop on that? > > Rob > > > Chris Fields wrote: > > If you plan on generating seqfeatures from this output you could check > > out the Bio::Tools core modules for examples. There are a few there > > that take program output and convert them to Bio::SeqFeature::Generic > > objects, including Bio::Tools:RNAMotif and Bio::Tools::tRNAscanSE. If > > alignments are involved you might want something like > > Bio::SeqFeature::FeaturePair. Not sure about using the > > SeqFeature::Annotation or others; I thought that the some of the > > Annotation/Annotatable stuff might be changing soon but I may be wrong. > > > > Chris > > > > On Jun 30, 2006, at 12:01 PM, Robert Buels wrote: > > > >> Hi all, > >> > >> I find myself needing a parser for GeneSeqer output, so I'm writing one > >> (which I will submit for your consideration when it's working). In a > >> nutshell, GeneSeqer is a (kind of old) program for aligning a bunch of > >> ESTs to genomic sequence, then using those alignments to predict where > >> in the genomic sequence the genes are. So really what you get from > this > >> is a bunch of hierarchical features. > >> > >> I don't really know where I should put it in the bioperl hierarchy > >> though. Probably FeatureIO? > >> > >> And what's the current fashion for objects it should emit? > >> Bio::SeqFeature::Generic? Bio::SeqFeature::Annotated? > >> > >> Rob > >> > >> --Robert Buels > >> SGN Bioinformatics Analyst > >> 252A Emerson Hall, Cornell University > >> Ithaca, NY 14853 > >> Tel: 503-889-8539 > >> rmb32 at cornell.edu > >> http://www.sgn.cornell.edu > >> > >> > >> _______________________________________________ > >> Bioperl-l mailing list > >> Bioperl-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > Christopher Fields > > Postdoctoral Researcher > > Lab of Dr. Robert Switzer > > Dept of Biochemistry > > University of Illinois Urbana-Champaign > > > > > > > > -- > Robert Buels > SGN Bioinformatics Analyst > 252A Emerson Hall, Cornell University > Ithaca, NY 14853 > Tel: 503-889-8539 > rmb32 at cornell.edu > http://www.sgn.cornell.edu > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From hlapp at gmx.net Thu Jul 6 23:41:44 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 6 Jul 2006 19:41:44 -0400 Subject: [Bioperl-l] parser for GeneSeqer In-Reply-To: <000001c6a153$d78b83c0$15327e82@pyrimidine> References: <000001c6a153$d78b83c0$15327e82@pyrimidine> Message-ID: Uhm - roadmap - I guess yes, but more that of the Golden State, or other states on the way, for Jason. On Jul 6, 2006, at 7:28 PM, Chris Fields wrote: > Not any word yet. Been pretty quiet, likely b/c everybody was there > planning a roadmap for v1.6. > > Chris > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> bounces at lists.open-bio.org] On Behalf Of Robert Buels >> Sent: Thursday, July 06, 2006 6:11 PM >> To: bioperl-l at bioperl.org >> Subject: Re: [Bioperl-l] parser for GeneSeqer >> >> The Annotation/Annotatable stuff was going to be talked about at the >> GMOD meeting that just happened, wasn't it? What's the scoop on >> that? >> >> Rob >> >> >> Chris Fields wrote: >>> If you plan on generating seqfeatures from this output you could >>> check >>> out the Bio::Tools core modules for examples. There are a few there >>> that take program output and convert them to >>> Bio::SeqFeature::Generic >>> objects, including Bio::Tools:RNAMotif and >>> Bio::Tools::tRNAscanSE. If >>> alignments are involved you might want something like >>> Bio::SeqFeature::FeaturePair. Not sure about using the >>> SeqFeature::Annotation or others; I thought that the some of the >>> Annotation/Annotatable stuff might be changing soon but I may be >>> wrong. >>> >>> Chris >>> >>> On Jun 30, 2006, at 12:01 PM, Robert Buels wrote: >>> >>>> Hi all, >>>> >>>> I find myself needing a parser for GeneSeqer output, so I'm >>>> writing one >>>> (which I will submit for your consideration when it's working). >>>> In a >>>> nutshell, GeneSeqer is a (kind of old) program for aligning a >>>> bunch of >>>> ESTs to genomic sequence, then using those alignments to predict >>>> where >>>> in the genomic sequence the genes are. So really what you get from >> this >>>> is a bunch of hierarchical features. >>>> >>>> I don't really know where I should put it in the bioperl hierarchy >>>> though. Probably FeatureIO? >>>> >>>> And what's the current fashion for objects it should emit? >>>> Bio::SeqFeature::Generic? Bio::SeqFeature::Annotated? >>>> >>>> Rob >>>> >>>> --Robert Buels >>>> SGN Bioinformatics Analyst >>>> 252A Emerson Hall, Cornell University >>>> Ithaca, NY 14853 >>>> Tel: 503-889-8539 >>>> rmb32 at cornell.edu >>>> http://www.sgn.cornell.edu >>>> >>>> >>>> _______________________________________________ >>>> Bioperl-l mailing list >>>> Bioperl-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >>> Christopher Fields >>> Postdoctoral Researcher >>> Lab of Dr. Robert Switzer >>> Dept of Biochemistry >>> University of Illinois Urbana-Champaign >>> >>> >>> >> >> -- >> Robert Buels >> SGN Bioinformatics Analyst >> 252A Emerson Hall, Cornell University >> Ithaca, NY 14853 >> Tel: 503-889-8539 >> rmb32 at cornell.edu >> http://www.sgn.cornell.edu >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Thu Jul 6 23:49:23 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 6 Jul 2006 18:49:23 -0500 Subject: [Bioperl-l] parser for GeneSeqer In-Reply-To: Message-ID: <000101c6a156$cee60bc0$15327e82@pyrimidine> Oh well. There's always BOSC... Chris > -----Original Message----- > From: Hilmar Lapp [mailto:hlapp at gmx.net] > Sent: Thursday, July 06, 2006 6:42 PM > To: Chris Fields > Cc: 'Robert Buels'; bioperl-l at bioperl.org > Subject: Re: [Bioperl-l] parser for GeneSeqer > > Uhm - roadmap - I guess yes, but more that of the Golden State, or > other states on the way, for Jason. > > On Jul 6, 2006, at 7:28 PM, Chris Fields wrote: > > > Not any word yet. Been pretty quiet, likely b/c everybody was there > > planning a roadmap for v1.6. > > > > Chris > > > >> -----Original Message----- > >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > >> bounces at lists.open-bio.org] On Behalf Of Robert Buels > >> Sent: Thursday, July 06, 2006 6:11 PM > >> To: bioperl-l at bioperl.org > >> Subject: Re: [Bioperl-l] parser for GeneSeqer > >> > >> The Annotation/Annotatable stuff was going to be talked about at the > >> GMOD meeting that just happened, wasn't it? What's the scoop on > >> that? > >> > >> Rob > >> > >> > >> Chris Fields wrote: > >>> If you plan on generating seqfeatures from this output you could > >>> check > >>> out the Bio::Tools core modules for examples. There are a few there > >>> that take program output and convert them to > >>> Bio::SeqFeature::Generic > >>> objects, including Bio::Tools:RNAMotif and > >>> Bio::Tools::tRNAscanSE. If > >>> alignments are involved you might want something like > >>> Bio::SeqFeature::FeaturePair. Not sure about using the > >>> SeqFeature::Annotation or others; I thought that the some of the > >>> Annotation/Annotatable stuff might be changing soon but I may be > >>> wrong. > >>> > >>> Chris > >>> > >>> On Jun 30, 2006, at 12:01 PM, Robert Buels wrote: > >>> > >>>> Hi all, > >>>> > >>>> I find myself needing a parser for GeneSeqer output, so I'm > >>>> writing one > >>>> (which I will submit for your consideration when it's working). > >>>> In a > >>>> nutshell, GeneSeqer is a (kind of old) program for aligning a > >>>> bunch of > >>>> ESTs to genomic sequence, then using those alignments to predict > >>>> where > >>>> in the genomic sequence the genes are. So really what you get from > >> this > >>>> is a bunch of hierarchical features. > >>>> > >>>> I don't really know where I should put it in the bioperl hierarchy > >>>> though. Probably FeatureIO? > >>>> > >>>> And what's the current fashion for objects it should emit? > >>>> Bio::SeqFeature::Generic? Bio::SeqFeature::Annotated? > >>>> > >>>> Rob > >>>> > >>>> --Robert Buels > >>>> SGN Bioinformatics Analyst > >>>> 252A Emerson Hall, Cornell University > >>>> Ithaca, NY 14853 > >>>> Tel: 503-889-8539 > >>>> rmb32 at cornell.edu > >>>> http://www.sgn.cornell.edu > >>>> > >>>> > >>>> _______________________________________________ > >>>> Bioperl-l mailing list > >>>> Bioperl-l at lists.open-bio.org > >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l > >>> > >>> Christopher Fields > >>> Postdoctoral Researcher > >>> Lab of Dr. Robert Switzer > >>> Dept of Biochemistry > >>> University of Illinois Urbana-Champaign > >>> > >>> > >>> > >> > >> -- > >> Robert Buels > >> SGN Bioinformatics Analyst > >> 252A Emerson Hall, Cornell University > >> Ithaca, NY 14853 > >> Tel: 503-889-8539 > >> rmb32 at cornell.edu > >> http://www.sgn.cornell.edu > >> > >> > >> _______________________________________________ > >> Bioperl-l mailing list > >> Bioperl-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > From osborne1 at optonline.net Fri Jul 7 01:06:32 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Thu, 06 Jul 2006 21:06:32 -0400 Subject: [Bioperl-l] PrimarySeqI object Exception In-Reply-To: <20060706181312.63581.qmail@web35607.mail.mud.yahoo.com> Message-ID: sss lll, What this error means is that $obj is not a valid Sequence object, this is what's passed to the write_seq method. What identifier is $array_gene_name[$p]? Brian O. On 7/6/06 2:13 PM, "sss lll" wrote: > Hi there, > > I encountered a problem while calling module > PrimarySeqI, with the following code: > > my $db=Bio::DB::Fasta->new($fafile); > my $obj=$db->get_Seq_by_id($array_gene_name[$p]); > $seqio->write_seq($obj); > > The error message was: > MSG: Did not provide a valid Bio::PrimarySeqI object > STACK Bio::SeqIO::fasta::write_seq > /usr/lib/perl5/site_perl/5.8.0/Bio/SeqIO/fasta.pm:178 > > We think it had to do with the lengh of the gene name. > For example the following gene name was a problem: > > gi|59711891|ref|YP_204667.1| NAD-specific glutamate > dehydrogenase [Vibrio fischeri ES114]*VIB*COG2902*E > > Any ideas on what happened? > > Thanks > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam protection around > http://mail.yahoo.com > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From rmb32 at cornell.edu Fri Jul 7 01:24:40 2006 From: rmb32 at cornell.edu (Robert Buels) Date: Thu, 06 Jul 2006 18:24:40 -0700 Subject: [Bioperl-l] FeatureIO tests failing on linux debian sarge Message-ID: <44ADB7D8.7080102@cornell.edu> I am stumped. On a fresh checkout from cvs (as of like 10 seconds ago): rob at rubisco:/usr/local/lib/site_perl/bioperl-live$ perl -v This is perl, v5.8.4 built for i386-linux-thread-multi Copyright 1987-2004, Larry Wall Perl may be copied only under the terms of either the Artistic License or the GNU General Public License, which may be found in the Perl 5 source kit. Complete documentation for Perl, including FAQ lists, should be found on this system using `man perl' or `perldoc perl'. If you have access to the Internet, point your browser at http://www.perl.com/, the Perl Home Page. rob at rubisco:/usr/local/lib/site_perl/Bio$ perl t/FeatureIO.t 1..22 ok 1 ok 2 ok 3 ok 4 ok 5 ok 6 Can't locate object method "get_Annotations" via package "Bio::SeqFeature::Annotated" at /usr/local/lib/site_perl/Bio/SeqFeature/Annotated.pm line 292, line 2. ok 7 # Cannot complete FeatureIO tests ok 8 # Cannot complete FeatureIO tests ok 9 # Cannot complete FeatureIO tests ok 10 # Cannot complete FeatureIO tests ok 11 # Cannot complete FeatureIO tests ok 12 # Cannot complete FeatureIO tests ok 13 # Cannot complete FeatureIO tests ok 14 # Cannot complete FeatureIO tests ok 15 # Cannot complete FeatureIO tests ok 16 # Cannot complete FeatureIO tests ok 17 # Cannot complete FeatureIO tests ok 18 # Cannot complete FeatureIO tests ok 19 # Cannot complete FeatureIO tests ok 20 # Cannot complete FeatureIO tests ok 21 # Cannot complete FeatureIO tests ok 22 # Cannot complete FeatureIO tests However, same code runs fine on my debian unstable machine (perl 5.8.8). Perhaps this is a bug in debian stable's perl? I did some poking around through the code, changing @ISA = qw/.../ to use base, switching the order of inclusion in the ISA at the top of Bio::SeqFeature::Annotated, no dice. Anybody able to reproduce this? Anyone have any ideas? Rob -- Robert Buels SGN Bioinformatics Analyst 252A Emerson Hall, Cornell University Ithaca, NY 14853 Tel: 503-889-8539 rmb32 at cornell.edu http://www.sgn.cornell.edu From cjfields at uiuc.edu Fri Jul 7 02:30:25 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 6 Jul 2006 21:30:25 -0500 Subject: [Bioperl-l] FeatureIO tests failing on linux debian sarge In-Reply-To: <44ADB7D8.7080102@cornell.edu> Message-ID: <000001c6a16d$4dd7e6e0$15327e82@pyrimidine> I don't get any issues (all tests pass), except a few warning messages which is normal; some ontology handlind not implemented. Usually when running tests I use 'perl -I. t/test.t' to force it to use the core directory first. You might try that to see if it 'fixes' the problem. If it does, there may be another bioperl installation in @INC being used instead of your current directory. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Robert Buels > Sent: Thursday, July 06, 2006 8:25 PM > To: bioperl-l at bioperl.org > Subject: [Bioperl-l] FeatureIO tests failing on linux debian sarge > > I am stumped. On a fresh checkout from cvs (as of like 10 seconds ago): > > > rob at rubisco:/usr/local/lib/site_perl/bioperl-live$ perl -v > > This is perl, v5.8.4 built for i386-linux-thread-multi > > Copyright 1987-2004, Larry Wall > > Perl may be copied only under the terms of either the Artistic License > or the > GNU General Public License, which may be found in the Perl 5 source kit. > > Complete documentation for Perl, including FAQ lists, should be found on > this system using `man perl' or `perldoc perl'. If you have access to the > Internet, point your browser at http://www.perl.com/, the Perl Home Page. > > rob at rubisco:/usr/local/lib/site_perl/Bio$ perl t/FeatureIO.t > 1..22 > ok 1 > ok 2 > ok 3 > ok 4 > ok 5 > ok 6 > Can't locate object method "get_Annotations" via package > "Bio::SeqFeature::Annotated" at > /usr/local/lib/site_perl/Bio/SeqFeature/Annotated.pm line 292, > line 2. > ok 7 # Cannot complete FeatureIO tests > ok 8 # Cannot complete FeatureIO tests > ok 9 # Cannot complete FeatureIO tests > ok 10 # Cannot complete FeatureIO tests > ok 11 # Cannot complete FeatureIO tests > ok 12 # Cannot complete FeatureIO tests > ok 13 # Cannot complete FeatureIO tests > ok 14 # Cannot complete FeatureIO tests > ok 15 # Cannot complete FeatureIO tests > ok 16 # Cannot complete FeatureIO tests > ok 17 # Cannot complete FeatureIO tests > ok 18 # Cannot complete FeatureIO tests > ok 19 # Cannot complete FeatureIO tests > ok 20 # Cannot complete FeatureIO tests > ok 21 # Cannot complete FeatureIO tests > ok 22 # Cannot complete FeatureIO tests > > However, same code runs fine on my debian unstable machine (perl > 5.8.8). Perhaps this is a bug in debian stable's perl? > > I did some poking around through the code, changing @ISA = qw/.../ to > use base, switching the order of inclusion in the ISA at the top of > Bio::SeqFeature::Annotated, no dice. > > Anybody able to reproduce this? Anyone have any ideas? > > Rob > > -- > Robert Buels > SGN Bioinformatics Analyst > 252A Emerson Hall, Cornell University > Ithaca, NY 14853 > Tel: 503-889-8539 > rmb32 at cornell.edu > http://www.sgn.cornell.edu > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From chandan.kr.singh at gmail.com Fri Jul 7 05:23:40 2006 From: chandan.kr.singh at gmail.com (CHANDAN SINGH) Date: Fri, 7 Jul 2006 10:53:40 +0530 Subject: [Bioperl-l] PrimarySeqI object Exception In-Reply-To: References: <20060706181312.63581.qmail@web35607.mail.mud.yahoo.com> Message-ID: <2d4f320607062223y520a1375lb30cf40c1c883702@mail.gmail.com> Hi By default , id is the first word encountered i.e, the first string after ">" seperated from the rest by a space. The sample id u mentioned in ur first mail contains spaces and as i mentioned in my previous mail, i am sure the ids made by indexing and the ones u r using in the array are different. U can see the ids used in indexing by using @ids = $db->ids() ; print join("\n", at ids) ; Cheers Chandan On 7/7/06, Brian Osborne wrote: > > sss lll, > > What this error means is that $obj is not a valid Sequence object, this is > what's passed to the write_seq method. What identifier is > $array_gene_name[$p]? > > Brian O. > > > On 7/6/06 2:13 PM, "sss lll" wrote: > > > Hi there, > > > > I encountered a problem while calling module > > PrimarySeqI, with the following code: > > > > my $db=Bio::DB::Fasta->new($fafile); > > my $obj=$db->get_Seq_by_id($array_gene_name[$p]); > > $seqio->write_seq($obj); > > > > The error message was: > > MSG: Did not provide a valid Bio::PrimarySeqI object > > STACK Bio::SeqIO::fasta::write_seq > > /usr/lib/perl5/site_perl/5.8.0/Bio/SeqIO/fasta.pm:178 > > > > We think it had to do with the lengh of the gene name. > > For example the following gene name was a problem: > > > > gi|59711891|ref|YP_204667.1| NAD-specific glutamate > > dehydrogenase [Vibrio fischeri ES114]*VIB*COG2902*E > > > > Any ideas on what happened? > > > > Thanks > > > > __________________________________________________ > > Do You Yahoo!? > > Tired of spam? Yahoo! Mail has the best spam protection around > > http://mail.yahoo.com > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From selvik at ufl.edu Fri Jul 7 16:07:03 2006 From: selvik at ufl.edu (Selvi Kadirvel) Date: Fri, 7 Jul 2006 12:07:03 -0400 Subject: [Bioperl-l] Bio::SearchIO::hmmer hsp behaviour In-Reply-To: <001a01c6a048$cb802420$15327e82@pyrimidine> References: <001a01c6a048$cb802420$15327e82@pyrimidine> Message-ID: <1A5235F4-87E6-42D7-9796-7FEB8F7C04E5@ufl.edu> Chris: I just tried it out, and it looks like this solution works fine for me. Thank you for the fix! -Selvi On Jul 5, 2006, at 11:36 AM, Chris Fields wrote: > Okay, I managed to figure out what the problem was. I committed a > fix in > CVS for the initial bug (Selvi's missing hits). Still has one HSP > per hit > for now; I think it will take a bit more effort to get a BLAST-like > multi > HSP/hit up and running. > > Selvi, update from CVS to see if that works. > > Chris > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> bounces at lists.open-bio.org] On Behalf Of Chris Fields >> Sent: Friday, June 30, 2006 12:44 PM >> To: Sendu Bala; Jason Stajich >> Cc: bioperl-l at lists.open-bio.org list >> Subject: Re: [Bioperl-l] Bio::SearchIO::hmmer hsp behaviour >> >> I'll try looking at it this weekend. A suggested workaround is to >> either try setting -A for no alignments or setting it to a high >> number to retrieve all of them. It's pretty serious as the error >> silently dumps those domains, so for those using automated annotation >> pipelines would miss it unless they are also checking the raw output. >> >> You could design a SearchIO::hmmpfam parser then expand it to take in >> hmmsearch output at a later point, or keep them separate. I like the >> idea of having modules that are more specific about what they parse; >> seems at some point you reach serious code bloat and maintenance >> becomes an issue. Look at SearchIO::blast; it parses various text >> BLAST output very well but with some serious obfuscation. Just don't >> know how productive it would be to separate out the PSI-BLAST and >> bl2seq stuff since they are pretty close to a standard BLAST >> report... oh well. >> >> To Jason : good luck on your move. Drop us a line here to let us >> know everything went well. >> >> Chris >> >> On Jun 30, 2006, at 11:14 AM, Sendu Bala wrote: >> >>> Chris Fields wrote: >>>> It may have been just simpler to have it be one HSP (domain) per >>>> Hit >>>> (model) as that's how the reports are generated. My reasoning was >>>> that >>>> using the one domain per model made sense based on what you are >>>> actually >>>> trying to do, which is annotate the sequence based on the order the >>>> domain appears. Most others may not view it that way, which is >>>> fine. >>>> One can always gather the relevant HSP's, convert to seqfeatures, >>>> then >>>> sort them if order is important, I suppose. >>>> >>>> I would say, if the overall consensus is to modify it to have >>>> multiple >>>> domain hits per model (similar to BLAST) then Sendu should go >>>> ahead and >>>> make those changes then announce it on the list so no one can gripe >>>> about it later. My main concern was not changing things so >>>> dramatically >>>> that it'll break for someone >>> >>> Going on your earlier suggestion, I was thinking about making >>> SearchIO::hmmpfam instead, which would get used if you set the >>> format to >>> 'hmmpfam' instead of the generic 'hmmer' when making a SearchIO. I >>> suppose I would make a SearchIO::hmmsearch as well, if necessary. >>> >>> >>> [...] >>>> that the reported bug about missing hits (Bug 2036) is fixed as >>>> well. >>> >>> However, having never made a SearchIO plugin before, it will be some >>> time before I get my head around it. I'll want to make one the >>> current >>> HOWTO:SearchIO way before I can think about doing it a better way >>> (hashes) as well. So I can say I'll make a move on this at some >>> point in >>> the future, but if someone wants to fix Bug 2036 in the mean time, >>> they >>> are welcome to. Again as suggested, my priority is Bio::Map right >>> now. >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> Christopher Fields >> Postdoctoral Researcher >> Lab of Dr. Robert Switzer >> Dept of Biochemistry >> University of Illinois Urbana-Champaign >> >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > From cjfields at uiuc.edu Fri Jul 7 16:16:30 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 7 Jul 2006 11:16:30 -0500 Subject: [Bioperl-l] Bio::SeqFeatureI spliced_seq Message-ID: <002a01c6a1e0$b4e2b360$15327e82@pyrimidine> There is a reported bug (Bug 2039) which I found an easy fix for; the issue is that spliced_seq, as currently implemented, has two optional arguments: my ($self, $db, $nosort) = @_; $db is-a Bio::DB::RandomAccessI; $nosort is a flag so that locations aren't sorted before splicing, which is crux of the bug. So, to set $nosort you must also set $db to either undef or a Bio::DB::RandomAccessI (a point not made in the docs and not immediately clear to the user). Would it make more sense to have something like this (using $self->_rearrange to get the options)? my $seq = $sf->spliced_seq(-nosort => 1); my $seq = $sf->spliced_seq(-db => $db); my $seq = $sf->spliced_seq(-nosort => 1 -db => $db); Christopher Fields Postdoctoral Researcher - Switzer Lab Dept. of Biochemistry University of Illinois Urbana-Champaign From vebaev at gmail.com Sat Jul 8 20:59:40 2006 From: vebaev at gmail.com (Vesselin Baev) Date: Sat, 08 Jul 2006 23:59:40 +0300 Subject: [Bioperl-l] BLAST running options Message-ID: <44B01CBC.9070404@gmail.com> Hi, I'm parsing Blast results, but I need an Blast option to limit limit and decrease the Blast number of results. I'm blasting an oligo about 40nt and I need only results which are with mismatches (not more than 10) or exactly matching but in the length as the query - 40. I do not want all the big amount of results that blast gave me about shorter matching. Do anyone knows what king of BLAST option to use? Thanks -- ------------------------------------------------ University of Plovdiv Faculty of Biology Dept. Molecular Biology and Plant Physiology Tzar Asen 24 Plovdiv 4000, BULGARIA vebaev at gmail.com tel.00359889034044 From cjfields at uiuc.edu Sat Jul 8 23:15:29 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sat, 8 Jul 2006 18:15:29 -0500 Subject: [Bioperl-l] BLAST running options In-Reply-To: <44B01CBC.9070404@gmail.com> References: <44B01CBC.9070404@gmail.com> Message-ID: <95D47990-9B63-444D-B386-04219D21DC39@uiuc.edu> There were some posts about this a few months back. http://bioperl.org/pipermail/bioperl-l/2006-April/021341.html Essentially, most responders suggested not using BLAST, but I believe there were a few who gave pointers. Chris On Jul 8, 2006, at 3:59 PM, Vesselin Baev wrote: > Hi, > I'm parsing Blast results, but I need an Blast option to limit > limit and > decrease the Blast number of results. > I'm blasting an oligo about 40nt and I need only results which are > with > mismatches (not more than 10) or exactly matching but in the length as > the query - 40. > I do not want all the big amount of results that blast gave me about > shorter matching. > > Do anyone knows what king of BLAST option to use? > Thanks > > -- > ------------------------------------------------ > > University of Plovdiv > Faculty of Biology > Dept. Molecular Biology and Plant Physiology > Tzar Asen 24 > Plovdiv 4000, BULGARIA > vebaev at gmail.com > tel.00359889034044 > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Mon Jul 10 21:09:12 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 10 Jul 2006 16:09:12 -0500 Subject: [Bioperl-l] How to use gi2taxonid Message-ID: <000301c6a465$182025d0$15327e82@pyrimidine> Hubert, In case you didn't get this going, there may be another option now. I have started work on a new set of modules called Bio::DB::EUtilities in bioperl-live, intended as a back-end for NCBI database searches. It can be used directly if needed though. You can use EPost/Elink to directly retrieve the taxonIDs using the following script (pass a file containing the protein/nucleotide primary ID on command line). The below retrieves taxonid's using protein GI's: use Bio::DB::EUtilities; my @ids; while (my $id = <>) { chomp $id; push @ids, $id; } my $epost = Bio::DB::EUtilities->new( -eutil => 'epost', -db => 'protein', -id => \@ids, ); $epost->get_response; my $elink = Bio::DB::EUtilities->new( -eutil => 'elink', -cookie => $epost->next_cookie, -db => 'taxonomy', ); $elink->get_response; my @tax_ids = $elink->get_db_ids; Chris > hi, > I have downloaded the gi2taxonid file to get the taxonid for a GI > number > taken from a report as recommended here, but I don't know how to > use the > gi2taxonid file. > Jason wrote in a previous post that you have to make a DB_File out of > it, but I don't know how....and finally tie it to a hash.... > Can anybody give me a hint how to use it..... my final goal is to get > the taxonomy. > > thanks > Hubert Christopher Fields Postdoctoral Researcher - Switzer Lab Dept. of Biochemistry University of Illinois Urbana-Champaign From hubert.prielinger at gmx.at Mon Jul 10 23:53:26 2006 From: hubert.prielinger at gmx.at (Hubert Prielinger) Date: Mon, 10 Jul 2006 17:53:26 -0600 Subject: [Bioperl-l] How to use gi2taxonid In-Reply-To: <000301c6a465$182025d0$15327e82@pyrimidine> References: <000301c6a465$182025d0$15327e82@pyrimidine> Message-ID: <44B2E876.2020200@gmx.at> Hi Chris, thanks for your response, actually I have done it with the EUtils, because I have only accession ids and there is no possibility to retrieve the taxonomy directly for an accession id. Because the xml files you retrieve are very small, I first assign accession id to esearch, parse the Uid from the xml file, assign Uid to esummary, parse tax id from xml and finally, assign tax id to esummary again and retrieve taxonomy and parse it..... I know a little bit intricatley, but it works fine.....thanks regards Hubert Chris Fields wrote: > Hubert, > > In case you didn't get this going, there may be another option now. I have > started work on a new set of modules called Bio::DB::EUtilities in > bioperl-live, intended as a back-end for NCBI database searches. It can be > used directly if needed though. You can use EPost/Elink to directly > retrieve the taxonIDs using the following script (pass a file containing the > protein/nucleotide primary ID on command line). The below retrieves > taxonid's using protein GI's: > > > use Bio::DB::EUtilities; > my @ids; > > while (my $id = <>) { > chomp $id; > push @ids, $id; > } > > my $epost = Bio::DB::EUtilities->new( > -eutil => 'epost', > -db => 'protein', > -id => \@ids, > ); > > $epost->get_response; > > my $elink = Bio::DB::EUtilities->new( > -eutil => 'elink', > -cookie => $epost->next_cookie, > -db => 'taxonomy', > ); > > $elink->get_response; > > my @tax_ids = $elink->get_db_ids; > > > > Chris > > >> hi, >> I have downloaded the gi2taxonid file to get the taxonid for a GI >> number >> taken from a report as recommended here, but I don't know how to >> use the >> gi2taxonid file. >> Jason wrote in a previous post that you have to make a DB_File out of >> it, but I don't know how....and finally tie it to a hash.... >> Can anybody give me a hint how to use it..... my final goal is to get >> the taxonomy. >> >> thanks >> Hubert >> > > Christopher Fields > Postdoctoral Researcher - Switzer Lab > Dept. of Biochemistry > University of Illinois Urbana-Champaign > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > From MEC at stowers-institute.org Tue Jul 11 00:25:11 2006 From: MEC at stowers-institute.org (Cook, Malcolm) Date: Mon, 10 Jul 2006 19:25:11 -0500 Subject: [Bioperl-l] Bio::Tools::Fgenesh bug? and fix? Message-ID: I am finding the Bio::Tools::Fgenesh parser to incorrectly handle the feature coordinates on - strand predictions. In particular, start & end are deliberately reversed if the strand is '-'. I guess this was a holdover from Genscan.pm and wasn't really tested !?!?! Or, perhaps fgenesh v 2.4 which I am running has different output in this respect compared to the version 2.0, against which this module was written. Or, perhaps my understanding is blotto (known to happen). Does anyone know for sure? If I comment out selected lines... # if($predobj->strand() == 1) { $predobj->start($start); $predobj->end($end); # } else { # $predobj->end($start); # $predobj->start($end); # } ... then GFF produced by my naive fgenesh2gff script below is correct (at least w.r.t. strand and coordinates - GFF compatibility purists might wince). Should I commit this change to head? Malcolm Cook Database Applications Manager, Bioinformatics Stowers Institute for Medical Research #!/usr/bin/env perl # fgenesh2gff # PURPOSE: parse fgenesh output into gff # USAGE: fgenesh fish somefish.dna | fgenesh2gff > somefish.dna.fgenesh.gff use strict; use warnings; use Bio::Tools::Fgenesh; use Bio::FeatureIO; # Remaining options should name files to process, but if none, process # standard input: @ARGV = ('-') unless @ARGV; my $fgenesh = Bio::Tools::Fgenesh->new(-fh => \*ARGV); my $featureout = new Bio::Tools::GFF( -gff_version => 2, #whatever ;) ); my $IDNUM = 0; while (my $gene = $fgenesh->next_prediction()) { my $ID = "fgenesh" . ++ $IDNUM; $gene->add_tag_value('ID', $ID); $featureout->write_feature($gene); foreach ($gene->exons()) { $_->add_tag_value('Parent', $ID); $_->seq_id($gene->seq_id); $featureout->write_feature($_); } } $fgenesh->close(); exit 0; From chris at dwan.org Tue Jul 11 02:06:41 2006 From: chris at dwan.org (Christopher Dwan) Date: Mon, 10 Jul 2006 22:06:41 -0400 Subject: [Bioperl-l] Bio::Tools::Fgenesh bug? and fix? In-Reply-To: References: Message-ID: I'm not surprised that there are parts that don't work right, I coped genscan.pm and made the absolute minimal changes required to get what I needed working. Haven't touched it since. Please feel free to do what needs to be done, and sorry about the mess. -Chris Dwan On Jul 10, 2006, at 8:25 PM, Cook, Malcolm wrote: > I am finding the Bio::Tools::Fgenesh parser to incorrectly handle the > feature coordinates on - strand predictions. > > In particular, start & end are deliberately reversed if the strand is > '-'. > > I guess this was a holdover from Genscan.pm and wasn't really tested > !?!?! > > Or, perhaps fgenesh v 2.4 which I am running has different output in > this respect compared to the version 2.0, against which this module > was > written. > > Or, perhaps my understanding is blotto (known to happen). > > Does anyone know for sure? > > If I comment out selected lines... > > # if($predobj->strand() == 1) { > $predobj->start($start); > $predobj->end($end); > # } else { > # $predobj->end($start); > # $predobj->start($end); > # } > > ... then GFF produced by my naive fgenesh2gff script below is correct > (at least w.r.t. strand and coordinates - GFF compatibility purists > might wince). > > Should I commit this change to head? > > > Malcolm Cook > Database Applications Manager, Bioinformatics > Stowers Institute for Medical Research > > > > #!/usr/bin/env perl > > # fgenesh2gff > # PURPOSE: parse fgenesh output into gff > # USAGE: fgenesh fish somefish.dna | fgenesh2gff > > somefish.dna.fgenesh.gff > > use strict; > use warnings; > use Bio::Tools::Fgenesh; > use Bio::FeatureIO; > > # Remaining options should name files to process, but if none, process > # standard input: > @ARGV = ('-') unless @ARGV; > my $fgenesh = Bio::Tools::Fgenesh->new(-fh => \*ARGV); > > my $featureout = new Bio::Tools::GFF( > -gff_version => 2, #whatever ;) > ); > my $IDNUM = 0; > while (my $gene = $fgenesh->next_prediction()) { > my $ID = "fgenesh" . ++ $IDNUM; > $gene->add_tag_value('ID', $ID); > $featureout->write_feature($gene); > foreach ($gene->exons()) { > $_->add_tag_value('Parent', $ID); > $_->seq_id($gene->seq_id); > $featureout->write_feature($_); > } > } > $fgenesh->close(); > > exit 0; > From rvosa at sfu.ca Tue Jul 11 08:58:46 2006 From: rvosa at sfu.ca (Rutger Vos) Date: Tue, 11 Jul 2006 01:58:46 -0700 Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? Message-ID: <44B36846.8070103@sfu.ca> Dear all, would it be possible to overload Bio::Root::RootI's 'throw' method to accept an additional, optional (positional) argument to define the exception class, e.g. using Exception::Class: # ...somewhere ... sub makefh { my ( $self, $filename ) = @_; open my $fh, '<' $filename or $self->throw("Can't open file: $!", 'Bio::Exceptions::FileIO'); # NOTE second argument return $fh; } #.... somewhere else my $fh; eval { $fh = $obj->makefh( 'data.txt'); } if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) { # something's wrong with the file? } -- ++++++++++++++++++++++++++++++++++++++++++++++++++++ Rutger Vos, PhD. candidate Department of Biological Sciences Simon Fraser University 8888 University Drive Burnaby, BC, V5A1S6 Phone: 604-291-5625 Fax: 604-291-3496 Personal site: http://www.sfu.ca/~rvosa FAB* lab: http://www.sfu.ca/~fabstar Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ ++++++++++++++++++++++++++++++++++++++++++++++++++++ From khoiwal_tara at yahoo.co.in Tue Jul 11 12:19:17 2006 From: khoiwal_tara at yahoo.co.in (Khoiwal Tara) Date: Tue, 11 Jul 2006 05:19:17 -0700 (PDT) Subject: [Bioperl-l] Need help in needle parser Message-ID: <20060711121918.64429.qmail@web8513.mail.in.yahoo.com> Hi, I want to parse the output of needle.I tried but didn't able to get expected output. My code is as follows: #!/usr/local/bin/perl use strict; use warnings; use Bio::AlignIO; my $needleReport = $ARGV[0]; my $in = new Bio::AlignIO(-format => 'emboss',-file =>$needleReport); while(my $align = $in->next_aln()){ print "Alignment Length:".$align->length()."\n"; print "Percentage Identity:".$align->percentage_identity()."\n"; print "Consensus string:".$align->consensus_string()."\n"; print "Number of sequences:".$align->no_sequence()."\n"; print "Number of residues:".$align->no_residues()."\n"; } But it doesn't go inside the while loop. Pls help me. How to find the alignment position for the query sequence on the target sequence from the needle output? Where can i find the good documentation on needle parser and its usage? Good document on bioperl for beginners. Regards, Tara Khoiwal. --------------------------------- Sneak preview the all-new Yahoo.com. It's not radically different. Just radically better. From cjfields at uiuc.edu Tue Jul 11 12:59:07 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 11 Jul 2006 07:59:07 -0500 Subject: [Bioperl-l] Need help in needle parser In-Reply-To: <20060711121918.64429.qmail@web8513.mail.in.yahoo.com> References: <20060711121918.64429.qmail@web8513.mail.in.yahoo.com> Message-ID: <250EEE60-48D0-4844-B0C0-13E17E60965C@uiuc.edu> perldoc Bio::AlignIO perldoc Bio::AlignIO::needle http://www.bioperl.org/wiki/FAQ http://www.bioperl.org/wiki/HOWTO:Beginners http://www.bioperl.org/wiki/Bptutorial.pl http://www.catb.org/~esr/faqs/smart-questions.html Google is your friend! If it isn't entering the while loop, there are two possibilities: 1) Something is wrong with the file 2) The parser isn't reading the file correctly In order to know which, we will need to see the alignment itself. Chris On Jul 11, 2006, at 7:19 AM, Khoiwal Tara wrote: > Hi, > I want to parse the output of needle.I tried but didn't able to > get expected output. > > My code is as follows: > > #!/usr/local/bin/perl > > use strict; > use warnings; > use Bio::AlignIO; > my $needleReport = $ARGV[0]; > > my $in = new Bio::AlignIO(-format => 'emboss',-file =>$needleReport); > > while(my $align = $in->next_aln()){ > print "Alignment Length:".$align->length()."\n"; > print "Percentage Identity:".$align->percentage_identity()."\n"; > print "Consensus string:".$align->consensus_string()."\n"; > print "Number of sequences:".$align->no_sequence()."\n"; > print "Number of residues:".$align->no_residues()."\n"; > } > > But it doesn't go inside the while loop. > Pls help me. > How to find the alignment position for the query sequence on the > target sequence from the needle output? > Where can i find the good documentation on needle parser and its > usage? > Good document on bioperl for beginners. > > Regards, > Tara Khoiwal. > > > --------------------------------- > Sneak preview the all-new Yahoo.com. It's not radically different. > Just radically better. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Tue Jul 11 13:13:23 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 11 Jul 2006 08:13:23 -0500 Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? In-Reply-To: <44B36846.8070103@sfu.ca> References: <44B36846.8070103@sfu.ca> Message-ID: <954D76EB-F8D3-4652-957B-6ED4BBAEE209@uiuc.edu> I suppose you could; Bio::Root::Root does that using Error.pm (if it is installed). It almost sounds like what Bio::Root::Root does is what you want, but you want a little more information when exceptions are thrown maybe? from perldoc Bio::Root::Root: ... # Alternatively, using the new typed exception syntax in the throw() call: $obj->throw( -class => 'Bio::Root::BadParameter', -text => "Can not open file $file", -value => $file); ... Typed Exception Syntax The typed exception syntax of throw() has the advantage of plainly indicating the nature of the trouble, since the name of the class is included in the title of the exception output. To take advantage of this capability, you must specify arguments as named parameters in the throw() call. Here are the parameters: -class name of the class of the exception. This should be one of the classes defined in Bio::Root::Exception, or a custom error of yours that extends one of the exceptions defined in Bio::Root::Exception. -text a sensible message for the exception -value the value causing the exception or $!, if appropriate. Note that Bio::Root::Exception does not need to be imported into your module (or script) namespace in order to throw exceptions via Bio::Root::Root::throw(), since Bio::Root::Root imports it. Chris On Jul 11, 2006, at 3:58 AM, Rutger Vos wrote: > Dear all, > > would it be possible to overload Bio::Root::RootI's 'throw' method to > accept an additional, optional (positional) argument to define the > exception class, e.g. using Exception::Class: > > # ...somewhere ... > > sub makefh { > my ( $self, $filename ) = @_; > open my $fh, '<' $filename or $self->throw("Can't open file: $!", > 'Bio::Exceptions::FileIO'); # NOTE second argument > return $fh; > } > > #.... somewhere else > my $fh; > eval { > $fh = $obj->makefh( 'data.txt'); > } > if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) { > # something's wrong with the file? > } > > -- > ++++++++++++++++++++++++++++++++++++++++++++++++++++ > Rutger Vos, PhD. candidate > Department of Biological Sciences > Simon Fraser University > 8888 University Drive > Burnaby, BC, V5A1S6 > Phone: 604-291-5625 > Fax: 604-291-3496 > Personal site: http://www.sfu.ca/~rvosa > FAB* lab: http://www.sfu.ca/~fabstar > Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Tue Jul 11 15:25:32 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 11 Jul 2006 10:25:32 -0500 Subject: [Bioperl-l] Need help in needle parser In-Reply-To: <20060711132601.46368.qmail@web8510.mail.in.yahoo.com> Message-ID: <001601c6a4fe$3ff7ca10$15327e82@pyrimidine> There are a few odd things about the data you sent; the FASTA files aren't FASTA format (they are raw) and the needle output doesn't have sequence names. You could try running these through needle with descriptors to see if that helps, but. it is very likely my option #2 (i.e. the parser doesn't recognize the format). There is a thread on the mail list about this issue: http://thread.gmane.org/gmane.comp.lang.perl.bio.general/8926/focus=8935 Basically, it looks like the needle output has changed dramatically in EMBOSS v3. Jason's suggested options from the above thread (as well as mine): . I think the "emboss" format changed in 3.0.0 solutions: a) fix the AlignIO::emboss parser to handle both flavors (old and new) b) have it output MSF format and use AlignIO::msf. . So, as a workaround, use MSF output. I won't have time to look at this anytime soon as I'm busy at $job and getting ready for a summer institute; I'll submit this as a bug to see if someone else can tackle it before I get back in early August. Chris _____ From: Khoiwal Tara [mailto:khoiwal_tara at yahoo.co.in] Sent: Tuesday, July 11, 2006 8:26 AM To: Chris Fields Subject: Re: [Bioperl-l] Need help in needle parser I am sending my testing data to you. I have two fasta files "GenomicSeq.fasta" and "TranscriptSeq.fasta". I ran needle on these files as follows: $ needle GenomicSeq.fasta TranscriptSeq.fasta outfile.needle So the out put of the needle will get stored in outfile.needle. I am attaching the output file also. Please check it and tell me if it has any problem. Is my output file is correct? Thanks and Regards, Tara. Chris Fields wrote: perldoc Bio::AlignIO perldoc Bio::AlignIO::needle http://www.bioperl.org/wiki/FAQ http://www.bioperl.org/wiki/HOWTO:Beginners http://www.bioperl.org/wiki/Bptutorial.pl http://www.catb.org/~esr/faqs/smart-questions.html Google is your friend! If it isn't entering the while loop, there are two possibilities: 1) Something is wrong with the file 2) The parser isn't reading the file correctly In order to know which, we will need to see the alignment itself. Chris On Jul 11, 2006, at 7:19 AM, Khoiwal Tara wrote: > Hi, > I want to parse the output of needle.I tried but didn't able to > get expected output. > > My code is as follows: > > #!/usr/local/bin/perl > > use strict; > use warnings; > use Bio::AlignIO; > my $needleReport = $ARGV[0]; > > my $in = new Bio::AlignIO(-format => 'emboss',-file =>$needleReport); > > while(my $align = $in->next_aln()){ > print "Alignment Length:".$align->length()."\n"; > print "Percentage Identity:".$align->percentage_identity()."\n"; > print "Consensus string:".$align->consensus_string()."\n"; > print "Number of sequences:".$align->no_sequence()."\n"; > print "Number of residues:".$align->no_residues()."\n"; > } > > But it doesn't go inside the while loop. > Pls help me. > How to find the alignment position for the query sequence on the > target sequence from the needle output? > Where can i find the good documentation on needle parser and its > usage? > Good document on bioperl for beginners. > > Regards, > Tara Khoiwal. > > > --------------------------------- > Sneak preview the all-new Yahoo.com. It's not radically different. > Just radically better. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign _____ Do you Yahoo!? Next-gen email? Have it all with the all-new Yahoo! Mail Beta. From MEC at stowers-institute.org Tue Jul 11 15:56:40 2006 From: MEC at stowers-institute.org (Cook, Malcolm) Date: Tue, 11 Jul 2006 10:56:40 -0500 Subject: [Bioperl-l] Bio::Tools::Fgenesh bug? and fix? Message-ID: Got it. Commits made. Thanks for the history lesson. Cheers, Malcolm Cook >-----Original Message----- >From: Christopher Dwan [mailto:chris at dwan.org] >Sent: Monday, July 10, 2006 9:07 PM >To: Cook, Malcolm >Cc: bioperl-l >Subject: Re: Bio::Tools::Fgenesh bug? and fix? > > >I'm not surprised that there are parts that don't work right, I coped >genscan.pm and made the absolute minimal changes required to get what >I needed working. Haven't touched it since. > >Please feel free to do what needs to be done, and sorry about the mess. > >-Chris Dwan > >On Jul 10, 2006, at 8:25 PM, Cook, Malcolm wrote: > >> I am finding the Bio::Tools::Fgenesh parser to incorrectly handle the >> feature coordinates on - strand predictions. >> >> In particular, start & end are deliberately reversed if the strand is >> '-'. >> >> I guess this was a holdover from Genscan.pm and wasn't really tested >> !?!?! >> >> Or, perhaps fgenesh v 2.4 which I am running has different output in >> this respect compared to the version 2.0, against which this module >> was >> written. >> >> Or, perhaps my understanding is blotto (known to happen). >> >> Does anyone know for sure? >> >> If I comment out selected lines... >> >> # if($predobj->strand() == 1) { >> $predobj->start($start); >> $predobj->end($end); >> # } else { >> # $predobj->end($start); >> # $predobj->start($end); >> # } >> >> ... then GFF produced by my naive fgenesh2gff script below is correct >> (at least w.r.t. strand and coordinates - GFF compatibility purists >> might wince). >> >> Should I commit this change to head? >> >> >> Malcolm Cook >> Database Applications Manager, Bioinformatics >> Stowers Institute for Medical Research >> >> >> >> #!/usr/bin/env perl >> >> # fgenesh2gff >> # PURPOSE: parse fgenesh output into gff >> # USAGE: fgenesh fish somefish.dna | fgenesh2gff > >> somefish.dna.fgenesh.gff >> >> use strict; >> use warnings; >> use Bio::Tools::Fgenesh; >> use Bio::FeatureIO; >> >> # Remaining options should name files to process, but if >none, process >> # standard input: >> @ARGV = ('-') unless @ARGV; >> my $fgenesh = Bio::Tools::Fgenesh->new(-fh => \*ARGV); >> >> my $featureout = new Bio::Tools::GFF( >> -gff_version => 2, #whatever ;) >> ); >> my $IDNUM = 0; >> while (my $gene = $fgenesh->next_prediction()) { >> my $ID = "fgenesh" . ++ $IDNUM; >> $gene->add_tag_value('ID', $ID); >> $featureout->write_feature($gene); >> foreach ($gene->exons()) { >> $_->add_tag_value('Parent', $ID); >> $_->seq_id($gene->seq_id); >> $featureout->write_feature($_); >> } >> } >> $fgenesh->close(); >> >> exit 0; >> > > From cjfields at uiuc.edu Tue Jul 11 16:04:49 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 11 Jul 2006 11:04:49 -0500 Subject: [Bioperl-l] Need help in needle parser In-Reply-To: <20060711132601.46368.qmail@web8510.mail.in.yahoo.com> Message-ID: <000101c6a503$bd982eb0$15327e82@pyrimidine> Okay, I take that back. Bio::AlignIO::emboss does parse EMBOSS v3 needle output! The fact that it doesn't parse your alignment is b/c there are no sequence descriptors in the file for the sequences (your FASTA files didn't have them either). Modifying the file to contain descriptions for both the alignment and the 'Aligned_sequences:' section gets your test alignment to work. I consider this a feature and not a bug; how would others be able to distinguish between numerous sequences in an alignment w/o identifiers of some sort? It shouldn't just toss this out without a warning however; I'll try to add a little exception handling. BTW, one line is incorrect in your script; it should be print "Number of sequences:".$align->no_sequences()."\n"; you have print "Number of sequences:".$align->no_sequence()."\n"; Chris _____ From: Khoiwal Tara [mailto:khoiwal_tara at yahoo.co.in] Sent: Tuesday, July 11, 2006 8:26 AM To: Chris Fields Subject: Re: [Bioperl-l] Need help in needle parser I am sending my testing data to you. I have two fasta files "GenomicSeq.fasta" and "TranscriptSeq.fasta". I ran needle on these files as follows: $ needle GenomicSeq.fasta TranscriptSeq.fasta outfile.needle So the out put of the needle will get stored in outfile.needle. I am attaching the output file also. Please check it and tell me if it has any problem. Is my output file is correct? Thanks and Regards, Tara. Chris Fields wrote: perldoc Bio::AlignIO perldoc Bio::AlignIO::needle http://www.bioperl.org/wiki/FAQ http://www.bioperl.org/wiki/HOWTO:Beginners http://www.bioperl.org/wiki/Bptutorial.pl http://www.catb.org/~esr/faqs/smart-questions.html Google is your friend! If it isn't entering the while loop, there are two possibilities: 1) Something is wrong with the file 2) The parser isn't reading the file correctly In order to know which, we will need to see the alignment itself. Chris On Jul 11, 2006, at 7:19 AM, Khoiwal Tara wrote: > Hi, > I want to parse the output of needle.I tried but didn't able to > get expected output. > > My code is as follows: > > #!/usr/local/bin/perl > > use strict; > use warnings; > use Bio::AlignIO; > my $needleReport = $ARGV[0]; > > my $in = new Bio::AlignIO(-format => 'emboss',-file =>$needleReport); > > while(my $align = $in->next_aln()){ > print "Alignment Length:".$align->length()."\n"; > print "Percentage Identity:".$align->percentage_identity()."\n"; > print "Consensus string:".$align->consensus_string()."\n"; > print "Number of sequences:".$align->no_sequence()."\n"; > print "Number of residues:".$align->no_residues()."\n"; > } > > But it doesn't go inside the while loop. > Pls help me. > How to find the alignment position for the query sequence on the > target sequence from the needle output? > Where can i find the good documentation on needle parser and its > usage? > Good document on bioperl for beginners. > > Regards, > Tara Khoiwal. > > > --------------------------------- > Sneak preview the all-new Yahoo.com. It's not radically different. > Just radically better. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign _____ Do you Yahoo!? Next-gen email? Have it all with the all-new Yahoo! Mail Beta. From wrp at virginia.edu Tue Jul 11 18:05:29 2006 From: wrp at virginia.edu (William R. Pearson) Date: Tue, 11 Jul 2006 14:05:29 -0400 Subject: [Bioperl-l] Course announcement: CSHL Computational Genomics Course In-Reply-To: References: Message-ID: <45D80228-35DE-44B0-9E11-48EC76CE0DE7@virginia.edu> Course announcement - Application deadline, July 15, 2006 ================================================================ Cold Spring Harbor COMPUTATIONAL & COMPARATIVE GENOMICS November 8 - 14, 2006 Application Deadline: July 15, 2006 INSTRUCTORS: Pearson, William, Ph.D., University of Virginia, Charlottesville, VA Smith, Randall, Ph.D., SmithKline Beecham Pharmaceuticals, King of Prussia, PA Beyond BLAST and FASTA - Alignment: from proteins to genomes - This course presents a comprehensive overview of the theory and practice of computational methods for extracting the maximum amount of information from protein and DNA sequence similarity through sequence database searches, statistical analysis, and multiple sequence alignment, and genome scale alignment. Additional topics include gene finding, dentifying signals in unaligned sequences, integration of genetic and sequence information in biological databases. The course combines lectures with hands-on exercises; students are encouraged to pose challenging sequence analysis problems using their own data. The course makes extensive use of local WWW pages to present problem sets and the computing tools to solve them. Students use Windows and Mac workstations attached to a UNIX server; participants should be comfortable using the Unix operating system and a Unix text editor. The course is designed for biologists seeking advanced training in biological sequence analysis, computational biology core resource directors and staff, and for scientists in other disciplines, such as computer science, who wish to survey current research problems in biological sequence analysis and comparative genomics. The primary focus of the Computational and Comparative Genomics Course is the theory and practice of algorithms used in computational biology, with the goal of using current methods more effectively and developing new algorithms. Cold Spring Harbor also offers a "Programming for Biology" course, which focuses more on software development. Over the past few years, the course has been expanded to cover more algorithms and exercises on comparative genomics and genome databases. For additional information and the lecture schedule and problem sets for the 2005 course, see: http://fasta.bioch.virginia.edu/cshl05 ================================================================ To apply to the course, fill out the form at: http://meetings.cshl.edu/courses/courseapplication.asp ================================================================ From rvosa at sfu.ca Tue Jul 11 18:58:25 2006 From: rvosa at sfu.ca (Rutger Vos) Date: Tue, 11 Jul 2006 11:58:25 -0700 Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? In-Reply-To: <954D76EB-F8D3-4652-957B-6ED4BBAEE209@uiuc.edu> References: <44B36846.8070103@sfu.ca> <954D76EB-F8D3-4652-957B-6ED4BBAEE209@uiuc.edu> Message-ID: <44B3F4D1.7090804@sfu.ca> I must have overlooked this. I think it does what I want. So could I do something like: $obj->thow_not_implemented( -class => 'Bio::Root::NotImplemented' ); ...in interfaces? Chris Fields wrote: > I suppose you could; Bio::Root::Root does that using Error.pm (if it > is installed). It almost sounds like what Bio::Root::Root does is > what you want, but you want a little more information when exceptions > are thrown maybe? > > from perldoc Bio::Root::Root: > > ... > # Alternatively, using the new typed exception syntax in > the throw() call: > > $obj->throw( -class => 'Bio::Root::BadParameter', > -text => "Can not open file $file", > -value => $file); > ... > > Typed Exception Syntax > > The typed exception syntax of throw() has the advantage of > plainly > indicating the nature of the trouble, since the name of the > class is > included in the title of the exception output. > > To take advantage of this capability, you must specify > arguments as > named parameters in the throw() call. Here are the parameters: > > -class > name of the class of the exception. This should be one > of the > classes defined in Bio::Root::Exception, or a custom > error of yours > that extends one of the exceptions defined in > Bio::Root::Exception. > > -text > a sensible message for the exception > > -value > the value causing the exception or $!, if appropriate. > > Note that Bio::Root::Exception does not need to be imported > into your > module (or script) namespace in order to throw exceptions via > Bio::Root::Root::throw(), since Bio::Root::Root imports it. > > > Chris > > On Jul 11, 2006, at 3:58 AM, Rutger Vos wrote: > > >> Dear all, >> >> would it be possible to overload Bio::Root::RootI's 'throw' method to >> accept an additional, optional (positional) argument to define the >> exception class, e.g. using Exception::Class: >> >> # ...somewhere ... >> >> sub makefh { >> my ( $self, $filename ) = @_; >> open my $fh, '<' $filename or $self->throw("Can't open file: $!", >> 'Bio::Exceptions::FileIO'); # NOTE second argument >> return $fh; >> } >> >> #.... somewhere else >> my $fh; >> eval { >> $fh = $obj->makefh( 'data.txt'); >> } >> if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) { >> # something's wrong with the file? >> } >> >> -- >> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Rutger Vos, PhD. candidate >> Department of Biological Sciences >> Simon Fraser University >> 8888 University Drive >> Burnaby, BC, V5A1S6 >> Phone: 604-291-5625 >> Fax: 604-291-3496 >> Personal site: http://www.sfu.ca/~rvosa >> FAB* lab: http://www.sfu.ca/~fabstar >> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- ++++++++++++++++++++++++++++++++++++++++++++++++++++ Rutger Vos, PhD. candidate Department of Biological Sciences Simon Fraser University 8888 University Drive Burnaby, BC, V5A1S6 Phone: 604-291-5625 Fax: 604-291-3496 Personal site: http://www.sfu.ca/~rvosa FAB* lab: http://www.sfu.ca/~fabstar Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ ++++++++++++++++++++++++++++++++++++++++++++++++++++ From hlapp at gmx.net Tue Jul 11 19:05:03 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 11 Jul 2006 15:05:03 -0400 Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? In-Reply-To: <44B36846.8070103@sfu.ca> References: <44B36846.8070103@sfu.ca> Message-ID: <18C839F9-B099-4A4A-9957-2BF4EB7CFB85@gmx.net> I think it does this already, except that I believe you need to create the exception object and initialize with the message upfront. Steve, can you comment? Is this at least somewhat right? -hilmar On Jul 11, 2006, at 4:58 AM, Rutger Vos wrote: > Dear all, > > would it be possible to overload Bio::Root::RootI's 'throw' method to > accept an additional, optional (positional) argument to define the > exception class, e.g. using Exception::Class: > > # ...somewhere ... > > sub makefh { > my ( $self, $filename ) = @_; > open my $fh, '<' $filename or $self->throw("Can't open file: $!", > 'Bio::Exceptions::FileIO'); # NOTE second argument > return $fh; > } > > #.... somewhere else > my $fh; > eval { > $fh = $obj->makefh( 'data.txt'); > } > if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) { > # something's wrong with the file? > } > > -- > ++++++++++++++++++++++++++++++++++++++++++++++++++++ > Rutger Vos, PhD. candidate > Department of Biological Sciences > Simon Fraser University > 8888 University Drive > Burnaby, BC, V5A1S6 > Phone: 604-291-5625 > Fax: 604-291-3496 > Personal site: http://www.sfu.ca/~rvosa > FAB* lab: http://www.sfu.ca/~fabstar > Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Tue Jul 11 19:05:54 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 11 Jul 2006 15:05:54 -0400 Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? In-Reply-To: <954D76EB-F8D3-4652-957B-6ED4BBAEE209@uiuc.edu> References: <44B36846.8070103@sfu.ca> <954D76EB-F8D3-4652-957B-6ED4BBAEE209@uiuc.edu> Message-ID: <297D4770-A963-4039-8D90-987CC570BA94@gmx.net> Alright - well spotted Chris. This is what I was looking for. On Jul 11, 2006, at 9:13 AM, Chris Fields wrote: > I suppose you could; Bio::Root::Root does that using Error.pm (if it > is installed). It almost sounds like what Bio::Root::Root does is > what you want, but you want a little more information when exceptions > are thrown maybe? > > from perldoc Bio::Root::Root: > > ... > # Alternatively, using the new typed exception syntax in > the throw() call: > > $obj->throw( -class => 'Bio::Root::BadParameter', > -text => "Can not open file $file", > -value => $file); > ... > > Typed Exception Syntax > > The typed exception syntax of throw() has the advantage of > plainly > indicating the nature of the trouble, since the name of the > class is > included in the title of the exception output. > > To take advantage of this capability, you must specify > arguments as > named parameters in the throw() call. Here are the parameters: > > -class > name of the class of the exception. This should be one > of the > classes defined in Bio::Root::Exception, or a custom > error of yours > that extends one of the exceptions defined in > Bio::Root::Exception. > > -text > a sensible message for the exception > > -value > the value causing the exception or $!, if appropriate. > > Note that Bio::Root::Exception does not need to be imported > into your > module (or script) namespace in order to throw exceptions via > Bio::Root::Root::throw(), since Bio::Root::Root imports it. > > > Chris > > On Jul 11, 2006, at 3:58 AM, Rutger Vos wrote: > >> Dear all, >> >> would it be possible to overload Bio::Root::RootI's 'throw' method to >> accept an additional, optional (positional) argument to define the >> exception class, e.g. using Exception::Class: >> >> # ...somewhere ... >> >> sub makefh { >> my ( $self, $filename ) = @_; >> open my $fh, '<' $filename or $self->throw("Can't open file: $!", >> 'Bio::Exceptions::FileIO'); # NOTE second argument >> return $fh; >> } >> >> #.... somewhere else >> my $fh; >> eval { >> $fh = $obj->makefh( 'data.txt'); >> } >> if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) { >> # something's wrong with the file? >> } >> >> -- >> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Rutger Vos, PhD. candidate >> Department of Biological Sciences >> Simon Fraser University >> 8888 University Drive >> Burnaby, BC, V5A1S6 >> Phone: 604-291-5625 >> Fax: 604-291-3496 >> Personal site: http://www.sfu.ca/~rvosa >> FAB* lab: http://www.sfu.ca/~fabstar >> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Tue Jul 11 20:42:35 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 11 Jul 2006 15:42:35 -0500 Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? In-Reply-To: <44B3F4D1.7090804@sfu.ca> Message-ID: <000b01c6a52a$8c77f8c0$15327e82@pyrimidine> Bio::Root::Root doesn't overload throw_not_implemented from Bio::Root::RootI; from the comments looks like Steve C and Ewan B couldn't work out some of the Error.pm issues. Judging by the POD for Bio::Root::RootI, throw_not_implemented doesn't accept arguments; it throws a Bio::Root::NotImplemented exception automatically. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Rutger Vos > Sent: Tuesday, July 11, 2006 1:58 PM > To: Chris Fields > Cc: 'Bioperl List' > Subject: Re: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? > > I must have overlooked this. I think it does what I want. So could I do > something like: > > $obj->thow_not_implemented( -class => 'Bio::Root::NotImplemented' ); > > ...in interfaces? > > Chris Fields wrote: > > I suppose you could; Bio::Root::Root does that using Error.pm (if it > > is installed). It almost sounds like what Bio::Root::Root does is > > what you want, but you want a little more information when exceptions > > are thrown maybe? > > > > from perldoc Bio::Root::Root: > > > > ... > > # Alternatively, using the new typed exception syntax in > > the throw() call: > > > > $obj->throw( -class => 'Bio::Root::BadParameter', > > -text => "Can not open file $file", > > -value => $file); > > ... > > > > Typed Exception Syntax > > > > The typed exception syntax of throw() has the advantage of > > plainly > > indicating the nature of the trouble, since the name of the > > class is > > included in the title of the exception output. > > > > To take advantage of this capability, you must specify > > arguments as > > named parameters in the throw() call. Here are the parameters: > > > > -class > > name of the class of the exception. This should be one > > of the > > classes defined in Bio::Root::Exception, or a custom > > error of yours > > that extends one of the exceptions defined in > > Bio::Root::Exception. > > > > -text > > a sensible message for the exception > > > > -value > > the value causing the exception or $!, if appropriate. > > > > Note that Bio::Root::Exception does not need to be imported > > into your > > module (or script) namespace in order to throw exceptions via > > Bio::Root::Root::throw(), since Bio::Root::Root imports it. > > > > > > Chris > > > > On Jul 11, 2006, at 3:58 AM, Rutger Vos wrote: > > > > > >> Dear all, > >> > >> would it be possible to overload Bio::Root::RootI's 'throw' method to > >> accept an additional, optional (positional) argument to define the > >> exception class, e.g. using Exception::Class: > >> > >> # ...somewhere ... > >> > >> sub makefh { > >> my ( $self, $filename ) = @_; > >> open my $fh, '<' $filename or $self->throw("Can't open file: $!", > >> 'Bio::Exceptions::FileIO'); # NOTE second argument > >> return $fh; > >> } > >> > >> #.... somewhere else > >> my $fh; > >> eval { > >> $fh = $obj->makefh( 'data.txt'); > >> } > >> if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) { > >> # something's wrong with the file? > >> } > >> > >> -- > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> Rutger Vos, PhD. candidate > >> Department of Biological Sciences > >> Simon Fraser University > >> 8888 University Drive > >> Burnaby, BC, V5A1S6 > >> Phone: 604-291-5625 > >> Fax: 604-291-3496 > >> Personal site: http://www.sfu.ca/~rvosa > >> FAB* lab: http://www.sfu.ca/~fabstar > >> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ > >> ++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> > >> > >> _______________________________________________ > >> Bioperl-l mailing list > >> Bioperl-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > >> > > > > Christopher Fields > > Postdoctoral Researcher > > Lab of Dr. Robert Switzer > > Dept of Biochemistry > > University of Illinois Urbana-Champaign > > > > > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > > > > > > > -- > ++++++++++++++++++++++++++++++++++++++++++++++++++++ > Rutger Vos, PhD. candidate > Department of Biological Sciences > Simon Fraser University > 8888 University Drive > Burnaby, BC, V5A1S6 > Phone: 604-291-5625 > Fax: 604-291-3496 > Personal site: http://www.sfu.ca/~rvosa > FAB* lab: http://www.sfu.ca/~fabstar > Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From frederick.partridge at st-johns.oxford.ac.uk Tue Jul 11 21:23:28 2006 From: frederick.partridge at st-johns.oxford.ac.uk (Frederick Partridge) Date: Tue, 11 Jul 2006 22:23:28 +0100 (BST) Subject: [Bioperl-l] Get nucleotide sequence when expecting protein from genpept Message-ID: I am trying to retrieve various protein sequences from genpept using get_Seq_by_acc. All of them work ok, except one T16005: If I try and retrieve it with a reduced program: #!usr/bin/perl -w use strict; use Bio::Perl; use Bio::SeqIO; my $genpept = new Bio::DB::GenPept; my $seq = $genpept->get_Seq_by_acc('T16005'); print ($seq->seq(),'\n'); I get back a nucleotide sequence, which is another sequence at NCBI with the same accession number. (I thought these were meant to be unique? but evidently not.) I am using bioperl 1.5.1, perl 5.8.1, Mac OS 10.3 Could anyone help me to get this protein sequence with my program? Many thanks, Freddie Partridge University of Oxford From qfdong at iastate.edu Tue Jul 11 21:32:56 2006 From: qfdong at iastate.edu (Qunfeng) Date: Tue, 11 Jul 2006 16:32:56 -0500 Subject: [Bioperl-l] Get nucleotide sequence when expecting protein from genpept In-Reply-To: References: Message-ID: <6.1.2.0.2.20060711163128.08086570@qfdong.mail.iastate.edu> This particular protein record (acc#T16005) was imported from PIR. In other words, this is not an original GenBank protein record. When GenBank imports protein records from other DB, it keeps their original acc#. However, gi# should be unique. Q At 04:23 PM 7/11/2006, Frederick Partridge wrote: >I am trying to retrieve various protein sequences from genpept using >get_Seq_by_acc. All of them work ok, except one T16005: > > >If I try and retrieve it with a reduced program: > > >#!usr/bin/perl -w > >use strict; > >use Bio::Perl; >use Bio::SeqIO; > >my $genpept = new Bio::DB::GenPept; > >my $seq = $genpept->get_Seq_by_acc('T16005'); > >print ($seq->seq(),'\n'); > > > >I get back a nucleotide sequence, which is another sequence at NCBI with >the same accession number. (I thought these were meant to be unique? but >evidently not.) > > >I am using bioperl 1.5.1, perl 5.8.1, Mac OS 10.3 > > >Could anyone help me to get this protein sequence with my program? > > >Many thanks, > > > >Freddie Partridge > >University of Oxford > > >_______________________________________________ >Bioperl-l mailing list >Bioperl-l at lists.open-bio.org >http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Tue Jul 11 22:05:09 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 11 Jul 2006 17:05:09 -0500 Subject: [Bioperl-l] Get nucleotide sequence when expecting protein fromgenpept In-Reply-To: Message-ID: <000001c6a536$141befb0$15327e82@pyrimidine> It's an imprted PIR record, so there probably is no accession recorded in the database. I think NCBI uses a fallback to nucleotide if it can't find a particular accession via protein. Using the primary ID (the GI#, 7498730) works. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Frederick Partridge > Sent: Tuesday, July 11, 2006 4:23 PM > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] Get nucleotide sequence when expecting protein > fromgenpept > > > > I am trying to retrieve various protein sequences from genpept using > get_Seq_by_acc. All of them work ok, except one T16005: > > > If I try and retrieve it with a reduced program: > > > #!usr/bin/perl -w > > use strict; > > use Bio::Perl; > use Bio::SeqIO; > > my $genpept = new Bio::DB::GenPept; > > my $seq = $genpept->get_Seq_by_acc('T16005'); > > print ($seq->seq(),'\n'); > > > > I get back a nucleotide sequence, which is another sequence at NCBI with > the same accession number. (I thought these were meant to be unique? but > evidently not.) > > > I am using bioperl 1.5.1, perl 5.8.1, Mac OS 10.3 > > > Could anyone help me to get this protein sequence with my program? > > > Many thanks, > > > > Freddie Partridge > > University of Oxford > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Tue Jul 11 22:47:38 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 11 Jul 2006 17:47:38 -0500 Subject: [Bioperl-l] Get nucleotide sequence when expecting proteinfromgenpept In-Reply-To: <000001c6a536$141befb0$15327e82@pyrimidine> Message-ID: <000201c6a53c$03970ed0$15327e82@pyrimidine> Okay, now try this: use Bio::DB::GenPept; use Bio::SeqIO; my $factory = Bio::DB::GenPept->new(-format => 'fasta'); my $seqin = $factory->get_Stream_by_acc('T16005'); my $seqout = Bio::SeqIO->new(-fh => \*STDOUT, -format => 'fasta'); while (my $seq = $seqin->next_seq) { $seqout->write_seq($seq); } This returns both the nucleotide sequence and the correct protein sequence; the protein was returned second for some reason, so get_Seq_by_acc misses it while get_Stream_by_acc doesn't. I have notified NCBI about this issue, but they will likely just tell me to use the GI number for searches as they are unique. Probably a good warning for anyone using accessions for all their work (I use the GI myself). Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Chris Fields > Sent: Tuesday, July 11, 2006 5:05 PM > To: 'Frederick Partridge'; bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Get nucleotide sequence when expecting > proteinfromgenpept > > It's an imprted PIR record, so there probably is no accession recorded in > the database. I think NCBI uses a fallback to nucleotide if it can't find > a > particular accession via protein. Using the primary ID (the GI#, 7498730) > works. > > Chris > > > -----Original Message----- > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > > bounces at lists.open-bio.org] On Behalf Of Frederick Partridge > > Sent: Tuesday, July 11, 2006 4:23 PM > > To: bioperl-l at lists.open-bio.org > > Subject: [Bioperl-l] Get nucleotide sequence when expecting protein > > fromgenpept > > > > > > > > I am trying to retrieve various protein sequences from genpept using > > get_Seq_by_acc. All of them work ok, except one T16005: > > > > > > If I try and retrieve it with a reduced program: > > > > > > #!usr/bin/perl -w > > > > use strict; > > > > use Bio::Perl; > > use Bio::SeqIO; > > > > my $genpept = new Bio::DB::GenPept; > > > > my $seq = $genpept->get_Seq_by_acc('T16005'); > > > > print ($seq->seq(),'\n'); > > > > > > > > I get back a nucleotide sequence, which is another sequence at NCBI with > > the same accession number. (I thought these were meant to be unique? but > > evidently not.) > > > > > > I am using bioperl 1.5.1, perl 5.8.1, Mac OS 10.3 > > > > > > Could anyone help me to get this protein sequence with my program? > > > > > > Many thanks, > > > > > > > > Freddie Partridge > > > > University of Oxford > > > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From Steve_Chervitz at affymetrix.com Wed Jul 12 00:21:16 2006 From: Steve_Chervitz at affymetrix.com (Steve Chervitz) Date: Tue, 11 Jul 2006 17:21:16 -0700 Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? In-Reply-To: <18C839F9-B099-4A4A-9957-2BF4EB7CFB85@gmx.net> Message-ID: The Bio::Root::Root object is rigged to use the Error.pm module if available, so you can throw and catch of exception objects derived from Error. The motivation here was to provide a recommended path for folks that want to use more structured exception handling logic in their bioperl code. There are a number of pre-defined subclasses of exceptions that cover common problems (such as FileOpenException), but you can also define your own. See a list of the predfined exceptions as well as some how to docs in the POD for Bio::Root::Exception: http://search.cpan.org/~birney/bioperl-1.4/Bio/Root/Exception.pm There's a bunch more info about Bioperl exception fun available from the bioperl distribution under the examples/root directory. See the README in that directory to get oriented. There are a number of demo scripts there, too. Bio::Root::Root doesn't know anything about Exception::Class, but I see you can use it with Error.pm as described here: http://search.cpan.org/~drolsky/Exception-Class-1.23/lib/Exception/Class.pm# OTHER_EXCEPTION_MODULES_(try%2Fcatch_syntax) Cheers, Steve > From: Hilmar Lapp > Date: Tue, 11 Jul 2006 15:05:03 -0400 > To: Rutger Vos > Cc: Bioperl , Steve Chervitz > > Subject: Re: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? > > I think it does this already, except that I believe you need to > create the exception object and initialize with the message upfront. > > Steve, can you comment? Is this at least somewhat right? > > -hilmar > > On Jul 11, 2006, at 4:58 AM, Rutger Vos wrote: > >> Dear all, >> >> would it be possible to overload Bio::Root::RootI's 'throw' method to >> accept an additional, optional (positional) argument to define the >> exception class, e.g. using Exception::Class: >> >> # ...somewhere ... >> >> sub makefh { >> my ( $self, $filename ) = @_; >> open my $fh, '<' $filename or $self->throw("Can't open file: $!", >> 'Bio::Exceptions::FileIO'); # NOTE second argument >> return $fh; >> } >> >> #.... somewhere else >> my $fh; >> eval { >> $fh = $obj->makefh( 'data.txt'); >> } >> if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) { >> # something's wrong with the file? >> } >> >> -- >> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Rutger Vos, PhD. candidate >> Department of Biological Sciences >> Simon Fraser University >> 8888 University Drive >> Burnaby, BC, V5A1S6 >> Phone: 604-291-5625 >> Fax: 604-291-3496 >> Personal site: http://www.sfu.ca/~rvosa >> FAB* lab: http://www.sfu.ca/~fabstar >> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > From Steve_Chervitz at affymetrix.com Wed Jul 12 01:07:06 2006 From: Steve_Chervitz at affymetrix.com (Steve_Chervitz) Date: Tue, 11 Jul 2006 18:07:06 -0700 Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? In-Reply-To: <000b01c6a52a$8c77f8c0$15327e82@pyrimidine> References: <000b01c6a52a$8c77f8c0$15327e82@pyrimidine> Message-ID: <337D69E3-4075-426B-A53E-31D37ED6CA2E@affymetrix.com> On Jul 11, 2006, at 1:42 PM, Chris Fields wrote: > Bio::Root::Root doesn't overload throw_not_implemented from > Bio::Root::RootI; from the comments looks like Steve C and Ewan B > couldn't > work out some of the Error.pm issues. The issue (I believe) was that Bio::Root::RootI::throw_not_implemented was doing some checking for the presence of Error.pm and calling Error::throw. I changed it so that this fanciness only happens in Root.pm. > Judging by the POD for Bio::Root::RootI, throw_not_implemented doesn't > accept arguments; it throws a Bio::Root::NotImplemented exception > automatically. Looking at the code now, throw_not_implemented() does not throw a Bio::Root::NotImplemented exception. It just throws a simple, unclassed message. We could allow it to throw an exception of class Bio::Root:NotImplemented by changing this code: if( $self->can('throw') ) { $self->throw($message); }... to this if( $self->can('throw') ) { $self->throw(-text=>$message, -class=>'Bio::Root::NotImplemented'); }... This does not create any dependency on Error.pm, but permits it to be used if available. If Error.pm is not loaded, the only change is that the class string is included in the error message, which is kind of handy. Trouble would occur if the implementing class: * does not derive from Bio::Root::Root, * does not import Bio::Root::Exception, * fails to implement a method which gets called, and * Error.pm is available. I don't know if such implementations exist in bioperl now, but I suspect they would be rare (and discouraged). Steve > Chris > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> bounces at lists.open-bio.org] On Behalf Of Rutger Vos >> Sent: Tuesday, July 11, 2006 1:58 PM >> To: Chris Fields >> Cc: 'Bioperl List' >> Subject: Re: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) >> overloading? >> >> I must have overlooked this. I think it does what I want. So could >> I do >> something like: >> >> $obj->thow_not_implemented( -class => 'Bio::Root::NotImplemented' ); >> >> ...in interfaces? >> >> Chris Fields wrote: >>> I suppose you could; Bio::Root::Root does that using Error.pm (if it >>> is installed). It almost sounds like what Bio::Root::Root does is >>> what you want, but you want a little more information when >>> exceptions >>> are thrown maybe? >>> >>> from perldoc Bio::Root::Root: >>> >>> ... >>> # Alternatively, using the new typed exception syntax in >>> the throw() call: >>> >>> $obj->throw( -class => 'Bio::Root::BadParameter', >>> -text => "Can not open file $file", >>> -value => $file); >>> ... >>> >>> Typed Exception Syntax >>> >>> The typed exception syntax of throw() has the advantage of >>> plainly >>> indicating the nature of the trouble, since the name of the >>> class is >>> included in the title of the exception output. >>> >>> To take advantage of this capability, you must specify >>> arguments as >>> named parameters in the throw() call. Here are the >>> parameters: >>> >>> -class >>> name of the class of the exception. This should be one >>> of the >>> classes defined in Bio::Root::Exception, or a custom >>> error of yours >>> that extends one of the exceptions defined in >>> Bio::Root::Exception. >>> >>> -text >>> a sensible message for the exception >>> >>> -value >>> the value causing the exception or $!, if appropriate. >>> >>> Note that Bio::Root::Exception does not need to be imported >>> into your >>> module (or script) namespace in order to throw exceptions >>> via >>> Bio::Root::Root::throw(), since Bio::Root::Root imports it. >>> >>> >>> Chris >>> >>> On Jul 11, 2006, at 3:58 AM, Rutger Vos wrote: >>> >>> >>>> Dear all, >>>> >>>> would it be possible to overload Bio::Root::RootI's 'throw' >>>> method to >>>> accept an additional, optional (positional) argument to define the >>>> exception class, e.g. using Exception::Class: >>>> >>>> # ...somewhere ... >>>> >>>> sub makefh { >>>> my ( $self, $filename ) = @_; >>>> open my $fh, '<' $filename or $self->throw("Can't open file: >>>> $!", >>>> 'Bio::Exceptions::FileIO'); # NOTE second argument >>>> return $fh; >>>> } >>>> >>>> #.... somewhere else >>>> my $fh; >>>> eval { >>>> $fh = $obj->makefh( 'data.txt'); >>>> } >>>> if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) { >>>> # something's wrong with the file? >>>> } >>>> >>>> -- >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Rutger Vos, PhD. candidate >>>> Department of Biological Sciences >>>> Simon Fraser University >>>> 8888 University Drive >>>> Burnaby, BC, V5A1S6 >>>> Phone: 604-291-5625 >>>> Fax: 604-291-3496 >>>> Personal site: http://www.sfu.ca/~rvosa >>>> FAB* lab: http://www.sfu.ca/~fabstar >>>> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >>>> >>>> _______________________________________________ >>>> Bioperl-l mailing list >>>> Bioperl-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>> >>> >>> Christopher Fields >>> Postdoctoral Researcher >>> Lab of Dr. Robert Switzer >>> Dept of Biochemistry >>> University of Illinois Urbana-Champaign >>> >>> >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >>> >>> >>> >> >> -- >> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Rutger Vos, PhD. candidate >> Department of Biological Sciences >> Simon Fraser University >> 8888 University Drive >> Burnaby, BC, V5A1S6 >> Phone: 604-291-5625 >> Fax: 604-291-3496 >> Personal site: http://www.sfu.ca/~rvosa >> FAB* lab: http://www.sfu.ca/~fabstar >> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Wed Jul 12 03:27:37 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 11 Jul 2006 22:27:37 -0500 Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? In-Reply-To: <337D69E3-4075-426B-A53E-31D37ED6CA2E@affymetrix.com> References: <000b01c6a52a$8c77f8c0$15327e82@pyrimidine> <337D69E3-4075-426B-A53E-31D37ED6CA2E@affymetrix.com> Message-ID: <18B66484-FE46-4F77-BDD1-B97085D04C95@uiuc.edu> Makes sense to keep most of the magic in Root instead of RootI.pm. The POD for RootI does state that the class exception thrown is Bio::Root::NotImplemented, so we should probably either change the POD to reflect what really happens or change throw_not_implemented like you suggest (my vote is the latter). I don't think many (if any) implementing classes fall into your 'trouble' category, though I can't be sure how many actually import Bio::Root::Exception. Chris On Jul 11, 2006, at 8:07 PM, Steve_Chervitz wrote: > On Jul 11, 2006, at 1:42 PM, Chris Fields wrote: > >> Bio::Root::Root doesn't overload throw_not_implemented from >> Bio::Root::RootI; from the comments looks like Steve C and Ewan B >> couldn't >> work out some of the Error.pm issues. > > The issue (I believe) was that > Bio::Root::RootI::throw_not_implemented was doing some checking for > the presence of Error.pm and calling Error::throw. I changed it so > that this fanciness only happens in Root.pm. > >> Judging by the POD for Bio::Root::RootI, throw_not_implemented >> doesn't >> accept arguments; it throws a Bio::Root::NotImplemented exception >> automatically. > > Looking at the code now, throw_not_implemented() does not throw a > Bio::Root::NotImplemented exception. It just throws a simple, > unclassed message. We could allow it to throw an exception of class > Bio::Root:NotImplemented by changing this code: > > if( $self->can('throw') ) { > $self->throw($message); > }... > > to this > > if( $self->can('throw') ) { > $self->throw(-text=>$message, - > class=>'Bio::Root::NotImplemented'); > }... > > This does not create any dependency on Error.pm, but permits it to > be used if available. If Error.pm is not loaded, the only change is > that the class string is included in the error message, which is > kind of handy. > > Trouble would occur if the implementing class: > > * does not derive from Bio::Root::Root, > * does not import Bio::Root::Exception, > * fails to implement a method which gets called, and > * Error.pm is available. > > I don't know if such implementations exist in bioperl now, but I > suspect they would be rare (and discouraged). > > Steve > > >> Chris >> >>> -----Original Message----- >>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >>> bounces at lists.open-bio.org] On Behalf Of Rutger Vos >>> Sent: Tuesday, July 11, 2006 1:58 PM >>> To: Chris Fields >>> Cc: 'Bioperl List' >>> Subject: Re: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) >>> overloading? >>> >>> I must have overlooked this. I think it does what I want. So >>> could I do >>> something like: >>> >>> $obj->thow_not_implemented( -class => 'Bio::Root::NotImplemented' ); >>> >>> ...in interfaces? >>> >>> Chris Fields wrote: >>>> I suppose you could; Bio::Root::Root does that using Error.pm >>>> (if it >>>> is installed). It almost sounds like what Bio::Root::Root does is >>>> what you want, but you want a little more information when >>>> exceptions >>>> are thrown maybe? >>>> >>>> from perldoc Bio::Root::Root: >>>> >>>> ... >>>> # Alternatively, using the new typed exception syntax in >>>> the throw() call: >>>> >>>> $obj->throw( -class => 'Bio::Root::BadParameter', >>>> -text => "Can not open file $file", >>>> -value => $file); >>>> ... >>>> >>>> Typed Exception Syntax >>>> >>>> The typed exception syntax of throw() has the advantage of >>>> plainly >>>> indicating the nature of the trouble, since the name of the >>>> class is >>>> included in the title of the exception output. >>>> >>>> To take advantage of this capability, you must specify >>>> arguments as >>>> named parameters in the throw() call. Here are the >>>> parameters: >>>> >>>> -class >>>> name of the class of the exception. This should be one >>>> of the >>>> classes defined in Bio::Root::Exception, or a custom >>>> error of yours >>>> that extends one of the exceptions defined in >>>> Bio::Root::Exception. >>>> >>>> -text >>>> a sensible message for the exception >>>> >>>> -value >>>> the value causing the exception or $!, if appropriate. >>>> >>>> Note that Bio::Root::Exception does not need to be imported >>>> into your >>>> module (or script) namespace in order to throw >>>> exceptions via >>>> Bio::Root::Root::throw(), since Bio::Root::Root imports it. >>>> >>>> >>>> Chris >>>> >>>> On Jul 11, 2006, at 3:58 AM, Rutger Vos wrote: >>>> >>>> >>>>> Dear all, >>>>> >>>>> would it be possible to overload Bio::Root::RootI's 'throw' >>>>> method to >>>>> accept an additional, optional (positional) argument to define the >>>>> exception class, e.g. using Exception::Class: >>>>> >>>>> # ...somewhere ... >>>>> >>>>> sub makefh { >>>>> my ( $self, $filename ) = @_; >>>>> open my $fh, '<' $filename or $self->throw("Can't open >>>>> file: $!", >>>>> 'Bio::Exceptions::FileIO'); # NOTE second argument >>>>> return $fh; >>>>> } >>>>> >>>>> #.... somewhere else >>>>> my $fh; >>>>> eval { >>>>> $fh = $obj->makefh( 'data.txt'); >>>>> } >>>>> if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) { >>>>> # something's wrong with the file? >>>>> } >>>>> >>>>> -- >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> Rutger Vos, PhD. candidate >>>>> Department of Biological Sciences >>>>> Simon Fraser University >>>>> 8888 University Drive >>>>> Burnaby, BC, V5A1S6 >>>>> Phone: 604-291-5625 >>>>> Fax: 604-291-3496 >>>>> Personal site: http://www.sfu.ca/~rvosa >>>>> FAB* lab: http://www.sfu.ca/~fabstar >>>>> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> >>>>> >>>>> _______________________________________________ >>>>> Bioperl-l mailing list >>>>> Bioperl-l at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>>> >>>> >>>> Christopher Fields >>>> Postdoctoral Researcher >>>> Lab of Dr. Robert Switzer >>>> Dept of Biochemistry >>>> University of Illinois Urbana-Champaign >>>> >>>> >>>> >>>> _______________________________________________ >>>> Bioperl-l mailing list >>>> Bioperl-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>> >>>> >>>> >>>> >>> >>> -- >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Rutger Vos, PhD. candidate >>> Department of Biological Sciences >>> Simon Fraser University >>> 8888 University Drive >>> Burnaby, BC, V5A1S6 >>> Phone: 604-291-5625 >>> Fax: 604-291-3496 >>> Personal site: http://www.sfu.ca/~rvosa >>> FAB* lab: http://www.sfu.ca/~fabstar >>> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From frederick.partridge at st-johns.oxford.ac.uk Wed Jul 12 15:16:33 2006 From: frederick.partridge at st-johns.oxford.ac.uk (Frederick Partridge) Date: Wed, 12 Jul 2006 16:16:33 +0100 (BST) Subject: [Bioperl-l] Get nucleotide sequence when expecting proteinfromgenpept In-Reply-To: <000201c6a53c$03970ed0$15327e82@pyrimidine> References: <000201c6a53c$03970ed0$15327e82@pyrimidine> Message-ID: On Tue, 11 Jul 2006, Chris Fields wrote: > This returns both the nucleotide sequence and the correct protein sequence; > the protein was returned second for some reason, so get_Seq_by_acc misses it > while get_Stream_by_acc doesn't. I have notified NCBI about this issue, but > they will likely just tell me to use the GI number for searches as they are > unique. Probably a good warning for anyone using accessions for all their > work (I use the GI myself). Thank you both for your help, I have converted to GIs and it works much better. As an aside, it might be nice to have a $hit->gi method as well as $hit->accession for parsing blast reports. (I now realise that you can derive the gi from $hit->name, but this might have encouraged me to start off using gi instead of accession numbers). Freddie Partridge University of Oxford From cjfields at uiuc.edu Wed Jul 12 15:39:39 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 12 Jul 2006 10:39:39 -0500 Subject: [Bioperl-l] Get nucleotide sequence when expecting proteinfromgenpept In-Reply-To: Message-ID: <000b01c6a5c9$635a7540$15327e82@pyrimidine> Problem is, you may or may not have GIs for a BLAST hit depending on how you retrieve the BLAST report, what interface you use, etc. NCBI is pretty ambiguous when it comes to GI vs. accession; the sequence database guys want you to use the GI for searches (since that's the unique ID for NCBI's databases) and don't promise getting the correct sequence using the accession. However, the BLAST interface guys have set up the BLAST CGI server to not return the GI by default(accessible through Bio::Tools::Run::RemoteBlast). Even more confusing, if you use the NCBI BLAST web interface, this option is turned on by default. Don't know what blastcl3 or blastall does, haven't checked in a while. Anyway, this could be why there is no $hit->gi method for GenericHit/BlastHit. It could be added; I will need to look at SearchIO::blast/blastxml/blasttable to see how this is parsed out. BTW, what I do as a work-around, when using RemoteBlast, is below (you could use the while loop to grab the GIs using SearchIO::blast if they are present in the BLAST report). This grabs all the GI's from the description line (not just the best hit). # sets retrieval header to include the GI always $Bio::Tools::Run::RemoteBlast::RETRIEVALHEADER{'NCBI_GI'} = 'yes'; ... while ( my $hit = $result->next_hit) { my $description = $hit->description; while ($description =~ /gi\|(.*?)\|/g) { my $gi = $1; push @gis, $gi; } } Chris > -----Original Message----- > From: Frederick Partridge [mailto:frederick.partridge at st- > johns.oxford.ac.uk] > Sent: Wednesday, July 12, 2006 10:17 AM > To: Chris Fields > Cc: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Get nucleotide sequence when expecting > proteinfromgenpept > > > > On Tue, 11 Jul 2006, Chris Fields wrote: > > This returns both the nucleotide sequence and the correct protein > sequence; > > the protein was returned second for some reason, so get_Seq_by_acc > misses it > > while get_Stream_by_acc doesn't. I have notified NCBI about this issue, > but > > they will likely just tell me to use the GI number for searches as they > are > > unique. Probably a good warning for anyone using accessions for all > their > > work (I use the GI myself). > > > Thank you both for your help, I have converted to GIs and it works much > better. > > As an aside, it might be nice to have a $hit->gi method as well as > $hit->accession for parsing blast reports. (I now realise that you can > derive the gi from $hit->name, but this might have encouraged me to start > off using gi instead of accession numbers). > > > Freddie Partridge > > University of Oxford > From Steve_Chervitz at affymetrix.com Wed Jul 12 18:53:22 2006 From: Steve_Chervitz at affymetrix.com (Steve_Chervitz) Date: Wed, 12 Jul 2006 11:53:22 -0700 Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? In-Reply-To: <18B66484-FE46-4F77-BDD1-B97085D04C95@uiuc.edu> References: <000b01c6a52a$8c77f8c0$15327e82@pyrimidine> <337D69E3-4075-426B-A53E-31D37ED6CA2E@affymetrix.com> <18B66484-FE46-4F77-BDD1-B97085D04C95@uiuc.edu> Message-ID: <3E119694-68C5-47A6-971B-8E035CBB6429@affymetrix.com> For modules that derive from Bio::Root::Root, there's no need to import Bio::Root::Exception since the Root object does it. I also favor adding the -class parameter to throw_not_implemented in RootI. I just committed this change in in bioperl-live. I also added a test for it in t/RootI.t I haven't run the complete suite of tests after making this change, but I don't suspect there'll be any trouble (famous last words). Really, if any test leads to the calling of throw_not_implemented (besides the test I just added), that in itself is trouble. Steve On Jul 11, 2006, at 8:27 PM, Chris Fields wrote: > Makes sense to keep most of the magic in Root instead of RootI.pm. > The POD for RootI does state that the class exception thrown is > Bio::Root::NotImplemented, so we should probably either change the > POD to reflect what really happens or change throw_not_implemented > like you suggest (my vote is the latter). I don't think many (if > any) implementing classes fall into your 'trouble' category, though I > can't be sure how many actually import Bio::Root::Exception. > > Chris > > On Jul 11, 2006, at 8:07 PM, Steve_Chervitz wrote: > >> On Jul 11, 2006, at 1:42 PM, Chris Fields wrote: >> >>> Bio::Root::Root doesn't overload throw_not_implemented from >>> Bio::Root::RootI; from the comments looks like Steve C and Ewan B >>> couldn't >>> work out some of the Error.pm issues. >> >> The issue (I believe) was that >> Bio::Root::RootI::throw_not_implemented was doing some checking for >> the presence of Error.pm and calling Error::throw. I changed it so >> that this fanciness only happens in Root.pm. >> >>> Judging by the POD for Bio::Root::RootI, throw_not_implemented >>> doesn't >>> accept arguments; it throws a Bio::Root::NotImplemented exception >>> automatically. >> >> Looking at the code now, throw_not_implemented() does not throw a >> Bio::Root::NotImplemented exception. It just throws a simple, >> unclassed message. We could allow it to throw an exception of class >> Bio::Root:NotImplemented by changing this code: >> >> if( $self->can('throw') ) { >> $self->throw($message); >> }... >> >> to this >> >> if( $self->can('throw') ) { >> $self->throw(-text=>$message, - >> class=>'Bio::Root::NotImplemented'); >> }... >> >> This does not create any dependency on Error.pm, but permits it to >> be used if available. If Error.pm is not loaded, the only change is >> that the class string is included in the error message, which is >> kind of handy. >> >> Trouble would occur if the implementing class: >> >> * does not derive from Bio::Root::Root, >> * does not import Bio::Root::Exception, >> * fails to implement a method which gets called, and >> * Error.pm is available. >> >> I don't know if such implementations exist in bioperl now, but I >> suspect they would be rare (and discouraged). >> >> Steve >> >> >>> Chris >>> >>>> -----Original Message----- >>>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >>>> bounces at lists.open-bio.org] On Behalf Of Rutger Vos >>>> Sent: Tuesday, July 11, 2006 1:58 PM >>>> To: Chris Fields >>>> Cc: 'Bioperl List' >>>> Subject: Re: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) >>>> overloading? >>>> >>>> I must have overlooked this. I think it does what I want. So >>>> could I do >>>> something like: >>>> >>>> $obj->thow_not_implemented( -class => >>>> 'Bio::Root::NotImplemented' ); >>>> >>>> ...in interfaces? >>>> >>>> Chris Fields wrote: >>>>> I suppose you could; Bio::Root::Root does that using Error.pm >>>>> (if it >>>>> is installed). It almost sounds like what Bio::Root::Root does is >>>>> what you want, but you want a little more information when >>>>> exceptions >>>>> are thrown maybe? >>>>> >>>>> from perldoc Bio::Root::Root: >>>>> >>>>> ... >>>>> # Alternatively, using the new typed exception syntax in >>>>> the throw() call: >>>>> >>>>> $obj->throw( -class => 'Bio::Root::BadParameter', >>>>> -text => "Can not open file $file", >>>>> -value => $file); >>>>> ... >>>>> >>>>> Typed Exception Syntax >>>>> >>>>> The typed exception syntax of throw() has the advantage of >>>>> plainly >>>>> indicating the nature of the trouble, since the name of >>>>> the >>>>> class is >>>>> included in the title of the exception output. >>>>> >>>>> To take advantage of this capability, you must specify >>>>> arguments as >>>>> named parameters in the throw() call. Here are the >>>>> parameters: >>>>> >>>>> -class >>>>> name of the class of the exception. This should be >>>>> one >>>>> of the >>>>> classes defined in Bio::Root::Exception, or a custom >>>>> error of yours >>>>> that extends one of the exceptions defined in >>>>> Bio::Root::Exception. >>>>> >>>>> -text >>>>> a sensible message for the exception >>>>> >>>>> -value >>>>> the value causing the exception or $!, if appropriate. >>>>> >>>>> Note that Bio::Root::Exception does not need to be >>>>> imported >>>>> into your >>>>> module (or script) namespace in order to throw >>>>> exceptions via >>>>> Bio::Root::Root::throw(), since Bio::Root::Root imports >>>>> it. >>>>> >>>>> >>>>> Chris >>>>> >>>>> On Jul 11, 2006, at 3:58 AM, Rutger Vos wrote: >>>>> >>>>> >>>>>> Dear all, >>>>>> >>>>>> would it be possible to overload Bio::Root::RootI's 'throw' >>>>>> method to >>>>>> accept an additional, optional (positional) argument to define >>>>>> the >>>>>> exception class, e.g. using Exception::Class: >>>>>> >>>>>> # ...somewhere ... >>>>>> >>>>>> sub makefh { >>>>>> my ( $self, $filename ) = @_; >>>>>> open my $fh, '<' $filename or $self->throw("Can't open >>>>>> file: $!", >>>>>> 'Bio::Exceptions::FileIO'); # NOTE second argument >>>>>> return $fh; >>>>>> } >>>>>> >>>>>> #.... somewhere else >>>>>> my $fh; >>>>>> eval { >>>>>> $fh = $obj->makefh( 'data.txt'); >>>>>> } >>>>>> if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) { >>>>>> # something's wrong with the file? >>>>>> } >>>>>> >>>>>> -- >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> Rutger Vos, PhD. candidate >>>>>> Department of Biological Sciences >>>>>> Simon Fraser University >>>>>> 8888 University Drive >>>>>> Burnaby, BC, V5A1S6 >>>>>> Phone: 604-291-5625 >>>>>> Fax: 604-291-3496 >>>>>> Personal site: http://www.sfu.ca/~rvosa >>>>>> FAB* lab: http://www.sfu.ca/~fabstar >>>>>> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Bioperl-l mailing list >>>>>> Bioperl-l at lists.open-bio.org >>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>>>> >>>>> >>>>> Christopher Fields >>>>> Postdoctoral Researcher >>>>> Lab of Dr. Robert Switzer >>>>> Dept of Biochemistry >>>>> University of Illinois Urbana-Champaign >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Bioperl-l mailing list >>>>> Bioperl-l at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>>> >>>>> >>>>> >>>>> >>>> >>>> -- >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> Rutger Vos, PhD. candidate >>>> Department of Biological Sciences >>>> Simon Fraser University >>>> 8888 University Drive >>>> Burnaby, BC, V5A1S6 >>>> Phone: 604-291-5625 >>>> Fax: 604-291-3496 >>>> Personal site: http://www.sfu.ca/~rvosa >>>> FAB* lab: http://www.sfu.ca/~fabstar >>>> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>> >>>> >>>> _______________________________________________ >>>> Bioperl-l mailing list >>>> Bioperl-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Wed Jul 12 19:23:33 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 12 Jul 2006 14:23:33 -0500 Subject: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? In-Reply-To: <3E119694-68C5-47A6-971B-8E035CBB6429@affymetrix.com> Message-ID: <000901c6a5e8$aaca53e0$15327e82@pyrimidine> Thanks Steve! Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Steve_Chervitz > Sent: Wednesday, July 12, 2006 1:53 PM > To: Chris Fields > Cc: Rutger Vos; Bioperl List > Subject: Re: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) overloading? > > For modules that derive from Bio::Root::Root, there's no need to > import Bio::Root::Exception since the Root object does it. > > I also favor adding the -class parameter to throw_not_implemented in > RootI. I just committed this change in in bioperl-live. I also added > a test for it in t/RootI.t > > I haven't run the complete suite of tests after making this change, > but I don't suspect there'll be any trouble (famous last words). > Really, if any test leads to the calling of throw_not_implemented > (besides the test I just added), that in itself is trouble. > > Steve > > On Jul 11, 2006, at 8:27 PM, Chris Fields wrote: > > > Makes sense to keep most of the magic in Root instead of RootI.pm. > > The POD for RootI does state that the class exception thrown is > > Bio::Root::NotImplemented, so we should probably either change the > > POD to reflect what really happens or change throw_not_implemented > > like you suggest (my vote is the latter). I don't think many (if > > any) implementing classes fall into your 'trouble' category, though I > > can't be sure how many actually import Bio::Root::Exception. > > > > Chris > > > > On Jul 11, 2006, at 8:07 PM, Steve_Chervitz wrote: > > > >> On Jul 11, 2006, at 1:42 PM, Chris Fields wrote: > >> > >>> Bio::Root::Root doesn't overload throw_not_implemented from > >>> Bio::Root::RootI; from the comments looks like Steve C and Ewan B > >>> couldn't > >>> work out some of the Error.pm issues. > >> > >> The issue (I believe) was that > >> Bio::Root::RootI::throw_not_implemented was doing some checking for > >> the presence of Error.pm and calling Error::throw. I changed it so > >> that this fanciness only happens in Root.pm. > >> > >>> Judging by the POD for Bio::Root::RootI, throw_not_implemented > >>> doesn't > >>> accept arguments; it throws a Bio::Root::NotImplemented exception > >>> automatically. > >> > >> Looking at the code now, throw_not_implemented() does not throw a > >> Bio::Root::NotImplemented exception. It just throws a simple, > >> unclassed message. We could allow it to throw an exception of class > >> Bio::Root:NotImplemented by changing this code: > >> > >> if( $self->can('throw') ) { > >> $self->throw($message); > >> }... > >> > >> to this > >> > >> if( $self->can('throw') ) { > >> $self->throw(-text=>$message, - > >> class=>'Bio::Root::NotImplemented'); > >> }... > >> > >> This does not create any dependency on Error.pm, but permits it to > >> be used if available. If Error.pm is not loaded, the only change is > >> that the class string is included in the error message, which is > >> kind of handy. > >> > >> Trouble would occur if the implementing class: > >> > >> * does not derive from Bio::Root::Root, > >> * does not import Bio::Root::Exception, > >> * fails to implement a method which gets called, and > >> * Error.pm is available. > >> > >> I don't know if such implementations exist in bioperl now, but I > >> suspect they would be rare (and discouraged). > >> > >> Steve > >> > >> > >>> Chris > >>> > >>>> -----Original Message----- > >>>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > >>>> bounces at lists.open-bio.org] On Behalf Of Rutger Vos > >>>> Sent: Tuesday, July 11, 2006 1:58 PM > >>>> To: Chris Fields > >>>> Cc: 'Bioperl List' > >>>> Subject: Re: [Bioperl-l] Bio::Root::RootI->throw($msg,$class) > >>>> overloading? > >>>> > >>>> I must have overlooked this. I think it does what I want. So > >>>> could I do > >>>> something like: > >>>> > >>>> $obj->thow_not_implemented( -class => > >>>> 'Bio::Root::NotImplemented' ); > >>>> > >>>> ...in interfaces? > >>>> > >>>> Chris Fields wrote: > >>>>> I suppose you could; Bio::Root::Root does that using Error.pm > >>>>> (if it > >>>>> is installed). It almost sounds like what Bio::Root::Root does is > >>>>> what you want, but you want a little more information when > >>>>> exceptions > >>>>> are thrown maybe? > >>>>> > >>>>> from perldoc Bio::Root::Root: > >>>>> > >>>>> ... > >>>>> # Alternatively, using the new typed exception syntax in > >>>>> the throw() call: > >>>>> > >>>>> $obj->throw( -class => 'Bio::Root::BadParameter', > >>>>> -text => "Can not open file $file", > >>>>> -value => $file); > >>>>> ... > >>>>> > >>>>> Typed Exception Syntax > >>>>> > >>>>> The typed exception syntax of throw() has the advantage of > >>>>> plainly > >>>>> indicating the nature of the trouble, since the name of > >>>>> the > >>>>> class is > >>>>> included in the title of the exception output. > >>>>> > >>>>> To take advantage of this capability, you must specify > >>>>> arguments as > >>>>> named parameters in the throw() call. Here are the > >>>>> parameters: > >>>>> > >>>>> -class > >>>>> name of the class of the exception. This should be > >>>>> one > >>>>> of the > >>>>> classes defined in Bio::Root::Exception, or a custom > >>>>> error of yours > >>>>> that extends one of the exceptions defined in > >>>>> Bio::Root::Exception. > >>>>> > >>>>> -text > >>>>> a sensible message for the exception > >>>>> > >>>>> -value > >>>>> the value causing the exception or $!, if appropriate. > >>>>> > >>>>> Note that Bio::Root::Exception does not need to be > >>>>> imported > >>>>> into your > >>>>> module (or script) namespace in order to throw > >>>>> exceptions via > >>>>> Bio::Root::Root::throw(), since Bio::Root::Root imports > >>>>> it. > >>>>> > >>>>> > >>>>> Chris > >>>>> > >>>>> On Jul 11, 2006, at 3:58 AM, Rutger Vos wrote: > >>>>> > >>>>> > >>>>>> Dear all, > >>>>>> > >>>>>> would it be possible to overload Bio::Root::RootI's 'throw' > >>>>>> method to > >>>>>> accept an additional, optional (positional) argument to define > >>>>>> the > >>>>>> exception class, e.g. using Exception::Class: > >>>>>> > >>>>>> # ...somewhere ... > >>>>>> > >>>>>> sub makefh { > >>>>>> my ( $self, $filename ) = @_; > >>>>>> open my $fh, '<' $filename or $self->throw("Can't open > >>>>>> file: $!", > >>>>>> 'Bio::Exceptions::FileIO'); # NOTE second argument > >>>>>> return $fh; > >>>>>> } > >>>>>> > >>>>>> #.... somewhere else > >>>>>> my $fh; > >>>>>> eval { > >>>>>> $fh = $obj->makefh( 'data.txt'); > >>>>>> } > >>>>>> if ( $@ and $@->isa('Bio::Exceptions::FileIO') ) { > >>>>>> # something's wrong with the file? > >>>>>> } > >>>>>> > >>>>>> -- > >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>>>> Rutger Vos, PhD. candidate > >>>>>> Department of Biological Sciences > >>>>>> Simon Fraser University > >>>>>> 8888 University Drive > >>>>>> Burnaby, BC, V5A1S6 > >>>>>> Phone: 604-291-5625 > >>>>>> Fax: 604-291-3496 > >>>>>> Personal site: http://www.sfu.ca/~rvosa > >>>>>> FAB* lab: http://www.sfu.ca/~fabstar > >>>>>> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ > >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>>>> > >>>>>> > >>>>>> _______________________________________________ > >>>>>> Bioperl-l mailing list > >>>>>> Bioperl-l at lists.open-bio.org > >>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l > >>>>>> > >>>>> > >>>>> Christopher Fields > >>>>> Postdoctoral Researcher > >>>>> Lab of Dr. Robert Switzer > >>>>> Dept of Biochemistry > >>>>> University of Illinois Urbana-Champaign > >>>>> > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> Bioperl-l mailing list > >>>>> Bioperl-l at lists.open-bio.org > >>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l > >>>>> > >>>>> > >>>>> > >>>>> > >>>> > >>>> -- > >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>> Rutger Vos, PhD. candidate > >>>> Department of Biological Sciences > >>>> Simon Fraser University > >>>> 8888 University Drive > >>>> Burnaby, BC, V5A1S6 > >>>> Phone: 604-291-5625 > >>>> Fax: 604-291-3496 > >>>> Personal site: http://www.sfu.ca/~rvosa > >>>> FAB* lab: http://www.sfu.ca/~fabstar > >>>> Bio::Phylo: http://search.cpan.org/~rvosa/Bio-Phylo/ > >>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>> > >>>> > >>>> _______________________________________________ > >>>> Bioperl-l mailing list > >>>> Bioperl-l at lists.open-bio.org > >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l > >>> > >>> _______________________________________________ > >>> Bioperl-l mailing list > >>> Bioperl-l at lists.open-bio.org > >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l > >> > > > > Christopher Fields > > Postdoctoral Researcher > > Lab of Dr. Robert Switzer > > Dept of Biochemistry > > University of Illinois Urbana-Champaign > > > > > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From dsche at uga.edu Thu Jul 13 18:55:03 2006 From: dsche at uga.edu (Dongsheng Che) Date: Thu, 13 Jul 2006 14:55:03 -0400 (EDT) Subject: [Bioperl-l] remoteBlast problem Message-ID: <20060713145503.CIV61560@punts2.cc.uga.edu> To whom it may concern: I'm trying to do blast search remotely, so I downloaded bioperl-1.5, and followed the installation procedure, ie, perl Makefile.PL, make, make test. make install. I know there are some installation failure during the installation. Since my main purpose is to get remoteBlast worked, I don't want bother to figure out all failures. but I run remote Blast, it gave me some erorrs from examples (bptutorial). ------------------------------------------------------------- Beginning run_remoteblast example... Use of uninitialized value in numeric lt (<) at bptutorial.pl line 3303. **Warning**: Couldn't connect to NCBI with Bio::Tools::Run::StandAloneBlast.pm! Probably no network access. Skipping Test ---------------------------------------------------------------- I wondering what cause the problem. Thanks in advance! Dongsheng From vrramnar at student.cs.uwaterloo.ca Thu Jul 13 22:39:19 2006 From: vrramnar at student.cs.uwaterloo.ca (vrramnar at student.cs.uwaterloo.ca) Date: Thu, 13 Jul 2006 18:39:19 -0400 Subject: [Bioperl-l] Remote Blast - Blast Human Genome Message-ID: <1152830359.44b6cb97ef16c@www.nexusmail.uwaterloo.ca> Hello Again, I have another question regarding Remote blast but this time using Genome Blast. Here is the link: http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=9606 which again uses the main Blast web site: http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi Again I am not sure what to add or what HEADER information to change within my script. Here is my program, which was the same as the last email: #!/usr/bin/perl -w use Bio::Perl; use Bio::Tools::Run::RemoteBlast; my $prog = "blastn"; my $db = "refseq_genomic"; my $e_val = 0.01; my @params = ( '-prog' => $prog, '-data' => $db, '-expect' => $e_val); my $factory = new Bio::Tools::Run::RemoteBlast->new(@params); $Bio::Tools::Run::RemoteBlast::HEADER{'WWW_BLAST_TYPE'} = '????'; <----- what do I put here #$Bio::Tools::Run::RemoteBlast::HEADER{'?????'} = '????'; <--- Do I need to add any other values to the form inputs $factory->submit_blast("blast.in"); $v = 1; while (my @rids = $factory->each_rid) { foreach my $rid ( @rids ) { my $rc = $factory->retrieve_blast($rid); if( !ref($rc) ) { if( $rc < 0 ) { $factory->remove_rid($rid); } print STDERR "." if ( $v > 0 ); sleep 5; } else { my $result = $rc->next_result(); my $filename = $result->query_name()."\.out"; $factory->save_output($filename); $factory->remove_rid($rid); print "\nQuery Name: ", $result->query_name(), "\n"; } } } Both of my questions are very similiar as in I know how to use remote blast but not sure what to change to access the specific blast I want. Again, any help would be very appreciated!! Rohan ---------------------------------------- This mail sent through www.mywaterloo.ca From vrramnar at student.cs.uwaterloo.ca Thu Jul 13 22:31:38 2006 From: vrramnar at student.cs.uwaterloo.ca (vrramnar at student.cs.uwaterloo.ca) Date: Thu, 13 Jul 2006 18:31:38 -0400 Subject: [Bioperl-l] Remote Blast - SNP data base Message-ID: <1152829898.44b6c9cab7a3a@www.nexusmail.uwaterloo.ca> Hello, 1. I was wondering if anyone knew how to use SNP Blast via the Remote Blast module?? Basically I want to blast my sequence against the dbSNP database and you can normally do this through NCBI's website: http://www.ncbi.nlm.nih.gov/SNP/snp_blastByOrg.cgi The site basically takes your info and submits it to the main blast site: http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi I am just not sure what settings to change within my script. I have something like this: #!/usr/bin/perl -w use Bio::Perl; use Bio::Tools::Run::RemoteBlast; my $prog = "blastn"; my $db = "refseq_genomic"; <--- What db should I use?? my $e_val = 0.01; my @params = ( '-prog' => $prog, '-data' => $db, '-expect' => $e_val); my $factory = new Bio::Tools::Run::RemoteBlast->new(@params); $factory->submit_blast("blast.in"); <--- Name of my file in fasta format $v = 1; while (my @rids = $factory->each_rid) { foreach my $rid ( @rids ) { my $rc = $qu->retrieve_blast($rid); if( !ref($rc) ) { if( $rc < 0 ) { $factory->remove_rid($rid); } print STDERR "." if ( $v > 0 ); sleep 5; } else { my $result = $rc->next_result(); my $filename = $result->query_name()."\.out"; $factory->save_output($filename); $factory->remove_rid($rid); print "\nQuery Name: ", $result->query_name(), "\n"; } } } I think something like this should be added to have the correct form inputs but I am unsure: $Bio::Tools::Run::RemoteBlast::HEADER{'???'} = '????'; Any help on this topic would greatly be appreciated!! Rohan ---------------------------------------- This mail sent through www.mywaterloo.ca From cjfields at uiuc.edu Fri Jul 14 00:42:57 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 13 Jul 2006 19:42:57 -0500 Subject: [Bioperl-l] remoteBlast problem In-Reply-To: <20060713145503.CIV61560@punts2.cc.uga.edu> Message-ID: <000401c6a6de$737fe570$15327e82@pyrimidine> 1) Before I get wound up in the obvious here, you need to upgrade to CVS; RemoteBlast and SearchIO::blast were fixed post v.-1.5.1 (i.e. in CVS) to account for changes in BLAST output at the NCBI 2) The Bio::Tools::Run::StandAloneBlast.pm bit worried me a little, so I did a little digging; that's a typo. Now corrected in CVS, along with some BPLite cruft left over. 3) Speaking bluntly? Come on. The error is stated as plainly as possible. No? How about this (note the arrows): -----------> **Warning**: Couldn't connect to NCBI with -----------> Bio::Tools::Run::StandAloneBlast.pm! -----------> Probably no network access. Skipping Test Check your network connections, preferably AFTER you update to CVS. It's possible that it's a proxy issue, but that should also be fixed in CVS. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Dongsheng Che > Sent: Thursday, July 13, 2006 1:55 PM > To: bioperl-l at bioperl.org > Subject: [Bioperl-l] remoteBlast problem > > To whom it may concern: > > I'm trying to do blast search remotely, so I downloaded bioperl-1.5, and > followed the installation procedure, ie, perl Makefile.PL, make, make > test. make install. I know there are some installation failure during the > installation. > > Since my main purpose is to get remoteBlast worked, I don't want bother to > figure out all failures. but I run remote Blast, it gave me some erorrs > from examples (bptutorial). > ------------------------------------------------------------- > Beginning run_remoteblast example... > Use of uninitialized value in numeric lt (<) at bptutorial.pl line 3303. > > > **Warning**: Couldn't connect to NCBI with > Bio::Tools::Run::StandAloneBlast.pm! > Probably no network access. > Skipping Test > ---------------------------------------------------------------- > > I wondering what cause the problem. > > Thanks in advance! > > Dongsheng > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Fri Jul 14 01:56:30 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 13 Jul 2006 20:56:30 -0500 Subject: [Bioperl-l] Remote Blast - Blast Human Genome In-Reply-To: <1152830359.44b6cb97ef16c@www.nexusmail.uwaterloo.ca> Message-ID: <000501c6a6e8$b9c24d20$15327e82@pyrimidine> I added a method to RemoteBlast in bioperl-live (CVS) if you want to play with changing the URL. I have been thinking about doing this for a bit now but I already see problems. Here's the issue: the BLAST page you see is NOT the NCBI BLAST page (note the differences in the URL) but a user-friendly request page, generated on the fly by Genome, to submit BLAST requests for the relevant database. So changing the URL will not work (even by adding extra parameters); you only get the original HTML web page. You could try changing the database or limiting the search using an Entrez term (which you should be able to include in the request, probably by adding it to the HEADER). Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of vrramnar at student.cs.uwaterloo.ca > Sent: Thursday, July 13, 2006 5:39 PM > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] Remote Blast - Blast Human Genome > > > Hello Again, > > I have another question regarding Remote blast but this time using Genome > Blast. > > Here is the link: > > http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=9606 > > which again uses the main Blast web site: > > http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi > > Again I am not sure what to add or what HEADER information to change > within my > script. > > Here is my program, which was the same as the last email: > > #!/usr/bin/perl -w > > use Bio::Perl; > use Bio::Tools::Run::RemoteBlast; > > my $prog = "blastn"; > my $db = "refseq_genomic"; > my $e_val = 0.01; > > my @params = ( '-prog' => $prog, > '-data' => $db, > '-expect' => $e_val); > > my $factory = new Bio::Tools::Run::RemoteBlast->new(@params); > $Bio::Tools::Run::RemoteBlast::HEADER{'WWW_BLAST_TYPE'} = '????'; <----- > what > do I put here > #$Bio::Tools::Run::RemoteBlast::HEADER{'?????'} = '????'; <--- Do I need > to add > any other values to the form inputs > > $factory->submit_blast("blast.in"); > $v = 1; > > while (my @rids = $factory->each_rid) > { foreach my $rid ( @rids ) > { my $rc = $factory->retrieve_blast($rid); > if( !ref($rc) ) > { if( $rc < 0 ) > { $factory->remove_rid($rid); > } > print STDERR "." if ( $v > 0 ); > sleep 5; > } > else > { my $result = $rc->next_result(); > my $filename = $result->query_name()."\.out"; > $factory->save_output($filename); > $factory->remove_rid($rid); > print "\nQuery Name: ", $result->query_name(), "\n"; > } > } > } > > > Both of my questions are very similiar as in I know how to use remote > blast but > not sure what to change to access the specific blast I want. > > Again, any help would be very appreciated!! > > Rohan > > > > ---------------------------------------- > This mail sent through www.mywaterloo.ca > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From smart_bioit at yahoo.com Fri Jul 14 17:25:51 2006 From: smart_bioit at yahoo.com (raj sharma) Date: Fri, 14 Jul 2006 10:25:51 -0700 (PDT) Subject: [Bioperl-l] advice Message-ID: <20060714172551.60978.qmail@web37309.mail.mud.yahoo.com> i have one problem in perl i want to make one program which whn run online can download required data from data bank to local server frm where i shld start --------------------------------- Yahoo! Music Unlimited - Access over 1 million songs.Try it free. From charlesh at stedwards.edu Sat Jul 15 19:29:46 2006 From: charlesh at stedwards.edu (Charles Hauser) Date: Sat, 15 Jul 2006 14:29:46 -0500 Subject: [Bioperl-l] Finding locations of a string within a fasta file Message-ID: <040D0FF5-0BD1-49DA-B66C-32E359F254BA@stedwards.edu> All, I'm trying to determine where (the start .. end positions) within a genomic scaffold sequence gaps occur. The gaps are denoted as runs of N's. Suggestions on how to easily retrieve this would be appreciated. ch From cjfields at uiuc.edu Sat Jul 15 21:22:15 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sat, 15 Jul 2006 16:22:15 -0500 Subject: [Bioperl-l] Finding locations of a string within a fasta file In-Reply-To: <040D0FF5-0BD1-49DA-B66C-32E359F254BA@stedwards.edu> Message-ID: <000001c6a854$bee47400$15327e82@pyrimidine> You can retrieve the original GenBank CONTIG file using Bio::DB::GenBank if the format is set to 'gb' (it is now set to 'gbwithparts' by default. The CONTIG lines are currently stored in a series of Bio::Annotation::SimpleValue objects; get the accessions using the following script. use strict; use warnings; use Bio::DB::GenBank; my $factory = Bio::DB::GenBank->new(-format => 'gb'); my $seq = $factory->get_Seq_by_id(shift); my $seqout = Bio::SeqIO->new(-fh => \*STDOUT, -format => 'genbank'); # greps only annotations with CONTIG tagname, joins all together my $contig = join '', grep {$_->tagname eq 'CONTIG'} $seq->get_Annotations(); # split each region, getting rid of gaps and join(), then split into acc/span for (grep {$_ !~ m{gap|join}} split ',', $contig) { my ($acc, $span) = split ':', $_; $span =~ s{\)}{}g; # spurious ')' print "ACC: $acc\n\tSpan:$span\n"; } > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Charles Hauser > Sent: Saturday, July 15, 2006 2:30 PM > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] Finding locations of a string within a fasta file > > All, > > I'm trying to determine where (the start .. end positions) within a > genomic scaffold sequence gaps occur. > The gaps are denoted as runs of N's. > > Suggestions on how to easily retrieve this would be appreciated. > > ch > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From sudhaneti at yahoo.com Sat Jul 15 19:26:01 2006 From: sudhaneti at yahoo.com (Sudha Gunturu) Date: Sat, 15 Jul 2006 12:26:01 -0700 (PDT) Subject: [Bioperl-l] BLOSUM matrix Message-ID: <20060715192601.36517.qmail@web53315.mail.yahoo.com> Being a beginner perl programming, was wondering if anyone can help me with implementation of BLOSUM 65 matrix for the following alignments or in general. Any inputs, websites to help with this are appreciated. AILCAA ALLLAA ILIICL Thanks Sudha --------------------------------- Do you Yahoo!? Next-gen email? Have it all with the all-new Yahoo! Mail Beta. From charlesh at stedwards.edu Sun Jul 16 23:32:38 2006 From: charlesh at stedwards.edu (Charles Hauser) Date: Sun, 16 Jul 2006 18:32:38 -0500 Subject: [Bioperl-l] Finding locations of a string within a fasta file In-Reply-To: <000001c6a854$bee47400$15327e82@pyrimidine> References: <000001c6a854$bee47400$15327e82@pyrimidine> Message-ID: Hi Chris, Thanks for the info. Unfortunately, I was not clear that the sequence is unannotated, i.e. there is no GenBank record. I need to extract the locations of the gaps from a raw fasta file. ch On Jul 15, 2006, at 4:22 PM, Chris Fields wrote: > You can retrieve the original GenBank CONTIG file using > Bio::DB::GenBank if > the format is set to 'gb' (it is now set to 'gbwithparts' by > default. The > CONTIG lines are currently stored in a series of > Bio::Annotation::SimpleValue objects; get the accessions using the > following > script. > > use strict; > use warnings; > > use Bio::DB::GenBank; > > my $factory = Bio::DB::GenBank->new(-format => 'gb'); > > my $seq = $factory->get_Seq_by_id(shift); > > my $seqout = Bio::SeqIO->new(-fh => \*STDOUT, > -format => 'genbank'); > > # greps only annotations with CONTIG tagname, joins all together > my $contig = join '', grep {$_->tagname eq 'CONTIG'} > $seq->get_Annotations(); > > # split each region, getting rid of gaps and join(), then split into > acc/span > for (grep {$_ !~ m{gap|join}} > split ',', $contig) { > my ($acc, $span) = split ':', $_; > $span =~ s{\)}{}g; # spurious ')' > print "ACC: $acc\n\tSpan:$span\n"; > } > > > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> bounces at lists.open-bio.org] On Behalf Of Charles Hauser >> Sent: Saturday, July 15, 2006 2:30 PM >> To: bioperl-l at lists.open-bio.org >> Subject: [Bioperl-l] Finding locations of a string within a fasta >> file >> >> All, >> >> I'm trying to determine where (the start .. end positions) within a >> genomic scaffold sequence gaps occur. >> The gaps are denoted as runs of N's. >> >> Suggestions on how to easily retrieve this would be appreciated. >> >> ch >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > From torsten.seemann at infotech.monash.edu.au Mon Jul 17 02:23:51 2006 From: torsten.seemann at infotech.monash.edu.au (Torsten Seemann) Date: Mon, 17 Jul 2006 12:23:51 +1000 Subject: [Bioperl-l] advice In-Reply-To: <20060714172551.60978.qmail@web37309.mail.mud.yahoo.com> References: <20060714172551.60978.qmail@web37309.mail.mud.yahoo.com> Message-ID: <44BAF4B7.8090508@infotech.monash.edu.au> raj sharma wrote: > i have one problem in perl is this Bio::Perl related? > i want to make one program which whn run online do you mean runs on a web server as a CGI script, or access on-line data? > can download required data from data bank to local server which databank - genbank or ... ? > frm where i shld start http://www.oreilly.com/catalog/lperl3/ -- Dr Torsten Seemann http://www.vicbioinformatics.com Victorian Bioinformatics Consortium, Monash University, Australia From torsten.seemann at infotech.monash.edu.au Mon Jul 17 02:21:31 2006 From: torsten.seemann at infotech.monash.edu.au (Torsten Seemann) Date: Mon, 17 Jul 2006 12:21:31 +1000 Subject: [Bioperl-l] Finding locations of a string within a fasta file In-Reply-To: <040D0FF5-0BD1-49DA-B66C-32E359F254BA@stedwards.edu> References: <040D0FF5-0BD1-49DA-B66C-32E359F254BA@stedwards.edu> Message-ID: <44BAF42B.8080102@infotech.monash.edu.au> > I'm trying to determine where (the start .. end positions) within a > genomic scaffold sequence gaps occur. > The gaps are denoted as runs of N's. > Suggestions on how to easily retrieve this would be appreciated. First you need to get the sequence into a string within Perl. As your email Subject: says it is in the Fasta file, you need to 1. open the fasta file - see Bio::SeqIO 2. read first sequence (as an object) - see next_seq() 3. get the string of the sequence in the object - see seq() Then you could just use the inbuilt Perl function index() to loop through all the occurences of 'N' - type 'perldoc -f index' for help. Alternatively use regexp matching eg, m/(N+)/g and the pos() function. -- Dr Torsten Seemann http://www.vicbioinformatics.com Victorian Bioinformatics Consortium, Monash University, Australia From sudhaneti at yahoo.com Mon Jul 17 02:33:20 2006 From: sudhaneti at yahoo.com (Sudha Gunturu) Date: Sun, 16 Jul 2006 19:33:20 -0700 (PDT) Subject: [Bioperl-l] BLOSUM matrix In-Reply-To: <44BAF316.9020301@infotech.monash.edu.au> Message-ID: <20060717023320.6402.qmail@web53313.mail.yahoo.com> Sorry for not being clear with my question. Let me try to explain. I want to Implement dynamic programing using Blosum as scoring matrix. 1. I want to know how to define the values of Blosum in an array. 2. What functions are suitable for global alignment of two sequences. Etc., Being a beginer programer want some direction, books, and good websites which can help me in achieving the implementation. It would be great if someone can walk me through this. Thanks Sudha Torsten Seemann wrote: Sudha, > Being a beginner perl programming, was wondering if anyone can help me with implementation of BLOSUM 65 matrix for the following alignments or in general. Any inputs, websites to help with this are appreciated. > AILCAA > ALLLAA > ILIICL The BLOSUM65 matrix does not define a method for alignment, it just provides some parameters. Perhaps you should read this first: http://en.wikipedia.org/wiki/Sequence_alignment -- Dr Torsten Seemann http://www.vicbioinformatics.com Victorian Bioinformatics Consortium, Monash University, Australia --------------------------------- Do you Yahoo!? Everyone is raving about the all-new Yahoo! Mail Beta. From torsten.seemann at infotech.monash.edu.au Mon Jul 17 02:16:54 2006 From: torsten.seemann at infotech.monash.edu.au (Torsten Seemann) Date: Mon, 17 Jul 2006 12:16:54 +1000 Subject: [Bioperl-l] BLOSUM matrix In-Reply-To: <20060715192601.36517.qmail@web53315.mail.yahoo.com> References: <20060715192601.36517.qmail@web53315.mail.yahoo.com> Message-ID: <44BAF316.9020301@infotech.monash.edu.au> Sudha, > Being a beginner perl programming, was wondering if anyone can help me with implementation of BLOSUM 65 matrix for the following alignments or in general. Any inputs, websites to help with this are appreciated. > AILCAA > ALLLAA > ILIICL The BLOSUM65 matrix does not define a method for alignment, it just provides some parameters. Perhaps you should read this first: http://en.wikipedia.org/wiki/Sequence_alignment -- Dr Torsten Seemann http://www.vicbioinformatics.com Victorian Bioinformatics Consortium, Monash University, Australia From smart_bioit at yahoo.com Mon Jul 17 04:21:41 2006 From: smart_bioit at yahoo.com (raj sharma) Date: Sun, 16 Jul 2006 21:21:41 -0700 (PDT) Subject: [Bioperl-l] advice In-Reply-To: <44BAF4B7.8090508@infotech.monash.edu.au> Message-ID: <20060717042141.9069.qmail@web37307.mail.mud.yahoo.com> hi trston well here i will make u clear my problem i want to make one data base of marine species u can say this as mirror of data so at present whn i click there on line data base of ncbi gets open so i want to dowload data of marine species (ny one) nd whn ever i click on tht link local data which i have downloaded shld open nd data shld also b updated online after some time waiting for ur reply --------------------------------- Do you Yahoo!? Next-gen email? Have it all with the all-new Yahoo! Mail Beta. From cjfields at uiuc.edu Mon Jul 17 04:51:20 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sun, 16 Jul 2006 23:51:20 -0500 Subject: [Bioperl-l] BLOSUM matrix In-Reply-To: <20060717023320.6402.qmail@web53313.mail.yahoo.com> References: <20060717023320.6402.qmail@web53313.mail.yahoo.com> Message-ID: Hmm, beginner programmer, wants to learn perl? Here are some directions: http://learn.perl.org/ Start with Schwartz's latest incarnation of Learning Perl, then work your way up to Intermediate Perl (I think Mastering Perl is on the horizon...) For some pointers using Perl and bioinformatics, pick up Tisdall's books Beginning/Mastering Perl for Bioinformatics. This is really a list for bioperl, not perl and bioinformatics (thought the two cross here all the time!). We normally don't mind answering questions but we typically don't do people's homework unless we're unusually bored. And we can be excessively cranky when someone repeatedly posts requests for something that shouldn't take much reading and Googling to find out. Again, we're not into that homework gig, i.e. 'walking you through it' is tantamount to 'doing it for you.' 1) Arrays and how to use them are in Learning Perl; there are probably better ways to do this than an array, though... 2) Use Torsten's link to get you started. Chris On Jul 16, 2006, at 9:33 PM, Sudha Gunturu wrote: > Sorry for not being clear with my question. Let me try to > explain. I want to Implement dynamic programing using Blosum as > scoring matrix. > > 1. I want to know how to define the values of Blosum in an array. > 2. What functions are suitable for global alignment of two > sequences. Etc., > > Being a beginer programer want some direction, books, and good > websites which can help me in achieving the implementation. It > would be great if someone can walk me through this. > > Thanks > Sudha > > Torsten Seemann wrote: > Sudha, > >> Being a beginner perl programming, was wondering if anyone can >> help me with implementation of BLOSUM 65 matrix for the following >> alignments or in > general. Any inputs, websites to help with this are appreciated. >> AILCAA >> ALLLAA >> ILIICL > > The BLOSUM65 matrix does not define a method for alignment, it just > provides some parameters. Perhaps you should read this first: > > http://en.wikipedia.org/wiki/Sequence_alignment > > -- > Dr Torsten Seemann http://www.vicbioinformatics.com > Victorian Bioinformatics Consortium, Monash University, Australia > > > > > --------------------------------- > Do you Yahoo!? > Everyone is raving about the all-new Yahoo! Mail Beta. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Mon Jul 17 05:01:53 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 17 Jul 2006 00:01:53 -0500 Subject: [Bioperl-l] advice In-Reply-To: <20060717042141.9069.qmail@web37307.mail.mud.yahoo.com> References: <20060717042141.9069.qmail@web37307.mail.mud.yahoo.com> Message-ID: <82C51420-A18B-4DEA-A519-CE1D7B9C7B10@uiuc.edu> This is a Bioperl list. If you don't have a Bioperl-related question, you will very likely get testy replies. I don't believe that you quite understand Torsten's response, so I'll just copy-and-paste from a reply I just gave a second ago to save myself the typing: Hmm, beginner programmer, wants to learn perl? Here are some directions: http://learn.perl.org/ Start with Schwartz's latest incarnation of Learning Perl, then work your way up to Intermediate Perl (I think Mastering Perl is on the horizon...) For some pointers using Perl and bioinformatics, pick up Tisdall's books Beginning/Mastering Perl for Bioinformatics. This is really a list for bioperl, not perl and bioinformatics (thought the two cross here all the time!). We normally don't mind answering questions but we typically don't do people's homework unless we're unusually bored. And we can be excessively cranky when someone repeatedly posts requests for something that shouldn't take much reading and Googling to find out. Again, we're not into that homework gig, i.e. 'walking you through it' is tantamount to 'doing it for you.' For your particular instance, you might want to brush up on web services, CGI, and a little web etiquette. http://catb.org/esr/faqs/smart-questions.html I think you may be waiting for a long time for a reply! Chris On Jul 16, 2006, at 11:21 PM, raj sharma wrote: > > hi trston > well here i will make u clear my problem > > i want to make one data base of marine species u can say this as > mirror of data > > > so at present whn i click there on line data base of ncbi gets open > > so i want to dowload data of marine species (ny one) > nd whn ever i click on tht link local data which i have > downloaded shld open > nd data shld also b updated online after some time > > waiting for ur reply > > > > > > > --------------------------------- > Do you Yahoo!? > Next-gen email? Have it all with the all-new Yahoo! Mail Beta. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bmoore at genetics.utah.edu Mon Jul 17 05:25:32 2006 From: bmoore at genetics.utah.edu (Barry Moore) Date: Sun, 16 Jul 2006 23:25:32 -0600 Subject: [Bioperl-l] advice In-Reply-To: <20060714172551.60978.qmail@web37309.mail.mud.yahoo.com> Message-ID: By reading this: http://catb.org/esr/faqs/smart-questions.html -----Original Message----- From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of raj sharma Sent: Friday, July 14, 2006 11:26 AM To: bioperl-l at lists.open-bio.org Subject: [Bioperl-l] advice i have one problem in perl i want to make one program which whn run online can download required data from data bank to local server frm where i shld start --------------------------------- Yahoo! Music Unlimited - Access over 1 million songs.Try it free. _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l From bmoore at genetics.utah.edu Mon Jul 17 05:34:58 2006 From: bmoore at genetics.utah.edu (Barry Moore) Date: Sun, 16 Jul 2006 23:34:58 -0600 Subject: [Bioperl-l] advice In-Reply-To: <20060717042141.9069.qmail@web37307.mail.mud.yahoo.com> Message-ID: If you're on a unix type system look at wget -mirror and it's variations. B -----Original Message----- From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of raj sharma Sent: Sunday, July 16, 2006 10:22 PM To: Torsten Seemann Subject: Re: [Bioperl-l] advice hi trston well here i will make u clear my problem i want to make one data base of marine species u can say this as mirror of data so at present whn i click there on line data base of ncbi gets open so i want to dowload data of marine species (ny one) nd whn ever i click on tht link local data which i have downloaded shld open nd data shld also b updated online after some time waiting for ur reply --------------------------------- Do you Yahoo!? Next-gen email? Have it all with the all-new Yahoo! Mail Beta. _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l From bix at sendu.me.uk Mon Jul 17 14:32:13 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 17 Jul 2006 15:32:13 +0100 Subject: [Bioperl-l] Bio::Map changes In-Reply-To: <44ACCCD5.3030309@sendu.me.uk> References: <44985915.8010607@sendu.me.uk> <449A9AF9.2000305@sendu.me.uk> <44ACCCD5.3030309@sendu.me.uk> Message-ID: <44BB9F6D.10005@sendu.me.uk> Sendu Bala wrote: > Sendu Bala wrote: >> The reimplementation will make Position central to the model, allowing >> for lots of other things to work properly without anything becoming >> inconsistent (as is currently the case). > > This is now done. It uses a new PositionHandler class behind the scenes. > > The next step is to introduce relative positioning across the board This is now done. It uses a new Relative class to describe what a given position is relative to. I also made Bio::Map:MapI an AnnotableI and SimpleMap an implementor. I think this pretty much brings an end to my changes to Bio::Map. Unless anyone thinks the changes lack sanity, I think the API of the new things should be somewhat stable. > possibly in a way that makes OrderedPosition redundant or an implementer > of the system. I haven't yet touched the other kinds of Positions to update/remove them. Docs in general could probably do with an update/ improvement. I plan to do this 'soon'. From golharam at umdnj.edu Mon Jul 17 14:13:20 2006 From: golharam at umdnj.edu (Ryan Golhar) Date: Mon, 17 Jul 2006 10:13:20 -0400 Subject: [Bioperl-l] advice In-Reply-To: Message-ID: <004301c6a9ab$2896c8b0$2f01a8c0@GOLHARMOBILE1> I apologize that this is off-topic, but it is an interesting email. Notice the lack of vowels (whn, ny, nd, shld, b) however in other words, the vowels are clearly included. Am I getting old or is "internet spelling" starting to differ from "english spelling"? Or is it that the younger generation (not that I'm old...a mere 32 is not old), using shorthand for frequently used words? -----Original Message----- From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Barry Moore Sent: Monday, July 17, 2006 1:35 AM To: raj sharma Cc: bioperl-l Subject: Re: [Bioperl-l] advice If you're on a unix type system look at wget -mirror and it's variations. B -----Original Message----- From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of raj sharma Sent: Sunday, July 16, 2006 10:22 PM To: Torsten Seemann Subject: Re: [Bioperl-l] advice hi trston well here i will make u clear my problem i want to make one data base of marine species u can say this as mirror of data so at present whn i click there on line data base of ncbi gets open so i want to dowload data of marine species (ny one) nd whn ever i click on tht link local data which i have downloaded shld open nd data shld also b updated online after some time waiting for ur reply --------------------------------- Do you Yahoo!? Next-gen email? Have it all with the all-new Yahoo! Mail Beta. _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l From arareko at campus.iztacala.unam.mx Mon Jul 17 15:31:09 2006 From: arareko at campus.iztacala.unam.mx (Mauricio Herrera Cuadra) Date: Mon, 17 Jul 2006 10:31:09 -0500 Subject: [Bioperl-l] advice In-Reply-To: <004301c6a9ab$2896c8b0$2f01a8c0@GOLHARMOBILE1> References: <004301c6a9ab$2896c8b0$2f01a8c0@GOLHARMOBILE1> Message-ID: <44BBAD3D.2040203@campus.iztacala.unam.mx> Maybe it's a new "obscure" perl6 syntax :) Ryan Golhar wrote: > I apologize that this is off-topic, but it is an interesting email. > Notice the lack of vowels (whn, ny, nd, shld, b) however in other > words, the vowels are clearly included. > > Am I getting old or is "internet spelling" starting to differ from > "english spelling"? Or is it that the younger generation (not that I'm > old...a mere 32 is not old), using shorthand for frequently used words? > > > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org > [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Barry Moore > Sent: Monday, July 17, 2006 1:35 AM > To: raj sharma > Cc: bioperl-l > Subject: Re: [Bioperl-l] advice > > > If you're on a unix type system look at wget -mirror and it's > variations. > > B > > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org > [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of raj sharma > Sent: Sunday, July 16, 2006 10:22 PM > To: Torsten Seemann > Subject: Re: [Bioperl-l] advice > > > hi trston > well here i will make u clear my problem > > i want to make one data base of marine species u can say this as > mirror of data > > > so at present whn i click there on line data base of ncbi gets open > > so i want to dowload data of marine species (ny one) > nd whn ever i click on tht link local data which i have downloaded > shld open > nd data shld also b updated online after some time > > waiting for ur reply > > > > > > > --------------------------------- > Do you Yahoo!? > Next-gen email? Have it all with the all-new Yahoo! Mail Beta. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- MAURICIO HERRERA CUADRA arareko at campus.iztacala.unam.mx Laboratorio de Gen?tica Unidad de Morfofisiolog?a y Funci?n Facultad de Estudios Superiores Iztacala, UNAM From cjfields at uiuc.edu Mon Jul 17 16:09:27 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 17 Jul 2006 11:09:27 -0500 Subject: [Bioperl-l] advice In-Reply-To: <004301c6a9ab$2896c8b0$2f01a8c0@GOLHARMOBILE1> Message-ID: <000a01c6a9bb$6478ba90$15327e82@pyrimidine> Ha ! I *almost* added something about that. I thought his vowel keys were broken for a bit, maybe from pounding the keyboard with extreme frustration! As an aside, doesn't Damian Conway say something about the non-use of vowels in 'Perl Best Practices?' I think it was in relation to variables, though... Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Ryan Golhar > Sent: Monday, July 17, 2006 9:13 AM > To: 'bioperl-l' > Subject: Re: [Bioperl-l] advice > > I apologize that this is off-topic, but it is an interesting email. > Notice the lack of vowels (whn, ny, nd, shld, b) however in other > words, the vowels are clearly included. > > Am I getting old or is "internet spelling" starting to differ from > "english spelling"? Or is it that the younger generation (not that I'm > old...a mere 32 is not old), using shorthand for frequently used words? > > > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org > [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Barry Moore > Sent: Monday, July 17, 2006 1:35 AM > To: raj sharma > Cc: bioperl-l > Subject: Re: [Bioperl-l] advice > > > If you're on a unix type system look at wget -mirror and it's > variations. > > B > > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org > [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of raj sharma > Sent: Sunday, July 16, 2006 10:22 PM > To: Torsten Seemann > Subject: Re: [Bioperl-l] advice > > > hi trston > well here i will make u clear my problem > > i want to make one data base of marine species u can say this as > mirror of data > > > so at present whn i click there on line data base of ncbi gets open > > so i want to dowload data of marine species (ny one) > nd whn ever i click on tht link local data which i have downloaded > shld open > nd data shld also b updated online after some time > > waiting for ur reply > > > > > > > --------------------------------- > Do you Yahoo!? > Next-gen email? Have it all with the all-new Yahoo! Mail Beta. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From bix at sendu.me.uk Mon Jul 17 16:31:37 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 17 Jul 2006 17:31:37 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes Message-ID: <44BBBB69.6000906@sendu.me.uk> I see strange node names via Bio::DB::Taxonomy::flatfile: use Bio::DB::Taxonomy; my $db = new Bio::DB::Taxonomy(-source => 'flatfile', -directory => $taxonomy_dir, -nodesfile => $taxonomy_dir.'nodes.dmp', -namesfile => $taxonomy_dir.'names.dmp'); my $tax_id = 89593; my $node = $db->get_Taxonomy_Node($tax_id); print "node $tax_id has name '", @{$node->name('common')}, "' and rank '", $node->rank, "'\n"; Results in: node 89593 has name 'Craniata ' and rank 'subphylum' Other examples: node 2 has name 'Bacteria ' and rank 'superkingdom' node 1386 has name 'Bacillus ' and rank 'genus' node 7776 has name 'Gnathostomata ' and rank 'superclass' etc. For me the bits in <> are inappropriate and shouldn't be there. The NCBI website agrees, and you won't see these things if you use -source => 'entrez'. Should they be removed by the flatfile parser as a matter of course, with no warnings or option? Or do people want them? Typically they are just the name of the parent node, so I don't see why anyone would /need/ them, and I argue it's invalid for parent node information to be duplicated here. If there are no objections I'll strip the <> bits. I also plan to make $node->name('scientific', 'sapiens'); set and get the node name, and have flatfile and entrez store all common names with $obj->name('common', 'human', 'man');. As these changes will make the implementation match the docs I don't see any problems, except that flatfile users will now find the node name in a different place (@{$node->name('scientific')} instead of @{$node->name('common')}). I'll also fix the problem with node names for ranks species and lower, as discussed in thread 'Bio::DB::Taxonomy:: mishandles species, subspecies/variant names', in the way I suggested there. If anyone can see a problem with any of these changes, let me know asap. From hlapp at gmx.net Mon Jul 17 17:53:17 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 17 Jul 2006 13:53:17 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BBBB69.6000906@sendu.me.uk> References: <44BBBB69.6000906@sendu.me.uk> Message-ID: <8C935CDE-BEFF-48BC-8CC8-0C3FFEF2EF17@gmx.net> Sound good to me. BTW NCBI guarantees (well, promises) that there will only be one node name of class 'scientific'. -hilmar On Jul 17, 2006, at 12:31 PM, Sendu Bala wrote: > I see strange node names via Bio::DB::Taxonomy::flatfile: > > use Bio::DB::Taxonomy; > > my $db = new Bio::DB::Taxonomy(-source => 'flatfile', -directory => > $taxonomy_dir, -nodesfile => $taxonomy_dir.'nodes.dmp', -namesfile => > $taxonomy_dir.'names.dmp'); > > my $tax_id = 89593; > my $node = $db->get_Taxonomy_Node($tax_id); > > print "node $tax_id has name '", @{$node->name('common')}, "' and rank > '", $node->rank, "'\n"; > > Results in: > node 89593 has name 'Craniata ' and rank 'subphylum' > > Other examples: > node 2 has name 'Bacteria ' and rank 'superkingdom' > node 1386 has name 'Bacillus ' and rank 'genus' > node 7776 has name 'Gnathostomata ' and rank 'superclass' > etc. > > For me the bits in <> are inappropriate and shouldn't be there. The > NCBI > website agrees, and you won't see these things if you use -source => > 'entrez'. Should they be removed by the flatfile parser as a matter of > course, with no warnings or option? Or do people want them? Typically > they are just the name of the parent node, so I don't see why anyone > would /need/ them, and I argue it's invalid for parent node > information > to be duplicated here. > > If there are no objections I'll strip the <> bits. I also plan to make > $node->name('scientific', 'sapiens'); set and get the node name, and > have flatfile and entrez store all common names with > $obj->name('common', 'human', 'man');. As these changes will make the > implementation match the docs I don't see any problems, except that > flatfile users will now find the node name in a different place > (@{$node->name('scientific')} instead of @{$node->name('common')}). > > I'll also fix the problem with node names for ranks species and lower, > as discussed in thread 'Bio::DB::Taxonomy:: mishandles species, > subspecies/variant names', in the way I suggested there. > > If anyone can see a problem with any of these changes, let me know > asap. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Mon Jul 17 18:31:08 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 17 Jul 2006 13:31:08 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <8C935CDE-BEFF-48BC-8CC8-0C3FFEF2EF17@gmx.net> Message-ID: <001d01c6a9cf$2cf50f60$15327e82@pyrimidine> I agree. Would be nice to get this to play well with weird bacterial names! I plan on doing some behind-the-scenes work on Bio::DB::Taxonomy::entrez at some point soon to test out Bio::DB::EUtilities as the user agent; it currently uses Bio::Root::HTTPget, I think. Reason I'm doing this is to quickly get tax info based on any primary ID, primarily for grabbing related Tax information from the sequence GI w/o parsing the sequence for the TaxID; this uses NCBI's ELink which I've now implemented. I'll make sure everything passes tests before I commit. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Hilmar Lapp > Sent: Monday, July 17, 2006 12:53 PM > To: Sendu Bala > Cc: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > Sound good to me. > > BTW NCBI guarantees (well, promises) that there will only be one node > name of class 'scientific'. > > -hilmar > > On Jul 17, 2006, at 12:31 PM, Sendu Bala wrote: > > > I see strange node names via Bio::DB::Taxonomy::flatfile: > > > > use Bio::DB::Taxonomy; > > > > my $db = new Bio::DB::Taxonomy(-source => 'flatfile', -directory => > > $taxonomy_dir, -nodesfile => $taxonomy_dir.'nodes.dmp', -namesfile => > > $taxonomy_dir.'names.dmp'); > > > > my $tax_id = 89593; > > my $node = $db->get_Taxonomy_Node($tax_id); > > > > print "node $tax_id has name '", @{$node->name('common')}, "' and rank > > '", $node->rank, "'\n"; > > > > Results in: > > node 89593 has name 'Craniata ' and rank 'subphylum' > > > > Other examples: > > node 2 has name 'Bacteria ' and rank 'superkingdom' > > node 1386 has name 'Bacillus ' and rank 'genus' > > node 7776 has name 'Gnathostomata ' and rank 'superclass' > > etc. > > > > For me the bits in <> are inappropriate and shouldn't be there. The > > NCBI > > website agrees, and you won't see these things if you use -source => > > 'entrez'. Should they be removed by the flatfile parser as a matter of > > course, with no warnings or option? Or do people want them? Typically > > they are just the name of the parent node, so I don't see why anyone > > would /need/ them, and I argue it's invalid for parent node > > information > > to be duplicated here. > > > > If there are no objections I'll strip the <> bits. I also plan to make > > $node->name('scientific', 'sapiens'); set and get the node name, and > > have flatfile and entrez store all common names with > > $obj->name('common', 'human', 'man');. As these changes will make the > > implementation match the docs I don't see any problems, except that > > flatfile users will now find the node name in a different place > > (@{$node->name('scientific')} instead of @{$node->name('common')}). > > > > I'll also fix the problem with node names for ranks species and lower, > > as discussed in thread 'Bio::DB::Taxonomy:: mishandles species, > > subspecies/variant names', in the way I suggested there. > > > > If anyone can see a problem with any of these changes, let me know > > asap. > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From bix at sendu.me.uk Mon Jul 17 18:09:44 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 17 Jul 2006 19:09:44 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <8C935CDE-BEFF-48BC-8CC8-0C3FFEF2EF17@gmx.net> References: <44BBBB69.6000906@sendu.me.uk> <8C935CDE-BEFF-48BC-8CC8-0C3FFEF2EF17@gmx.net> Message-ID: <44BBD268.2060308@sendu.me.uk> Hilmar Lapp wrote: >> I also plan to make $node->name('scientific', 'sapiens'); set and >> get the node name, [...] users will now find the node name in [...] >> @{$node->name('scientific')} > > BTW NCBI guarantees (well, promises) that there will only be one node > name of class 'scientific'. Yes, which is why I feel the API for name() isn't ideal, but thought it would be best to play along. Would having a new scientific_name() method be better, which gets/sets a single value? Perhaps it could just be a more 'sane' shorthand to setting @{$node->name('scientific')} to a list with only the supplied name, and getting ${$node->name('scientific')}[0] ? From hlapp at gmx.net Mon Jul 17 19:31:55 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 17 Jul 2006 15:31:55 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BBD268.2060308@sendu.me.uk> References: <44BBBB69.6000906@sendu.me.uk> <8C935CDE-BEFF-48BC-8CC8-0C3FFEF2EF17@gmx.net> <44BBD268.2060308@sendu.me.uk> Message-ID: <5B62229C-BAB7-4320-BBAE-87A483B0EC15@gmx.net> Yes I think $node->scientific_name() as shorthand would be good to have. Same BTW for $node->common_names() (which would return an array). -hilmar On Jul 17, 2006, at 2:09 PM, Sendu Bala wrote: > Hilmar Lapp wrote: >>> I also plan to make $node->name('scientific', 'sapiens'); set and >>> get the node name, [...] users will now find the node name in [...] >>> @{$node->name('scientific')} >> >> BTW NCBI guarantees (well, promises) that there will only be one node >> name of class 'scientific'. > > Yes, which is why I feel the API for name() isn't ideal, but > thought it > would be best to play along. Would having a new scientific_name() > method > be better, which gets/sets a single value? Perhaps it could just be a > more 'sane' shorthand to setting @{$node->name('scientific')} to a > list > with only the supplied name, and getting ${$node->name > ('scientific')}[0] ? > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Mon Jul 17 20:44:18 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 17 Jul 2006 15:44:18 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <5B62229C-BAB7-4320-BBAE-87A483B0EC15@gmx.net> Message-ID: <000001c6a9e1$c6b51610$15327e82@pyrimidine> There was some interest in getting Bio::Species to delegate to Bio::Taxonomy::Node, so having scientific_name() would help quite a bit since the name used on the ORGANISM line is the scientific name (well, is supposed to be; famous last words). Don't know about SwissProt, EMBL, and others though... Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Hilmar Lapp > Sent: Monday, July 17, 2006 2:32 PM > To: Sendu Bala > Cc: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > Yes I think $node->scientific_name() as shorthand would be good to > have. Same BTW for $node->common_names() (which would return an array). > > -hilmar > > On Jul 17, 2006, at 2:09 PM, Sendu Bala wrote: > > > Hilmar Lapp wrote: > >>> I also plan to make $node->name('scientific', 'sapiens'); set and > >>> get the node name, [...] users will now find the node name in [...] > >>> @{$node->name('scientific')} > >> > >> BTW NCBI guarantees (well, promises) that there will only be one node > >> name of class 'scientific'. > > > > Yes, which is why I feel the API for name() isn't ideal, but > > thought it > > would be best to play along. Would having a new scientific_name() > > method > > be better, which gets/sets a single value? Perhaps it could just be a > > more 'sane' shorthand to setting @{$node->name('scientific')} to a > > list > > with only the supplied name, and getting ${$node->name > > ('scientific')}[0] ? > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From vrramnar at student.cs.uwaterloo.ca Mon Jul 17 20:46:32 2006 From: vrramnar at student.cs.uwaterloo.ca (vrramnar at student.cs.uwaterloo.ca) Date: Mon, 17 Jul 2006 16:46:32 -0400 Subject: [Bioperl-l] Remote Blast - Blast Human Genome In-Reply-To: <000501c6a6e8$b9c24d20$15327e82@pyrimidine> References: <000501c6a6e8$b9c24d20$15327e82@pyrimidine> Message-ID: <1153169192.44bbf728056fd@www.nexusmail.uwaterloo.ca> Hi Chris, 1. I have tried changing the database to snp or dbSNP but neither works. It seems that depending on which type of blast you use(ie, Genome Blast, Blast SNP, normal blast such as blastn, etc...) you see a different listing of databases available for querys. Since you mention that the Blast page I see was generated by Genome, where could I go to see a complete listing of databases I can query?? Or if you knew off hand which database to search if I only wanted dbSNP hits? 2. You also mention, I can limit the search by using Entrez terms. Do you mean like: $Bio::Tools::Run::RemoteBlast::HEADER{'ENTREZ_QUERY'} = 'abc'; where 'abc' is the name of the subject with which you would only like to see result of. For example if you put it as 'Homo sapiens[Organism]' then only human sequences would be in hit lists. If this is what you mean, what would I change it to, to see only hits from dbSNP? Thanks for the ongoing help, Rohan Quoting Chris Fields : > I added a method to RemoteBlast in bioperl-live (CVS) if you want to play > with changing the URL. I have been thinking about doing this for a bit now > but I already see problems. > > Here's the issue: the BLAST page you see is NOT the NCBI BLAST page (note > the differences in the URL) but a user-friendly request page, generated on > the fly by Genome, to submit BLAST requests for the relevant database. So > changing the URL will not work (even by adding extra parameters); you only > get the original HTML web page. > > You could try changing the database or limiting the search using an Entrez > term (which you should be able to include in the request, probably by adding > it to the HEADER). > > Chris > > > -----Original Message----- > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > > bounces at lists.open-bio.org] On Behalf Of vrramnar at student.cs.uwaterloo.ca > > Sent: Thursday, July 13, 2006 5:39 PM > > To: bioperl-l at lists.open-bio.org > > Subject: [Bioperl-l] Remote Blast - Blast Human Genome > > > > > > Hello Again, > > > > I have another question regarding Remote blast but this time using Genome > > Blast. > > > > Here is the link: > > > > http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=9606 > > > > which again uses the main Blast web site: > > > > http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi > > > > Again I am not sure what to add or what HEADER information to change > > within my > > script. > > > > Here is my program, which was the same as the last email: > > > > #!/usr/bin/perl -w > > > > use Bio::Perl; > > use Bio::Tools::Run::RemoteBlast; > > > > my $prog = "blastn"; > > my $db = "refseq_genomic"; > > my $e_val = 0.01; > > > > my @params = ( '-prog' => $prog, > > '-data' => $db, > > '-expect' => $e_val); > > > > my $factory = new Bio::Tools::Run::RemoteBlast->new(@params); > > $Bio::Tools::Run::RemoteBlast::HEADER{'WWW_BLAST_TYPE'} = '????'; <----- > > what > > do I put here > > #$Bio::Tools::Run::RemoteBlast::HEADER{'?????'} = '????'; <--- Do I need > > to add > > any other values to the form inputs > > > > $factory->submit_blast("blast.in"); > > $v = 1; > > > > while (my @rids = $factory->each_rid) > > { foreach my $rid ( @rids ) > > { my $rc = $factory->retrieve_blast($rid); > > if( !ref($rc) ) > > { if( $rc < 0 ) > > { $factory->remove_rid($rid); > > } > > print STDERR "." if ( $v > 0 ); > > sleep 5; > > } > > else > > { my $result = $rc->next_result(); > > my $filename = $result->query_name()."\.out"; > > $factory->save_output($filename); > > $factory->remove_rid($rid); > > print "\nQuery Name: ", $result->query_name(), "\n"; > > } > > } > > } > > > > > > Both of my questions are very similiar as in I know how to use remote > > blast but > > not sure what to change to access the specific blast I want. > > > > Again, any help would be very appreciated!! > > > > Rohan > > > > > > > > ---------------------------------------- > > This mail sent through www.mywaterloo.ca > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > ---------------------------------------- This mail sent through www.mywaterloo.ca From cjfields at uiuc.edu Mon Jul 17 21:25:54 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 17 Jul 2006 16:25:54 -0500 Subject: [Bioperl-l] Remote Blast - Blast Human Genome In-Reply-To: <1153169192.44bbf728056fd@www.nexusmail.uwaterloo.ca> Message-ID: <001001c6a9e7$962b56c0$15327e82@pyrimidine> Okay, I think I may know what's going on a little more now with NCBI's BLAST interface. Looks like any NCBI BLAST query must use the default URL and so must set up to proper GET/PUT commands to retrieve everything correctly. Here's the API description for it all: http://www.ncbi.nlm.nih.gov/BLAST/Doc/urlapi.html You could try setting the database to 'snp' or something along those lines instead of 'nr'; or you could see what the name of the database is when you use the web form and try setting it to that. According to this page, this should be possible: http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helpsnpfaq.section.SearchdbSNP _test._Search_dbSNP_Using_B The Entrez Query limit was a recommendation for limiting your search to a set of sequences for human, for instance. I'll try looking into it a bit more but I'm pretty busy. If you find anything out you should probably post it here . Chris > Hi Chris, > > 1. I have tried changing the database to snp or dbSNP but neither works. > It > seems that depending on which type of blast you use(ie, Genome Blast, > Blast SNP, > normal blast such as blastn, etc...) you see a different listing of > databases > available for querys. Since you mention that the Blast page I see was > generated > by Genome, where could I go to see a complete listing of databases I can > query?? > Or if you knew off hand which database to search if I only wanted dbSNP > hits? > > 2. You also mention, I can limit the search by using Entrez terms. Do you > mean > like: > $Bio::Tools::Run::RemoteBlast::HEADER{'ENTREZ_QUERY'} = 'abc'; > where 'abc' is the name of the subject with which you would only like to > see > result of. For example if you put it as 'Homo sapiens[Organism]' then only > human > sequences would be in hit lists. > If this is what you mean, what would I change it to, to see only hits from > dbSNP? > > Thanks for the ongoing help, > > Rohan > > Quoting Chris Fields : > > > I added a method to RemoteBlast in bioperl-live (CVS) if you want to > play > > with changing the URL. I have been thinking about doing this for a bit > now > > but I already see problems. > > > > Here's the issue: the BLAST page you see is NOT the NCBI BLAST page > (note > > the differences in the URL) but a user-friendly request page, generated > on > > the fly by Genome, to submit BLAST requests for the relevant database. > So > > changing the URL will not work (even by adding extra parameters); you > only > > get the original HTML web page. > > > > You could try changing the database or limiting the search using an > Entrez > > term (which you should be able to include in the request, probably by > adding > > it to the HEADER). > > > > Chris > > > > > -----Original Message----- > > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > > > bounces at lists.open-bio.org] On Behalf Of > vrramnar at student.cs.uwaterloo.ca > > > Sent: Thursday, July 13, 2006 5:39 PM > > > To: bioperl-l at lists.open-bio.org > > > Subject: [Bioperl-l] Remote Blast - Blast Human Genome > > > > > > > > > Hello Again, > > > > > > I have another question regarding Remote blast but this time using > Genome > > > Blast. > > > > > > Here is the link: > > > > > > > http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=9606 > > > > > > which again uses the main Blast web site: > > > > > > http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi > > > > > > Again I am not sure what to add or what HEADER information to change > > > within my > > > script. > > > > > > Here is my program, which was the same as the last email: > > > > > > #!/usr/bin/perl -w > > > > > > use Bio::Perl; > > > use Bio::Tools::Run::RemoteBlast; > > > > > > my $prog = "blastn"; > > > my $db = "refseq_genomic"; > > > my $e_val = 0.01; > > > > > > my @params = ( '-prog' => $prog, > > > '-data' => $db, > > > '-expect' => $e_val); > > > > > > my $factory = new Bio::Tools::Run::RemoteBlast->new(@params); > > > $Bio::Tools::Run::RemoteBlast::HEADER{'WWW_BLAST_TYPE'} = '????'; <-- > --- > > > what > > > do I put here > > > #$Bio::Tools::Run::RemoteBlast::HEADER{'?????'} = '????'; <--- Do I > need > > > to add > > > any other values to the form inputs > > > > > > $factory->submit_blast("blast.in"); > > > $v = 1; > > > > > > while (my @rids = $factory->each_rid) > > > { foreach my $rid ( @rids ) > > > { my $rc = $factory->retrieve_blast($rid); > > > if( !ref($rc) ) > > > { if( $rc < 0 ) > > > { $factory->remove_rid($rid); > > > } > > > print STDERR "." if ( $v > 0 ); > > > sleep 5; > > > } > > > else > > > { my $result = $rc->next_result(); > > > my $filename = $result->query_name()."\.out"; > > > $factory->save_output($filename); > > > $factory->remove_rid($rid); > > > print "\nQuery Name: ", $result->query_name(), "\n"; > > > } > > > } > > > } > > > > > > > > > Both of my questions are very similiar as in I know how to use remote > > > blast but > > > not sure what to change to access the specific blast I want. > > > > > > Again, any help would be very appreciated!! > > > > > > Rohan > > > > > > > > > > > > ---------------------------------------- > > > This mail sent through www.mywaterloo.ca > > > _______________________________________________ > > > Bioperl-l mailing list > > > Bioperl-l at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > > > > ---------------------------------------- > This mail sent through www.mywaterloo.ca From bix at sendu.me.uk Mon Jul 17 21:33:26 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 17 Jul 2006 22:33:26 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <000001c6a9e1$c6b51610$15327e82@pyrimidine> References: <000001c6a9e1$c6b51610$15327e82@pyrimidine> Message-ID: <44BC0226.1080605@sendu.me.uk> Chris Fields wrote: > There was some interest in getting Bio::Species to delegate to > Bio::Taxonomy::Node, so having scientific_name() would help quite a bit > since the name used on the ORGANISM line is the scientific name (well, is > supposed to be; famous last words). Can you clarify exactly what you mean here? Preferably with an example? ORGANISM line of which file format? The reason I ask is that I still feel we need to do parsing of the names for species rank and lower: # The 'scientific name' for humans could be considered to be 'Homo sapiens'. # Taxid 9606 in the NCBI taxonomy database has rank 'species' and ScientificName 'Homo sapiens'. # For sanity, Bio::*Taxonomy* likes to interpret this ScientificName as 'sapiens' so that the genus is not held redundantly. It provides a binomial() method to give you 'Homo sapiens' again if you want it. # I plan on maintaining this; scientific_name() would give you the non-redundant sibling-unique name 'sapiens'. binomial() on a species rank and lower would give you 'Homo sapiens' (presumably grabbing the 'Homo' from the parent node with rank 'genus', or similar). Good, bad or ugly? I would prefer it works like this and we agree to differ with NCBI on what the 'scientific name' of a species node should be. Bio::Species can still delegate to Bio::Taxonomy::Node by calling binomial() (which I propose will actually give the correct answer, even for bacteria and viruses). Perhaps the short-hand (and the classifier used in name()) shouldn't mention the word 'scientific' to avoid confusion? But a) what else would we call it?, and b) for all ranks above species it /is/ the scientific name. From hlapp at gmx.net Mon Jul 17 23:47:24 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 17 Jul 2006 19:47:24 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BC0226.1080605@sendu.me.uk> References: <000001c6a9e1$c6b51610$15327e82@pyrimidine> <44BC0226.1080605@sendu.me.uk> Message-ID: <64AB2D48-B342-4FC4-BB4E-9927B24C972C@gmx.net> I don't think we should differ from NCBI in places where the connection between a method name and the NCBI data file is obvious or otherwise we will confuse people and send them into traps. $node->scientific_name() should simply report what NCBI reports. For simple species this will be identical to what $node->binomial() returns, but for others it may not, e.g., strains, varieties, etc or the weird world of viri and bacteria. This will also absolve us from retaining the business logic for how to construct the scientific name from genus, species, and possibly strain or whatever. binomial() isn't part of the NCBI taxonomy definition, so you have freedom there to report what suits you. -hilmar On Jul 17, 2006, at 5:33 PM, Sendu Bala wrote: > Chris Fields wrote: >> There was some interest in getting Bio::Species to delegate to >> Bio::Taxonomy::Node, so having scientific_name() would help quite >> a bit >> since the name used on the ORGANISM line is the scientific name >> (well, is >> supposed to be; famous last words). > > Can you clarify exactly what you mean here? Preferably with an > example? > ORGANISM line of which file format? > The reason I ask is that I still feel we need to do parsing of the > names > for species rank and lower: > > # The 'scientific name' for humans could be considered to be 'Homo > sapiens'. > # Taxid 9606 in the NCBI taxonomy database has rank 'species' and > ScientificName 'Homo sapiens'. > # For sanity, Bio::*Taxonomy* likes to interpret this > ScientificName as > 'sapiens' so that the genus is not held redundantly. It provides a > binomial() method to give you 'Homo sapiens' again if you want it. > # I plan on maintaining this; scientific_name() would give you the > non-redundant sibling-unique name 'sapiens'. binomial() on a species > rank and lower would give you 'Homo sapiens' (presumably grabbing the > 'Homo' from the parent node with rank 'genus', or similar). > > Good, bad or ugly? I would prefer it works like this and we agree to > differ with NCBI on what the 'scientific name' of a species node > should > be. Bio::Species can still delegate to Bio::Taxonomy::Node by calling > binomial() (which I propose will actually give the correct answer, > even > for bacteria and viruses). > > Perhaps the short-hand (and the classifier used in name()) shouldn't > mention the word 'scientific' to avoid confusion? But a) what else > would > we call it?, and b) for all ranks above species it /is/ the > scientific name. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From osborne1 at optonline.net Tue Jul 18 00:52:04 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Mon, 17 Jul 2006 20:52:04 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BC0226.1080605@sendu.me.uk> Message-ID: Sendu, The string "sapiens" is not what a biology textbook would call a scientific name. You're going to have to respect decades of convention and have scientific_name() return the genus and species name. Brian O. On 7/17/06 5:33 PM, "Sendu Bala" wrote: > # I plan on maintaining this; scientific_name() would give you the > non-redundant sibling-unique name 'sapiens'. binomial() on a species > rank and lower would give you 'Homo sapiens' (presumably grabbing the > 'Homo' from the parent node with rank 'genus', or similar). From cjfields at uiuc.edu Tue Jul 18 01:36:12 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 17 Jul 2006 20:36:12 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BC0226.1080605@sendu.me.uk> References: <000001c6a9e1$c6b51610$15327e82@pyrimidine> <44BC0226.1080605@sendu.me.uk> Message-ID: <1345AB61-E7AB-447A-AB40-2170244404B2@uiuc.edu> On Jul 17, 2006, at 4:33 PM, Sendu Bala wrote: > Chris Fields wrote: >> There was some interest in getting Bio::Species to delegate to >> Bio::Taxonomy::Node, so having scientific_name() would help quite >> a bit >> since the name used on the ORGANISM line is the scientific name >> (well, is >> supposed to be; famous last words). > > Can you clarify exactly what you mean here? Preferably with an > example? > ORGANISM line of which file format? > The reason I ask is that I still feel we need to do parsing of the > names > for species rank and lower: Sorry, should have clarified; GenBank sequence format. Here's the link: http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html The ORGANISM annotation line for a GenBank record contains the formal scientific name for the organism along with the lineage. I believe SwissProt/EMBL and several other RichSeq formats do the same. The lineage that is also present is almost always abbreviated, so it's not always possible to determine the formal rankings strictly from the file with any real degree of reliability (hence the past problems with Bio::Species). > > # The 'scientific name' for humans could be considered to be 'Homo > sapiens'. > # Taxid 9606 in the NCBI taxonomy database has rank 'species' and > ScientificName 'Homo sapiens'. > # For sanity, Bio::*Taxonomy* likes to interpret this > ScientificName as > 'sapiens' so that the genus is not held redundantly. It provides a > binomial() method to give you 'Homo sapiens' again if you want it. > # I plan on maintaining this; scientific_name() would give you the > non-redundant sibling-unique name 'sapiens'. binomial() on a species > rank and lower would give you 'Homo sapiens' (presumably grabbing the > 'Homo' from the parent node with rank 'genus', or similar). I think you should use scientific_name to designate the full formal scientific name for an organism according to the way NCBI describes it for that particular node (nothing more, except removing the <> stuff you mentioned earlier) and as it would appear for the ORGANISM line. Otherwise you'll run into serious species/subspecies/strain headaches (see below). If you want real genus/species (i.e. nothing extra, like strains or subspecies), separate them out and store them using a genus/species get/set if possible; the binomial them will give back the two name genus species designation. Here are a couple of example ones in (this is in XML, using EUtilities). These were retrieved using NCBI TaxIDs using Elink from a list of protein GI's (~700 of them total), so represent the actual NCBI TaxID linked with the sequence file. If you try breaking these apart into species, what happens to the strain/subspecies stuff? Notice that many of these nodes, which come directly from protein GI's, also have no rank. ... 376686 Flavobacterium johnsoniae UW101 Flavobacterium johnsoniae NBRC 14942 Flavobacterium johnsoniae IFO 14942 Flavobacterium johnsoniae IAM 14304 Flavobacterium johnsoniae MYX.1.1.1 Flavobacterium johnsoniae NCIB 11054 Flavobacterium johnsoniae DSM 2064 Flavobacterium johnsoniae LMG 1341 Flavobacterium johnsoniae ATCC 17061 Flavobacterium johnsoniae strain UW101 Flavobacterium johnsoniae str. UW101 986 no rank Bacteria ... 370552 Streptococcus pyogenes MGAS10270 Streptococcus pyogenes strain MGAS10270 Streptococcus pyogenes str. MGAS10270 301448 no rank Bacteria ... 224308 Bacillus subtilis subsp. subtilis str. 168 Bacillus subtilis subsp. subtilis 168 135461 no rank Bacteria > Good, bad or ugly? I would prefer it works like this and we agree to > differ with NCBI on what the 'scientific name' of a species node > should > be. Bio::Species can still delegate to Bio::Taxonomy::Node by calling > binomial() (which I propose will actually give the correct answer, > even > for bacteria and viruses). This is where I would strongly disagree (though I agree that the way NCBI uses 'scientific name' is a bit off). We are using the NCBI tax database, anf as such we are somewhat at the mercy of the NCBI tax nomenclature, unfortunately. If NCBI decides to change their official definition for the scientific name to something that made a bit more sense, the XML and dump data will reflect that and we won't have many problems adapting since the scientific name will always conform to their definition. But if we split the information up ad hoc then we are bound for disaster; it's just way too much headache to worry about. We could always point to the official NCBI definition as the one we adopt and then assign the tagged information from the node directly to scientific_name (no globbing together at all). Bio::Species could delegate likewise fro the ORGANISM line, so there's no piecemeal attempts to get Humpty Dumpty to fit back together again. You could go through and get the lineage from the XML/dump file data and try to sort the genus/species out, then paste it all back together (fingers crossed!), but I think it's more headache than it's worth to split these up, then hope that you can paste them back together again and always expect to get the same results. Chris > Perhaps the short-hand (and the classifier used in name()) shouldn't > mention the word 'scientific' to avoid confusion? But a) what else > would > we call it?, and b) for all ranks above species it /is/ the > scientific name. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Tue Jul 18 01:55:28 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 17 Jul 2006 20:55:28 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: References: Message-ID: <07041A82-BE01-4998-AF26-35212CA1F0F2@uiuc.edu> I agree with Hilmar's assessment, not b/c I disagree with your definition of scientific name or the reasoning Sendu proposes. I think we are somewhat bound to NCBI's nomenclature for their tax database. If we veer away from NCBI's definition for 'scientific name' it will just confuse users and lead to more trouble than it's worth, frankly. If we stick with it then any changes NCBI makes should be easier to deal with. Leaving the scientific_name as NCBI designates it, though it probably disagrees with ~99% of the world's textbooks, may be the most maintainable solution. Now, binomial() on the other hand... Chris On Jul 17, 2006, at 7:52 PM, Brian Osborne wrote: > Sendu, > > The string "sapiens" is not what a biology textbook would call a > scientific > name. You're going to have to respect decades of convention and have > scientific_name() return the genus and species name. > > Brian O. > > > On 7/17/06 5:33 PM, "Sendu Bala" wrote: > >> # I plan on maintaining this; scientific_name() would give you the >> non-redundant sibling-unique name 'sapiens'. binomial() on a species >> rank and lower would give you 'Homo sapiens' (presumably grabbing the >> 'Homo' from the parent node with rank 'genus', or similar). > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From hlapp at gmx.net Tue Jul 18 02:06:01 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 17 Jul 2006 22:06:01 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <07041A82-BE01-4998-AF26-35212CA1F0F2@uiuc.edu> References: <07041A82-BE01-4998-AF26-35212CA1F0F2@uiuc.edu> Message-ID: On Jul 17, 2006, at 9:55 PM, Chris Fields wrote: > Leaving the scientific_name as NCBI designates it, though it probably > disagrees with ~99% of the world's textbooks, may be the most > maintainable solution. It doesn't disagree, it's quite like what the world's textbooks give you as a 'scientific name'. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Tue Jul 18 04:24:50 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 17 Jul 2006 23:24:50 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: References: <07041A82-BE01-4998-AF26-35212CA1F0F2@uiuc.edu> Message-ID: <7BCA093B-90FB-4B0A-91FD-A6E0B34C96DD@uiuc.edu> When you mean genus-species, which would be yes. But parent nodes? If you trust WIkipedia, the scientific name == binomial nomenclature. Which could mean no subspecies, strains, etc if one were to be really strict about it, though that may be a grey area; I'm no taxonomist. http://en.wikipedia.org/wiki/Scientific_name The parent nodes shouldn't have a scientific name if one were to adhere strictly to the standard definition above, but NCBI refers to the names for the parent nodes as 'scientific name' (the XML element is still ScientificName, just like the child node). I'm not sure what the tax dump file is, though, so that may be different. Here's the lineage for Taxid 312284 (marine actinobacterium PHSC20C1). I cut out the irrelevant bits and just show the lineage with all the parent nodes, taxID, and rank: 131567 cellular organisms no rank 2 Bacteria superkingdom 201174 Actinobacteria phylum 1760 Actinobacteria (class) class 52018 unclassified Actinobacteria no rank 78537 unclassified Actinobacteria (miscellaneous) no rank .... Seems to me the easiest thing to do here, when looking at a particular node, is to use scientific_name() to hold that particular element for the node and have binomial represent the true 'scientific name', much as Sendu proposed. It would also make life much easier when parsing GenBank/SwissProt/EMBL (SeqIO) to have the data designating the formal scientific name (according to NCBI) be assigned to a scientific_name() get/set method in Bio::Species for later writing; then if we want to delegate this over to Bio::Taxonomy::Node from Bio::Species it would be that much easier. This would also get around some of the problems I have been seeing with bacterial names when passing GenBank data through SeqIO, since you wouldn't be required to glop the name together from the way Bio::Species tried to guess the lineage. Chris On Jul 17, 2006, at 9:06 PM, Hilmar Lapp wrote: > > On Jul 17, 2006, at 9:55 PM, Chris Fields wrote: > >> Leaving the scientific_name as NCBI designates it, though it probably >> disagrees with ~99% of the world's textbooks, may be the most >> maintainable solution. > > It doesn't disagree, it's quite like what the world's textbooks give > you as a 'scientific name'. > > -hilmar > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Tue Jul 18 07:27:49 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Tue, 18 Jul 2006 08:27:49 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <64AB2D48-B342-4FC4-BB4E-9927B24C972C@gmx.net> References: <000001c6a9e1$c6b51610$15327e82@pyrimidine> <44BC0226.1080605@sendu.me.uk> <64AB2D48-B342-4FC4-BB4E-9927B24C972C@gmx.net> Message-ID: <44BC8D75.1080806@sendu.me.uk> Hilmar Lapp wrote: > I don't think we should differ from NCBI in places where the > connection between a method name and the NCBI data file is obvious or > otherwise we will confuse people and send them into traps. > > $node->scientific_name() should simply report what NCBI reports. For > simple species this will be identical to what $node->binomial() > returns, but for others it may not, e.g., strains, varieties, etc or > the weird world of viri and bacteria. Ok, well this certainly seems to be consensus so I'll abide. > This will also absolve us from retaining the business logic for how > to construct the scientific name from genus, species, and possibly > strain or whatever. What about the existing genus(), species(), sub_species() and variant() methods? There would be no need for any logic to join things together, but I would still like to be able to get just 'sapiens' from somewhere. Can I use species() for that purpose (though again, species is strictly 'Homo sapiens')? Likewise sub_species() and variant() could hold the remaining non-redundant names. Or should all of these be deprecated because they don't really have a place in a generic Node class? What about node_name()? Yet another synonym of scientific_name? (right now it grabs the common name(s)). Ugh. What should I do with the classification array? Should it hold the raw ScientificName like: join(',', $node->classification) eq 'Homo sapiens, Homo, Homo/Pan/Gorilla group [...]'? Or should it be like: join(',', $node->classification) eq 'sapiens, Homo, Homo/Pan/Gorilla group [...]'? The latter is how it currently works (when it works correctly); I would rather fix it than lose the logic completely, but if we're staying true to proper classification (vs. what a programmer might expect), I guess I must use the raw ScientificName? > binomial() isn't part of the NCBI taxonomy definition, so you have > freedom there to report what suits you. I don't think binomial() would serve any useful purpose now, however. I can either deprecate it or make it a synonym of scientific_name() or both. Or binomial() can be a version of scientific_name() that complains if you use it on a rank higher or lower than species. As for species() et al., it may have no place in a generic Node class. Thoughts? From bix at sendu.me.uk Tue Jul 18 08:43:43 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Tue, 18 Jul 2006 09:43:43 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BBBB69.6000906@sendu.me.uk> References: <44BBBB69.6000906@sendu.me.uk> Message-ID: <44BC9F3F.2040500@sendu.me.uk> Sendu Bala wrote: [snip proposed changes to Bio::DB::Taxonomy::* and Bio::Taxonomy::Node] > If anyone can see a problem with any of these changes, let me know asap. I've just realised that there are currently no tests for Bio::DB::Taxonomy::flatfile, and that the ones for entrez get skipped. Node doesn't get an especially thorough work-out either (in the skipped section). I'm guessing it's not feasible to include the full taxdump from NCBI (~40MB) in t/data... do people think it would be reasonable to create some sort of small subset of the data? I could just pull out the lines from names.dmp and nodes.dmp relevant to a few example organisms. Say, for human and a tricky bacteria and virus? For the purposes of running the test, where should the index files be kept? In t/data with the .dmp files or in /tmp? Should the test script delete them afterwards, or leave them be? The entrez tests are skipped to 'avoid blocking', but the test only makes 2 entrez queries with a sleep(3) in-between. Basically, I don't think there's ever any reason to skip. Shall I remove the skip? Lots of other database-accessing tests in the test suite just go right ahead and access their database, no problem. Cheers, Sendu. From torsten.seemann at infotech.monash.edu.au Tue Jul 18 03:53:02 2006 From: torsten.seemann at infotech.monash.edu.au (Torsten Seemann) Date: Tue, 18 Jul 2006 13:53:02 +1000 Subject: [Bioperl-l] advice In-Reply-To: <000a01c6a9bb$6478ba90$15327e82@pyrimidine> References: <000a01c6a9bb$6478ba90$15327e82@pyrimidine> Message-ID: <44BC5B1E.5080600@infotech.monash.edu.au> > Ha ! I *almost* added something about that. I thought his vowel keys were > broken for a bit, maybe from pounding the keyboard with extreme frustration! The wide variety of pronunciation of English around the world can be mostly blamed on those damned vowels... so perhaps removing them helps one to reach a wider audience :-) > As an aside, doesn't Damian Conway say something about the non-use of vowels > in 'Perl Best Practices?' I think it was in relation to variables, > though... Yeah, on page 46 he says NOT to remove vowels in variable names, use prefixes instead. Apprntly rmvng vwls mks t hrdr t ndrstnd stff. (Actually, I studied at Monash University under Damian Conway, and recall his ridiculing of Perl, so I found it kind of ironic that he ended up changing the Perl landscape so significantly! He even wrote an internal publication "theStyle - a guide to C programming style" in about 1990 in which he violates some of his later Perl Best Practices :-) -- Dr Torsten Seemann http://www.vicbioinformatics.com Victorian Bioinformatics Consortium, Monash University, Australia From sharma.animesh at gmail.com Tue Jul 18 07:58:41 2006 From: sharma.animesh at gmail.com (Animesh Sharma) Date: Tue, 18 Jul 2006 13:28:41 +0530 Subject: [Bioperl-l] PDB file parser (Separates chain-sequence and chain-structure) Message-ID: <156674e60607180058r653fa8fesbc654508c9c19b5b@mail.gmail.com> Hi Chris, I have written a small script to separate the Chain in a PDB file. It stores the sequence (fasta format) and structure (pdb format) in separate files with middle name according to the Chain it contains. If the PDB file has only one chain, it creates a file with default as middle name. Eg, perl pdb_chain_extract.pl 1HCO.pdb Will create 4 files with names: 1HCO.A.fas ( Sequence of Chain A in fasta format) 1HCO.A.pdb ( Structure of Chain A in pdb format) 1HCO.B.fas ( Sequence of Chain B in fasta format) 1HCO.B.pdb ( Sequence of Chain B in pdb format) .I wrote it in the spirit of your example script given @ http://cvs.bioperl.org/cgi-bin/viewcvs/viewcvs.cgi/bioperl-live/examples/structure/structure-io.pl?rev=1.2&content-type=text/vnd.viewcvs-markupCan this be included in the example scripts too? Thanks and regards, Animesh -- ______________________"The Answer Lies in Genome"______________________ http://fuzzylife.org/animesh/ +919868580004 -------------- next part -------------- A non-text attachment was scrubbed... Name: pdb_chain_extract.pl Type: application/octet-stream Size: 2593 bytes Desc: not available URL: From bix at sendu.me.uk Tue Jul 18 13:20:34 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Tue, 18 Jul 2006 14:20:34 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BCAE08.8070307@ebi.ac.uk> References: <44BCAE08.8070307@ebi.ac.uk> Message-ID: <44BCE022.5000502@sendu.me.uk> I thought I'd post this here incase anyone wants to discuss the points Nadeem brings up. As far as I can see it is acceptable to remove the <> bits so I still plan to do so. Nadeem Faruque wrote: [off-list, posted here with permission] > In case you didn't realise, odd node names such as 'Gnathostomata > ' are created to uniquify some tax nodes that have identical > scientific names, eg there are 8 entries for Rhodotorula. > > When we parse the ncbi tax dump we store this column as UNIQUE_NAME but > I don't think that we actually use it for anything at within EMBL > nucleotide sequence bank. [...] > Also, I note that there are 548 non-unique NAME_TXT of class 'scientific > name', so the UNIQUE_NAME column may be of use to someone (though given > the strength of using a taxid directly I don't see why you'd want to). Indeed. And given that we are building a taxonomy with nodes, it doesn't matter that two different nodes in the entire taxonomy tree share the same name - the position in the tree implicitly is something unique. So if you find yourself with a node called 'Rhodotorula' you can find out which one it is by looking at the closest ranked parent. That said, for 'Rhodotorula ' the closest ranked parent is 'Sporidiobolales' and not 'Sporidiobolaceae'. Is that a problem? Do we need to care about this word 'Sporidiobolaceae' that is effectively just a synonym of 'Sporidiobolales'? [Nadeem later replied "...I can't imagine the <> value to be of any use.". He also clarified that if species have identical names and you store those, you can't work out what the corresponding taxid is. Without the <> bit you need some other information, like the classification. I think this other information will be present in input file formats and it must be up to the user to store the extra when outputting from bioperl] From osborne1 at optonline.net Tue Jul 18 14:50:48 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Tue, 18 Jul 2006 10:50:48 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BC9F3F.2040500@sendu.me.uk> Message-ID: Sendu, The idea to create mini *dmp files is a good one, I think. With respect to temporary files I'm fairly sure that most tests that use them create them some where in t/data and then delete them after. Brian O. On 7/18/06 4:43 AM, "Sendu Bala" wrote: > (~40MB) in t/data... do people think it would be reasonable to create > some sort of small subset of the data? I could just pull out the lines > from names.dmp and nodes.dmp relevant to a few example organisms. Say, > for human and a tricky bacteria and virus? > For the purposes of running the test, where should the index files be > kept? In t/data with the .dmp files or in /tmp? Should the test script > delete them afterwards, or leave them be? From cjfields at uiuc.edu Tue Jul 18 15:44:07 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 18 Jul 2006 10:44:07 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BC8D75.1080806@sendu.me.uk> Message-ID: <003201c6aa81$01db9a30$15327e82@pyrimidine> > What about the existing genus(), species(), sub_species() and variant() > methods? There would be no need for any logic to join things together, > but I would still like to be able to get just 'sapiens' from somewhere. > Can I use species() for that purpose (though again, species is strictly > 'Homo sapiens')? Likewise sub_species() and variant() could hold the > remaining non-redundant names. Or should all of these be deprecated > because they don't really have a place in a generic Node class? This is where Hilmar suggests that you have a bit of freedom in doing what you want, as with binomial(). So species() should return species ('sapiens'), genus return genus, etc. At that level there will need to be some additional data munging since the ranks below species seem to include the entire name, not just the species. But this could be done from the lineage if all nodes are present and tagged as such. > What about node_name()? Yet another synonym of scientific_name? (right > now it grabs the common name(s)). Ugh. I agree things need cleaning up. You could always make node_name() an alias for scientific_name() though it could just be deprecated. > What should I do with the classification array? Should it hold the raw > ScientificName like: > join(',', $node->classification) eq 'Homo sapiens, Homo, > Homo/Pan/Gorilla group [...]'? > Or should it be like: > join(',', $node->classification) eq 'sapiens, Homo, Homo/Pan/Gorilla > group [...]'? Don't know what the dump file gives; the XML output using efetch via entrez has the raw lineage (as appears in a GenBank sequence file) and the actual full lineage with TaxID, rank, 'scientific name,' in the actual lineage order. I think one problem area will be the 'no rank' designations in the lineage. Note that the below example also has a species and no genus; tricky! 312284 marine actinobacterium PHSC20C1 marine actinobacterium strain PHSC20C1 marine actinobacterium str. PHSC20C1 78537 species Bacteria ... cellular organisms; Bacteria; Actinobacteria; Actinobacteria (class); unclassified Actinobacteria; unclassified Actinobacteria (miscellaneous) 131567 cellular organisms no rank 2 Bacteria superkingdom 201174 Actinobacteria phylum 1760 Actinobacteria (class) class 52018 unclassified Actinobacteria no rank 78537 unclassified Actinobacteria (miscellaneous) no rank > The latter is how it currently works (when it works correctly); I would > rather fix it than lose the logic completely, but if we're staying true > to proper classification (vs. what a programmer might expect), I guess I > must use the raw ScientificName? > > > binomial() isn't part of the NCBI taxonomy definition, so you have > > freedom there to report what suits you. > > I don't think binomial() would serve any useful purpose now, however. I > can either deprecate it or make it a synonym of scientific_name() or > both. Or binomial() can be a version of scientific_name() that complains > if you use it on a rank higher or lower than species. As for species() > et al., it may have no place in a generic Node class. Thoughts? The use of scientific_name() in this context would be more to conform with what NCBI defines it as rather than as the actual definition; this should be explicitly stated as such in POD and is more for long-term maintainability. No matter what is done here, you will have some degree of confusion: those who want strict adherence to the term 'scientific name' and those who want the method to conform to NCBI's definition. Better to document the reasoning for it in some way that risk the random masses complaining. We could use binomial() for the 'scientific name' as the rest of the world knows it (as in binomial nomenclature), having it built from genus-species like you had originally suggested. That's what Hilmar suggested as an 'experimental' area of sorts, since NCBI doesn't use that particular term in its taxonomy definition. Chris From cjfields at uiuc.edu Tue Jul 18 15:48:36 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 18 Jul 2006 10:48:36 -0500 Subject: [Bioperl-l] advice In-Reply-To: <44BC5B1E.5080600@infotech.monash.edu.au> Message-ID: <003301c6aa81$a34fd8e0$15327e82@pyrimidine> Guess Dr. Conway became a Perl convert. The reviews of the book state that the 'best practices' really come from his experience as a Perl programmer over the last couple of decades, so maybe he learned something since 1990. Chris > > Ha ! I *almost* added something about that. I thought his vowel keys > were > > broken for a bit, maybe from pounding the keyboard with extreme > frustration! > > The wide variety of pronunciation of English around the world can be > mostly blamed on those damned vowels... so perhaps removing them helps > one to reach a wider audience :-) > > > As an aside, doesn't Damian Conway say something about the non-use of > vowels > > in 'Perl Best Practices?' I think it was in relation to variables, > > though... > > Yeah, on page 46 he says NOT to remove vowels in variable names, use > prefixes instead. Apprntly rmvng vwls mks t hrdr t ndrstnd stff. > > (Actually, I studied at Monash University under Damian Conway, and > recall his ridiculing of Perl, so I found it kind of ironic that he > ended up changing the Perl landscape so significantly! He even wrote an > internal publication "theStyle - a guide to C programming style" in > about 1990 in which he violates some of his later Perl Best Practices :-) > > -- > Dr Torsten Seemann http://www.vicbioinformatics.com > Victorian Bioinformatics Consortium, Monash University, Australia > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Tue Jul 18 16:05:48 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 18 Jul 2006 11:05:48 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BC9F3F.2040500@sendu.me.uk> Message-ID: <003401c6aa84$08ff6c80$15327e82@pyrimidine> > I've just realised that there are currently no tests for > Bio::DB::Taxonomy::flatfile, and that the ones for entrez get skipped. > Node doesn't get an especially thorough work-out either (in the skipped > section). > > I'm guessing it's not feasible to include the full taxdump from NCBI > (~40MB) in t/data... do people think it would be reasonable to create > some sort of small subset of the data? I could just pull out the lines > from names.dmp and nodes.dmp relevant to a few example organisms. Say, > for human and a tricky bacteria and virus? > For the purposes of running the test, where should the index files be > kept? In t/data with the .dmp files or in /tmp? Should the test script > delete them afterwards, or leave them be? I would place a small section in t/data or several individual examples in a subdirectory thereof (t/data/taxonomy). > The entrez tests are skipped to 'avoid blocking', but the test only > makes 2 entrez queries with a sleep(3) in-between. Basically, I don't > think there's ever any reason to skip. Shall I remove the skip? Lots of > other database-accessing tests in the test suite just go right ahead and > access their database, no problem. Depends on whether there is someone out there who doesn't have a network connection (and there always is). The DB.t tests skip based on testing for the env. variable BIOPERLDEBUG. 1..121 ok 1 # Skipping tests which require remote servers - set env variable BIOPERLDEBUG to test You could always do something along those lines or add a test for a network connection using an eval block and skip the tests if the network test fails, but there you run the risk of the tests failing not b/c of code problems but from remote server issues; I've seen this happen with SwissProt and GenBank testing before during peak hours. Chris > Cheers, > Sendu. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From bix at sendu.me.uk Tue Jul 18 17:03:54 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Tue, 18 Jul 2006 18:03:54 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <003201c6aa81$01db9a30$15327e82@pyrimidine> References: <003201c6aa81$01db9a30$15327e82@pyrimidine> Message-ID: <44BD147A.9020103@sendu.me.uk> Chris Fields wrote: >> What about the existing genus(), species(), sub_species() and variant() >> methods? There would be no need for any logic to join things together, >> but I would still like to be able to get just 'sapiens' from somewhere. >> Can I use species() for that purpose (though again, species is strictly >> 'Homo sapiens')? Likewise sub_species() and variant() could hold the >> remaining non-redundant names. Or should all of these be deprecated >> because they don't really have a place in a generic Node class? > > This is where Hilmar suggests that you have a bit of freedom in doing what > you want, as with binomial(). So species() should return species > ('sapiens'), genus return genus, etc. [regarding changes to Bio::Taxonomy::Node] Actually, I'm really strongly leaning toward getting rid of the following methods and new() options (and giving up entirely on being able to keep 'sapiens' somewhere): -organelle, organelle() -division, division() -sub_species, sub_species() -variant, variant() species(), validate_species_name() genus() binomial() As far as I can see none of these methods have any place in a generic Node class. If you want to know what your species is you have to be rank() 'species' and you just call scientific_name(). The above kind of methods belong in something like Bio::Species or similar, NOT in Node. Does anyone disagree? Can anyone offer a justification for keeping these methods? Changes I haven't yet discussed but have already made (but not committed): *parent_taxon_id = \&parent_id; *common_name = \&common_names; -factory and factory() removed, since there is no Bio::Taxonomy::FactoryI-implementing module, nothing in Node to make use of a factory once set, and a factory seems redundant when we're a node with a -dbh. validate_name() removed because it just returns 1. >> What about node_name()? Yet another synonym of scientific_name? (right >> now it grabs the common name(s)). Ugh. > > I agree things need cleaning up. You could always make node_name() an alias > for scientific_name() though it could just be deprecated. Actually, I've gone with node_name as the 'pure' and best method to set the name of your node with, and made scientific_name an alias of it (though it behaves as suggested earlier in the thread). >> What should I do with the classification array? Should it hold the raw >> ScientificName like: >> join(',', $node->classification) eq 'Homo sapiens, Homo, >> Homo/Pan/Gorilla group [...]'? (I've decided to do it the above way for consistency with scientific_name) >> Or should it be like: >> join(',', $node->classification) eq 'sapiens, Homo, Homo/Pan/Gorilla >> group [...]'? > > Don't know what the dump file gives; the XML output using efetch via entrez > has the raw lineage (as appears in a GenBank sequence file) and the actual > full lineage with TaxID, rank, 'scientific name,' in the actual lineage > order. I think one problem area will be the 'no rank' designations in the > lineage. Note that the below example also has a species and no genus; > tricky! Currently, flatfile and entrez ignore nodes with a rank of 'no rank' when they build the classification array. I had no intention of changing this behaviour. > 1760 > Actinobacteria (class) > class Ugh. I guess my proposal to remove <> bits via flatfile extends to removing () bits via entrez. We don't need unique names; we can use object_id() when uniqueness matters. >> I don't think binomial() would serve any useful purpose now, however. > > We could use binomial() for the 'scientific name' as the rest of the world > knows it (as in binomial nomenclature), having it built from genus-species > like you had originally suggested. No, see above. I don't think it makes the slightest bit of sense for a Node to go around trying to build things from a parent it may or may not have. Again, binomial() is a method for something like Bio::Species, not a generic Node class. From cjfields at uiuc.edu Tue Jul 18 19:34:29 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 18 Jul 2006 14:34:29 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BD147A.9020103@sendu.me.uk> Message-ID: <000901c6aaa1$328dd3d0$15327e82@pyrimidine> ... > [regarding changes to Bio::Taxonomy::Node] > > Actually, I'm really strongly leaning toward getting rid of the > following methods and new() options (and giving up entirely on being > able to keep 'sapiens' somewhere): > > -organelle, organelle() > -division, division() > -sub_species, sub_species() > -variant, variant() > species(), validate_species_name() > genus() > binomial() > > As far as I can see none of these methods have any place in a generic > Node class. If you want to know what your species is you have to be > rank() 'species' and you just call scientific_name(). The above kind of > methods belong in something like Bio::Species or similar, NOT in Node. > Does anyone disagree? Can anyone offer a justification for keeping these > methods? Bio::Species and Bio::Taxonomy::Node are closely linked and plans are to have Bio::Species delegate methods to Bio::Taxonomy::Node. So any changes to Node will affect Bio::Species to some degree. If you can get the lineage from XML, you could set many of these based on the rank given. Jason uses XML::Twig in Bio::DB::Taxonomy::entrez to parse out the XML data into Bio::Taxonomy::Node objects; it shouldn't be difficult to leave some methods based on rank (genus, species, etc) as simple get/set methods for the time being and leave the heavy lifting to the modules dealing directly with the data. Bio::Species could then delegate data/methods over to Bio::Taxonomy::Node fairly easily. If there is no genus/species data to be grabbed (either it doesn't exist or isn't present for some reason), then simply leave it as undef. That's also why I thought binomial() could stick around; if you have both the genus() and species() you could grab both using binomial(), building in special cases or error handling in case genus() or species() or both return undef. I don't see the problem in keeping this as long as users know what it means: by detailing the method in POD. If someone complains we tell them to RTFM. > Changes I haven't yet discussed but have already made (but not committed): > > *parent_taxon_id = \&parent_id; > *common_name = \&common_names; > -factory and factory() removed, since there is no > Bio::Taxonomy::FactoryI-implementing module, nothing in Node to make use > of a factory once set, and a factory seems redundant when we're a node > with a -dbh. > validate_name() removed because it just returns 1. > ... > Actually, I've gone with node_name as the 'pure' and best method to set > the name of your node with, and made scientific_name an alias of it > (though it behaves as suggested earlier in the thread). I don't have any problem with that. As long as it conforms somewhat to the NCBI definition to prevent confusion I think it's okay. > >> What should I do with the classification array? Should it hold the raw > >> ScientificName like: > >> join(',', $node->classification) eq 'Homo sapiens, Homo, > >> Homo/Pan/Gorilla group [...]'? > > (I've decided to do it the above way for consistency with scientific_name) I think that's fine. ... > Currently, flatfile and entrez ignore nodes with a rank of 'no rank' > when they build the classification array. I had no intention of changing > this behaviour. If you ignore nodes with 'no rank' there will be major problems when retrieving certain TaxID's from protein/nucleotide sequences. I had posted some sample XML from many NCBI TaxIDs taken from sequence files and via ELink and a good many of those nodes (most of them from genome projects) have 'no rank'. 376686 Flavobacterium johnsoniae UW101 ... 986 no rank ... 373903 Halothermothrix orenii H 168 ... 31909 no rank These aren't 'edge cases' anymore but now are pretty common from genome sequencing. I would just assign 'no rank' to rank() and have the node retained for DB purposes. It seems that the tax dump loses quite a bit of information somewhere along the way that shows up in the XML. Or am I wrong? > > 1760 > > Actinobacteria (class) > > class > > Ugh. I guess my proposal to remove <> bits via flatfile extends to > removing () bits via entrez. We don't need unique names; we can use > object_id() when uniqueness matters. The XML parsing in Taxonomy::entrez will take care of the and retains the character data in between. It would be a matter of setting the parser correctly to grab the relevant data and assign it properly. > >> I don't think binomial() would serve any useful purpose now, however. > > > > We could use binomial() for the 'scientific name' as the rest of the > world > > knows it (as in binomial nomenclature), having it built from genus- > species > > like you had originally suggested. > > No, see above. I don't think it makes the slightest bit of sense for a > Node to go around trying to build things from a parent it may or may not > have. Again, binomial() is a method for something like Bio::Species, not > a generic Node class. Bio::Species, from what I gather, was initially created to hold the tax data from GenBank/EMBL/SwissProt (RichSeq) files and is not DB-aware. Bio::Taxonomy::Node was supposed to be like Bio::Species and also be DB-aware: http://thread.gmane.org/gmane.comp.lang.perl.bio.general/4284/focus=4321 Again, Bio::Species methods are supposed to (eventually) delegate to Bio::Taxonomy::Node, so the two are closely linked along with their methods. Any way we go about it here (keeping certain methods and tossing others, changing the data returned, etc), it looks like there will be API issues down the road which will directly affect anyone using tax data. That affects bioperl-db directly as well as any other bioperl-based DB's which rely on tax data. So we need to tread a bit carefully when making major changes to make sure that they work for bioperl-db and anywhere else that may require it. Chris From cjfields at uiuc.edu Tue Jul 18 19:41:31 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 18 Jul 2006 14:41:31 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BD147A.9020103@sendu.me.uk> Message-ID: <000a01c6aaa2$2b4f50c0$15327e82@pyrimidine> Sendu et al, I'll play around with adding a quick method to Bio::Species for scientific_name(); if I can get it to play nice with Bio::SeqIO::genbank and it passes tests I'll commit it. Chris From golharam at umdnj.edu Tue Jul 18 19:36:54 2006 From: golharam at umdnj.edu (Ryan Golhar) Date: Tue, 18 Jul 2006 15:36:54 -0400 Subject: [Bioperl-l] advice In-Reply-To: <003301c6aa81$a34fd8e0$15327e82@pyrimidine> Message-ID: <00a501c6aaa1$86edb620$2f01a8c0@GOLHARMOBILE1> Right. There was a chain letter going around the internet for awhile about how you can leave out certain letters and the human brain will still be able to correctly interpret what the word is supposed to be. Either that or it was something about how Europe was adopting a new variation of English and after many successions it started to sound/look like German. > The wide variety of pronunciation of English around the world can be > mostly blamed on those damned vowels... so perhaps removing them helps > one to reach a wider audience :-) > > > As an aside, doesn't Damian Conway say something about the non-use > > of > vowels > > in 'Perl Best Practices?' I think it was in relation to variables, > > though... > > Yeah, on page 46 he says NOT to remove vowels in variable names, use > prefixes instead. Apprntly rmvng vwls mks t hrdr t ndrstnd stff. From cjfields at uiuc.edu Tue Jul 18 21:44:29 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 18 Jul 2006 16:44:29 -0500 Subject: [Bioperl-l] Bio::SeqIO::genbank and Bio::Species Message-ID: <000001c6aab3$58ee7bd0$15327e82@pyrimidine> For a given GenBank file, you'll have the following (this is from NCBI's current flatfile format, http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html): LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999 DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds. ACCESSION U49845 VERSION U49845.1 GI:1293613 KEYWORDS . SOURCE Saccharomyces cerevisiae (baker's yeast) ORGANISM Saccharomyces cerevisiae Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces. ... The SOURCE line above, according to NCBI, contains an abbreviated name and a common name (optional); it can also apparently contain additional information, such as organelles and so on. The ORGANISM line contains NCBI's definition of the formal scientific name (see the related thread on Taxonomy proposed changes) along with lineage information Currently, Bio::SeqIO::genbank and Bio::Species are very inconsistent with bacterial names, so when I process everything through SeqIO I get: SOURCE Mycobacterium tuberculosis H37Rv H37Rv ORGANISM Mycobacterium tuberculosis SOURCE Mycobacterium tuberculosis CDC1551 CDC1551 ORGANISM Mycobacterium tuberculosis SOURCE Mycobacterium avium subsp. paratuberculosis K-10 paratuberculosis K-10 ORGANISM Mycobacterium avium subsp. SOURCE Bacillus sp. NRRL B-14911 NRRL B-14911 ORGANISM Bacillus sp. I have added a scientific_name() method to Bio::Species to contain the string on the ORGANISM line and replace it as is, which seems to work well (doesn't chop the name down). The bigger issue is the mess with the SOURCE line. This stems from adding back information from sub_species(), which I don't think needs to be done as it's supposed to be an abbreviated name. Anybody mind if I try splitting up the original SOURCE line data into organelle(), abbreviated_name(), and common_name()? This will change common_name a bit (so, instead of 'Saccharomyces cerevisiae' it will give 'baker's yeast') but will also conform more to the NCBI definition of 'common name.' Also, organelle info isn't handled yet; I could toy with adding support for it. Any objections? I may proceed to do the same with EMBL, SwissPort, and others that use Bio::Species if this works out. Christopher Fields Postdoctoral Researcher - Switzer Lab Dept. of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Tue Jul 18 22:50:37 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Tue, 18 Jul 2006 23:50:37 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <000901c6aaa1$328dd3d0$15327e82@pyrimidine> References: <000901c6aaa1$328dd3d0$15327e82@pyrimidine> Message-ID: <44BD65BD.4030501@sendu.me.uk> Chris Fields wrote: > ... >> [regarding changes to Bio::Taxonomy::Node] >> >> Actually, I'm really strongly leaning toward getting rid of the >> following methods and new() options (and giving up entirely on being >> able to keep 'sapiens' somewhere): >> >> -organelle, organelle() >> -division, division() >> -sub_species, sub_species() >> -variant, variant() >> species(), validate_species_name() >> genus() >> binomial() > > Bio::Species and Bio::Taxonomy::Node are closely linked and plans are to > have Bio::Species delegate methods to Bio::Taxonomy::Node. So any changes > to Node will affect Bio::Species to some degree. I see from the original postings that Node was intended to be like Species, but I don't think it makes the slightest bit of sense. A /single/ Node need only (must only!) represent the information for a single node in the taxonomy. Or else what do these objects mean? What is the object model? It's bad bad bad for it to be sensible one way (when you're making your own taxonomy by making your own nodes) and nonsensical another (when we stuff in methods so that Bio::Species is happy). The way Node is written right now, and what you're suggesting, is that we stuff the entire Taxonomy into the Node. Well, except that you don't even have methods for every taxonomic level - there is genus() but no subphylum(). I can't emphasise strongly enough how insane all this is. The correct thing for Bio::Species to interact with is Bio::Taxonomy. Bio::Taxonomy is a collection of Nodes and has the sort of methods that Bio::Species would need to delegate its current functionality. I'm quite willing to do a proper overhaul here so everything makes sense. You either make your own nodes and add these to a Taxonomy or use a factory (which would use a Bio::DB::Taxonomy presumably). A Taxonomy lets you discover the classification of any node it contains. Bio::Species could implement a method like genus() by: $node = $taxonomy->get_node('genus') || return; return $node->scientific_name; Bio::Taxonomy isn't perfect, but I can certainly get it to do its job. I'd probably make it rank-name and order independent for starters. Bio::Taxonomy::Node needs to be reduced right down to just hold data about the node it represents, and possibly its parent node id (or other way of getting to its parent). So now I'm proposing dropping the classification() method from Node as well. It's simply not necessary; Bio::Taxonomy should give you that information. Bio::Taxnomoy::FactoryI doesn't make much sense to me at the moment from its docs, but it could be used to build a Taxonomy (that seems to be its intent, I'm just not sure what some of the methods are really supposed to do) such that Node might not even need any methods for getting its parent or child nodes. The Factory or Taxonomy might be able to deal with that. In short, I'm proposing a major change to Bio::Taxonomy::Node (make it just a node), and minor changes to (& implementation of) Bio::Taxonomy and Bio::Taxonomy::FactoryI such that they actually get used to do their jobs. > That's also why I thought binomial() could stick around; if you have both > the genus() and species() you could grab both using binomial(), building in > special cases or error handling in case genus() or species() or both return > undef. binomial() would belong in (and is present in) Bio::Taxonomy. But in any case, it's not needed there either; if you want the binomial you just ask for the scientific_name of the species node in your Taxonomy, since this now contains the actual scientific name == binomial. binomial() in Bio::Taxonomy could be reimplemented as: $node = $self->get_node('species') || return; return $node->scientific_name; >> Currently, flatfile and entrez ignore nodes with a rank of 'no rank' >> when they build the classification array. I had no intention of changing >> this behaviour. > > If you ignore nodes with 'no rank' there will be major problems when > retrieving certain TaxID's from protein/nucleotide sequences. This is only for the classification array, which is meaningless anyway (there only for file-format compatibility). If you want the real information you ask your Bio::Taxonomy (which asks each of its nodes). This is the whole point of having Bio::Taxonomy in the first place. It gives you great flexibility to do whatever you want to do. >>> 1760 >>> Actinobacteria (class) >>> class >> Ugh. I guess my proposal to remove <> bits via flatfile extends to >> removing () bits via entrez. We don't need unique names; we can use >> object_id() when uniqueness matters. > > The XML parsing in Taxonomy::entrez will take care of the and retains > the character data in between. You misunderstood. I meant the <> bits I discussed at the very start of this thread, that flatfile gives you. Here I'm referring to getting rid of ' (class)' as well. > Any way we go about it here (keeping certain methods and tossing others, > changing the data returned, etc), it looks like there will be API issues > down the road which will directly affect anyone using tax data. That > affects bioperl-db directly as well as any other bioperl-based DB's which > rely on tax data. So we need to tread a bit carefully when making major > changes to make sure that they work for bioperl-db and anywhere else that > may require it. Does anything make serious use of the current Bio::Taxonomy code? Or are they using Bio::Species? From cjfields at uiuc.edu Wed Jul 19 04:38:05 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 18 Jul 2006 23:38:05 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BD65BD.4030501@sendu.me.uk> References: <000901c6aaa1$328dd3d0$15327e82@pyrimidine> <44BD65BD.4030501@sendu.me.uk> Message-ID: I think we should wait a bit for any dramatic changes but implement the ones there seems to be a consensus on. I understand your reasoning for taking this on but I'm not sure completely revamping Bio::Taxonomy w/o input from the core developers is wise, especially since we do NOT know who uses it, why they use it, and how changing/ removing methods will affect their code. We are doing nothing productive here by constantly butting heads on this and having different opinions on what we think Bio::Taxonomy/Bio::Species is best suited for, when neither one of us is actually sure about who uses it and why. A reasonable solution is there but we must rely on outside opinions in order to reach it, so I propose a short moratorium on changes to Bio::Taxonomy/Bio::Species that radically redefine the API on either class. BTW, for anbody following, I'm perfectly comfortable if Sendu takes the lead on this and implements his changes; I'm just not sure about stripping the class down to the bare minimum. So far, the only thing that has been proposed (and accepted by all) is that scientific_name() hold the data for that tag in a node. I think most here would agree that's fine; I've already added a get/set to Bio::Species but haven't committed it yet. However, what you propose doing below is refactoring the code and changing the API. I agree there needs to be an overhaul but we can't do this w/o guidance or input from the GBE (Great Bioperl Elders). I would like some of the 'senior' core developers chime in a bit more on their thoughts on this. Jason also mentioned somewhere that any changes for Taxonomy/ Species should be tracked on the wiki somewhere as well to make sure everything is kosher and keep users up-to-date. I would like his input here but I think he's still incommunicado at the moment. Chris On Jul 18, 2006, at 5:50 PM, Sendu Bala wrote: > Chris Fields wrote: >> ... >>> [regarding changes to Bio::Taxonomy::Node] >>> >>> Actually, I'm really strongly leaning toward getting rid of the >>> following methods and new() options (and giving up entirely on being >>> able to keep 'sapiens' somewhere): >>> >>> -organelle, organelle() >>> -division, division() >>> -sub_species, sub_species() >>> -variant, variant() >>> species(), validate_species_name() >>> genus() >>> binomial() >> >> Bio::Species and Bio::Taxonomy::Node are closely linked and plans >> are to >> have Bio::Species delegate methods to Bio::Taxonomy::Node. So any >> changes >> to Node will affect Bio::Species to some degree. > > I see from the original postings that Node was intended to be like > Species, but I don't think it makes the slightest bit of sense. A > /single/ Node need only (must only!) represent the information for a > single node in the taxonomy. Or else what do these objects mean? > What is > the object model? It's bad bad bad for it to be sensible one way (when > you're making your own taxonomy by making your own nodes) and > nonsensical another (when we stuff in methods so that Bio::Species is > happy). The way Node is written right now, and what you're suggesting, > is that we stuff the entire Taxonomy into the Node. Well, except that > you don't even have methods for every taxonomic level - there is > genus() > but no subphylum(). I can't emphasise strongly enough how insane all > this is. > > The correct thing for Bio::Species to interact with is Bio::Taxonomy. > Bio::Taxonomy is a collection of Nodes and has the sort of methods > that > Bio::Species would need to delegate its current functionality. > > I'm quite willing to do a proper overhaul here so everything makes > sense. You either make your own nodes and add these to a Taxonomy > or use > a factory (which would use a Bio::DB::Taxonomy presumably). A Taxonomy > lets you discover the classification of any node it contains. > Bio::Species could implement a method like genus() by: > $node = $taxonomy->get_node('genus') || return; > return $node->scientific_name; > > Bio::Taxonomy isn't perfect, but I can certainly get it to do its job. > I'd probably make it rank-name and order independent for starters. > > Bio::Taxonomy::Node needs to be reduced right down to just hold data > about the node it represents, and possibly its parent node id (or > other > way of getting to its parent). So now I'm proposing dropping the > classification() method from Node as well. It's simply not necessary; > Bio::Taxonomy should give you that information. > > Bio::Taxnomoy::FactoryI doesn't make much sense to me at the moment > from > its docs, but it could be used to build a Taxonomy (that seems to > be its > intent, I'm just not sure what some of the methods are really supposed > to do) such that Node might not even need any methods for getting its > parent or child nodes. The Factory or Taxonomy might be able to deal > with that. > > In short, I'm proposing a major change to Bio::Taxonomy::Node (make it > just a node), and minor changes to (& implementation of) Bio::Taxonomy > and Bio::Taxonomy::FactoryI such that they actually get used to do > their > jobs. > > >> That's also why I thought binomial() could stick around; if you >> have both >> the genus() and species() you could grab both using binomial(), >> building in >> special cases or error handling in case genus() or species() or >> both return >> undef. > > binomial() would belong in (and is present in) Bio::Taxonomy. But > in any > case, it's not needed there either; if you want the binomial you just > ask for the scientific_name of the species node in your Taxonomy, > since > this now contains the actual scientific name == binomial. > > binomial() in Bio::Taxonomy could be reimplemented as: > $node = $self->get_node('species') || return; > return $node->scientific_name; > > >>> Currently, flatfile and entrez ignore nodes with a rank of 'no rank' >>> when they build the classification array. I had no intention of >>> changing >>> this behaviour. >> >> If you ignore nodes with 'no rank' there will be major problems when >> retrieving certain TaxID's from protein/nucleotide sequences. > > This is only for the classification array, which is meaningless anyway > (there only for file-format compatibility). If you want the real > information you ask your Bio::Taxonomy (which asks each of its nodes). > This is the whole point of having Bio::Taxonomy in the first place. > > It gives you great flexibility to do whatever you want to do. > > >>>> 1760 >>>> Actinobacteria (class) >>>> class >>> Ugh. I guess my proposal to remove <> bits via flatfile extends to >>> removing () bits via entrez. We don't need unique names; we can use >>> object_id() when uniqueness matters. >> >> The XML parsing in Taxonomy::entrez will take care of the >> and retains >> the character data in between. > > You misunderstood. I meant the <> bits I discussed at the very > start of > this thread, that flatfile gives you. Here I'm referring to getting > rid > of ' (class)' as well. > > >> Any way we go about it here (keeping certain methods and tossing >> others, >> changing the data returned, etc), it looks like there will be API >> issues >> down the road which will directly affect anyone using tax data. That >> affects bioperl-db directly as well as any other bioperl-based >> DB's which >> rely on tax data. So we need to tread a bit carefully when making >> major >> changes to make sure that they work for bioperl-db and anywhere >> else that >> may require it. > > Does anything make serious use of the current Bio::Taxonomy code? > Or are > they using Bio::Species? > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From ong at embl.de Wed Jul 19 07:51:48 2006 From: ong at embl.de (ong at embl.de) Date: Wed, 19 Jul 2006 09:51:48 +0200 Subject: [Bioperl-l] Fwd: Re: BioPerl query Message-ID: <20060719095148.f71b1v3p7qosk440@webmail.embl.de> HI, Anybody have an answer to the below query? Thanks. Regards, Ong ----- Forwarded message from birney at ebi.ac.uk ----- Date: Wed, 19 Jul 2006 08:16:06 +0100 From: Ewan Birney Reply-To: Ewan Birney Subject: Re: BioPerl query To: ong at embl.de On 18 Jul 2006, at 10:26, ong at embl.de wrote: > Dear Birney, > > Good day i wish to get your advise on how do i print out the PSM > matrix from > the code below. Thanks > I would ask this message on the bioperl list, not to me directly. > Regards, > Ong > > use Bio::Matrix::PSM::IO; > > my $psmIO=new Bio::Matrix::PSM::IO(-file=>'matrix.dat',- > format=>'transfac'); > while (my $psm=$psmIO->next_psm) { > my $id=$psm->id; > my $an=$psm->accession_number; > my $re = $psm->regexp; > #my $l=$psm->width; > my $cons=$psm->IUPAC; > print"$id\t$an\t$re\t$l\t$cons\t$psm\n"; > } ----- End forwarded message ----- From rmb32 at cornell.edu Wed Jul 19 00:06:02 2006 From: rmb32 at cornell.edu (Robert Buels) Date: Tue, 18 Jul 2006 17:06:02 -0700 Subject: [Bioperl-l] bioperl pulls, xml parsers push, and things get complicated Message-ID: <44BD776A.1080402@cornell.edu> Hi all, Here's a kind of abstract question about Bioperl and XML parsing: I'm thinking about writing a bioperl parser for genomethreader XML, and I'm sort of mulling over the 'impedence mismatch' between the way bioperl Bio::*IO::* modules work and the way all of the current XML parsers work. Bioperl uses a 'pull' model, where every time you want a new chunk of stuff, you call $io_object->next_thing. All the XML parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a 'push' model, where every time they parse a chunk, they call _your_ code, usually via a subroutine reference you've given to the XML parser when you start it up. From what I can tell, current Bioperl IO modules that parse XML are using push parsers to parse the whole document, holding stuff in memory, then spoon-feeding it in chunks to the calling program when it calls next_*(). This is fine until the input XML gets really big, in which case you can quickly run out of memory. Does anybody have good ideas for nice, robust ways of writing a bioperl IO module for really big input XML files? There don't seem to be any perl pull parsers for XML. All I've dug up so far would be having the XML push parser running in a different thread or process, pushing chunks of data into a pipe or similar structure that blocks the progress of the push parser until the pulling bioperl code wants the next piece of data, but there are plenty of ugly issues with that, whether one were too use perl threads for it (aaagh!) or fork and push some kind of intermediate format through a pipe or socket between the two processes (eek!). So, um, if you've read this far, do you have any ideas? Rob -- Robert Buels SGN Bioinformatics Analyst 252A Emerson Hall, Cornell University Ithaca, NY 14853 Tel: 503-889-8539 rmb32 at cornell.edu http://www.sgn.cornell.edu From alc at sanger.ac.uk Wed Jul 19 10:55:12 2006 From: alc at sanger.ac.uk (Avril Coghlan) Date: Wed, 19 Jul 2006 11:55:12 +0100 Subject: [Bioperl-l] parsing est2genome output Message-ID: <1153306513.27383.12.camel@deskpro104.dynamic.sanger.ac.uk> An embedded and charset-unspecified text was scrubbed... Name: not available URL: From bernd.web at gmail.com Wed Jul 19 11:36:08 2006 From: bernd.web at gmail.com (Bernd Web) Date: Wed, 19 Jul 2006 13:36:08 +0200 Subject: [Bioperl-l] SearchIO HOWTO Message-ID: <716af09c0607190436n5fdd5576m23887051aaf95f8e@mail.gmail.com> Hi, On http://www.bioperl.org/wiki/HOWTO:SearchIO there is a great HOWTO parse your BLAST report. In the Table of methods, the third line from the bottom is: "HSP alignment Not available in this report Bio::SimpleAlign object " Would it not be good to add the get_aln method ( $hsp->get_aln) ? The line in "Using the methods" my $alignment_as_string = $alnIO->write_aln($aln); may be confusing: $alignment_as_string will be "1" on success and the alignment is printed to STDIO. Should IO::String be introduced here too set up a string filehandle? Best regards, Bernd From hlapp at gmx.net Wed Jul 19 13:40:47 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 19 Jul 2006 09:40:47 -0400 Subject: [Bioperl-l] bioperl pulls, xml parsers push, and things get complicated In-Reply-To: <44BD776A.1080402@cornell.edu> References: <44BD776A.1080402@cornell.edu> Message-ID: <73755CCF-2966-4580-BBEF-1F8A94CDC55D@gmx.net> In the past the way this was done for potentially big XML files is to use regex-based extraction of chunks that correspond to a object you want to return per call to next_XXX(). That chunk would then be passed on to the XML parser under the hood. This only gets problematic once even the chunks are huge, or the name of the element that encloses your chunk can be ambiguous with what's in your text. The latter is unlikely though if you include the angle brackets. I believe this is how at least some bioperl parsers for XML-based formats were written, and it seemed to work fine. -hilmar On Jul 18, 2006, at 8:06 PM, Robert Buels wrote: > Hi all, > > Here's a kind of abstract question about Bioperl and XML parsing: > > I'm thinking about writing a bioperl parser for genomethreader XML, > and > I'm sort of mulling over the 'impedence mismatch' between the way > bioperl Bio::*IO::* modules work and the way all of the current XML > parsers work. Bioperl uses a 'pull' model, where every time you > want a > new chunk of stuff, you call $io_object->next_thing. All the XML > parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a > 'push' model, where every time they parse a chunk, they call _your_ > code, usually via a subroutine reference you've given to the XML > parser > when you start it up. > > From what I can tell, current Bioperl IO modules that parse XML are > using push parsers to parse the whole document, holding stuff in > memory, > then spoon-feeding it in chunks to the calling program when it calls > next_*(). This is fine until the input XML gets really big, in which > case you can quickly run out of memory. > > Does anybody have good ideas for nice, robust ways of writing a > bioperl > IO module for really big input XML files? There don't seem to be any > perl pull parsers for XML. All I've dug up so far would be having the > XML push parser running in a different thread or process, pushing > chunks > of data into a pipe or similar structure that blocks the progress > of the > push parser until the pulling bioperl code wants the next piece of > data, > but there are plenty of ugly issues with that, whether one were too > use > perl threads for it (aaagh!) or fork and push some kind of > intermediate > format through a pipe or socket between the two processes (eek!). > > So, um, if you've read this far, do you have any ideas? > > Rob > > -- > Robert Buels > SGN Bioinformatics Analyst > 252A Emerson Hall, Cornell University > Ithaca, NY 14853 > Tel: 503-889-8539 > rmb32 at cornell.edu > http://www.sgn.cornell.edu > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From jay at jays.net Wed Jul 19 13:43:52 2006 From: jay at jays.net (Jay Hannah) Date: Wed, 19 Jul 2006 08:43:52 -0500 (CDT) Subject: [Bioperl-l] Walking multiple bioentries using bioperl-db Message-ID: Howdy -- I'm using bioperl-db + biosql-schema + mySQL. I can now successfully build a biosql-schema instance in mySQL, load taxonomy, then using bioperl-db load a GenBank file from disk, commiting the sequences I want. For a given accession number + version + namespace, I can tell bioperl-db to delete that from mySQL and it does. Yay!! I'll be throwing a "Using bioperl-db" document onto the wiki over the next week. What I am current baffled by: How do I ask bioperl-db to walk over multiple bioentries in my database so I can do things with them? The simplest possible example: print a list of all bioentries in my database. It is trivially easy to just query mySQL directly, but if I'm reading / understanding the documentation correctly bioperl-db intends to be database schema and RDBMS agnostic. In that case, I should use bioperl-db to walk my records. So, how do I do that? Is Bio::DB::Query::BioQuery the way to do this? The only way? If so then can someone help me understand the datacollections() and where() methods? perldoc Bio::DB::Query::BioQuery # all mouse sequences loaded under namespace ensembl that # have receptor in their description $query->datacollections(["Bio::PrimarySeqI e", "Bio::Species=>Bio::PrimarySeqI sp", "BioNamespace=>Bio::PrimarySeqI db"]); $query->where(["sp.binomial like 'Mus *'", "e.desc like '*receptor*'", "db.namespace = 'ensembl'"]); # all mouse sequences loaded under namespace ensembl that # have receptor in their description, and that also have a # cross-reference with SWISS as the database $query->datacollections(["Bio::PrimarySeqI e", "Bio::Species=>Bio::PrimarySeqI sp", "BioNamespace=>Bio::PrimarySeqI db", "Bio::Annotation::DBLink xref", I'm bewildered by this API. Please forgive my ignorance. 1) How do I get *all* bioentries out of my database? 2) Say I did want just the "namespace" 'Pico' (one of my biodatabase.name's). Where did "BioNamespace=>Bio::PrimarySeqI db"]); come from? How was I supposed to figure out the left hand side of that mapping? The right hand side? If that line wasn't sitting in that document was there a way for me to figure it out as a *user* of bioperl-db? Or would I need to be a *programmer* of bioperl-db reading source to figure this out? Where did "db.namespace = 'ensembl'"]); come from? Again, do I have to read source code to know how to invoke that magic? Sorry if I sound like a jerk. That is not my intention. Hopefully I can document the answers for future bioperl-db'ers. Thanks in advance, j my current plaything: http://openlab.jays.net From cjfields at uiuc.edu Wed Jul 19 14:34:48 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 19 Jul 2006 09:34:48 -0500 Subject: [Bioperl-l] bioperl pulls, xml parsers push, and things get complicated In-Reply-To: <44BD776A.1080402@cornell.edu> Message-ID: <002801c6ab40$7cfcd980$15327e82@pyrimidine> The Bio::SearchIO modules are supposed work like a SAX parser, where results are returned as the report is parsed b/c of the occurrence of specific 'events' (start_element, end_element, and so on). However, the actual behaviour for each module changes depending on the report type and the author's intention. There was a thread about a month ago on HMMPFAM report parsing where there was some contention as to how to build hits(models)/HSPs(domains). HMMPFAM output has one HSP per hit and is sorted on the sequence length so a particular hit can appear more than once, depending on how many times it hits along the sequence length itself. So, to gather all the HSPs together under one hit you would have to parse the entire report and build up a Hit/HSP tree, then use the next_hit/next_hsp oterators to parse through everything. Currently it just reports Hit/HSP pairs and it is up to the user to build that tree. In contrast, BLAST output should be capable of throwing hit/HSP clusters on the fly based on the report output, but is quite slow (event the XML output crawls). Jason thinks it's b/c of object inheritance and instantiation; I think it's probably more complicated than that (there are a ton of method calls which tend to slow things down quite a bit as well). I would say try using SearchIO, but instead of relying directly on object handler calls to create Hit/HSP objects using an object factory (which is where I think a majority of the speed is lost), build the data internally on the fly using start_element/end_element, then return hashes instead based on the element type triggered using end_element. As an aside, I'm trying to switch the SearchIO::blastxml over to XML::SAX (using XML::SAX::ExpatXS/expat) and plan on switching it over to using hashes at some point, possibly starting off with a different SearchIO plugin module. If you have other suggestions (XML parser of choice, ways to speed up parsing/retrieve data) we would be glad to hear them. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Robert Buels > Sent: Tuesday, July 18, 2006 7:06 PM > To: bioperl-l at bioperl.org > Subject: [Bioperl-l] bioperl pulls, xml parsers push,and things get > complicated > > Hi all, > > Here's a kind of abstract question about Bioperl and XML parsing: > > I'm thinking about writing a bioperl parser for genomethreader XML, and > I'm sort of mulling over the 'impedence mismatch' between the way > bioperl Bio::*IO::* modules work and the way all of the current XML > parsers work. Bioperl uses a 'pull' model, where every time you want a > new chunk of stuff, you call $io_object->next_thing. All the XML > parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a > 'push' model, where every time they parse a chunk, they call _your_ > code, usually via a subroutine reference you've given to the XML parser > when you start it up. > > From what I can tell, current Bioperl IO modules that parse XML are > using push parsers to parse the whole document, holding stuff in memory, > then spoon-feeding it in chunks to the calling program when it calls > next_*(). This is fine until the input XML gets really big, in which > case you can quickly run out of memory. > > Does anybody have good ideas for nice, robust ways of writing a bioperl > IO module for really big input XML files? There don't seem to be any > perl pull parsers for XML. All I've dug up so far would be having the > XML push parser running in a different thread or process, pushing chunks > of data into a pipe or similar structure that blocks the progress of the > push parser until the pulling bioperl code wants the next piece of data, > but there are plenty of ugly issues with that, whether one were too use > perl threads for it (aaagh!) or fork and push some kind of intermediate > format through a pipe or socket between the two processes (eek!). > > So, um, if you've read this far, do you have any ideas? > > Rob > > -- > Robert Buels > SGN Bioinformatics Analyst > 252A Emerson Hall, Cornell University > Ithaca, NY 14853 > Tel: 503-889-8539 > rmb32 at cornell.edu > http://www.sgn.cornell.edu > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Wed Jul 19 14:44:30 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 19 Jul 2006 09:44:30 -0500 Subject: [Bioperl-l] SearchIO HOWTO In-Reply-To: <716af09c0607190436n5fdd5576m23887051aaf95f8e@mail.gmail.com> Message-ID: <002901c6ab41$d7f61350$15327e82@pyrimidine> The information in that table is referring to the BLAST report example before the table itself. However, I can tell you that using that report works (sorry if the text wrapping here mangles the output), so the table information is erroneous. I'll do some updating on that. Chris Here's the script: use Bio::SearchIO; use Bio::AlignIO; my $parser = Bio::SearchIO->new (-file => shift @ARGV, -format => 'blast'); my $aln_out = Bio::AlignIO->new(-fh => \*STDOUT, -format => 'clustalw'); while (my $result = $parser->next_result) { while (my $hit = $result->next_hit) { while (my $hsp = $hit->next_hsp) { $aln_out->write_aln($hsp->get_aln); } } } Output (via STDOUT): ------------------------------------ CLUSTAL W(1.81) multiple sequence alignment gi|20521485|dbj|AP004641.2/2896-3051 DMGRCSSGCNRYPEPMTPDTMIKLYREKEGLGAYIWMPTPDMSTEGRVQMLP gb|443893|124775/197-246 DIVQNSSGCNRYPEPMTPDTMIKLYRE-EGL-AYIWMPTPDMSTEGRVQMLP *: : ********************** *** ******************** ------------------------------------ > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Bernd Web > Sent: Wednesday, July 19, 2006 6:36 AM > To: bioperl-l at bioperl.org > Subject: [Bioperl-l] SearchIO HOWTO > > Hi, > > On http://www.bioperl.org/wiki/HOWTO:SearchIO there is a great HOWTO > parse your BLAST report. > In the Table of methods, the third line from the bottom is: > "HSP alignment Not available in this report Bio::SimpleAlign object " > > Would it not be good to add the get_aln method ( $hsp->get_aln) ? > > The line in "Using the methods" > my $alignment_as_string = $alnIO->write_aln($aln); > > may be confusing: $alignment_as_string will be "1" on success and the > alignment is printed to STDIO. Should IO::String be introduced here > too set up a string filehandle? > > > Best regards, > Bernd > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Wed Jul 19 14:55:02 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 19 Jul 2006 09:55:02 -0500 Subject: [Bioperl-l] ListSummaries delay apologies Message-ID: <002a01c6ab43$508aa5a0$15327e82@pyrimidine> Sorry about the delay for the ListSummaries the past couple months; things have been pretty hectic here which has put me really behind on them (it hasn't ever been my top priority, anyway). We're getting papers ready for publication, I going to a summer institute in a few weeks, and research (as always) is full steam ahead. Just so everybody know, I haven't given up on them, and plan on getting caught up after I get back from the institute in Connecticut (beginning of August). Cheers! Christopher Fields Postdoctoral Researcher - Switzer Lab Dept. of Biochemistry University of Illinois Urbana-Champaign From hlapp at gmx.net Wed Jul 19 15:31:50 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 19 Jul 2006 11:31:50 -0400 Subject: [Bioperl-l] Walking multiple bioentries using bioperl-db In-Reply-To: References: Message-ID: <62DA6CBC-CD0E-46A7-A669-71FFC808041B@gmx.net> On Jul 19, 2006, at 9:43 AM, Jay Hannah wrote: > Howdy -- > > I'm using bioperl-db + biosql-schema + mySQL. > > I can now successfully build a biosql-schema instance in mySQL, load > taxonomy, then using bioperl-db load a GenBank file from disk, > commiting > the sequences I want. For a given accession number + version + > namespace, > I can tell bioperl-db to delete that from mySQL and it does. Yay!! > I'll be > throwing a "Using bioperl-db" document onto the wiki over the next > week. Excellent! > > What I am current baffled by: > > How do I ask bioperl-db to walk over multiple bioentries in my > database so > I can do things with them? The simplest possible example: print a > list of > all bioentries in my database. > > It is trivially easy to just query mySQL directly, but if I'm > reading / > understanding the documentation correctly bioperl-db intends to be > database schema and RDBMS agnostic. In that case, I should use > bioperl-db > to walk my records. So, how do I do that? Bioperl-db indeed intends to be schema(-variant) and RDBMS agnostic, but that doesn't mean that you have to be as well. If you find it trivially easy to query your database using SQL and DBI and you don't care about being RDBMS or schema-variant agnostic, then by all means don't feel obligated to go through the bioperl-db API for querying. Note you can obtain the DBI database handle being used by a persistence adaptor by calling dbh(): my $dbh = $adaptor->dbh(); (The advantage of this is that you use the same connection, and therefore the same machinery for obtaining connection parameters and building the DSN that the rest of bioperl-db uses. Also, you have the ability to see transactions in progress that have not been committed yet by the adaptor.) What you should not do through SQL directly is modifying (UPDATE & DELETE) entities which bioperl-db also holds in a cache (by default terms, dbxrefs), unless you also take care to clear the cache of the respective adaptor. > > Is Bio::DB::Query::BioQuery the way to do this? The only way? Well, yes, unless you want to use SQL directly (which is not 0a despised option, see above). > > If so then can someone help me understand the datacollections() and > where() methods? datacollections() in essence corresponds to the FROM clause in a SQL statement, including JOIN statements. '=>' joins two entities in 1:n relationship, '<=>' joins two entities in n:n relationship. Instead of the table(s) you give the (Bioperl) objects that are to be joined, and bioperl-db will translate the objects to database entities, i.e., tables. Each object may be followed by an alias. The alias makes it easier to refer to the object (entity) in the query constraint part (where()). A single alias following a join expression will always apply to the master object (table). > > perldoc Bio::DB::Query::BioQuery > > # all mouse sequences loaded under namespace ensembl that > # have receptor in their description > $query->datacollections(["Bio::PrimarySeqI e", > "Bio::Species=>Bio::PrimarySeqI sp", > "BioNamespace=>Bio::PrimarySeqI > db"]); This is short for $query->datacollections([ # enumare the objects we need: "Bio::PrimarySeqI e", "Bio::Species sp", "BioNamespace db", # specify master-detail relationships "Bio::Species=>Bio::PrimarySeqI", "BioNamespace=>Bio::PrimarySeqI"]); because the alias following the join statement applies to the master entity. > $query->where(["sp.binomial like 'Mus *'", > "e.desc like '*receptor*'", > "db.namespace = 'ensembl'"]); The where() method corresponds to the WHERE clause in SQL. The default logical operator between constraints is AND. There is more documentation in on the syntax of expressing constraints in Bio::DB::Query::QueryConstraint. The column for which to constrain the value is given as the attribute (method) of the (bioperl) object. If there are multiple objects in the 'datacollections' then you need to qualify each attribute by prefixing it with the object, or the alias assigned in datacollections (), followed by a dot; corresponding to typical OO syntax. > > # all mouse sequences loaded under namespace ensembl that > # have receptor in their description, and that also have a > # cross-reference with SWISS as the database > $query->datacollections(["Bio::PrimarySeqI e", > "Bio::Species=>Bio::PrimarySeqI sp", > "BioNamespace=>Bio::PrimarySeqI db", > "Bio::Annotation::DBLink xref", > > I'm bewildered by this API. Please forgive my ignorance. I understand. This part of the API is by far the one with the skimpiest documentation. There are a considerable number of tests in t/query.t which may serve as examples. They also are known to work if their tests don't fail. The tests don't actually execute any query, instead some internal guts are used to test the translation to SQL, so if you know SQL you may be able to understand better what's going on by seeing the object- level query and the SQL-level query side-by-side. > > 1) How do I get *all* bioentries out of my database? Your datacollections would consist of the single object Bio::SeqI (or Bio::PrimarySeqI if you didn't want any annotation), and there would be no query constraint: my $query = Bio::DB::Query::BioQuery->new(-datacollections=> ["Bio::SeqI"]); > > 2) Say I did want just the "namespace" 'Pico' (one of my > biodatabase.name's). Where did > > "BioNamespace=>Bio::PrimarySeqI db"]); > > come from? How was I supposed to figure out the left hand side of that > mapping? The right hand side? If that line wasn't sitting in that > document > was there a way for me to figure it out as a *user* of bioperl-db? You would not know from Bioperl itself. The right hand side is a Bioperl class. The left hand side is a kludge because Bioperl does not have a namespace class, instead objects that have a namespace implement the Bio::IdentifiableI interface directly. This kind of one class mapping to two database entities (biodatabase is a table separate from, in fact a master for, bioentry) is extremely cumbersome to express in a generic way, so I chose to create a Bio::DB::Persistent::BioNamespace class to represent that for the purpose of queries. > Or would I need to be a *programmer* of bioperl-db reading source > to figure > this out? Where did > > "db.namespace = 'ensembl'"]); > > come from? Again, do I have to read source code to know how to invoke > that magic? Well, I'm not sure even reading the source code clears it all up ;) As I said before, the part before the dot is the alias or object, the part after is the attribute (or method) to be constrained. > > Sorry if I sound like a jerk. That is not my intention. Hopefully I > can > document the answers for future bioperl-db'ers. No problem, that's fine - and whatever you would be willing to contribute to documentation would be highly appreciated. -hilmar > > Thanks in advance, > > j > my current plaything: http://openlab.jays.net > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From aaron.j.mackey at gsk.com Wed Jul 19 13:48:55 2006 From: aaron.j.mackey at gsk.com (aaron.j.mackey at gsk.com) Date: Wed, 19 Jul 2006 09:48:55 -0400 Subject: [Bioperl-l] bioperl pulls, xml parsers push, and things get complicated In-Reply-To: <44BD776A.1080402@cornell.edu> Message-ID: There are 3rd generation XML "Pull" parsers (also called "StAX" for Streaming API for XML), but they seem to still be stuck in Java land (e.g. "MXP1") You could probably use POE to setup a state machine that used XML::Twig to "push" units of XML content onto a stack, to be read by your "next_*" pull method (where the XML::Twig push "stalled" until the "next_*" method was called, and vice versa). -Aaron bioperl-l-bounces at lists.open-bio.org wrote on 07/18/2006 08:06:02 PM: > Hi all, > > Here's a kind of abstract question about Bioperl and XML parsing: > > I'm thinking about writing a bioperl parser for genomethreader XML, and > I'm sort of mulling over the 'impedence mismatch' between the way > bioperl Bio::*IO::* modules work and the way all of the current XML > parsers work. Bioperl uses a 'pull' model, where every time you want a > new chunk of stuff, you call $io_object->next_thing. All the XML > parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a > 'push' model, where every time they parse a chunk, they call _your_ > code, usually via a subroutine reference you've given to the XML parser > when you start it up. > > From what I can tell, current Bioperl IO modules that parse XML are > using push parsers to parse the whole document, holding stuff in memory, > then spoon-feeding it in chunks to the calling program when it calls > next_*(). This is fine until the input XML gets really big, in which > case you can quickly run out of memory. > > Does anybody have good ideas for nice, robust ways of writing a bioperl > IO module for really big input XML files? There don't seem to be any > perl pull parsers for XML. All I've dug up so far would be having the > XML push parser running in a different thread or process, pushing chunks > of data into a pipe or similar structure that blocks the progress of the > push parser until the pulling bioperl code wants the next piece of data, > but there are plenty of ugly issues with that, whether one were too use > perl threads for it (aaagh!) or fork and push some kind of intermediate > format through a pipe or socket between the two processes (eek!). > > So, um, if you've read this far, do you have any ideas? > > Rob > > -- > Robert Buels > SGN Bioinformatics Analyst > 252A Emerson Hall, Cornell University > Ithaca, NY 14853 > Tel: 503-889-8539 > rmb32 at cornell.edu > http://www.sgn.cornell.edu > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From arareko at campus.iztacala.unam.mx Wed Jul 19 16:20:21 2006 From: arareko at campus.iztacala.unam.mx (Mauricio Herrera Cuadra) Date: Wed, 19 Jul 2006 11:20:21 -0500 Subject: [Bioperl-l] bioperl pulls, xml parsers push, and things get complicated In-Reply-To: <002801c6ab40$7cfcd980$15327e82@pyrimidine> References: <002801c6ab40$7cfcd980$15327e82@pyrimidine> Message-ID: <44BE5BC5.5040006@campus.iztacala.unam.mx> There are a lot of different XML processing strategies. Most fall into two categories: stream-based and tree-based. With the stream-based strategy, the parser continuously alerts a program to patterns in the XML. The parser functions like a pipeline, taking XML markup on one end and pumping out processed nuggets of data to your program. With the tree-based strategy, the parser keeps the data to itself until the very end, when it presents a complete model of the document to your program. The whole point to this strategy is that your program can pull out any data it needs, in any order. Most of the times I use tree-based strategies because they place all of the data into a structure which lets me to access any internal node using array/hash references. The simplest parser for this is XML::Simple using XML::Parser as the 'preferred parser' (which is built on top of XML::Parser::Expat, which is a wrapper around the expat library). More advanced parsers (both stream and tree-based) are: * XML::LibXML (a wrapper for libxml2's C library) * XML::Grove (takes a tree and changes it into an object hierarchy. Each node type is represented by a different class) * XML::PYX (for repackaging XML as a stream of easily recognizable and transmutable symbols) * XML::SimpleObject (changes a hierarchy of lists into a hierarchy of objects) * XML::XPath (for writing expressions that pinpoint specific pieces of documents) There are also some standards-based solutions like: * XML::SAX (Simple API for XML) for event streams. * XML::DOM (Document Object Model) for tree processing. Your strategy of choice depends a lot on the type of XML files you want to parse. Understanding the structure of the files and deciding which is the data you want to extract from them is a fundamental step to choose the appropriate method/parser to use. Just my 2 cents :) Regards, Mauricio. Chris Fields wrote: > The Bio::SearchIO modules are supposed work like a SAX parser, where results > are returned as the report is parsed b/c of the occurrence of specific > 'events' (start_element, end_element, and so on). However, the actual > behaviour for each module changes depending on the report type and the > author's intention. > > There was a thread about a month ago on HMMPFAM report parsing where there > was some contention as to how to build hits(models)/HSPs(domains). HMMPFAM > output has one HSP per hit and is sorted on the sequence length so a > particular hit can appear more than once, depending on how many times it > hits along the sequence length itself. So, to gather all the HSPs together > under one hit you would have to parse the entire report and build up a > Hit/HSP tree, then use the next_hit/next_hsp oterators to parse through > everything. Currently it just reports Hit/HSP pairs and it is up to the > user to build that tree. > > In contrast, BLAST output should be capable of throwing hit/HSP clusters on > the fly based on the report output, but is quite slow (event the XML output > crawls). Jason thinks it's b/c of object inheritance and instantiation; I > think it's probably more complicated than that (there are a ton of method > calls which tend to slow things down quite a bit as well). > > I would say try using SearchIO, but instead of relying directly on object > handler calls to create Hit/HSP objects using an object factory (which is > where I think a majority of the speed is lost), build the data internally on > the fly using start_element/end_element, then return hashes instead based on > the element type triggered using end_element. > > As an aside, I'm trying to switch the SearchIO::blastxml over to XML::SAX > (using XML::SAX::ExpatXS/expat) and plan on switching it over to using > hashes at some point, possibly starting off with a different SearchIO plugin > module. If you have other suggestions (XML parser of choice, ways to speed > up parsing/retrieve data) we would be glad to hear them. > > Chris > > > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> bounces at lists.open-bio.org] On Behalf Of Robert Buels >> Sent: Tuesday, July 18, 2006 7:06 PM >> To: bioperl-l at bioperl.org >> Subject: [Bioperl-l] bioperl pulls, xml parsers push,and things get >> complicated >> >> Hi all, >> >> Here's a kind of abstract question about Bioperl and XML parsing: >> >> I'm thinking about writing a bioperl parser for genomethreader XML, and >> I'm sort of mulling over the 'impedence mismatch' between the way >> bioperl Bio::*IO::* modules work and the way all of the current XML >> parsers work. Bioperl uses a 'pull' model, where every time you want a >> new chunk of stuff, you call $io_object->next_thing. All the XML >> parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a >> 'push' model, where every time they parse a chunk, they call _your_ >> code, usually via a subroutine reference you've given to the XML parser >> when you start it up. >> >> From what I can tell, current Bioperl IO modules that parse XML are >> using push parsers to parse the whole document, holding stuff in memory, >> then spoon-feeding it in chunks to the calling program when it calls >> next_*(). This is fine until the input XML gets really big, in which >> case you can quickly run out of memory. >> >> Does anybody have good ideas for nice, robust ways of writing a bioperl >> IO module for really big input XML files? There don't seem to be any >> perl pull parsers for XML. All I've dug up so far would be having the >> XML push parser running in a different thread or process, pushing chunks >> of data into a pipe or similar structure that blocks the progress of the >> push parser until the pulling bioperl code wants the next piece of data, >> but there are plenty of ugly issues with that, whether one were too use >> perl threads for it (aaagh!) or fork and push some kind of intermediate >> format through a pipe or socket between the two processes (eek!). >> >> So, um, if you've read this far, do you have any ideas? >> >> Rob >> >> -- >> Robert Buels >> SGN Bioinformatics Analyst >> 252A Emerson Hall, Cornell University >> Ithaca, NY 14853 >> Tel: 503-889-8539 >> rmb32 at cornell.edu >> http://www.sgn.cornell.edu >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- MAURICIO HERRERA CUADRA arareko at campus.iztacala.unam.mx Laboratorio de Gen?tica Unidad de Morfofisiolog?a y Funci?n Facultad de Estudios Superiores Iztacala, UNAM From cjfields at uiuc.edu Wed Jul 19 18:45:55 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 19 Jul 2006 13:45:55 -0500 Subject: [Bioperl-l] bioperl pulls, xml parsers push, and things get complicated In-Reply-To: <44BE5BC5.5040006@campus.iztacala.unam.mx> Message-ID: <000301c6ab63$91d31680$15327e82@pyrimidine> Yeah, we use XML::SAX, with XML::SAX::ExpatXS and expat, for SearchIO::blastxml. It previously used XML::Parser::PerlSAX but that didn't support SAX2-based parsing. XML::Twig is also used quite a bit Jason added his thoughts about this to the wiki: http://www.bioperl.org/wiki/XML_parsers Personally, I use XML::Simple with EUtilities because the XML returned is remarkably simple and normally fairly short. The trick is making sure when parsing data to dereference everything properly since XML::Simple stores everything in an elaborate data structure. I plan on switching to XML::SAX::ExpatXS or XML::Twig soon. Chris > There are a lot of different XML processing strategies. Most fall into > two categories: stream-based and tree-based. > > With the stream-based strategy, the parser continuously alerts a program > to patterns in the XML. The parser functions like a pipeline, taking XML > markup on one end and pumping out processed nuggets of data to your > program. > > With the tree-based strategy, the parser keeps the data to itself until > the very end, when it presents a complete model of the document to your > program. The whole point to this strategy is that your program can pull > out any data it needs, in any order. > > Most of the times I use tree-based strategies because they place all of > the data into a structure which lets me to access any internal node > using array/hash references. The simplest parser for this is XML::Simple > using XML::Parser as the 'preferred parser' (which is built on top of > XML::Parser::Expat, which is a wrapper around the expat library). > > More advanced parsers (both stream and tree-based) are: > > * XML::LibXML (a wrapper for libxml2's C library) > * XML::Grove (takes a tree and changes it into an object hierarchy. Each > node type is represented by a different class) > * XML::PYX (for repackaging XML as a stream of easily recognizable and > transmutable symbols) > * XML::SimpleObject (changes a hierarchy of lists into a hierarchy of > objects) > * XML::XPath (for writing expressions that pinpoint specific pieces of > documents) > > There are also some standards-based solutions like: > > * XML::SAX (Simple API for XML) for event streams. > * XML::DOM (Document Object Model) for tree processing. > > Your strategy of choice depends a lot on the type of XML files you want > to parse. Understanding the structure of the files and deciding which is > the data you want to extract from them is a fundamental step to choose > the appropriate method/parser to use. > > Just my 2 cents :) > > Regards, > Mauricio. > > Chris Fields wrote: > > The Bio::SearchIO modules are supposed work like a SAX parser, where > results > > are returned as the report is parsed b/c of the occurrence of specific > > 'events' (start_element, end_element, and so on). However, the actual > > behaviour for each module changes depending on the report type and the > > author's intention. > > > > There was a thread about a month ago on HMMPFAM report parsing where > there > > was some contention as to how to build hits(models)/HSPs(domains). > HMMPFAM > > output has one HSP per hit and is sorted on the sequence length so a > > particular hit can appear more than once, depending on how many times it > > hits along the sequence length itself. So, to gather all the HSPs > together > > under one hit you would have to parse the entire report and build up a > > Hit/HSP tree, then use the next_hit/next_hsp oterators to parse through > > everything. Currently it just reports Hit/HSP pairs and it is up to the > > user to build that tree. > > > > In contrast, BLAST output should be capable of throwing hit/HSP clusters > on > > the fly based on the report output, but is quite slow (event the XML > output > > crawls). Jason thinks it's b/c of object inheritance and instantiation; > I > > think it's probably more complicated than that (there are a ton of > method > > calls which tend to slow things down quite a bit as well). > > > > I would say try using SearchIO, but instead of relying directly on > object > > handler calls to create Hit/HSP objects using an object factory (which > is > > where I think a majority of the speed is lost), build the data > internally on > > the fly using start_element/end_element, then return hashes instead > based on > > the element type triggered using end_element. > > > > As an aside, I'm trying to switch the SearchIO::blastxml over to > XML::SAX > > (using XML::SAX::ExpatXS/expat) and plan on switching it over to using > > hashes at some point, possibly starting off with a different SearchIO > plugin > > module. If you have other suggestions (XML parser of choice, ways to > speed > > up parsing/retrieve data) we would be glad to hear them. > > > > Chris > > > > > > > >> -----Original Message----- > >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > >> bounces at lists.open-bio.org] On Behalf Of Robert Buels > >> Sent: Tuesday, July 18, 2006 7:06 PM > >> To: bioperl-l at bioperl.org > >> Subject: [Bioperl-l] bioperl pulls, xml parsers push,and things get > >> complicated > >> > >> Hi all, > >> > >> Here's a kind of abstract question about Bioperl and XML parsing: > >> > >> I'm thinking about writing a bioperl parser for genomethreader XML, and > >> I'm sort of mulling over the 'impedence mismatch' between the way > >> bioperl Bio::*IO::* modules work and the way all of the current XML > >> parsers work. Bioperl uses a 'pull' model, where every time you want a > >> new chunk of stuff, you call $io_object->next_thing. All the XML > >> parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a > >> 'push' model, where every time they parse a chunk, they call _your_ > >> code, usually via a subroutine reference you've given to the XML parser > >> when you start it up. > >> > >> From what I can tell, current Bioperl IO modules that parse XML are > >> using push parsers to parse the whole document, holding stuff in > memory, > >> then spoon-feeding it in chunks to the calling program when it calls > >> next_*(). This is fine until the input XML gets really big, in which > >> case you can quickly run out of memory. > >> > >> Does anybody have good ideas for nice, robust ways of writing a bioperl > >> IO module for really big input XML files? There don't seem to be any > >> perl pull parsers for XML. All I've dug up so far would be having the > >> XML push parser running in a different thread or process, pushing > chunks > >> of data into a pipe or similar structure that blocks the progress of > the > >> push parser until the pulling bioperl code wants the next piece of > data, > >> but there are plenty of ugly issues with that, whether one were too use > >> perl threads for it (aaagh!) or fork and push some kind of intermediate > >> format through a pipe or socket between the two processes (eek!). > >> > >> So, um, if you've read this far, do you have any ideas? > >> > >> Rob > >> > >> -- > >> Robert Buels > >> SGN Bioinformatics Analyst > >> 252A Emerson Hall, Cornell University > >> Ithaca, NY 14853 > >> Tel: 503-889-8539 > >> rmb32 at cornell.edu > >> http://www.sgn.cornell.edu > >> > >> > >> _______________________________________________ > >> Bioperl-l mailing list > >> Bioperl-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- > MAURICIO HERRERA CUADRA > arareko at campus.iztacala.unam.mx > Laboratorio de Gen?tica > Unidad de Morfofisiolog?a y Funci?n > Facultad de Estudios Superiores Iztacala, UNAM > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From rmb32 at cornell.edu Wed Jul 19 19:30:28 2006 From: rmb32 at cornell.edu (Robert Buels) Date: Wed, 19 Jul 2006 12:30:28 -0700 Subject: [Bioperl-l] bioperl pulls, xml parsers push, and things get complicated In-Reply-To: References: Message-ID: <44BE8854.8010301@cornell.edu> POE is a really neat thing, I didn't know about it before. Something tells me, however, that I would have trouble convincing people to install POE as a dependency for a genomethreader output parser. ;-) I hope I'll have the opportunity to use it sometime. For the curious, here's a nice intro to POE: http://perl.com/pub/a/2001/01/poe.html And the POE main site: http://poe.perl.org/ Rob aaron.j.mackey at GSK.COM wrote: > There are 3rd generation XML "Pull" parsers (also called "StAX" for > Streaming API for XML), but they seem to still be stuck in Java land (e.g. > "MXP1") > > You could probably use POE to setup a state machine that used XML::Twig to > "push" units of XML content onto a stack, to be read by your "next_*" pull > method (where the XML::Twig push "stalled" until the "next_*" method was > called, and vice versa). > > -Aaron > > bioperl-l-bounces at lists.open-bio.org wrote on 07/18/2006 08:06:02 PM: > > >> Hi all, >> >> Here's a kind of abstract question about Bioperl and XML parsing: >> >> I'm thinking about writing a bioperl parser for genomethreader XML, and >> I'm sort of mulling over the 'impedence mismatch' between the way >> bioperl Bio::*IO::* modules work and the way all of the current XML >> parsers work. Bioperl uses a 'pull' model, where every time you want a >> new chunk of stuff, you call $io_object->next_thing. All the XML >> parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a >> 'push' model, where every time they parse a chunk, they call _your_ >> code, usually via a subroutine reference you've given to the XML parser >> when you start it up. >> >> From what I can tell, current Bioperl IO modules that parse XML are >> using push parsers to parse the whole document, holding stuff in memory, >> > > >> then spoon-feeding it in chunks to the calling program when it calls >> next_*(). This is fine until the input XML gets really big, in which >> case you can quickly run out of memory. >> >> Does anybody have good ideas for nice, robust ways of writing a bioperl >> IO module for really big input XML files? There don't seem to be any >> perl pull parsers for XML. All I've dug up so far would be having the >> XML push parser running in a different thread or process, pushing chunks >> > > >> of data into a pipe or similar structure that blocks the progress of the >> > > >> push parser until the pulling bioperl code wants the next piece of data, >> > > >> but there are plenty of ugly issues with that, whether one were too use >> perl threads for it (aaagh!) or fork and push some kind of intermediate >> format through a pipe or socket between the two processes (eek!). >> >> So, um, if you've read this far, do you have any ideas? >> >> Rob >> >> -- >> Robert Buels >> SGN Bioinformatics Analyst >> 252A Emerson Hall, Cornell University >> Ithaca, NY 14853 >> Tel: 503-889-8539 >> rmb32 at cornell.edu >> http://www.sgn.cornell.edu >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> > > > -- Robert Buels SGN Bioinformatics Analyst 252A Emerson Hall, Cornell University Ithaca, NY 14853 Tel: 503-889-8539 rmb32 at cornell.edu http://www.sgn.cornell.edu From dwaner at scitegic.com Wed Jul 19 19:47:58 2006 From: dwaner at scitegic.com (dwaner at scitegic.com) Date: Wed, 19 Jul 2006 12:47:58 -0700 Subject: [Bioperl-l] EMBL release 87 format changes. Message-ID: BioPerl Users and Developers, I have updated the EMBL SeqIO parser to work correctly with Release 87 of EMBL (June 19th, 2006). As suggested by Chris Fields in an earlier message, the EMBL parser now reads both new and old formats, but only writes the new format. I don't think that my changes will affect most users, but if you are using the EMBL format can you review the changes described below and speak up if anything looks like it could create a problem for you? If I don't hear any objections soon, I will submit a patch to bugzilla. Thanks, - David Parser changes: - EMBL files no longer contain the "entry name". When reading old format files, the EMBL "entry name" from the ID line is used as the Bio::Seq::id and Bio::Seq::display_id, but when reading new format files, the accession number is used for these fields. Changes to output: - The ID line was changed to the new format. - The SV line is never written; SV is now part of the ID line. - "DNA" and "RNA" are no longer valid EMBL molecule types. They are now written as "unassigned DNA" and "unassigned RNA" - Strictly speaking, EMBL format should only be used for nucleotide sequences. If the alphabet is 'protein', write_seq() emits a warning and writes the non-standard molecule type "AA" in the ID line. - Because BioPerl sequences do not have a "data class" attribute, all sequences are written with a data class of "STD" in the ID line. - The ID line contains the Bio::Seq::accession, unless it is missing, in which case the Bio::Seq::id is used. - molecule type is strictly validated. Non-EMBL values are output as "unassigned DNA" or "unassigned RNA", depending on the sequence alphabet. - "taxonomic division" is strictly validated. Non-EMBL values are output as "UNC". - The taxonomic division code "UNK" is now written as "UNC" (unclassified). Possible Gotchas for some users: - Because the EMBL entry name is no longer included anywhere in the file, when round-tripping from old format to new format the entry name will be lost. - In order to ensure that BioPerl writes valid EMBL files, I have added strict validation to the writer for "molecule type" and "taxonomic division". This could present a problem for users who are using non-standard values for these fields, but I felt it was important to write files that adhere to the EMBL spec. From slenk at emich.edu Wed Jul 19 20:04:16 2006 From: slenk at emich.edu (Stephen Gordon Lenk) Date: Wed, 19 Jul 2006 16:04:16 -0400 Subject: [Bioperl-l] bioperl pulls, xml parsers push, and things get complicated Message-ID: <13edac5b13ed8208.13ed820813edac5b@emich.edu> Hi, I have found that POE fails to execute a periodic task after 32 iterations in a Perl thread, consistent failure on both XP and OSX - if I knew how to write up a defect for Perl I would do this (hint ? how is this done - I'm *not* asking RTFM etc) - probably remiss for not doing so - I was going to write messages to a Controller Area Network (CAN) to control automotive widgets from Perl - I wound up using a C code exe (piped to from Perl) with its own threads to do this. Oh yes I believe that bio lab systems can be done this way as well. But ... POE is really neat if you think in state machine terms. I have an alternate architecture for my test harness (Perlizer) that would use POE to run tests with CAN and GPIB. Steve Lenk ----- Original Message ----- From: Robert Buels Date: Wednesday, July 19, 2006 3:30 pm Subject: Re: [Bioperl-l] bioperl pulls, xml parsers push, and things get complicated > POE is a really neat thing, I didn't know about it before. > Something > tells me, however, that I would have trouble convincing people to > install POE as a dependency for a genomethreader output parser. ;- > ) I > hope I'll have the opportunity to use it sometime. > > For the curious, here's a nice intro to POE: > http://perl.com/pub/a/2001/01/poe.html > And the POE main site: > http://poe.perl.org/ > > Rob > > aaron.j.mackey at GSK.COM wrote: > > There are 3rd generation XML "Pull" parsers (also called "StAX" > for > > Streaming API for XML), but they seem to still be stuck in Java > land (e.g. > > "MXP1") > > > > You could probably use POE to setup a state machine that used > XML::Twig to > > "push" units of XML content onto a stack, to be read by your > "next_*" pull > > method (where the XML::Twig push "stalled" until the "next_*" > method was > > called, and vice versa). > > > > -Aaron > > > > bioperl-l-bounces at lists.open-bio.org wrote on 07/18/2006 > 08:06:02 PM: > > > > > >> Hi all, > >> > >> Here's a kind of abstract question about Bioperl and XML parsing: > >> > >> I'm thinking about writing a bioperl parser for genomethreader > XML, and > >> I'm sort of mulling over the 'impedence mismatch' between the > way > >> bioperl Bio::*IO::* modules work and the way all of the current > XML > >> parsers work. Bioperl uses a 'pull' model, where every time > you want a > >> new chunk of stuff, you call $io_object->next_thing. All the > XML > >> parsers (including XML::SAX, XML::Parser::PerlSAX and > XML::Twig) use a > >> 'push' model, where every time they parse a chunk, they call > _your_ > >> code, usually via a subroutine reference you've given to the > XML parser > >> when you start it up. > >> > >> From what I can tell, current Bioperl IO modules that parse > XML are > >> using push parsers to parse the whole document, holding stuff > in memory, > >> > > > > > >> then spoon-feeding it in chunks to the calling program when it > calls > >> next_*(). This is fine until the input XML gets really big, in > which > >> case you can quickly run out of memory. > >> > >> Does anybody have good ideas for nice, robust ways of writing a > bioperl > >> IO module for really big input XML files? There don't seem to > be any > >> perl pull parsers for XML. All I've dug up so far would be > having the > >> XML push parser running in a different thread or process, > pushing chunks > >> > > > > > >> of data into a pipe or similar structure that blocks the > progress of the > >> > > > > > >> push parser until the pulling bioperl code wants the next piece > of data, > >> > > > > > >> but there are plenty of ugly issues with that, whether one were > too use > >> perl threads for it (aaagh!) or fork and push some kind of > intermediate > >> format through a pipe or socket between the two processes (eek!). > >> > >> So, um, if you've read this far, do you have any ideas? > >> > >> Rob > >> > >> -- > >> Robert Buels > >> SGN Bioinformatics Analyst > >> 252A Emerson Hall, Cornell University > >> Ithaca, NY 14853 > >> Tel: 503-889-8539 > >> rmb32 at cornell.edu > >> http://www.sgn.cornell.edu > >> > >> > >> _______________________________________________ > >> Bioperl-l mailing list > >> Bioperl-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > >> > >> > > > > > > > > -- > Robert Buels > SGN Bioinformatics Analyst > 252A Emerson Hall, Cornell University > Ithaca, NY 14853 > Tel: 503-889-8539 > rmb32 at cornell.edu > http://www.sgn.cornell.edu > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From cjfields at uiuc.edu Wed Jul 19 21:46:43 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 19 Jul 2006 16:46:43 -0500 Subject: [Bioperl-l] EMBL release 87 format changes. In-Reply-To: Message-ID: <000601c6ab7c$d39d8cd0$15327e82@pyrimidine> You can go ahead and submit the patch to Bugzilla anyway. Comments about the proposed changes from the developers can be added there. I think there's some confusion here, though: the EMBL SeqIO change you mentioned I committed is actually for Bio::SeqIO::swiss (SwissProt). I haven't touched Bio::SeqIO::embl (yet). 'swiss' format now reads old and new swiss data files and writes only new format; no major changes have been made to SeqIO::embl in about a year (and even that was a small one). Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of dwaner at scitegic.com > Sent: Wednesday, July 19, 2006 2:48 PM > To: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] EMBL release 87 format changes. > > BioPerl Users and Developers, > > I have updated the EMBL SeqIO parser to work correctly with Release 87 of > EMBL (June 19th, 2006). As suggested by Chris Fields in an earlier > message, the EMBL parser now reads both new and old formats, but only > writes the new format. > > I don't think that my changes will affect most users, but if you are using > the EMBL format can you review the changes described below and speak up if > anything looks like it could create a problem for you? > > If I don't hear any objections soon, I will submit a patch to bugzilla. > > Thanks, > > - David > > Parser changes: > > - EMBL files no longer contain the "entry name". When reading old format > files, > the EMBL "entry name" from the ID line is used as the Bio::Seq::id and > Bio::Seq::display_id, but when reading new format files, the accession > number > is used for these fields. > > Changes to output: > > - The ID line was changed to the new format. > > - The SV line is never written; SV is now part of the ID line. > > - "DNA" and "RNA" are no longer valid EMBL molecule types. They are now > written > as "unassigned DNA" and "unassigned RNA" > > - Strictly speaking, EMBL format should only be used for nucleotide > sequences. > If the alphabet is 'protein', write_seq() emits a warning and writes the > > non-standard molecule type "AA" in the ID line. > > - Because BioPerl sequences do not have a "data class" attribute, all > sequences > are written with a data class of "STD" in the ID line. > > - The ID line contains the Bio::Seq::accession, unless it is missing, in > which > case the Bio::Seq::id is used. > > - molecule type is strictly validated. Non-EMBL values are output as > "unassigned DNA" or "unassigned RNA", depending on the sequence > alphabet. > > - "taxonomic division" is strictly validated. Non-EMBL values are output > as "UNC". > > - The taxonomic division code "UNK" is now written as "UNC" > (unclassified). > > Possible Gotchas for some users: > > - Because the EMBL entry name is no longer included anywhere in the file, > when round-tripping from old format to new format the entry name will be > lost. > > - In order to ensure that BioPerl writes valid EMBL files, I have added > strict > validation to the writer for "molecule type" and "taxonomic division". > This > could present a problem for users who are using non-standard values for > these > fields, but I felt it was important to write files that adhere to the > EMBL spec. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From stewarta at nmrc.navy.mil Wed Jul 19 22:00:26 2006 From: stewarta at nmrc.navy.mil (Andrew Stewart) Date: Wed, 19 Jul 2006 18:00:26 -0400 Subject: [Bioperl-l] #bioperl Message-ID: Wandering about the new bioperl.org page, I noticed that there's never really been much mention of starting up a bioperl chat channel on IRC for casual bioperl discussion and support. This has worked really well for projects like MediaWiki, etc. I'll sit on the channel for awhile and maybe we can see if the idea picks up. Point your favorite IRC client to... (windows users I would suggest mIRC, mac I would suggest Colloquy) server: irc.freenode.net channel: #bioperl Hope to see you there. -- Andrew Stewart Research Assistant, Genomics Team Navy Medical Research Center (NMRC) Biological Defense Research Directorate (BDRD) BDRD Annex 12300 Washington Avenue, 2nd Floor Rockville, MD 20852 email: stewarta at nmrc.navy.mil phone: 301-231-6700 Ext 270 From rmb32 at cornell.edu Wed Jul 19 22:40:52 2006 From: rmb32 at cornell.edu (Robert Buels) Date: Wed, 19 Jul 2006 15:40:52 -0700 Subject: [Bioperl-l] bioperl pulls, xml parsers push, and things get complicated In-Reply-To: <002801c6ab40$7cfcd980$15327e82@pyrimidine> References: <002801c6ab40$7cfcd980$15327e82@pyrimidine> Message-ID: <44BEB4F4.1060407@cornell.edu> Hi Chris, It seems to me the SearchIO framework isn't really appropriate for genomethreader, since it's more of a gene prediction program than a search/alignment program. Also, w.r.t. XML parsing and buffering, I don't see how Bio::SearchIO is fundamentally different from the other bioperl IO systems, it still has a next_this(), next_that() interface,which means lots of buffering memory if you're doing your actual parsing with a push parser (or a tree parser, of course, which is buffering an expanded form of the entire document). It looks like it just adds another layer of method calls for parser events, allowing the SearchIO to make different kinds of objects and stuff. It looks like none of this changes the fact that these are all push parsers, and bioperl pulls, so you have to buffer a lot of stuff. I guess the only really general strategies for reducing the buffering is a.) to break up the XML with regexps and such like Hilmar said, b.) to put your push parser in another process, and somehow keep it blocking in one of its callbacks until you're ready for its next data. I think what I'll do with the gthxml parser is find a way to split the input XML into chunks and run a parser separately on each, like Hilmar said. If more performance is needed, maybe a multi-process approach would be appropriate, but not yet. Anyway, looking at blastxml, I have some ruminations, which fill the rest of this email: Looking at SearchIO::blastxml, it looks like it's already using XML::SAX, which will use XML::SAX::ExpatXS if installed. Is that recent? Is blastxml faster when using the tempfile option than when putting the whole report in a string in memory? If you're looking for speed gains, have you tried running some kind of profiling on it? Whenever one is out to optimize code, profiling should be stop number one. Almost every time, you will be surprised at what parts of the code are actually eating up the most time. Here's a perl profiling intro: http://perl.com/pub/a/2004/06/25/profiling.html . The profiling mechansim talked about in that article is kind of old, there are also a bunch of newer code profiling tools available on CPAN. I haven't used any of them though. But yeah, I can't emphasize enough the importance of profiling if you're trying to optimize for speed. As for memory, the blastxml parser suffers from the same handicap I was pondering at the start of this thread. To see what I mean, think of what would happen if there were somehow 10 million HSPs in one of the reports? It's buffering all of them before returning each result, and your machine could melt. :-) Things would be beautiful (and fast, probably) if next_hsp() would actually parse the next HSP in the report instead of just returning a HSP object that's sitting in memory. But there's not really anything that can be done about that, I don't think. One nice thing, the blastxml parser's memory footprint doesn't really suffer if you have 100,000 blast reports in your input file, because it splits out the reports and parses each one individually. This I think is a good illustration of what Hilmar was talking about, breaking the input XML into chunks cuts down on the amount of buffering you have to do. As XML parsers go, I kind of like XML::Twig, because it manages to combine most of the easy use of a DOM/tree parser with the better memory usage and speed of a push parser (like SAX and XML::Parser). Within a parser callback, you have a DOM-like tree that's just the part of your XML document you're interested in at that time, and then you free that structure when you're done picking things out of it. I'm not sure how fast it is, though, probably not as fast as ExpatXS. At any rate, it is definitely a lot more intuitive to use than a more standard push parser, since if you make good choices about what elements to use as the roots of your twigs, you can often do your processing on a self-contained chunk and not have to keep track of a bunch of parse state like you typically need with a straight push parser like XML::Parser or a SAX parser. Rob Chris Fields wrote: > The Bio::SearchIO modules are supposed work like a SAX parser, where results > are returned as the report is parsed b/c of the occurrence of specific > 'events' (start_element, end_element, and so on). However, the actual > behaviour for each module changes depending on the report type and the > author's intention. > > There was a thread about a month ago on HMMPFAM report parsing where there > was some contention as to how to build hits(models)/HSPs(domains). HMMPFAM > output has one HSP per hit and is sorted on the sequence length so a > particular hit can appear more than once, depending on how many times it > hits along the sequence length itself. So, to gather all the HSPs together > under one hit you would have to parse the entire report and build up a > Hit/HSP tree, then use the next_hit/next_hsp oterators to parse through > everything. Currently it just reports Hit/HSP pairs and it is up to the > user to build that tree. > > In contrast, BLAST output should be capable of throwing hit/HSP clusters on > the fly based on the report output, but is quite slow (event the XML output > crawls). Jason thinks it's b/c of object inheritance and instantiation; I > think it's probably more complicated than that (there are a ton of method > calls which tend to slow things down quite a bit as well). > > I would say try using SearchIO, but instead of relying directly on object > handler calls to create Hit/HSP objects using an object factory (which is > where I think a majority of the speed is lost), build the data internally on > the fly using start_element/end_element, then return hashes instead based on > the element type triggered using end_element. > > As an aside, I'm trying to switch the SearchIO::blastxml over to XML::SAX > (using XML::SAX::ExpatXS/expat) and plan on switching it over to using > hashes at some point, possibly starting off with a different SearchIO plugin > module. If you have other suggestions (XML parser of choice, ways to speed > up parsing/retrieve data) we would be glad to hear them. > > Chris > > > > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> bounces at lists.open-bio.org] On Behalf Of Robert Buels >> Sent: Tuesday, July 18, 2006 7:06 PM >> To: bioperl-l at bioperl.org >> Subject: [Bioperl-l] bioperl pulls, xml parsers push,and things get >> complicated >> >> Hi all, >> >> Here's a kind of abstract question about Bioperl and XML parsing: >> >> I'm thinking about writing a bioperl parser for genomethreader XML, and >> I'm sort of mulling over the 'impedence mismatch' between the way >> bioperl Bio::*IO::* modules work and the way all of the current XML >> parsers work. Bioperl uses a 'pull' model, where every time you want a >> new chunk of stuff, you call $io_object->next_thing. All the XML >> parsers (including XML::SAX, XML::Parser::PerlSAX and XML::Twig) use a >> 'push' model, where every time they parse a chunk, they call _your_ >> code, usually via a subroutine reference you've given to the XML parser >> when you start it up. >> >> From what I can tell, current Bioperl IO modules that parse XML are >> using push parsers to parse the whole document, holding stuff in memory, >> then spoon-feeding it in chunks to the calling program when it calls >> next_*(). This is fine until the input XML gets really big, in which >> case you can quickly run out of memory. >> >> Does anybody have good ideas for nice, robust ways of writing a bioperl >> IO module for really big input XML files? There don't seem to be any >> perl pull parsers for XML. All I've dug up so far would be having the >> XML push parser running in a different thread or process, pushing chunks >> of data into a pipe or similar structure that blocks the progress of the >> push parser until the pulling bioperl code wants the next piece of data, >> but there are plenty of ugly issues with that, whether one were too use >> perl threads for it (aaagh!) or fork and push some kind of intermediate >> format through a pipe or socket between the two processes (eek!). >> >> So, um, if you've read this far, do you have any ideas? >> >> Rob >> >> -- >> Robert Buels >> SGN Bioinformatics Analyst >> 252A Emerson Hall, Cornell University >> Ithaca, NY 14853 >> Tel: 503-889-8539 >> rmb32 at cornell.edu >> http://www.sgn.cornell.edu >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > -- Robert Buels SGN Bioinformatics Analyst 252A Emerson Hall, Cornell University Ithaca, NY 14853 Tel: 503-889-8539 rmb32 at cornell.edu http://www.sgn.cornell.edu From skirov at utk.edu Wed Jul 19 21:54:03 2006 From: skirov at utk.edu (Stefan Kirov) Date: Wed, 19 Jul 2006 17:54:03 -0400 Subject: [Bioperl-l] Accessing TRANSFAC matrices In-Reply-To: <20060717145000.0aig7ymcmurk4wsk@webmail.embl.de> References: <20060717145000.0aig7ymcmurk4wsk@webmail.embl.de> Message-ID: <44BEA9FB.1070009@utk.edu> I have nothing to do with TFBS (except for using it). I suggest you contact Boris Lenhard who is behind TFBS. Please also send bioperl questions to the list. Finally, I believe TRANSFAC does not distribute the data files anymore. However, if you find out this is not the case, please let me know. Stefan ong at embl.de wrote: >HI , > > Good day, i am trying to retrieve TRANSFAC matrices via TFBS Perl module, but >it happens that about 50 matrices are missing after M00359 do you have any idea? >Also i wish to try using the Bio::Matrix::PSM::IO object, but can you advise how >do i get the matrix.dat which is a transfac file? > > Tahnks and hear for you soon. > >REgards, >Ong > > From bix at sendu.me.uk Thu Jul 20 06:49:45 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 20 Jul 2006 07:49:45 +0100 Subject: [Bioperl-l] Accessing TRANSFAC matrices In-Reply-To: <44BEA9FB.1070009@utk.edu> References: <20060717145000.0aig7ymcmurk4wsk@webmail.embl.de> <44BEA9FB.1070009@utk.edu> Message-ID: <44BF2789.1090204@sendu.me.uk> Stefan Kirov wrote: > Finally, I believe TRANSFAC does not distribute the data files anymore. > However, if you find out this is not the case, please let me know. They get distributed as Transfac 'Pro', for which you need a license (money). > ong at embl.de wrote: >> good day, i am trying to retrieve TRANSFAC matrices via TFBS Perl module, but >> it happens that about 50 matrices are missing after M00359 do you have any idea? What is meant by this? Missing from where? At the least, M00360 is accessible via the website (public database). >> Also i wish to try using the Bio::Matrix::PSM::IO object, but can you advise how >> do i get the matrix.dat which is a transfac file? http://www.biobase-international.com/pages/index.php?id=174 From dhoworth at mrc-lmb.cam.ac.uk Thu Jul 20 09:19:22 2006 From: dhoworth at mrc-lmb.cam.ac.uk (Dave Howorth) Date: Thu, 20 Jul 2006 10:19:22 +0100 Subject: [Bioperl-l] bioperl pulls, xml parsers push, and things get complicated In-Reply-To: <13edac5b13ed8208.13ed820813edac5b@emich.edu> References: <13edac5b13ed8208.13ed820813edac5b@emich.edu> Message-ID: <44BF4A9A.60100@mrc-lmb.cam.ac.uk> Stephen Gordon Lenk wrote: > I have found that POE fails to execute a periodic task after 32 > iterations in a Perl thread, consistent failure on both XP and OSX - > if I knew how to write up a defect for Perl I would do this (hint ? > how is this done - I'm *not* asking RTFM etc) Generally: Go to http://search.cpan.org and search for the module (POE). Click on the distribution link, rather than the doc link (i.e. POE-0.3502, which takes you to http://search.cpan.org/~rcaputo/POE-0.3502/). Click on the View/Report Bugs link. Check through the existing bugs and if it's not there click on the Report a new bug link. Cheers, Dave From georg.otto at tuebingen.mpg.de Thu Jul 20 10:53:53 2006 From: georg.otto at tuebingen.mpg.de (Georg Otto) Date: Thu, 20 Jul 2006 12:53:53 +0200 Subject: [Bioperl-l] Features in SeqIO GenBank output Message-ID: Hi, this is probably a FAQ but I could not find anything to solve it. I want to get sequences from GenBank and save them in GenBank format. This works with the script shown below, but the "Features" part is missing and contains references instead (see below). How can I print out the complete GenBank entry? I am running Bioperl 1.5, Perl 5.8.6, Mac 10.4.7 Best, Georg Here is my script: use strict; use warnings; use Bio::Seq; use Bio::SeqIO; use Bio::DB::GenBank; my $acc = 'AB017118'; my $db_obj = Bio::DB::GenBank->new(); my $seq_obj = $db_obj-> get_Seq_by_acc($acc); my $out = Bio::SeqIO->new(-format => 'genbank', -file => '>output.gb'); $out->write_seq($seq_obj); Here is the output: LOCUS AB017118 2038 bp mRNA linear VRT 06-JUN-2006 DEFINITION Danio rerio mRNA for ornithine decarboxylase antizyme long isoform, complete cds. ACCESSION AB017118 VERSION AB017118.1 GI:4239978 KEYWORDS . SOURCE Danio rerio (zebrafish) ORGANISM Danio rerio Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Actinopterygii; Neopterygii; Teleostei; Ostariophysi; Cypriniformes; Cyprinidae; Danio. REFERENCE 1 AUTHORS Saito,T., Hascilowicz,T., Ohkido,I., Kikuchi,Y., Okamoto,H., Hayashi,S., Murakami,Y. and Matsufuji,S. TITLE Two zebrafish (Danio rerio) antizymes with different expression and activities JOURNAL Biochem. J. 345 PT 1, 99-106 (2000) PUBMED 10600644 REFERENCE 2 (bases 1 to 2038) AUTHORS Matsufuji,S. and Saito,T. TITLE Direct Submission JOURNAL Submitted (23-AUG-1998) Senya Matsufuji, Jikei University School of Medicine, Department of Biochemistry II; 3-25-8 Nishishinbashi, Minato-ku, Tokyo 105-8461, Japan (E-mail:senya at jikei.ac.jp, Tel:+81-3-3433-1111(ex.2276), Fax:+81-3-3436-3897) FEATURES Location/Qualifiers source 1..2038 /db_xref="Bio::Annotation::SimpleValue=HASH(0x19b9a28)" /mol_type="Bio::Annotation::SimpleValue=HASH(0x19b9b6c)" /dev_stage="Bio::Annotation::SimpleValue=HASH(0x19b9bb4)" /organism="Bio::Annotation::SimpleValue=HASH(0x19bfe18)" /clone_lib="Bio::Annotation::SimpleValue=HASH(0x19bfe60)" CDS join(45..224,226..702) /db_xref="Bio::Annotation::SimpleValue=HASH(0x19c0960)" /ribosomal_slippage="Bio::Annotation::SimpleValue=HASH(0x1 9beecc)" /codon_start=Bio::Annotation::SimpleValue=HASH(0x19bef14) /protein_id="Bio::Annotation::SimpleValue=HASH(0x19bef5c)" /translation="Bio::Annotation::SimpleValue=HASH(0x19befa4) " /product="Bio::Annotation::SimpleValue=HASH(0x19befec)" /note="Bio::Annotation::SimpleValue=HASH(0x19bf034)" CDS 45..227 /db_xref="Bio::Annotation::SimpleValue=HASH(0x19bee24)" /codon_start=Bio::Annotation::SimpleValue=HASH(0x19bf160) /protein_id="Bio::Annotation::SimpleValue=HASH(0x19bf1cc)" /translation="Bio::Annotation::SimpleValue=HASH(0x19c1830) " /note="Bio::Annotation::SimpleValue=HASH(0x19c1878)" polyA_signal 2017..2022 polyA_site 2038 /note="Bio::Annotation::SimpleValue=HASH(0x19bffc8)" BASE COUNT 439 a 377 c 532 g 690 t ORIGIN 1 cagcagccga gcgcacaggc cgccgtgaaa cctcccgagg ccggatggta aaatccaacc 1981 ttatcctcta tagtggtaca cctttgcttc tgtcataata aaaccattat ttaaagac // From cjfields at uiuc.edu Thu Jul 20 12:43:08 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 20 Jul 2006 07:43:08 -0500 Subject: [Bioperl-l] Features in SeqIO GenBank output In-Reply-To: References: Message-ID: <73C89D17-91FE-47E4-80C1-AA6A689FA14E@uiuc.edu> I'll give it a look. You might try upgrading to Bioperl 1.5.1 to see if this was fixed. Chris On Jul 20, 2006, at 5:53 AM, Georg Otto wrote: > > Hi, > > this is probably a FAQ but I could not find anything to solve it. > > I want to get sequences from GenBank and save them in GenBank > format. This works with the script shown below, but the "Features" > part is missing and contains references instead (see below). How can I > print out the complete GenBank entry? > > I am running Bioperl 1.5, Perl 5.8.6, Mac 10.4.7 > > Best, > > Georg > > > > Here is my script: > > use strict; > use warnings; > > use Bio::Seq; > use Bio::SeqIO; > use Bio::DB::GenBank; > > > my $acc = 'AB017118'; > my $db_obj = Bio::DB::GenBank->new(); > my $seq_obj = $db_obj-> get_Seq_by_acc($acc); > my $out = Bio::SeqIO->new(-format => 'genbank', > -file => '>output.gb'); > $out->write_seq($seq_obj); > > > > Here is the output: > > LOCUS AB017118 2038 bp mRNA linear VRT > 06-JUN-2006 > DEFINITION Danio rerio mRNA for ornithine decarboxylase antizyme long > isoform, complete cds. > ACCESSION AB017118 > VERSION AB017118.1 GI:4239978 > KEYWORDS . > SOURCE Danio rerio (zebrafish) > ORGANISM Danio rerio > Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; > Euteleostomi; > Actinopterygii; Neopterygii; Teleostei; Ostariophysi; > Cypriniformes; Cyprinidae; Danio. > REFERENCE 1 > AUTHORS Saito,T., Hascilowicz,T., Ohkido,I., Kikuchi,Y., > Okamoto,H., > Hayashi,S., Murakami,Y. and Matsufuji,S. > TITLE Two zebrafish (Danio rerio) antizymes with different > expression > and activities > JOURNAL Biochem. J. 345 PT 1, 99-106 (2000) > PUBMED 10600644 > REFERENCE 2 (bases 1 to 2038) > AUTHORS Matsufuji,S. and Saito,T. > TITLE Direct Submission > JOURNAL Submitted (23-AUG-1998) Senya Matsufuji, Jikei > University School > of Medicine, Department of Biochemistry II; 3-25-8 > Nishishinbashi, > Minato-ku, Tokyo 105-8461, Japan (E- > mail:senya at jikei.ac.jp, > Tel:+81-3-3433-1111(ex.2276), Fax:+81-3-3436-3897) > FEATURES Location/Qualifiers > source 1..2038 > /db_xref="Bio::Annotation::SimpleValue=HASH > (0x19b9a28)" > /mol_type="Bio::Annotation::SimpleValue=HASH > (0x19b9b6c)" > /dev_stage="Bio::Annotation::SimpleValue=HASH > (0x19b9bb4)" > /organism="Bio::Annotation::SimpleValue=HASH > (0x19bfe18)" > /clone_lib="Bio::Annotation::SimpleValue=HASH > (0x19bfe60)" > CDS join(45..224,226..702) > /db_xref="Bio::Annotation::SimpleValue=HASH > (0x19c0960)" > / > ribosomal_slippage="Bio::Annotation::SimpleValue=HASH(0x1 > 9beecc)" > /codon_start=Bio::Annotation::SimpleValue=HASH > (0x19bef14) > /protein_id="Bio::Annotation::SimpleValue=HASH > (0x19bef5c)" > /translation="Bio::Annotation::SimpleValue=HASH > (0x19befa4) > " > /product="Bio::Annotation::SimpleValue=HASH > (0x19befec)" > /note="Bio::Annotation::SimpleValue=HASH > (0x19bf034)" > CDS 45..227 > /db_xref="Bio::Annotation::SimpleValue=HASH > (0x19bee24)" > /codon_start=Bio::Annotation::SimpleValue=HASH > (0x19bf160) > /protein_id="Bio::Annotation::SimpleValue=HASH > (0x19bf1cc)" > /translation="Bio::Annotation::SimpleValue=HASH > (0x19c1830) > " > /note="Bio::Annotation::SimpleValue=HASH > (0x19c1878)" > polyA_signal 2017..2022 > polyA_site 2038 > /note="Bio::Annotation::SimpleValue=HASH > (0x19bffc8)" > BASE COUNT 439 a 377 c 532 g 690 t > ORIGIN > 1 cagcagccga gcgcacaggc cgccgtgaaa cctcccgagg ccggatggta > aaatccaacc > > > > > 1981 ttatcctcta tagtggtaca cctttgcttc tgtcataata aaaccattat > ttaaagac > // > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Thu Jul 20 13:35:43 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 20 Jul 2006 14:35:43 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BBBB69.6000906@sendu.me.uk> References: <44BBBB69.6000906@sendu.me.uk> Message-ID: <44BF86AF.8080408@sendu.me.uk> Sendu Bala wrote: > node 2 has name 'Bacteria ' and rank 'superkingdom' > node 1386 has name 'Bacillus ' and rank 'genus' > node 7776 has name 'Gnathostomata ' and rank 'superclass' > etc. > > For me the bits in <> are inappropriate and shouldn't be there. > [...] > If there are no objections I'll strip the <> bits. I also plan to make > $node->name('scientific', 'sapiens'); set and get the node name, and > have flatfile and entrez store all common names with > $obj->name('common', 'human', 'man');. I'll describe all the changes I've now made and if no-one complains I'll commit. (I've also made these notes into bug 2047 for easier reference in the future.) Bio::DB::Taxonomy::flatfile --------------------------- # Bug-fixes Removed invalid requirement that all species nodes have at least 7 named-rank parents. The names->id solution used by get_taxonid() only stored that last id associated with a name. However the name used wasn't necessarily unique, such that multiple ids could match. names->id solution now remembers all ids that match a name. API-CHANGE: for this reason I've renamed get_taxonid() to get_taxonids() and it returns an array of ids in list context. For backward compatibility it returns one of the ids in scalar context, and *get_taxonid = \&get_taxonids. Added missing division ENV 'Environmental samples'. # Improvements Like Bio::DB::Taxonomy::entrez, flatfile now retrieves and stores the common names, genetic code and mitochondrial genetic code in each node it makes. NOTE: entrez also stores creation, publication and update dates, but this data is not available in the taxdump from NCBI ftp site. NOTE: the common names are stored in no particular order; the genbank common name in particular isn't necessarily the first in the list (cf. old entrez.pm behaviour). BEHAVIOUR-CHANGE: flatfile used to store within the nodes it makes the division as a three letter code, like 'PRI'. However, for consistency with entrez and the scientific_name() of the node the division is supposed to correspond to, it is now stored as the full name, like 'Primates'. The names->id solution also stores the artificially uniqued names like 'Craniata ', allowing you for the first time to retrieve the correct id. Previously the search would have simply failed completely. The names->id solution now handles nodes with scientific names of 'xyz (class)', allowing you to retrieve the id with both get_taxonids('xyz') and get_taxonids('xyz (class)'). Previously only the latter would work. NOTE: the previous 2 changes (and the issues with entrez, see below) make flatfile better at searching the taxonomy database than entrez module or the website, both in terms of speed and completeness of results. BEHAVIOUR-CHANGE: The scientific name field isn't touched in any way, always being sent directly to Bio::Taxonomy::Node->new(-name => $untouched) or the $node->classification() array. Previously, a species node would have its name converted from 'Homo sapiens' to 'sapiens', but the conversion mangled very badly certain other species names. Bio::DB::Taxonomy::entrez ------------------------- # Bug-fixes Special characters like ", ( and ) in the input query string to get_taxonid() result in the failure or inaccuracy of the search. These characters are now removed prior to submission, allowing for correct search results. API-CHANGE: entrez has always been able to return multiple ids that match a single input name, so I've renamed get_taxonid() to get_taxonids() and it returns an array of ids in list context. It returns one of the ids in scalar context. For backward compatibility, *get_taxonid = \&get_taxonids. NOTE: entrez modules (and website) cannot cope with '' in the query, failing searches like 'Craniata '. For this reason, if get_taxonids() is given a query with '' it will immediately return undefined, saving a pointless website access. If you want the id of 'Craniata ' you must search for 'Craniata', then get the node for each returned id to see which one has a parent node with a scientific_name() or common_names() case-insensitive matching to 'chordata'. # Improvements BEHAVIOUR-CHANGE: now throws on failure to retrieve data from website. BEHAVIOUR-CHANGE: the ScientificName field isn't touched except for s/ \(class\)$//, being sent directly to Bio::Taxonomy::Node->new(-name => $untouched) or the $node->classification() array. Previously, a species node would have its name converted from 'Homo sapiens' to 'sapiens', but the conversion mangled very badly certain other species names. BEHAVIOUR-CHANGE: all common names of a node are now stored in the resulting Node object with Bio::Taxonomy::Node->new(-common_names => \@names). This means that the Genbank common name is now just one amongst others, and isn't guaranteed to be the first in the list either. Bio::Taxonomy::Node ------------------- # Bug-fixes non-interesting fixes to get get_Children_Nodes(), get_Lineage_Nodes() and get_LCA_Node() to work correctly. classification() has a proper solution to finding the classification when the array wasn't manually set. # Improvements BEHAVIOUR-CHANGE: node_name() used to be an alias to name('common'). Now it is an alias to name('scientific'). NOTE: node_name is what is set when ->new(-name => $name) is set, so flatfile and entrez and user-created nodes now implicitly associate the name of the node they create with its scientific name. BEHAVIOUR-CHANGE: scientific_name() used to be an alias to binomial(). Now it is *scientific_name = \&node_name. binomial(), in addition to working the old way (assume first two elements of classification array are species and genus, combine them), will shortcut and return the scientific_name() if we are a node with rank 'species' and scientific_name is two words. This makes binomial() an effective synonym of scientific_name() when Nodes were constructed as per flatfile or entrez, and when it is used correctly on a species node. BEHAVIOUR-CHANGE: *parent_taxon_id = \&parent_id. (Previously, you could assign and retrieve different values to/from each method.) New method common_names() supersedes common_name(), returning a list of all common_names. For backward compatibility, returns one of the names in scalar context, and *common_name = \&common_names. -factory and factory() removed, since there is no Bio::Taxonomy::FactoryI-implementing module, nothing in Node to make use of a factory once set, and a factory seems redundant when we're a node with a -dbh. species() and genus() issue a warning when you try to use them on a node that isn't of rank 'species' (since they interact with the classification array and not names('method') like the other similar methods). validate_name() removed because it just returns 1. validate_species_name() removed because species() can (should) now contain the real species name, like 'Homo sapiens', not 'sapiens'. But it could also be any wonderfully complex thing, so there's nothing we can confidently check for as being 'correct'. t/Taxonomy.t ------------ Runs a slightly more comprehensive set of tests on entrez, which are now only skipped if data retrieval fails. Tests flatfile on a cut-down version of the taxdump. > I'll also fix the problem with node names for ranks species and lower, > as discussed in thread 'Bio::DB::Taxonomy:: mishandles species, > subspecies/variant names', in the way I suggested there. This hasn't been done per se, because we now store the real ScientificName so there is no 'mishandling' to fix. From bix at sendu.me.uk Thu Jul 20 13:49:04 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 20 Jul 2006 14:49:04 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BF86AF.8080408@sendu.me.uk> References: <44BBBB69.6000906@sendu.me.uk> <44BF86AF.8080408@sendu.me.uk> Message-ID: <44BF89D0.7090103@sendu.me.uk> Sendu Bala wrote: > > Bio::DB::Taxonomy::flatfile > > BEHAVIOUR-CHANGE: The scientific name field isn't touched in any way, > always being sent directly to Bio::Taxonomy::Node->new(-name => > $untouched) or the $node->classification() array. Previously, a species > node would have its name converted from 'Homo sapiens' to 'sapiens', but > the conversion mangled very badly certain other species names. [...] > Bio::DB::Taxonomy::entrez > > BEHAVIOUR-CHANGE: the ScientificName field isn't touched except for s/ > \(class\)$//, being sent directly to Bio::Taxonomy::Node->new(-name => > $untouched) or the $node->classification() array. Previously, a species > node would have its name converted from 'Homo sapiens' to 'sapiens', but > the conversion mangled very badly certain other species names. Oops. In both cases the scientific name has ' (class)' removed from it, but the original name (with ' (class)') is stored as one of the common names. From georg.otto at tuebingen.mpg.de Thu Jul 20 14:29:33 2006 From: georg.otto at tuebingen.mpg.de (Georg Otto) Date: Thu, 20 Jul 2006 16:29:33 +0200 Subject: [Bioperl-l] Features in SeqIO GenBank output References: <73C89D17-91FE-47E4-80C1-AA6A689FA14E@uiuc.edu> Message-ID: This indeed seems to be the case. After upgrading it works fine. Sorry for stealing your time. Georg Chris Fields writes: > I'll give it a look. You might try upgrading to Bioperl 1.5.1 to see > if this was fixed. > > Chris > > On Jul 20, 2006, at 5:53 AM, Georg Otto wrote: > >> >> Hi, >> >> this is probably a FAQ but I could not find anything to solve it. >> >> I want to get sequences from GenBank and save them in GenBank >> format. This works with the script shown below, but the "Features" >> part is missing and contains references instead (see below). How can I >> print out the complete GenBank entry? >> >> I am running Bioperl 1.5, Perl 5.8.6, Mac 10.4.7 >> >> Best, >> >> Georg >> >> >> >> Here is my script: >> >> use strict; >> use warnings; >> >> use Bio::Seq; >> use Bio::SeqIO; >> use Bio::DB::GenBank; >> >> >> my $acc = 'AB017118'; >> my $db_obj = Bio::DB::GenBank->new(); >> my $seq_obj = $db_obj-> get_Seq_by_acc($acc); >> my $out = Bio::SeqIO->new(-format => 'genbank', >> -file => '>output.gb'); >> $out->write_seq($seq_obj); >> >> >> >> Here is the output: >> >> LOCUS AB017118 2038 bp mRNA linear VRT >> 06-JUN-2006 >> DEFINITION Danio rerio mRNA for ornithine decarboxylase antizyme long >> isoform, complete cds. >> ACCESSION AB017118 >> VERSION AB017118.1 GI:4239978 >> KEYWORDS . >> SOURCE Danio rerio (zebrafish) >> ORGANISM Danio rerio >> Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; >> Euteleostomi; >> Actinopterygii; Neopterygii; Teleostei; Ostariophysi; >> Cypriniformes; Cyprinidae; Danio. >> REFERENCE 1 >> AUTHORS Saito,T., Hascilowicz,T., Ohkido,I., Kikuchi,Y., >> Okamoto,H., >> Hayashi,S., Murakami,Y. and Matsufuji,S. >> TITLE Two zebrafish (Danio rerio) antizymes with different >> expression >> and activities >> JOURNAL Biochem. J. 345 PT 1, 99-106 (2000) >> PUBMED 10600644 >> REFERENCE 2 (bases 1 to 2038) >> AUTHORS Matsufuji,S. and Saito,T. >> TITLE Direct Submission >> JOURNAL Submitted (23-AUG-1998) Senya Matsufuji, Jikei >> University School >> of Medicine, Department of Biochemistry II; 3-25-8 >> Nishishinbashi, >> Minato-ku, Tokyo 105-8461, Japan (E- >> mail:senya at jikei.ac.jp, >> Tel:+81-3-3433-1111(ex.2276), Fax:+81-3-3436-3897) >> FEATURES Location/Qualifiers >> source 1..2038 >> /db_xref="Bio::Annotation::SimpleValue=HASH >> (0x19b9a28)" >> /mol_type="Bio::Annotation::SimpleValue=HASH >> (0x19b9b6c)" >> /dev_stage="Bio::Annotation::SimpleValue=HASH >> (0x19b9bb4)" >> /organism="Bio::Annotation::SimpleValue=HASH >> (0x19bfe18)" >> /clone_lib="Bio::Annotation::SimpleValue=HASH >> (0x19bfe60)" >> CDS join(45..224,226..702) >> /db_xref="Bio::Annotation::SimpleValue=HASH >> (0x19c0960)" >> / >> ribosomal_slippage="Bio::Annotation::SimpleValue=HASH(0x1 >> 9beecc)" >> /codon_start=Bio::Annotation::SimpleValue=HASH >> (0x19bef14) >> /protein_id="Bio::Annotation::SimpleValue=HASH >> (0x19bef5c)" >> /translation="Bio::Annotation::SimpleValue=HASH >> (0x19befa4) >> " >> /product="Bio::Annotation::SimpleValue=HASH >> (0x19befec)" >> /note="Bio::Annotation::SimpleValue=HASH >> (0x19bf034)" >> CDS 45..227 >> /db_xref="Bio::Annotation::SimpleValue=HASH >> (0x19bee24)" >> /codon_start=Bio::Annotation::SimpleValue=HASH >> (0x19bf160) >> /protein_id="Bio::Annotation::SimpleValue=HASH >> (0x19bf1cc)" >> /translation="Bio::Annotation::SimpleValue=HASH >> (0x19c1830) >> " >> /note="Bio::Annotation::SimpleValue=HASH >> (0x19c1878)" >> polyA_signal 2017..2022 >> polyA_site 2038 >> /note="Bio::Annotation::SimpleValue=HASH >> (0x19bffc8)" >> BASE COUNT 439 a 377 c 532 g 690 t >> ORIGIN >> 1 cagcagccga gcgcacaggc cgccgtgaaa cctcccgagg ccggatggta >> aaatccaacc >> >> >> >> >> 1981 ttatcctcta tagtggtaca cctttgcttc tgtcataata aaaccattat >> ttaaagac >> // >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign From prabubio at gmail.com Thu Jul 20 16:01:35 2006 From: prabubio at gmail.com (Prabu R) Date: Thu, 20 Jul 2006 21:31:35 +0530 Subject: [Bioperl-l] Blast Output Parsing Message-ID: Dear All! I am now trying to parse a Blast output using PERL. I have to extract each alignment and have to parse the alignment. I mean, I have to check whether a particular part of the given sequence got aligned 100%. Anybody please tell me what module in PERL I have to use for getting this. I've tried Bio::SearchIO. But I didnt get any method to get the alignment. Kindly help. Thanks, R. Prabu From cjfields at uiuc.edu Thu Jul 20 17:03:17 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 20 Jul 2006 12:03:17 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BF86AF.8080408@sendu.me.uk> Message-ID: <002901c6ac1e$66ea3820$15327e82@pyrimidine> These all seem fine to me. Fantastic work! I added some comments but everything seems fine to me. I still plan on switching Bio::DB::Taxonomy::entrez to use Bio::DB::EUtilities at some point but probably won't get around to it until August; I still need to write up tests for the EUtilities modules. I may add a method for retrieving tax data based on protein/nucleotide sequence primary ID and relevant sequence database, so you could directly retrieve the relevant TaxID w/o parsing sequences directly for them. This would mainly be useful if you gather GIs from a BLAST search, for instance. Anyway, I could add this in then base class Bio::DB::Taxonomy directly so one could used the retrieved TaxIDs for flat-file or entrez searches; this requires, of course, access to the remote Entrez database (it would use ELink). Would that be of interest? If so, I'll work on that and add relevant tests to Taxonomy.t when I can. > Bio::DB::Taxonomy::flatfile > --------------------------- ... > API-CHANGE: for this reason I've renamed get_taxonid() to get_taxonids() > and it returns an array of ids in list context. For backward > compatibility it returns one of the ids in scalar context, and > *get_taxonid = \&get_taxonids. Returning a scalar makes sense as long as its noted in the POD. I have seen similar methods return an array ref based on wantarray instead of a scalar, but that largely depends on the complexity of the array (an array of hashes, for instance). ... > Bio::DB::Taxonomy::entrez > ------------------------- ... > NOTE: entrez modules (and website) cannot cope with '' in the > query, failing searches like 'Craniata '. For this reason, if > get_taxonids() is given a query with '' it will immediately > return undefined, saving a pointless website access. If you want the id > of 'Craniata ' you must search for 'Craniata', then get the > node for each returned id to see which one has a parent node with a > scientific_name() or common_names() case-insensitive matching to > 'chordata'. It may be something with the esearch interface, though the direct TaxBrowser query also seems to have problems with this: http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/ I'll try looking into it to see if there is a more direct way to get those (there probably isn't). > # Improvements > BEHAVIOUR-CHANGE: now throws on failure to retrieve data from website. > > BEHAVIOUR-CHANGE: the ScientificName field isn't touched except for s/ > \(class\)$//, being sent directly to Bio::Taxonomy::Node->new(-name => > $untouched) or the $node->classification() array. Previously, a species > node would have its name converted from 'Homo sapiens' to 'sapiens', but > the conversion mangled very badly certain other species names. This actually relates to the similar comment made for Bio::DB::Taxonomy::flatfle. The mangling probably depends on the current node and whether using flatfile or XML (entrez). Most of the odd XML examples I posted before, where the TaxID associated with a sequence had extra data, were a rank of 'no rank'. The species rank, if present, has a normal binomial name for : Flavobacterium johnsoniae UW101 ... Flavobacterium johnsoniae species Pseudomonas putida F1 ... Pseudomonas putida species Caldicellulosiruptor saccharolyticus DSM 8903 ... Caldicellulosiruptor saccharolyticus species The genus rank has one name; the subspecies rank has the full species name with 'subsp.' followed by the subspecies name. So, if using XML, one could use the taxon subelements stored in the XML element to sort out genus(), species(), subspecies(), and also higher order elements if someone wanted to implement them. This, of course, isn't necessary for the current changes, but down the road if anybody wanted it... ... > Bio::Taxonomy::Node > ------------------- ... > species() and genus() issue a warning when you try to use them on a node > that isn't of rank 'species' (since they interact with the > classification array and not names('method') like the other similar > methods). I would just have genus() and species() issue warnings if they aren't set to a particular value. So, if the current node is at the genus rank, genus() will be set but species() won't be. And no need to do additional checking! Fabulous work Sendu! Chris From cjfields at uiuc.edu Thu Jul 20 17:23:14 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 20 Jul 2006 12:23:14 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BF89D0.7090103@sendu.me.uk> Message-ID: <002a01c6ac21$2ed16190$15327e82@pyrimidine> Just thought of something... You had mentioned using a stripped-down version of Bio::Taxonomy::Node previously, which led to a bit of contention. One way to make everybody happy would be to create an interface class that contains the basic shared methods (Bio::Taxonomy::NodeI), then have the currently-named Bio::Taxonomy::Node (which could be renamed to Bio::Taxonomy::Species or something similar) implement those methods along with the current methods. Another class (your stripped down version, which could then be Bio::Taxonomy::Node) would also implement whatever base class methods were needed. They would both be Bio::Taxonomy::NodeI-implementing, so you could use either object type where required. |------Node NodeI----| |------Species Another option would be to have Bio::Taxonomy::Node itself stripped down, then have another class (Bio::Taxonomy::Species) inherit methods from it and also implement additional methods (genus(), species(), etc). Node----Species Would something like that be feasible? I favor the interface version as it sticks with the interface-implementation design that Bioperl has been migrating towards: http://www.bioperl.org/wiki/Advanced_BioPerl#Bioperl_Interface_design This would also help out with the whole Bio::Species issue; just have Bio::Taxonomy::Species replace it. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Sendu Bala > Sent: Thursday, July 20, 2006 8:49 AM > To: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > Sendu Bala wrote: > > > > Bio::DB::Taxonomy::flatfile > > > > BEHAVIOUR-CHANGE: The scientific name field isn't touched in any way, > > always being sent directly to Bio::Taxonomy::Node->new(-name => > > $untouched) or the $node->classification() array. Previously, a species > > node would have its name converted from 'Homo sapiens' to 'sapiens', but > > the conversion mangled very badly certain other species names. > [...] > > Bio::DB::Taxonomy::entrez > > > > BEHAVIOUR-CHANGE: the ScientificName field isn't touched except for s/ > > \(class\)$//, being sent directly to Bio::Taxonomy::Node->new(-name => > > $untouched) or the $node->classification() array. Previously, a species > > node would have its name converted from 'Homo sapiens' to 'sapiens', but > > the conversion mangled very badly certain other species names. > > Oops. In both cases the scientific name has ' (class)' removed from it, > but the original name (with ' (class)') is stored as one of the common > names. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Thu Jul 20 17:31:42 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 20 Jul 2006 12:31:42 -0500 Subject: [Bioperl-l] Blast Output Parsing In-Reply-To: Message-ID: <002b01c6ac22$5d75e1f0$15327e82@pyrimidine> Grab the HSPs, then use get_aln() to generate a Bio::SimpleAlign object. You can then use Bio::AlignIO to generate the alignment output if needed, or use the Bio::SimpleAlign methods to get what you want. http://www.bioperl.org/wiki/HOWTO:Beginners http://www.bioperl.org/wiki/HOWTO:SearchIO http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/SimpleAlign .html Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Prabu R > Sent: Thursday, July 20, 2006 11:02 AM > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] Blast Output Parsing > > Dear All! > > I am now trying to parse a Blast output using PERL. > > I have to extract each alignment and have to parse the alignment. I mean, > I > have to check whether a particular part of the given sequence got aligned > 100%. > > Anybody please tell me what module in PERL I have to use for getting this. > > I've tried Bio::SearchIO. But I didnt get any method to get the > alignment. > > Kindly help. > > Thanks, > R. Prabu > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From bix at sendu.me.uk Thu Jul 20 17:53:03 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 20 Jul 2006 18:53:03 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <002901c6ac1e$66ea3820$15327e82@pyrimidine> References: <002901c6ac1e$66ea3820$15327e82@pyrimidine> Message-ID: <44BFC2FF.3030704@sendu.me.uk> Chris Fields wrote: > > I still plan on switching Bio::DB::Taxonomy::entrez to use > Bio::DB::EUtilities at some point but probably won't get around to it until > August; If I may make two feature requests (you've probably already done them, if so apologies)? a) Automatically enforce the 3second wait rule when querying via the ncbi website. b) Automatically cache results locally in a reasonable way, such that repeated queries aiming to get the same result don't have to go via the website. > Anyway, I could add this in then base class Bio::DB::Taxonomy directly so > one could used the retrieved TaxIDs for flat-file or entrez searches; this > requires, of course, access to the remote Entrez database (it would use > ELink). Would that be of interest? Sorry, I don't really understand this paragraph. I'm unable to parse '...then base class Bio::DB::Taxonomy directly so...', for starters. >> Bio::Taxonomy::Node >> ------------------- > > ... > >> species() and genus() issue a warning when you try to use them on a node >> that isn't of rank 'species' (since they interact with the >> classification array and not names('method') like the other similar >> methods). > > I would just have genus() and species() issue warnings if they aren't set to > a particular value. So, if the current node is at the genus rank, genus() > will be set but species() won't be. And no need to do additional checking! The problem is, genus() and species() are special cases that aren't normally directly set. They get their values from the classification array: genus() returns (classification())[1] and species() returns (classification())[0]. They set the same values. Doing this is only sane (though is still likely to be wrong, given that there can be ranks between species and genus) when the node is of rank 'species', hence the warnings. I imagine this is to work with pesky file formats like genbank, so I can't really change anything here without major overhaul. And my plans for overhaul involve getting rid of genus() and species(), so I'll just leave them be for now. Anyway, thanks for your comments and input into this thread! It's much appreciated. From bix at sendu.me.uk Thu Jul 20 17:55:56 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 20 Jul 2006 18:55:56 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <002a01c6ac21$2ed16190$15327e82@pyrimidine> References: <002a01c6ac21$2ed16190$15327e82@pyrimidine> Message-ID: <44BFC3AC.8010704@sendu.me.uk> Chris Fields wrote: > Just thought of something... > > You had mentioned using a stripped-down version of Bio::Taxonomy::Node > previously, which led to a bit of contention. One way to make everybody > happy would be to create an interface class that contains the basic shared > methods (Bio::Taxonomy::NodeI), then have the currently-named > Bio::Taxonomy::Node (which could be renamed to Bio::Taxonomy::Species or > something similar) implement those methods along with the current methods. > Another class (your stripped down version, which could then be > Bio::Taxonomy::Node) would also implement whatever base class methods were > needed. They would both be Bio::Taxonomy::NodeI-implementing, so you could > use either object type where required. > > |------Node > NodeI----| > |------Species [...] > I favor the interface version as it > sticks with the interface-implementation design that Bioperl has been > migrating towards: > > http://www.bioperl.org/wiki/Advanced_BioPerl#Bioperl_Interface_design > > This would also help out with the whole Bio::Species issue; just have > Bio::Taxonomy::Species replace it. Yes, this sounds good to me. Should I still wait until Jason/elders are able to comment before I start exploring this avenue? From cjfields at uiuc.edu Thu Jul 20 18:21:48 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 20 Jul 2006 13:21:48 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BFC3AC.8010704@sendu.me.uk> Message-ID: <000601c6ac29$5d533a90$15327e82@pyrimidine> I would say go ahead, why not? This would likely lead to the eventual deprecation of Bio::Species, which was in the cards anyway. The only problem I can foresee is which class to use with Bio::DB::Taxonomy*? I guess one could settle on one class by default and have the option to use another Bio::Taxonomy::NodeI-implementing class if you wanted more data/methods available... Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Sendu Bala > Sent: Thursday, July 20, 2006 12:56 PM > To: bioperl-l at bioperl.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > Chris Fields wrote: > > Just thought of something... > > > > You had mentioned using a stripped-down version of Bio::Taxonomy::Node > > previously, which led to a bit of contention. One way to make everybody > > happy would be to create an interface class that contains the basic > shared > > methods (Bio::Taxonomy::NodeI), then have the currently-named > > Bio::Taxonomy::Node (which could be renamed to Bio::Taxonomy::Species or > > something similar) implement those methods along with the current > methods. > > Another class (your stripped down version, which could then be > > Bio::Taxonomy::Node) would also implement whatever base class methods > were > > needed. They would both be Bio::Taxonomy::NodeI-implementing, so you > could > > use either object type where required. > > > > |------Node > > NodeI----| > > |------Species > [...] > > I favor the interface version as it > > sticks with the interface-implementation design that Bioperl has been > > migrating towards: > > > > http://www.bioperl.org/wiki/Advanced_BioPerl#Bioperl_Interface_design > > > > This would also help out with the whole Bio::Species issue; just have > > Bio::Taxonomy::Species replace it. > > Yes, this sounds good to me. Should I still wait until Jason/elders are > able to comment before I start exploring this avenue? > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From hlapp at gmx.net Thu Jul 20 18:24:19 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 20 Jul 2006 14:24:19 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BFC3AC.8010704@sendu.me.uk> References: <002a01c6ac21$2ed16190$15327e82@pyrimidine> <44BFC3AC.8010704@sendu.me.uk> Message-ID: On Jul 20, 2006, at 1:55 PM, Sendu Bala wrote: > > Yes, this sounds good to me. Should I still wait until Jason/elders > are > able to comment before I start exploring this avenue? Unless you're afraid that your suggestions are going too wild for our palate please do go ahead. The joy of CVS is we can always go back. For my part, I just haven't been able to keep up with the flurry of long emails ... I'll have to do some extensive bedtime reading (and then writing ;) soon I guess :-) -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From saunders at uchicago.edu Thu Jul 20 21:47:08 2006 From: saunders at uchicago.edu (Matthew A. Saunders) Date: Thu, 20 Jul 2006 16:47:08 -0500 (CDT) Subject: [Bioperl-l] installing bioperl Message-ID: Dear Bioperl representative, I have been trying to install bioperl (in order to ultimately run some Ensembl APIs) but I seem to be having some problems with the bioperl installation. I have followed the installation directions and I get to the last steps of the "make" process, yet this stage fails with the error message below. Can you possibly tell me what is the problem. I am not sure that I understand the command "make", but I think that it requires that there be a file named "makefile" in the given folder, when I look in my newly formed "bioperl-1.4" folder there is no "makefile" in there. Perhaps that is a problem. If so, how might I rectify the matter? Thanks! Matt ************************************************************* . . . Enjoy the rest of bioperl, which you can use after going 'make install' Checking if your kit is complete... Looks good /usr/bin/perl: symbol lookup error: /usr/lib64/perl5/5.8.5/x86_64-linux-thread-multi/auto/DB_File/DB_File.so: undefined symbol: db_version Running make test Make had some problems, maybe interrupted? Won't test Running make install Make had some problems, maybe interrupted? Won't install *************************************************************** ----------------------------------------------------- Matthew A. Saunders UNCF-MERCK Postdoctoral Research Fellow Dept. of Ecology and Evolution University of Chicago (773)834-3964 Skype: mattsaunders555 http://home.uchicago.edu/~saunders ------------------------------------------------------- From saunders at uchicago.edu Thu Jul 20 22:01:53 2006 From: saunders at uchicago.edu (Matthew A. Saunders) Date: Thu, 20 Jul 2006 17:01:53 -0500 (CDT) Subject: [Bioperl-l] installing bioperl In-Reply-To: References: Message-ID: In continuation to my described problem, I have just installed the bioperl-run file from the .tar.gz format and that was successful through the "perl Makefile.PL" and the "make" & "make test" phases. It is the "bioperl core" file that is still giving me the problems described below. Thanks! Matt ******************************** On Thu, 20 Jul 2006, Matthew A. Saunders wrote: > Dear Bioperl representative, > > I have been trying to install bioperl (in order to ultimately run some > Ensembl APIs) but I seem to be having some problems with the bioperl > installation. > > I have followed the installation directions and I get to the last steps of > the "make" process, yet this stage fails with the error message below. Can > you possibly tell me what is the problem. I am not sure that I understand > the command "make", but I think that it requires that there be a file named > "makefile" in the given folder, when I look in my newly formed "bioperl-1.4" > folder there is no "makefile" in there. Perhaps that is a problem. If so, > how might I rectify the matter? > > Thanks! > > Matt > > > ************************************************************* . . . > Enjoy the rest of bioperl, which you can use after going 'make install' > > Checking if your kit is complete... > Looks good > /usr/bin/perl: symbol lookup error: > /usr/lib64/perl5/5.8.5/x86_64-linux-thread-multi/auto/DB_File/DB_File.so: > undefined symbol: db_version > Running make test > Make had some problems, maybe interrupted? Won't test > Running make install > Make had some problems, maybe interrupted? Won't install > *************************************************************** > > > > ----------------------------------------------------- > Matthew A. Saunders > UNCF-MERCK Postdoctoral Research Fellow > > Dept. of Ecology and Evolution > University of Chicago > (773)834-3964 > Skype: mattsaunders555 > http://home.uchicago.edu/~saunders > ------------------------------------------------------- > > From bix at sendu.me.uk Thu Jul 20 22:47:33 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 20 Jul 2006 23:47:33 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine> References: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine> Message-ID: <44C00805.7090403@sendu.me.uk> Chris Fields wrote: > As for caching, > do you mean caching of the tax information or the sequence ID information? Anything you get from entrez. > Caching of tax information would be great, but how would you go about it? I > can see how it would be easy to have a cache for the flatfile using a local > index, but not so much for XML data retrieved from Entrez (a > continually-appended local file, maybe, with a n accompanying index file?). I didn't actually mean a stored file (but that would be possible with a tied hash or something: DB_File, just like flatfile), but an in-memory one for use during the course of program execution. Stored file would probably be dangerous because you wouldn't know if the data has become stale or not - and checking to see if it wasn't would defeat the point. >> The problem is, genus() and species() are special cases that aren't >> normally directly set. They get their values from the classification >> array: genus() returns (classification())[1] and species() returns >> (classification())[0]. They set the same values. Doing this is only sane >> (though is still likely to be wrong, given that there can be ranks >> between species and genus) when the node is of rank 'species', hence the >> warnings. >> >> I imagine this is to work with pesky file formats like genbank, so I >> can't really change anything here without major overhaul. And my plans >> for overhaul involve getting rid of genus() and species(), so I'll just >> leave them be for now. > > This would all depend on where the information came from; if the information > came from the Entrez XML element data: > [snip] > > The subspecies(), genus(), and species() could all be set from this instead > of the classification array. The problem lies then with the flatfile data > and how it would be parsed out, if that's at all possible with the flatfile > data. If not, I see why you would rather have this return a stripped-down > Bio::Taxonomy::Node object. > > I would have to look at how everything is indexed in > Bio::DB::Taxonomy::entrez, but I think it's feasible. entrez already parses through LineageEx to build the classification array. flatfile walks up all the parents to do the same. Having the information isn't the issue. We have the information. The methods genus() and species() need to work with the genbank fileformat, that is the problem. From MEC at stowers-institute.org Thu Jul 20 22:40:55 2006 From: MEC at stowers-institute.org (Cook, Malcolm) Date: Thu, 20 Jul 2006 17:40:55 -0500 Subject: [Bioperl-l] Remote Blast - Blast Human Genome Message-ID: Rohan, 'snp/human/human_snp' is the database name you need to use to blast into human snp database at NCBI See the following document for the full list (which link was provided to me via personal correspondace with NCBI helpdesk). Very useful... Hmm, looming again, there appear now to be two versions: http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastdblist.html (last updated 2/7/2006) http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/remote_accessible_blastdbli st.html (last uypdated 5/29/2006) Neither are linked to by any other document on the internet (google sez) including anywhere else at NCBI. Go figure. It should be IMHO since this info is nowhere else collected. Of course it may be out of date, but it always has got me through. Good luck Malcolm Cook - mec at stowers-institute.org - 816-926-4449 Database Applications Manager - Bioinformatics Stowers Institute for Medical Research - Kansas City, MO USA >-----Original Message----- >From: bioperl-l-bounces at lists.open-bio.org >[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Chris Fields >Sent: Monday, July 17, 2006 4:26 PM >To: vrramnar at student.cs.uwaterloo.ca; bioperl-l at lists.open-bio.org >Subject: Re: [Bioperl-l] Remote Blast - Blast Human Genome > >Okay, I think I may know what's going on a little more now >with NCBI's BLAST >interface. Looks like any NCBI BLAST query must use the >default URL and so >must set up to proper GET/PUT commands to retrieve everything >correctly. > >Here's the API description for it all: > >http://www.ncbi.nlm.nih.gov/BLAST/Doc/urlapi.html > >You could try setting the database to 'snp' or something along >those lines >instead of 'nr'; or you could see what the name of the >database is when you >use the web form and try setting it to that. According to >this page, this >should be possible: > >http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helpsnpfaq.sectio >n.SearchdbSNP >_test._Search_dbSNP_Using_B > >The Entrez Query limit was a recommendation for limiting your >search to a >set of sequences for human, for instance. > >I'll try looking into it a bit more but I'm pretty busy. If you find >anything out you should probably post it here . > >Chris > >> Hi Chris, >> >> 1. I have tried changing the database to snp or dbSNP but >neither works. >> It >> seems that depending on which type of blast you use(ie, Genome Blast, >> Blast SNP, >> normal blast such as blastn, etc...) you see a different listing of >> databases >> available for querys. Since you mention that the Blast page I see was >> generated >> by Genome, where could I go to see a complete listing of >databases I can >> query?? >> Or if you knew off hand which database to search if I only >wanted dbSNP >> hits? >> >> 2. You also mention, I can limit the search by using Entrez >terms. Do you >> mean >> like: >> $Bio::Tools::Run::RemoteBlast::HEADER{'ENTREZ_QUERY'} = 'abc'; >> where 'abc' is the name of the subject with which you would >only like to >> see >> result of. For example if you put it as 'Homo >sapiens[Organism]' then only >> human >> sequences would be in hit lists. >> If this is what you mean, what would I change it to, to see >only hits from >> dbSNP? >> >> Thanks for the ongoing help, >> >> Rohan >> >> Quoting Chris Fields : >> >> > I added a method to RemoteBlast in bioperl-live (CVS) if >you want to >> play >> > with changing the URL. I have been thinking about doing >this for a bit >> now >> > but I already see problems. >> > >> > Here's the issue: the BLAST page you see is NOT the NCBI BLAST page >> (note >> > the differences in the URL) but a user-friendly request >page, generated >> on >> > the fly by Genome, to submit BLAST requests for the >relevant database. >> So >> > changing the URL will not work (even by adding extra >parameters); you >> only >> > get the original HTML web page. >> > >> > You could try changing the database or limiting the search using an >> Entrez >> > term (which you should be able to include in the request, >probably by >> adding >> > it to the HEADER). >> > >> > Chris >> > >> > > -----Original Message----- >> > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> > > bounces at lists.open-bio.org] On Behalf Of >> vrramnar at student.cs.uwaterloo.ca >> > > Sent: Thursday, July 13, 2006 5:39 PM >> > > To: bioperl-l at lists.open-bio.org >> > > Subject: [Bioperl-l] Remote Blast - Blast Human Genome >> > > >> > > >> > > Hello Again, >> > > >> > > I have another question regarding Remote blast but this >time using >> Genome >> > > Blast. >> > > >> > > Here is the link: >> > > >> > > >> >http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=9606 >> > > >> > > which again uses the main Blast web site: >> > > >> > > http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi >> > > >> > > Again I am not sure what to add or what HEADER >information to change >> > > within my >> > > script. >> > > >> > > Here is my program, which was the same as the last email: >> > > >> > > #!/usr/bin/perl -w >> > > >> > > use Bio::Perl; >> > > use Bio::Tools::Run::RemoteBlast; >> > > >> > > my $prog = "blastn"; >> > > my $db = "refseq_genomic"; >> > > my $e_val = 0.01; >> > > >> > > my @params = ( '-prog' => $prog, >> > > '-data' => $db, >> > > '-expect' => $e_val); >> > > >> > > my $factory = new Bio::Tools::Run::RemoteBlast->new(@params); >> > > $Bio::Tools::Run::RemoteBlast::HEADER{'WWW_BLAST_TYPE'} >= '????'; <-- >> --- >> > > what >> > > do I put here >> > > #$Bio::Tools::Run::RemoteBlast::HEADER{'?????'} = >'????'; <--- Do I >> need >> > > to add >> > > any other values to the form inputs >> > > >> > > $factory->submit_blast("blast.in"); >> > > $v = 1; >> > > >> > > while (my @rids = $factory->each_rid) >> > > { foreach my $rid ( @rids ) >> > > { my $rc = $factory->retrieve_blast($rid); >> > > if( !ref($rc) ) >> > > { if( $rc < 0 ) >> > > { $factory->remove_rid($rid); >> > > } >> > > print STDERR "." if ( $v > 0 ); >> > > sleep 5; >> > > } >> > > else >> > > { my $result = $rc->next_result(); >> > > my $filename = $result->query_name()."\.out"; >> > > $factory->save_output($filename); >> > > $factory->remove_rid($rid); >> > > print "\nQuery Name: ", $result->query_name(), "\n"; >> > > } >> > > } >> > > } >> > > >> > > >> > > Both of my questions are very similiar as in I know how >to use remote >> > > blast but >> > > not sure what to change to access the specific blast I want. >> > > >> > > Again, any help would be very appreciated!! >> > > >> > > Rohan >> > > >> > > >> > > >> > > ---------------------------------------- >> > > This mail sent through www.mywaterloo.ca >> > > _______________________________________________ >> > > Bioperl-l mailing list >> > > Bioperl-l at lists.open-bio.org >> > > http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > >> >> >> >> >> ---------------------------------------- >> This mail sent through www.mywaterloo.ca > >_______________________________________________ >Bioperl-l mailing list >Bioperl-l at lists.open-bio.org >http://lists.open-bio.org/mailman/listinfo/bioperl-l > From cjfields at uiuc.edu Thu Jul 20 23:01:02 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 20 Jul 2006 18:01:02 -0500 Subject: [Bioperl-l] installing bioperl In-Reply-To: References: Message-ID: <68C6025D-A9FE-47F0-905C-28B79C4B843A@uiuc.edu> Did you run perl Makefile.PL make make install 'perl Makefile.PL' generates the Makefile. Something screwy with DB_File, apparently, is also going on. > /usr/lib64/perl5/5.8.5/x86_64-linux-thread-multi/auto/DB_File/ > DB_File.so: Try updating or reinstalling DB_File. Chris On Jul 20, 2006, at 4:47 PM, Matthew A. Saunders wrote: > Dear Bioperl representative, > > I have been trying to install bioperl (in order to ultimately run some > Ensembl APIs) but I seem to be having some problems with the > bioperl installation. > > I have followed the installation directions and I get to the last > steps of > the "make" process, yet this stage fails with the error message below. > Can you possibly tell me what is the problem. I am not sure that I > understand the command "make", but I think that it requires that > there be > a file named "makefile" in the given folder, when I look in my newly > formed "bioperl-1.4" folder there is no "makefile" in there. > Perhaps that > is a problem. If so, how might I rectify the matter? > > Thanks! > > Matt > > > ************************************************************* . . > . > Enjoy the rest of bioperl, which you can use after going 'make > install' > > Checking if your kit is complete... > Looks good > /usr/bin/perl: symbol lookup error: > /usr/lib64/perl5/5.8.5/x86_64-linux-thread-multi/auto/DB_File/ > DB_File.so: > undefined symbol: db_version > Running make test > Make had some problems, maybe interrupted? Won't test > Running make install > Make had some problems, maybe interrupted? Won't install > *************************************************************** > > > > ----------------------------------------------------- > Matthew A. Saunders > UNCF-MERCK Postdoctoral Research Fellow > > Dept. of Ecology and Evolution > University of Chicago > (773)834-3964 > Skype: mattsaunders555 > http://home.uchicago.edu/~saunders > ------------------------------------------------------- > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Thu Jul 20 23:02:08 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 20 Jul 2006 18:02:08 -0500 Subject: [Bioperl-l] Remote Blast - Blast Human Genome In-Reply-To: References: Message-ID: Nice to know! I'll add this to the wiki. Chris On Jul 20, 2006, at 5:40 PM, Cook, Malcolm wrote: > Rohan, > > 'snp/human/human_snp' is the database name you need to use to blast > into > human snp database at NCBI > > See the following document for the full list (which link was > provided to > me via personal correspondace with NCBI helpdesk). Very useful... > > Hmm, looming again, there appear now to be two versions: > > http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastdblist.html (last > updated 2/7/2006) > http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/ > remote_accessible_blastdbli > st.html (last uypdated 5/29/2006) > > Neither are linked to by any other document on the internet (google > sez) > including anywhere else at NCBI. Go figure. It should be IMHO since > this info is nowhere else collected. > > Of course it may be out of date, but it always has got me through. > > Good luck > > Malcolm Cook - mec at stowers-institute.org - 816-926-4449 > Database Applications Manager - Bioinformatics > Stowers Institute for Medical Research - Kansas City, MO USA > > > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org >> [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Chris >> Fields >> Sent: Monday, July 17, 2006 4:26 PM >> To: vrramnar at student.cs.uwaterloo.ca; bioperl-l at lists.open-bio.org >> Subject: Re: [Bioperl-l] Remote Blast - Blast Human Genome >> >> Okay, I think I may know what's going on a little more now >> with NCBI's BLAST >> interface. Looks like any NCBI BLAST query must use the >> default URL and so >> must set up to proper GET/PUT commands to retrieve everything >> correctly. >> >> Here's the API description for it all: >> >> http://www.ncbi.nlm.nih.gov/BLAST/Doc/urlapi.html >> >> You could try setting the database to 'snp' or something along >> those lines >> instead of 'nr'; or you could see what the name of the >> database is when you >> use the web form and try setting it to that. According to >> this page, this >> should be possible: >> >> http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helpsnpfaq.sectio >> n.SearchdbSNP >> _test._Search_dbSNP_Using_B >> >> The Entrez Query limit was a recommendation for limiting your >> search to a >> set of sequences for human, for instance. >> >> I'll try looking into it a bit more but I'm pretty busy. If you find >> anything out you should probably post it here . >> >> Chris >> >>> Hi Chris, >>> >>> 1. I have tried changing the database to snp or dbSNP but >> neither works. >>> It >>> seems that depending on which type of blast you use(ie, Genome >>> Blast, >>> Blast SNP, >>> normal blast such as blastn, etc...) you see a different listing of >>> databases >>> available for querys. Since you mention that the Blast page I see >>> was >>> generated >>> by Genome, where could I go to see a complete listing of >> databases I can >>> query?? >>> Or if you knew off hand which database to search if I only >> wanted dbSNP >>> hits? >>> >>> 2. You also mention, I can limit the search by using Entrez >> terms. Do you >>> mean >>> like: >>> $Bio::Tools::Run::RemoteBlast::HEADER{'ENTREZ_QUERY'} = 'abc'; >>> where 'abc' is the name of the subject with which you would >> only like to >>> see >>> result of. For example if you put it as 'Homo >> sapiens[Organism]' then only >>> human >>> sequences would be in hit lists. >>> If this is what you mean, what would I change it to, to see >> only hits from >>> dbSNP? >>> >>> Thanks for the ongoing help, >>> >>> Rohan >>> >>> Quoting Chris Fields : >>> >>>> I added a method to RemoteBlast in bioperl-live (CVS) if >> you want to >>> play >>>> with changing the URL. I have been thinking about doing >> this for a bit >>> now >>>> but I already see problems. >>>> >>>> Here's the issue: the BLAST page you see is NOT the NCBI BLAST page >>> (note >>>> the differences in the URL) but a user-friendly request >> page, generated >>> on >>>> the fly by Genome, to submit BLAST requests for the >> relevant database. >>> So >>>> changing the URL will not work (even by adding extra >> parameters); you >>> only >>>> get the original HTML web page. >>>> >>>> You could try changing the database or limiting the search using an >>> Entrez >>>> term (which you should be able to include in the request, >> probably by >>> adding >>>> it to the HEADER). >>>> >>>> Chris >>>> >>>>> -----Original Message----- >>>>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >>>>> bounces at lists.open-bio.org] On Behalf Of >>> vrramnar at student.cs.uwaterloo.ca >>>>> Sent: Thursday, July 13, 2006 5:39 PM >>>>> To: bioperl-l at lists.open-bio.org >>>>> Subject: [Bioperl-l] Remote Blast - Blast Human Genome >>>>> >>>>> >>>>> Hello Again, >>>>> >>>>> I have another question regarding Remote blast but this >> time using >>> Genome >>>>> Blast. >>>>> >>>>> Here is the link: >>>>> >>>>> >>> >> http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi? >> taxid=9606 >>>>> >>>>> which again uses the main Blast web site: >>>>> >>>>> http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi >>>>> >>>>> Again I am not sure what to add or what HEADER >> information to change >>>>> within my >>>>> script. >>>>> >>>>> Here is my program, which was the same as the last email: >>>>> >>>>> #!/usr/bin/perl -w >>>>> >>>>> use Bio::Perl; >>>>> use Bio::Tools::Run::RemoteBlast; >>>>> >>>>> my $prog = "blastn"; >>>>> my $db = "refseq_genomic"; >>>>> my $e_val = 0.01; >>>>> >>>>> my @params = ( '-prog' => $prog, >>>>> '-data' => $db, >>>>> '-expect' => $e_val); >>>>> >>>>> my $factory = new Bio::Tools::Run::RemoteBlast->new(@params); >>>>> $Bio::Tools::Run::RemoteBlast::HEADER{'WWW_BLAST_TYPE'} >> = '????'; <-- >>> --- >>>>> what >>>>> do I put here >>>>> #$Bio::Tools::Run::RemoteBlast::HEADER{'?????'} = >> '????'; <--- Do I >>> need >>>>> to add >>>>> any other values to the form inputs >>>>> >>>>> $factory->submit_blast("blast.in"); >>>>> $v = 1; >>>>> >>>>> while (my @rids = $factory->each_rid) >>>>> { foreach my $rid ( @rids ) >>>>> { my $rc = $factory->retrieve_blast($rid); >>>>> if( !ref($rc) ) >>>>> { if( $rc < 0 ) >>>>> { $factory->remove_rid($rid); >>>>> } >>>>> print STDERR "." if ( $v > 0 ); >>>>> sleep 5; >>>>> } >>>>> else >>>>> { my $result = $rc->next_result(); >>>>> my $filename = $result->query_name()."\.out"; >>>>> $factory->save_output($filename); >>>>> $factory->remove_rid($rid); >>>>> print "\nQuery Name: ", $result->query_name(), "\n"; >>>>> } >>>>> } >>>>> } >>>>> >>>>> >>>>> Both of my questions are very similiar as in I know how >> to use remote >>>>> blast but >>>>> not sure what to change to access the specific blast I want. >>>>> >>>>> Again, any help would be very appreciated!! >>>>> >>>>> Rohan >>>>> >>>>> >>>>> >>>>> ---------------------------------------- >>>>> This mail sent through www.mywaterloo.ca >>>>> _______________________________________________ >>>>> Bioperl-l mailing list >>>>> Bioperl-l at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>> >>> >>> >>> >>> >>> ---------------------------------------- >>> This mail sent through www.mywaterloo.ca >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From vrramnar at student.cs.uwaterloo.ca Thu Jul 20 23:07:15 2006 From: vrramnar at student.cs.uwaterloo.ca (vrramnar at student.cs.uwaterloo.ca) Date: Thu, 20 Jul 2006 19:07:15 -0400 Subject: [Bioperl-l] Remote Blast - Blast Human Genome In-Reply-To: References: Message-ID: <1153436835.44c00ca39f2ee@www.nexusmail.uwaterloo.ca> Hi Malcolm, Thanks for the help, I actually figured this out today the same way you did through discussions with NCBI help deskng. He mentioned the main site is: http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/ But specifically: http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/remote_accessible_blastdblist.html So all you would need to do while using remoteblast is set your $db to one of the following: snp/human_9606/human_9606 Human SNPs snp/human_9606/rs_ch1 Human chr 1 SNPs snp/human_9606/rs_ch10 Human chr 10 SNPs snp/human_9606/rs_ch11 Human chr 11 SNPs snp/human_9606/rs_ch12 Human chr 12 SNPs snp/human_9606/rs_ch13 Human chr 13 SNPs snp/human_9606/rs_ch14 Human chr 14 SNPs snp/human_9606/rs_ch15 Human chr 15 SNPs snp/human_9606/rs_ch16 Human chr 16 SNPs snp/human_9606/rs_ch17 Human chr 17 SNPs snp/human_9606/rs_ch18 Human chr 18 SNPs snp/human_9606/rs_ch19 Human chr 19 SNPs snp/human_9606/rs_ch2 Human chr 2 SNPs snp/human_9606/rs_ch20 Human chr 20 SNPs snp/human_9606/rs_ch21 Human chr 21 SNPs snp/human_9606/rs_ch22 Human chr 22 SNPs snp/human_9606/rs_ch3 Human chr 3 SNPs snp/human_9606/rs_ch4 Human chr 4 SNPs snp/human_9606/rs_ch5 Human chr 5 SNPs snp/human_9606/rs_ch6 Human chr 6 SNPs snp/human_9606/rs_ch7 Human chr 7 SNPs snp/human_9606/rs_ch8 Human chr 8 SNPs snp/human_9606/rs_ch9 Human chr 9 SNPs snp/human_9606/rs_chMT Human chr Mitochondrial SNPs snp/human_9606/rs_chMulti Human SNPs mapped to multiple locations snp/human_9606/rs_chNotOn Human SNPs not mapped snp/human_9606/rs_chUn Human SNPs mapped to unplaced contigs snp/human_9606/rs_chX Human chr x SNPs snp/human_9606/rs_chY Human chr y SNPs The web site has a more complete list of all other databases available using the remoteblast module. Rohan Quoting "Cook, Malcolm" : > Rohan, > > 'snp/human/human_snp' is the database name you need to use to blast into > human snp database at NCBI > > See the following document for the full list (which link was provided to > me via personal correspondace with NCBI helpdesk). Very useful... > > Hmm, looming again, there appear now to be two versions: > > http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastdblist.html (last > updated 2/7/2006) > http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/remote_accessible_blastdbli > st.html (last uypdated 5/29/2006) > > Neither are linked to by any other document on the internet (google sez) > including anywhere else at NCBI. Go figure. It should be IMHO since > this info is nowhere else collected. > > Of course it may be out of date, but it always has got me through. > > Good luck > > Malcolm Cook - mec at stowers-institute.org - 816-926-4449 > Database Applications Manager - Bioinformatics > Stowers Institute for Medical Research - Kansas City, MO USA > > > > >-----Original Message----- > >From: bioperl-l-bounces at lists.open-bio.org > >[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Chris Fields > >Sent: Monday, July 17, 2006 4:26 PM > >To: vrramnar at student.cs.uwaterloo.ca; bioperl-l at lists.open-bio.org > >Subject: Re: [Bioperl-l] Remote Blast - Blast Human Genome > > > >Okay, I think I may know what's going on a little more now > >with NCBI's BLAST > >interface. Looks like any NCBI BLAST query must use the > >default URL and so > >must set up to proper GET/PUT commands to retrieve everything > >correctly. > > > >Here's the API description for it all: > > > >http://www.ncbi.nlm.nih.gov/BLAST/Doc/urlapi.html > > > >You could try setting the database to 'snp' or something along > >those lines > >instead of 'nr'; or you could see what the name of the > >database is when you > >use the web form and try setting it to that. According to > >this page, this > >should be possible: > > > >http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helpsnpfaq.sectio > >n.SearchdbSNP > >_test._Search_dbSNP_Using_B > > > >The Entrez Query limit was a recommendation for limiting your > >search to a > >set of sequences for human, for instance. > > > >I'll try looking into it a bit more but I'm pretty busy. If you find > >anything out you should probably post it here . > > > >Chris > > > >> Hi Chris, > >> > >> 1. I have tried changing the database to snp or dbSNP but > >neither works. > >> It > >> seems that depending on which type of blast you use(ie, Genome Blast, > >> Blast SNP, > >> normal blast such as blastn, etc...) you see a different listing of > >> databases > >> available for querys. Since you mention that the Blast page I see was > >> generated > >> by Genome, where could I go to see a complete listing of > >databases I can > >> query?? > >> Or if you knew off hand which database to search if I only > >wanted dbSNP > >> hits? > >> > >> 2. You also mention, I can limit the search by using Entrez > >terms. Do you > >> mean > >> like: > >> $Bio::Tools::Run::RemoteBlast::HEADER{'ENTREZ_QUERY'} = 'abc'; > >> where 'abc' is the name of the subject with which you would > >only like to > >> see > >> result of. For example if you put it as 'Homo > >sapiens[Organism]' then only > >> human > >> sequences would be in hit lists. > >> If this is what you mean, what would I change it to, to see > >only hits from > >> dbSNP? > >> > >> Thanks for the ongoing help, > >> > >> Rohan > >> > >> Quoting Chris Fields : > >> > >> > I added a method to RemoteBlast in bioperl-live (CVS) if > >you want to > >> play > >> > with changing the URL. I have been thinking about doing > >this for a bit > >> now > >> > but I already see problems. > >> > > >> > Here's the issue: the BLAST page you see is NOT the NCBI BLAST page > >> (note > >> > the differences in the URL) but a user-friendly request > >page, generated > >> on > >> > the fly by Genome, to submit BLAST requests for the > >relevant database. > >> So > >> > changing the URL will not work (even by adding extra > >parameters); you > >> only > >> > get the original HTML web page. > >> > > >> > You could try changing the database or limiting the search using an > >> Entrez > >> > term (which you should be able to include in the request, > >probably by > >> adding > >> > it to the HEADER). > >> > > >> > Chris > >> > > >> > > -----Original Message----- > >> > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > >> > > bounces at lists.open-bio.org] On Behalf Of > >> vrramnar at student.cs.uwaterloo.ca > >> > > Sent: Thursday, July 13, 2006 5:39 PM > >> > > To: bioperl-l at lists.open-bio.org > >> > > Subject: [Bioperl-l] Remote Blast - Blast Human Genome > >> > > > >> > > > >> > > Hello Again, > >> > > > >> > > I have another question regarding Remote blast but this > >time using > >> Genome > >> > > Blast. > >> > > > >> > > Here is the link: > >> > > > >> > > > >> > >http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=9606 > >> > > > >> > > which again uses the main Blast web site: > >> > > > >> > > http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi > >> > > > >> > > Again I am not sure what to add or what HEADER > >information to change > >> > > within my > >> > > script. > >> > > > >> > > Here is my program, which was the same as the last email: > >> > > > >> > > #!/usr/bin/perl -w > >> > > > >> > > use Bio::Perl; > >> > > use Bio::Tools::Run::RemoteBlast; > >> > > > >> > > my $prog = "blastn"; > >> > > my $db = "refseq_genomic"; > >> > > my $e_val = 0.01; > >> > > > >> > > my @params = ( '-prog' => $prog, > >> > > '-data' => $db, > >> > > '-expect' => $e_val); > >> > > > >> > > my $factory = new Bio::Tools::Run::RemoteBlast->new(@params); > >> > > $Bio::Tools::Run::RemoteBlast::HEADER{'WWW_BLAST_TYPE'} > >= '????'; <-- > >> --- > >> > > what > >> > > do I put here > >> > > #$Bio::Tools::Run::RemoteBlast::HEADER{'?????'} = > >'????'; <--- Do I > >> need > >> > > to add > >> > > any other values to the form inputs > >> > > > >> > > $factory->submit_blast("blast.in"); > >> > > $v = 1; > >> > > > >> > > while (my @rids = $factory->each_rid) > >> > > { foreach my $rid ( @rids ) > >> > > { my $rc = $factory->retrieve_blast($rid); > >> > > if( !ref($rc) ) > >> > > { if( $rc < 0 ) > >> > > { $factory->remove_rid($rid); > >> > > } > >> > > print STDERR "." if ( $v > 0 ); > >> > > sleep 5; > >> > > } > >> > > else > >> > > { my $result = $rc->next_result(); > >> > > my $filename = $result->query_name()."\.out"; > >> > > $factory->save_output($filename); > >> > > $factory->remove_rid($rid); > >> > > print "\nQuery Name: ", $result->query_name(), "\n"; > >> > > } > >> > > } > >> > > } > >> > > > >> > > > >> > > Both of my questions are very similiar as in I know how > >to use remote > >> > > blast but > >> > > not sure what to change to access the specific blast I want. > >> > > > >> > > Again, any help would be very appreciated!! > >> > > > >> > > Rohan > >> > > > >> > > > >> > > > >> > > ---------------------------------------- > >> > > This mail sent through www.mywaterloo.ca > >> > > _______________________________________________ > >> > > Bioperl-l mailing list > >> > > Bioperl-l at lists.open-bio.org > >> > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > >> > > >> > >> > >> > >> > >> ---------------------------------------- > >> This mail sent through www.mywaterloo.ca > > > >_______________________________________________ > >Bioperl-l mailing list > >Bioperl-l at lists.open-bio.org > >http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > ---------------------------------------- This mail sent through www.mywaterloo.ca From vrramnar at student.cs.uwaterloo.ca Thu Jul 20 23:18:27 2006 From: vrramnar at student.cs.uwaterloo.ca (vrramnar at student.cs.uwaterloo.ca) Date: Thu, 20 Jul 2006 19:18:27 -0400 Subject: [Bioperl-l] SNP reference file download Message-ID: <1153437507.44c00f43b53d4@www.nexusmail.uwaterloo.ca> Hello All, I was wondering if anyone knew how to download an entire SNP reference file from NCBI?? Or even downloading the sequence data for a particular SNP. I know how to do this via Bio::DB::GenBank, Bio::DB::SwissP, etc.. when referring to NM_##### but when I try to access rs###### files I am unsure of what Bio::DB to point to, if there is one. For example, if I had the accession number: rs4986950 How could I retrieve NCBI's entire reference file for this SNP record OR just the SNP sequence relating to this accession number. Any help on this subject would greatly be appreciated, Rohan ---------------------------------------- This mail sent through www.mywaterloo.ca From cjfields at uiuc.edu Fri Jul 21 04:51:30 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 20 Jul 2006 23:51:30 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C00805.7090403@sendu.me.uk> References: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine> <44C00805.7090403@sendu.me.uk> Message-ID: <9777D0A4-5340-4367-879D-F9125393B4FD@uiuc.edu> > I didn't actually mean a stored file (but that would be possible > with a > tied hash or something: DB_File, just like flatfile), but an in-memory > one for use during the course of program execution. Stored file would > probably be dangerous because you wouldn't know if the data has become > stale or not - and checking to see if it wasn't would defeat the > point. Okay, that wouldn't be a problem. I currently use in-memory caches to hold NCBI history information and ELink information for EUtilities. It would just a matter of doing the same for Bio::DB::Taxonomy. ... > entrez already parses through LineageEx to build the classification > array. flatfile walks up all the parents to do the same. Having the > information isn't the issue. We have the information. The methods > genus() and species() need to work with the genbank fileformat, > that is > the problem. The original purpose for Bio::Species was a simple object to hold taxonomic information. This object was then used in an attempt to hold the basic organism information (scientific name, common name, lineage information, etc) contained in a RichSeq file, like GenBank, EMBL, SwissProt, etc. The problem: trying to determine which term in the lineage corresponds to which rank and what part of the organism's scientific name is the genus, the species, and so on based solely on the data in the file, which comes down to a best-guess scenario for many organisms. It does work, but not equally well for all RichSeq files, not for every organism, and definitely not all the time. So, yes, genus(), species(), binomial, and other methods are present, but one must realize that parsing out the data into the appropriate object data using the various get/sets, with the obvious exceptions, is not the best way. Unless... you incorporate information available only outside the actual file itself (i.e. NCBI Taxonomy information). This is where Bio::Taxonomy seems to come along, as it's not-species specific (it can represent any rank) and is also DB-aware. Though Bio::Species was originally going to delegate all its data to Bio::Taxonomy::Node, I think the purpose was to eventually replace Bio::Species. So, my question is, why not use a Bio::Taxonomy::Node-like class initially to contain the appropriate data for a GenBank file (just for read/write purposes)? This object, since it implements Bio::Taxonomy::NodeI, is also DB-aware and thus, if set up with a database could also get/set the appropriate object data correctly using the lineage data. So, for instance, if I called $species = $seq->species(); and wanted the classification, scientific_name(), common_name, and other information that is gleaned from the file, then there's no need for a lookup. Once you cross into the bounds of: print $species->species(); print $species->genus(); then there's trouble, since we're working straight from the file (i.e. parsing is mainly correct, but still guesswork and sometimes wrong). But what if you could do something like this: my $db = Bio::DB::Taxonomy->new(-source => 'entrez'); # normally not needed as this is set by default internally, but as a demo here... $species->db_handle($db); # reset the appropriate data (genus, species, etc) based on Entrez tax data $species->reset_data(); # this method, BTW, doesn't exist yet but should be easy to implement print $species->species(); my $parent = $species->get_Parent_Node; my @child = $species->get_Children_Nodes; ...and so on Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From prabubio at gmail.com Fri Jul 21 06:17:41 2006 From: prabubio at gmail.com (Prabu R) Date: Fri, 21 Jul 2006 11:47:41 +0530 Subject: [Bioperl-l] Blast Output Parsing In-Reply-To: <002b01c6ac22$5d75e1f0$15327e82@pyrimidine> References: <002b01c6ac22$5d75e1f0$15327e82@pyrimidine> Message-ID: It works great Thanks a lot Mr.Chris. R. Prabu On 7/20/06, Chris Fields wrote: > > Grab the HSPs, then use get_aln() to generate a Bio::SimpleAlign object. > You can then use Bio::AlignIO to generate the alignment output if needed, > or > use the Bio::SimpleAlign methods to get what you want. > > http://www.bioperl.org/wiki/HOWTO:Beginners > > http://www.bioperl.org/wiki/HOWTO:SearchIO > > > http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/SimpleAlign > .html > > Chris > > > -----Original Message----- > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > > bounces at lists.open-bio.org] On Behalf Of Prabu R > > Sent: Thursday, July 20, 2006 11:02 AM > > To: bioperl-l at lists.open-bio.org > > Subject: [Bioperl-l] Blast Output Parsing > > > > Dear All! > > > > I am now trying to parse a Blast output using PERL. > > > > I have to extract each alignment and have to parse the alignment. I > mean, > > I > > have to check whether a particular part of the given sequence got > aligned > > 100%. > > > > Anybody please tell me what module in PERL I have to use for getting > this. > > > > I've tried Bio::SearchIO. But I didnt get any method to get the > > alignment. > > > > Kindly help. > > > > Thanks, > > R. Prabu > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- "Every noble work is at first impossible." - Thomas Carlyle From mh6 at sanger.ac.uk Fri Jul 21 09:04:57 2006 From: mh6 at sanger.ac.uk (Michael Han) Date: Fri, 21 Jul 2006 10:04:57 +0100 Subject: [Bioperl-l] PAML parser Message-ID: <44C098B9.4090003@sanger.ac.uk> Hi, I have some questions about the PAML parser (Bio::Tools::Phylo::PAML in CVS HEAD). Maybe some of you could help. If you call next_result, $self->_parse_summary might be called, which loops over $self->_readline . Later in next_result when "while (defined ($_=$self->_readline))" is used isn't the filepointer/filehandle already at the end of the output file and should return undef breaking the parsing? I added a crude seek($self->{_filehandle},0,0) after the _parse_summary and it seemed to work, but I wonder if I missed something obvious. thanks, Mike From cjfields at uiuc.edu Fri Jul 21 12:22:01 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 21 Jul 2006 07:22:01 -0500 Subject: [Bioperl-l] PAML parser In-Reply-To: <44C098B9.4090003@sanger.ac.uk> References: <44C098B9.4090003@sanger.ac.uk> Message-ID: Normally when you parse a report you use a loop to iterate through results: while (my $result = $parser->next_result) { # do work here } So returning undef is necessary to end the loop. This type of loop construct is common in BioPerl (and in Perl in general). There is a HOWTO for PAML: http://www.bioperl.org/wiki/HOWTO:PAML Chris On Jul 21, 2006, at 4:04 AM, Michael Han wrote: > Hi, > > I have some questions about the PAML parser > (Bio::Tools::Phylo::PAML in CVS HEAD). Maybe some of you could help. > > If you call next_result, $self->_parse_summary might be called, > which loops over $self->_readline . > > Later in next_result when "while (defined ($_=$self->_readline))" > is used isn't the filepointer/filehandle > already at the end of the output file and should return undef > breaking the parsing? > > I added a crude seek($self->{_filehandle},0,0) after the > _parse_summary and it seemed to work, but I wonder if I missed > something obvious. > > thanks, > > Mike > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Fri Jul 21 15:50:20 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 21 Jul 2006 10:50:20 -0500 Subject: [Bioperl-l] SNP reference file download In-Reply-To: <1153437507.44c00f43b53d4@www.nexusmail.uwaterloo.ca> Message-ID: <000901c6acdd$5f38ddb0$15327e82@pyrimidine> You'll need the latest code from CVS; you could try (the highly experimental) Bio::DB::EUtilities to get the raw flatfile XML data, then pass everything through Bio::ClusterIO. Currently there isn't tempfile, file, or filehandle support for the EUtilities but I plan on adding this soon. You could also pipe STDOUT from one SNP retrieval script into STDIN for the ClusterIO. BTW, the EFetch object below accepts an array reference of primary IDs if you want to use them instead, so you don't need to run an ESearch query first. To do this you'll need to set the database parameter (-db => 'snp'); the database from the ESearch query is passed to EFetch via the Cookie object. Chris use Bio::DB::EUtilities; use Bio::ClusterIO; # save XML to tempfile for read/write open my $XMLDATA, '+>', 'tempfile.xml'; # ESearch for term, place data in search history my $esearch= Bio::DB::EUtilities->new(-eutil => 'esearch', -term => 'dihydroorotase', -db => 'snp', -usehistory => 'y'); $esearch->get_response; print STDERR "Count: ", $esearch->count,"\n"; # efetch is default EUtility my $efetch = Bio::DB::EUtilities->new(-cookie => $esearch->next_cookie, -rettype => 'flt'); # SNP flatfile print $XMLDATA $efetch->get_response->content; seek ($XMLDATA, 0, 0); # don't forget to rewind... my $cio = Bio::ClusterIO->new(-format => 'dbsnp', -fh => $XMLDATA); # $snp is a Bio::Variation::snp object, see perldoc for methods while (my $snp = $cio->next_cluster) { print "ID : ",$snp->id,"\n"; } close $XMLDATA; > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of vrramnar at student.cs.uwaterloo.ca > Sent: Thursday, July 20, 2006 6:18 PM > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] SNP reference file download > > > Hello All, > > I was wondering if anyone knew how to download an entire SNP reference > file from > NCBI?? Or even downloading the sequence data for a particular SNP. > > I know how to do this via Bio::DB::GenBank, Bio::DB::SwissP, etc.. when > referring > to NM_##### but when I try to access rs###### files I am unsure of what > Bio::DB > to point to, if there is one. > > For example, if I had the accession number: rs4986950 How could I retrieve > NCBI's > entire reference file for this SNP record OR just the SNP sequence > relating to > this accession number. > > Any help on this subject would greatly be appreciated, > > Rohan > > > ---------------------------------------- > This mail sent through www.mywaterloo.ca > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Sun Jul 23 19:09:48 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sun, 23 Jul 2006 14:09:48 -0500 Subject: [Bioperl-l] obo_parser.t test warnings Message-ID: Hilmar, Sohel, Didn't know who to notify, so sorry in advance about cross-posting this to the list. I was running through cleaning up some bugs and found that obo_parser.t is throwing a ton of warnings: bayou-75:~/Chris/Bioperl/bioperl-live natashacapell$ perl -I. -w t/ obo_parser.t 1..40 "my" variable $val masks earlier declaration in same scope at Bio/ OntologyIO/obo.pm line 592. "my" variable $qh masks earlier declaration in same scope at Bio/ OntologyIO/obo.pm line 592. Use of uninitialized value in string eq at Bio/OntologyIO/obo.pm line 239, line 13. ... Good news: all tests pass! Cheers! Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Sun Jul 23 20:53:32 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sun, 23 Jul 2006 15:53:32 -0500 Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes Message-ID: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu> Sendu, Hilmar, et al, I was looking through SeqIO::genbank and though I would bring up a couple of things to think about re: GenBank Taxonomy information. This is how NCBI defines the names used for SOURCE and ORGANISM according to the latest GenBank release notes: SOURCE - Common name of the organism or the name most frequently used in the literature. Mandatory keyword in all annotated entries/one or more records/includes one subkeyword. ORGANISM - Formal scientific name of the organism (first line) and taxonomic classification levels (second and subsequent lines). Mandatory subkeyword in all annotated entries/two or more records. According to their sample file page (http://www.ncbi.nlm.nih.gov/ Sitemap/samplerecord.html), the SOURCE is this: Free-format information including an abbreviated form of the organism name, sometimes followed by a molecule type. (See section 3.4.10 of the GenBank release notes for more info.) The SOURCE can also include the organelle and also may include additional information, such as an abbreviated name and a common name in parentheses. ... SOURCE Saccharomyces cerevisiae (baker's yeast) ORGANISM Saccharomyces cerevisiae Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces. ... Setting scientific_name() isn't a problem; acc. to the above definition, it is the full name on the ORGANISM line. The lineage (or classification() array) is also straight-forward. The common_name (), though as used by Bio::SeqIO::genbank, is the entire SOURCE line (not just the abbreviated name, but the name and everything else). No additional parsing is performed on it. write_seq() also seems to do the wrong thing when rebuilding the SOURCE line as well as the method writes the subspecies to the line. I plan on using Bio::SeqIO::genbank as a guinea pig of sorts to try using Bio::Taxonomy::Node objects instead of Bio::Species, then get the parsing for these lines corrected and simplified. Essentially, the way NCBI describes it, the main name on the line is actually the free-form abbreviated name, the name in parentheses is the common name (optionally present), and the organelle precedes all of these if present. I want to try getting common_name() to match the common name found for taxonomy (baker's yeast) rather than have it be a simple container, add an abbreviated_name() method for the name container for the SOURCE line, and have the organelle() method actually be used if an organelle is present (it doesn't seem to be set at the moment in SeqIO::genbank). Right now, I have NO idea how EMBL, DDBJ, other formats deal with organism info; I would think that the main three (GenBank/EMBL- SwissProt/DDBJ) handle them similarly...(Famous Last Words) I also propose (I'll probably get yelled at here) NOT actively supporting additional parsing of species, subspecies, etc directly from a file w/o a DB lookup. As in, leave species, subspecies, genus parsing from the flatfile as is (no longer support it) or remove it completely and leave them unset. I haven't looked, but I have a strong feeling that the species parsing in Bio::SeqIO is different from format to format. It really seems like more trouble than it's worth to maintain this, especially as there is perfectly valid Taxonomy database information available either locally using a flatfile or via Entrez. If people want to have reliable $species->species or $species-genus for taxonomy information, they will need to have the db_handle() set for the Bio::Taxonomy::Node object and have an Node-based method to reset species, genus, etc to the tax database information (maybe reset_taxon or something along those lines). Okay, rambled on enough. Any thoughts? Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From hlapp at gmx.net Sun Jul 23 23:40:45 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 23 Jul 2006 19:40:45 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BF86AF.8080408@sendu.me.uk> References: <44BBBB69.6000906@sendu.me.uk> <44BF86AF.8080408@sendu.me.uk> Message-ID: On Jul 20, 2006, at 9:35 AM, Sendu Bala wrote: > I'll describe all the changes I've now made and if no-one complains > I'll > commit. (I've also made these notes into bug 2047 for easier reference > in the future.) > > Bio::DB::Taxonomy::flatfile > --------------------------- > [...] > > BEHAVIOUR-CHANGE: flatfile used to store within the nodes it makes the > division as a three letter code, like 'PRI'. However, for consistency > with entrez and the scientific_name() of the node the division is > supposed to correspond to, it is now stored as the full name, like > 'Primates'. What about adding a method division_code() which would return the 3- letter abbreviation? The abbreviation may be needed by flat-file writers, so it may be handy to have in some cases. > > The names->id solution also stores the artificially uniqued names like > 'Craniata ', allowing you for the first time to retrieve the > correct id. Previously the search would have simply failed completely. > > The names->id solution now handles nodes with scientific names of 'xyz > (class)', allowing you to retrieve the id with both get_taxonids > ('xyz') > and get_taxonids('xyz (class)'). Previously only the latter would > work. Should angle brackets be allowed too? > > NOTE: the previous 2 changes (and the issues with entrez, see below) > make flatfile better at searching the taxonomy database than entrez > module or the website, both in terms of speed and completeness of > results. > > BEHAVIOUR-CHANGE: The scientific name field isn't touched in any way, > always being sent directly to Bio::Taxonomy::Node->new(-name => > $untouched) Maybe there should also be a -names parameter which accepts a hash reference with keys being the kind of name (scientific, common, etc) and the values being array references with the set of names of that kind? > or the $node->classification() array. Bio::Taxonomy::Node shouldn't have this attribute. It is legacy brought over from a flawed (because flat) object model in Bio::Species. > [...] > > Bio::DB::Taxonomy::entrez > ------------------------- > > # Bug-fixes > Special characters like ", ( and ) in the input query string to > get_taxonid() result in the failure or inaccuracy of the search. These > characters are now removed prior to submission, allowing for correct > search results. > API-CHANGE: entrez has always been able to return multiple ids that > match a single input name, so I've renamed get_taxonid() to > get_taxonids() and it returns an array of ids in list context. It > returns one of the ids in scalar context. For backward compatibility, > *get_taxonid = \&get_taxonids. Sounds good to me. > NOTE: entrez modules (and website) cannot cope with '' > in the > query, failing searches like 'Craniata '. For this > reason, if > get_taxonids() is given a query with '' it will immediately > return undefined, saving a pointless website access. If there is a 'next-best-thing' that is still semantically compatible with the API documentation, I would do that. In this case, if there is a in the query the entrez module should strip it and automatically use the rest for searching. If indeed multiple IDs match there should be a warning to inform the user that entrez cannot use the notation to limit the query results. In fact, you might as well provide an option to enable an automatic check for the correct branch for each ID if multiple ones are returned. I.e., if this option is enabled, the module would automatically query the parent nodes to see if is in the lineage, and if not will remove the respective ID from the result set. The reason you may want to make it optional is because it potentially costs time. (but in reality I'm not sure why a client will not want to enable the option - so maybe this should even be default) > If you want the id > of 'Craniata ' you must search for 'Craniata', then get the > node for each returned id to see which one has a parent node with a > scientific_name() or common_names() case-insensitive matching to > 'chordata'. Yep, see above. The more burden you can shield from the user the better. > [...] > Bio::Taxonomy::Node > ------------------- > [...] > classification() has a proper solution to finding the classification > when the array wasn't manually set. > > # Improvements > BEHAVIOUR-CHANGE: node_name() used to be an alias to name > ('common'). Now > it is an alias to name('scientific'). > NOTE: node_name is what is set when ->new(-name => $name) is set, so > flatfile and entrez and user-created nodes now implicitly associate > the > name of the node they create with its scientific name. I'm not even sure node_name() should just be deprecated. The methods falsely suggests that there is only a single and definitive name for the taxon node. In NCBI reality, this is only true for the scientific name of the node. In real reality, many nodes have multiple scientific names - taxonomy isn't static and therefore the scientific naming of nodes isn't either. > [...] > Thanks for the work, all other changes sound great. Thanks also to Chris for assisting! Looks like this is in much better shape now than before. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Sun Jul 23 23:44:23 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 23 Jul 2006 19:44:23 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44BD147A.9020103@sendu.me.uk> References: <003201c6aa81$01db9a30$15327e82@pyrimidine> <44BD147A.9020103@sendu.me.uk> Message-ID: <6701FDC3-7630-48EB-9B21-5BB5C566424E@gmx.net> On Jul 18, 2006, at 1:03 PM, Sendu Bala wrote: > > [regarding changes to Bio::Taxonomy::Node] > > Actually, I'm really strongly leaning toward getting rid of the > following methods and new() options (and giving up entirely on being > able to keep 'sapiens' somewhere): > > -organelle, organelle() > -division, division() > -sub_species, sub_species() > -variant, variant() > species(), validate_species_name() > genus() > binomial() > > As far as I can see none of these methods have any place in a generic > Node class. I agree. Some of them are a special case for genbank files (organelle () etc), and the rest is legacy from Bio::Species. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Mon Jul 24 00:48:22 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 23 Jul 2006 20:48:22 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <9777D0A4-5340-4367-879D-F9125393B4FD@uiuc.edu> References: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine> <44C00805.7090403@sendu.me.uk> <9777D0A4-5340-4367-879D-F9125393B4FD@uiuc.edu> Message-ID: <21C7763A-B8FF-4E91-BAEC-E5F79E462DD3@gmx.net> On Jul 21, 2006, at 12:51 AM, Chris Fields wrote: > my $db = Bio::DB::Taxonomy->new(-source => 'entrez'); > > # normally not needed as this is set by default internally, but as a > demo here... > $species->db_handle($db); > > # reset the appropriate data (genus, species, etc) based on Entrez > tax data > $species->reset_data(); # this method, BTW, doesn't exist yet but > should be easy to implement Don't call this reset_data() as it may be misleading (usually reset() means to revert into a native or original state). Instead, you would use fetch_from_db() or something. However, it seems redundant to me to begin with. If we ignore for a second that the return value in the following isn't exactly compatible, why would you not just call $species = $db->get_Taxonomy_Node(-taxonid => $species->ncbi_taxid); So I think more than anything else, this should be made to work, and you would have a more seamless interface. > Short and sweet summary: > > Sendu volunteered making changes to Bio::Taxonomy::Node and related > modules; > we disagreed on exactly what changes should be made. Sendu wanted a > stripped-down version of Bio::Taxonomy::Node; I wanted one which would > support similar methods as in Bio::Species. Bio::Species should be considered legacy; I think it is flawed as an object model because it imposes a flat view on something which in reality is only a node in a tree and not flat at all. The only real need for the flat view came from the desire to write sequence files - for all other purposes the classification() etc attributes are useless anyway. I.e., binomial() and common_name() (corresponding to scientific_name () and names('common')) are the only real useful attributes, the rest is baggage for writing sequence files. The baggage should not be passed on to a better model ... Instead, there should be a separate module (in essence a Bio::Species factory) which can translate a Bio::Taxonomy::Node into a Bio::Species object - if the rank is 'species' or below. Alternatively, you could have a Bio::Taxonomy::SpeciesNode object which implements both APIs and can be initialized with either a Bio::Taxonomy::Node instance, or the combination of a Bio::Species and a db handle. At any rate, I think Bio::Taxonomy::Node should be stripped of legacy methods that are only there to achieve Bio::Species compatibility. > > I suggested have a common interface module, one for Node and > another for > Species; both implement the same interface methods (NodeI maybe), > so you > could use either a bare-bones Node or a full-fledged Species > object. I then > suggested this new version of Species could replace Bio::Species. > We could > worry about which one to use for Bio::DB::Taxonomy* later. I'm not following here... How would this look like? What would the API (s) be? > > We both agreed. Everybody's happy. Happiness is great, so maybe you shouldn't bother about me not following... > I still plan on switching Bio::DB::Taxonomy::entrez to use > Bio::DB::EUtilities at some point Wouldn't that rather be Bio::DB::Taxonomy::eutil? > I may > add a method for retrieving tax data based on protein/nucleotide > sequence > primary ID and relevant sequence database, so you could directly > retrieve > the relevant TaxID w/o parsing sequences directly for them. This > would > mainly be useful if you gather GIs from a BLAST search, for instance. > > Anyway, I could add this in then base class Bio::DB::Taxonomy > directly so > one could used the retrieved TaxIDs for flat-file or entrez > searches; this > requires, of course, access to the remote Entrez database (it would > use > ELink). Would that be of interest? If you add the API methods for this to the base class (which in this case is close in concept to an interface, much like Bio/SeqIO.pm), then make clear that not every database will allow you to implement this. > > |------Node > NodeI----| > |------Species > > Another option would be to have Bio::Taxonomy::Node itself stripped > down, > then have another class (Bio::Taxonomy::Species) inherit methods > from it and > also implement additional methods (genus(), species(), etc). I think this would be the way to go. I.e., |------Node NodeI----| |-| |----SpeciesNode Species----| This way the NodeI interface and its direct implementors are kept free of legacy. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Mon Jul 24 00:43:45 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sun, 23 Jul 2006 19:43:45 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <6701FDC3-7630-48EB-9B21-5BB5C566424E@gmx.net> References: <003201c6aa81$01db9a30$15327e82@pyrimidine> <44BD147A.9020103@sendu.me.uk> <6701FDC3-7630-48EB-9B21-5BB5C566424E@gmx.net> Message-ID: <5F6027E0-A504-4019-8DAB-C50DF9EB6E18@uiuc.edu> As an aside, the 'source' seqfeature in a GenBank file contains some of the following information as tags; that's where the NCBI tax ID is taken from in Bio::SeqIO::genbank: FEATURES Location/Qualifiers source 1..814 /organism="Porterinema fluviatile" /organelle="plastid:chloroplast" /mol_type="genomic DNA" /strain="SAG 124.79" /db_xref="taxon:246123" /country="Germany" ... So, variant(), organelle(), and ncbi_taxid() could all be set from the same point in Bio::SeqIO::genbank. I suggested an option to Sendu, but I'd like to hear your thoughts on this since this will possibly affect bioperl-db. We could have two Node-like Taxonomy objects using a common interface class (Bio::Taxonomy::NodeI) : Bio::Taxonomy::Node (stripped down version), and Bio::Taxonomy::Species (the sequence-based NodeI-implementing object, which would retain the other Bio::Species-like methods). Bio::Taxonomy::Species would act sort of as an 'entry point' for Bio::Taxonomy from sequences; moving up or down the tax node hierarchy gets Tax::Node objects, unless you are specifically at a 'species'-ranked node (though this could be just a Tax::Node as well). BTW, I have managed to get Bio::SeqIO::genbank switched over to Bio::Taxonomy::Node (er... Bio::Taxonomy::Species); all tests pass. I was quite surprised how easy it was. It shouldn't be too hard to get a NodeI/Node/Species class hierarchy set up if everybody thinks it's worth it. Then we could deprecate Bio::Species! Chris On Jul 23, 2006, at 6:44 PM, Hilmar Lapp wrote: > > On Jul 18, 2006, at 1:03 PM, Sendu Bala wrote: > >> >> [regarding changes to Bio::Taxonomy::Node] >> >> Actually, I'm really strongly leaning toward getting rid of the >> following methods and new() options (and giving up entirely on being >> able to keep 'sapiens' somewhere): >> >> -organelle, organelle() >> -division, division() >> -sub_species, sub_species() >> -variant, variant() >> species(), validate_species_name() >> genus() >> binomial() >> >> As far as I can see none of these methods have any place in a generic >> Node class. > > I agree. Some of them are a special case for genbank files (organelle > () etc), and the rest is legacy from Bio::Species. > > -hilmar > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From hlapp at gmx.net Mon Jul 24 00:58:32 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 23 Jul 2006 20:58:32 -0400 Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes In-Reply-To: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu> References: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu> Message-ID: On Jul 23, 2006, at 4:53 PM, Chris Fields wrote: > I also propose (I'll probably get yelled at here) NOT actively > supporting additional parsing of species, subspecies, etc directly > from a file w/o a DB lookup. As in, leave species, subspecies, genus > parsing from the flatfile as is (no longer support it) or remove it > completely and leave them unset. Note that most (as in: most used, not most taxa) cases are actually straightforward. I don't think removing what's there is desirable, just everyone needs to understand that it will recognize only a limited number of syntactical variations, and beyond that if you want correct taxon attributes you will a database (be it flatfile, eutil, whatever) lookup. > If people want to > have reliable $species->species or $species-genus for taxonomy > information, they will need to have the db_handle() set for the > Bio::Taxonomy::Node object and have an Node-based method to reset > species, genus, etc to the tax database information (maybe > reset_taxon or something along those lines). That's what I've saying all along. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Mon Jul 24 03:30:07 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sun, 23 Jul 2006 22:30:07 -0500 Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes In-Reply-To: References: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu> Message-ID: <28D3470B-DA8F-4C41-96C7-F0D0DE89BAEE@uiuc.edu> On Jul 23, 2006, at 7:58 PM, Hilmar Lapp wrote: > > On Jul 23, 2006, at 4:53 PM, Chris Fields wrote: > >> I also propose (I'll probably get yelled at here) NOT actively >> supporting additional parsing of species, subspecies, etc directly >> from a file w/o a DB lookup. As in, leave species, subspecies, genus >> parsing from the flatfile as is (no longer support it) or remove it >> completely and leave them unset. > > Note that most (as in: most used, not most taxa) cases are actually > straightforward. I don't think removing what's there is desirable, > just everyone needs to understand that it will recognize only a > limited number of syntactical variations, and beyond that if you > want correct taxon attributes you will a database (be it flatfile, > eutil, whatever) lookup. Aha! We seem to agree on that... >> If people want to >> have reliable $species->species or $species-genus for taxonomy >> information, they will need to have the db_handle() set for the >> Bio::Taxonomy::Node object and have an Node-based method to reset >> species, genus, etc to the tax database information (maybe >> reset_taxon or something along those lines). > > That's what I've saying all along. > > -hilmar > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== I thought you had mentioned something about this a few months back on EMBL format issues with organism data. Anyway, I don't think it was from anybody disagreeing with you as much as it was one of the project priorities that sort of got lost in the shuffle. I'm sure Sendu will like having a bit of freedom with Bio::Taxonomy::Node. Anyway, I'll do what I can within reason; I have to leave next weekend for a 5-day conference. Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Mon Jul 24 08:21:55 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 24 Jul 2006 09:21:55 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <21C7763A-B8FF-4E91-BAEC-E5F79E462DD3@gmx.net> References: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine> <44C00805.7090403@sendu.me.uk> <9777D0A4-5340-4367-879D-F9125393B4FD@uiuc.edu> <21C7763A-B8FF-4E91-BAEC-E5F79E462DD3@gmx.net> Message-ID: <44C48323.5060704@sendu.me.uk> Hilmar Lapp wrote: > On Jul 21, 2006, at 12:51 AM, Chris Fields wrote: > >> my $db = Bio::DB::Taxonomy->new(-source => 'entrez'); >> >> # normally not needed as this is set by default internally, but as a >> demo here... >> $species->db_handle($db); >> >> # reset the appropriate data (genus, species, etc) based on Entrez >> tax data >> $species->reset_data(); # this method, BTW, doesn't exist yet but >> should be easy to implement > > Don't call this reset_data() as it may be misleading (usually reset() > means to revert into a native or original state). Instead, you would > use fetch_from_db() or something. > > However, it seems redundant to me to begin with. If we ignore for a > second that the return value in the following isn't exactly > compatible, why would you not just call > > $species = $db->get_Taxonomy_Node(-taxonid => $species->ncbi_taxid); If Bio::Species was a Bio::Taxonomy, and we had FactoryI implementing classes or similar, we would say: $species = $factory->fetch(-taxon_id => $species->ncbi_taxid); > Instead, there should be a separate module (in essence a Bio::Species > factory) which can translate a Bio::Taxonomy::Node into a > Bio::Species object - if the rank is 'species' or below. I don't think a 'translation' module is necessary. Bio::Species can just be a Bio::Taxonomy. > At any rate, I think Bio::Taxonomy::Node should be stripped of legacy > methods that are only there to achieve Bio::Species compatibility. Yes :) > I think this would be the way to go. I.e., > > > |------Node > NodeI----| > |-| > |----SpeciesNode > Species----| Actually, if we're changing the name of the module that Species interacts with, any existing code needs to be re-written. So why not just do it properly and have Bio::Species interact with Bio::Taxonomy? |----Bio::Taxonomy Bio::TaxonomyI----| |----Bio::Species Or Bio::TaxonomyI----|----Bio::Taxonomy----|----Bio::Species Leaving Node completely free to be just a node. This way we don't have a crufty SpeciesNode there simply for the sake of Bio::Species. Bio::Species itself provides all the legacy stuff it needs for itself, while interacting with Nodes via TaxonomyI methods in the 'correct' way only. From bix at sendu.me.uk Mon Jul 24 07:58:57 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 24 Jul 2006 08:58:57 +0100 Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes In-Reply-To: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu> References: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu> Message-ID: <44C47DC1.8020503@sendu.me.uk> Chris Fields wrote: > Sendu, Hilmar, et al, > > I was looking through SeqIO::genbank and though I would bring up a > couple of things to think about re: GenBank Taxonomy information. [...] > SOURCE - Common name of the organism or the name most frequently used > in the literature. Mandatory keyword in all annotated entries/one or > more records/includes one subkeyword. [...] > Free-format information including an abbreviated form of the organism > name, sometimes followed by a molecule type. (See section 3.4.10 of > the GenBank release notes for more info.) > > The SOURCE can also include the organelle and also may include > additional information, such as an abbreviated name and a common name > in parentheses. More specifically: (from 3.4.10 ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt) The SOURCE field consists of two parts. The first part is found after the SOURCE keyword and contains free-format information including an abbreviated form of the organism name followed by a molecule type; multiple lines are allowed, but the last line must end with a period. The second part consists of information found after the ORGANISM subkeyword. The formal scientific name for the source organism (genus and species, where appropriate) is found on the same line as ORGANISM. The records following the ORGANISM line list the taxonomic classification levels, separated by semicolons and ending with a period. > The common_name (), though as used by Bio::SeqIO::genbank, is the > entire SOURCE line (not just the abbreviated name, but the name and > everything else). No additional parsing is performed on it. > write_seq() also seems to do the wrong thing when rebuilding the > SOURCE line as well as the method writes the subspecies to the line. > > I plan on using Bio::SeqIO::genbank as a guinea pig of sorts to try > using Bio::Taxonomy::Node objects instead of Bio::Species, then get > the parsing for these lines corrected and simplified. Essentially, > the way NCBI describes it, the main name on the line is actually the > free-form abbreviated name, the name in parentheses is the common > name (optionally present), and the organelle precedes all of these if > present. I want to try getting common_name() to match the common > name found for taxonomy (baker's yeast) rather than have it be a > simple container, add an abbreviated_name() method for the name > container for the SOURCE line, and have the organelle() method > actually be used if an organelle is present (it doesn't seem to be > set at the moment in SeqIO::genbank). This is not how I read the specification. Everything on the the same line as 'Source' is free-format text and therefore cannot be parsed. For the purposes of writing out it must be stored as-is, but it serves no other useful purpose. The file also provides the scientific name which can be used to do an accurate database lookup, which in turn gives you access to the common names, like "baker's yeast". On a side note, why would we care about 'organelle' when we're dealing with taxonomy? Why does the NCBI taxonomy db have a slot for organelle? From bix at sendu.me.uk Mon Jul 24 08:45:38 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 24 Jul 2006 09:45:38 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: References: <44BBBB69.6000906@sendu.me.uk> <44BF86AF.8080408@sendu.me.uk> Message-ID: <44C488B2.5070806@sendu.me.uk> Hilmar Lapp wrote: > On Jul 20, 2006, at 9:35 AM, Sendu Bala wrote: > >> Bio::DB::Taxonomy::flatfile >> --------------------------- >> [...] >> >> BEHAVIOUR-CHANGE: flatfile used to store within the nodes it makes the >> division as a three letter code, like 'PRI'. However, for consistency >> with entrez and the scientific_name() of the node the division is >> supposed to correspond to, it is now stored as the full name, like >> 'Primates'. > > What about adding a method division_code() which would return the 3- > letter abbreviation? > > The abbreviation may be needed by flat-file writers, so it may be > handy to have in some cases. As far as I know you can't get the 3-letter version via entrez, so no other module can really expect to be able to get it, not knowing which database (flatfile.pm or entez.pm) the taxonomic information is coming from. But of course it would be somewhat harmless to add division_code() anyway. It might be better done as a -code => 1 option to division()? >> The names->id solution also stores the artificially uniqued names like >> 'Craniata ', allowing you for the first time to retrieve the >> correct id. Previously the search would have simply failed completely. >> >> The names->id solution now handles nodes with scientific names of 'xyz >> (class)', allowing you to retrieve the id with both get_taxonids >> ('xyz') >> and get_taxonids('xyz (class)'). Previously only the latter would >> work. > > Should angle brackets be allowed too? Allowed in what sense? You can indeed search for both get_taxonids('Craniata ') [returns a single id] and get_taxonids('Craniata') [returns multipe ids, one of which is the previous answer]. > Maybe there should also be a -names parameter which accepts a hash > reference with keys being the kind of name (scientific, common, etc) > and the values being array references with the set of names of that > kind? Not sure what you mean. name() has that data structure, though you're not supposed to set its hash ref directly. >> or the $node->classification() array. > > Bio::Taxonomy::Node shouldn't have this attribute. It is legacy > brought over from a flawed (because flat) object model in Bio::Species. Yes, I agree. >> NOTE: entrez modules (and website) cannot cope with '' >> in the >> query, failing searches like 'Craniata '. For this >> reason, if >> get_taxonids() is given a query with '' it will immediately >> return undefined, saving a pointless website access. > > If there is a 'next-best-thing' that is still semantically compatible > with the API documentation, I would do that. > > In this case, if there is a in the query the entrez > module should strip it and automatically use the rest for searching. > If indeed multiple IDs match there should be a warning to inform the > user that entrez cannot use the notation to limit the > query results. I wouldn't like this. I actually had it working this way initially, but decided that if someone entered 'xyz ' they really didn't want multiple ids, expected to get multiple ids with just 'xyz' and don't want their query made something else and then be warned about it. > In fact, you might as well provide an option to enable an automatic > check for the correct branch for each ID if multiple ones are > returned. I.e., if this option is enabled, the module would > automatically query the parent nodes to see if is in the > lineage, and if not will remove the respective ID from the result > set. The reason you may want to make it optional is because it > potentially costs time. (but in reality I'm not sure why a client > will not want to enable the option - so maybe this should even be > default) I can certainly add that, it seems like a good idea. I don't, however, see any scope for an option at all. What would the option be called? -don't_give_me_the_answer_I_actually_want_to_save_time ? Pointless, imho. If the user queries 'xyz ' with that option, they're just going to have to do for themselves manually what the method would have done for them without that option, in order to get the correct answer. It'll be slower that way, if anything. So the option would actually be called -don't_give_me_the_answer_I_actually_want_so_I_can_get_it_myself_a_little_slower (!). >> Bio::Taxonomy::Node >> ------------------- >> [...] >> classification() has a proper solution to finding the classification >> when the array wasn't manually set. >> >> # Improvements >> BEHAVIOUR-CHANGE: node_name() used to be an alias to name >> ('common'). Now >> it is an alias to name('scientific'). >> NOTE: node_name is what is set when ->new(-name => $name) is set, so >> flatfile and entrez and user-created nodes now implicitly associate >> the >> name of the node they create with its scientific name. > > I'm not even sure node_name() should just be deprecated. The methods > falsely suggests that there is only a single and definitive name for > the taxon node. > > In NCBI reality, this is only true for the scientific name of the > node. In real reality, many nodes have multiple scientific names - > taxonomy isn't static and therefore the scientific naming of nodes > isn't either. For the programmer not using any database but just making up his own nodes, I think he needs a node_name() because he may not be thinking about anything fancy or realistic. He just want to give his node a single name that he invents. node_name() seems like the ideal method name to me. From jaynelvallance at hotmail.com Mon Jul 24 09:45:50 2006 From: jaynelvallance at hotmail.com (Jayne Vallance) Date: Mon, 24 Jul 2006 09:45:50 +0000 Subject: [Bioperl-l] SearchIO - Stop throwing away data Message-ID: Hi I developing someone elses work. I wondered whether anyone could identify the mistake that the previous coder made? I am not very familiar with SearchIO yet. They are trying to extract filenames from an output report. This is their code: # store the query name of the mito db blast hits into an array my $searchio = new Bio::SearchIO( -file => $blast_mito_output ); # array to store the mitochondrial BLAST database hits my @mito_hits; # name of query for BLAST hit my $query_name; while ( my $result = $searchio->next_result() ) { # get the hits and their associated name # do not want to include these in the clustering step while( my $hit = $result->next_hit ) { # store the names of these hits into an array # these filenames will not be copied over $query_name = $result->query_name(); #print "\nQuery $query_name\n"; push(@mito_hits, $query_name); } } I think they have based it on the code at http://www.bioperl.org/wiki/HOWTO:SearchIO#Authors use Bio::SearchIO; use Bio::SearchIO::FastHitEventBuilder; my $searchio = new Bio::SearchIO(-format => $format, -file => $file); $searchio->attach_EventHandler(Bio::SearchIO::FastHitEventBuilder->new); while( my $r = $searchio->next_result ) { while( my $h = $r->next_hit ) { # Hits will NOT have HSPs print $h->significance,"\n"; } which "throws away data you don't want"??? I am finding that our code is finding the last file name in the ouput report, rather than each and every one. I suspect it is overwriting (or throwing away the data). How do I need to change the code to make sure *every* file name goes into @mito_hits? Thankyou Jayne _________________________________________________________________ The new MSN Search Toolbar now includes Desktop search! http://join.msn.com/toolbar/overview From simon.andrews at bbsrc.ac.uk Mon Jul 24 11:14:08 2006 From: simon.andrews at bbsrc.ac.uk (simon andrews (BI)) Date: Mon, 24 Jul 2006 12:14:08 +0100 Subject: [Bioperl-l] SearchIO - Stop throwing away data In-Reply-To: Message-ID: > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org > [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of > Jayne Vallance > Sent: 24 July 2006 10:46 > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] SearchIO - Stop throwing away data > > Hi > > I developing someone > elses work. I wondered whether anyone could identify the > mistake that the previous coder made? > I am not very familiar with SearchIO yet. > > They are trying to extract filenames from an output report. I'm not sure what you mean by filenames here. The value which is being collected in your code snippet is the name of the original query sequence. > This is their code: > while ( my $result = $searchio->next_result() ) { > # get the hits and their associated name > # do not want to include these in the clustering step > while( my $hit = $result->next_hit ) { > # store the names of these hits into an array > # these filenames will not be copied over > $query_name = $result->query_name(); > #print "\nQuery $query_name\n"; > push(@mito_hits, $query_name); OK, this bit is odd. You're collecting the name of the query sequence but you're doing it as you're looping through the hits. Since all the hits come from the same result you're just going to get the same query name put into your array multiple times (once for each hit). This almost certainly isn't what you want. If you just want the name of the query sequence you can miss out the inner loop (the $result->next_hit() loop). If you actually want to collect the names of the sequences which were hit then you need to collect $hit->name() rather than $result->query_name(); > } > } > > I think they have based it on the code at > http://www.bioperl.org/wiki/HOWTO:SearchIO#Authors > $searchio->attach_EventHandler(Bio::SearchIO::FastHitEventBuil der->new); > while( my $r = $searchio->next_result ) { while( my $h = > $r->next_hit ) { > # Hits will NOT have HSPs > print $h->significance,"\n"; > } > > which "throws away data you don't want"??? Indeed, but probably not in the way you're thinking. The data it throws away is the details of each individual HSP (mostly the alinment data). You're not using hsp data in your code so it will have no effect (other than making it a bit quicker). It doesn't throw away whole hits or anything like that. > I am finding that our code is finding the last file name in > the ouput report, rather than each and every one. I suspect > it is overwriting (or throwing away the data). I suspect then that you should be collecting the hit names rather than the query names? Simon. From hlapp at gmx.net Mon Jul 24 12:20:00 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 24 Jul 2006 08:20:00 -0400 Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes In-Reply-To: <44C47DC1.8020503@sendu.me.uk> References: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu> <44C47DC1.8020503@sendu.me.uk> Message-ID: <27E70EEF-28F6-4B54-9CCB-6090E338CEFB@gmx.net> On Jul 24, 2006, at 3:58 AM, Sendu Bala wrote: > On a side note, why would we care about 'organelle' when we're dealing > with taxonomy? Why does the NCBI taxonomy db have a slot for > organelle? Because some sequences are of the organelle DNA, and Genbank needs a way to express this. Highly artificial, but still can't be ignored. -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Mon Jul 24 12:27:28 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 24 Jul 2006 08:27:28 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C488B2.5070806@sendu.me.uk> References: <44BBBB69.6000906@sendu.me.uk> <44BF86AF.8080408@sendu.me.uk> <44C488B2.5070806@sendu.me.uk> Message-ID: <11A2B917-C633-4806-A6F4-920F02F0BF6E@gmx.net> :-) I think we're largely in agreement. As for node_name() I fully understand the motivation, but it needs to be understood that the attribute's value will be based on a largely arbitrary choice unless it is set directly by the user. -hilmar On Jul 24, 2006, at 4:45 AM, Sendu Bala wrote: > Hilmar Lapp wrote: >> On Jul 20, 2006, at 9:35 AM, Sendu Bala wrote: >> >>> Bio::DB::Taxonomy::flatfile >>> --------------------------- >>> [...] >>> >>> BEHAVIOUR-CHANGE: flatfile used to store within the nodes it >>> makes the >>> division as a three letter code, like 'PRI'. However, for >>> consistency >>> with entrez and the scientific_name() of the node the division is >>> supposed to correspond to, it is now stored as the full name, like >>> 'Primates'. >> >> What about adding a method division_code() which would return the 3- >> letter abbreviation? >> >> The abbreviation may be needed by flat-file writers, so it may be >> handy to have in some cases. > > As far as I know you can't get the 3-letter version via entrez, so no > other module can really expect to be able to get it, not knowing which > database (flatfile.pm or entez.pm) the taxonomic information is > coming from. > > But of course it would be somewhat harmless to add division_code() > anyway. It might be better done as a -code => 1 option to division()? > > >>> The names->id solution also stores the artificially uniqued names >>> like >>> 'Craniata ', allowing you for the first time to >>> retrieve the >>> correct id. Previously the search would have simply failed >>> completely. >>> >>> The names->id solution now handles nodes with scientific names of >>> 'xyz >>> (class)', allowing you to retrieve the id with both get_taxonids >>> ('xyz') >>> and get_taxonids('xyz (class)'). Previously only the latter would >>> work. >> >> Should angle brackets be allowed too? > > Allowed in what sense? You can indeed search for both > get_taxonids('Craniata ') [returns a single id] and > get_taxonids('Craniata') [returns multipe ids, one of which is the > previous answer]. > > >> Maybe there should also be a -names parameter which accepts a hash >> reference with keys being the kind of name (scientific, common, etc) >> and the values being array references with the set of names of that >> kind? > > Not sure what you mean. name() has that data structure, though you're > not supposed to set its hash ref directly. > > >>> or the $node->classification() array. >> >> Bio::Taxonomy::Node shouldn't have this attribute. It is legacy >> brought over from a flawed (because flat) object model in >> Bio::Species. > > Yes, I agree. > > >>> NOTE: entrez modules (and website) cannot cope with '' >>> in the >>> query, failing searches like 'Craniata '. For this >>> reason, if >>> get_taxonids() is given a query with '' it will >>> immediately >>> return undefined, saving a pointless website access. >> >> If there is a 'next-best-thing' that is still semantically compatible >> with the API documentation, I would do that. >> >> In this case, if there is a in the query the entrez >> module should strip it and automatically use the rest for searching. >> If indeed multiple IDs match there should be a warning to inform the >> user that entrez cannot use the notation to limit the >> query results. > > I wouldn't like this. I actually had it working this way initially, > but > decided that if someone entered 'xyz ' they really didn't > want multiple ids, expected to get multiple ids with just 'xyz' and > don't want their query made something else and then be warned about > it. > > >> In fact, you might as well provide an option to enable an automatic >> check for the correct branch for each ID if multiple ones are >> returned. I.e., if this option is enabled, the module would >> automatically query the parent nodes to see if is in the >> lineage, and if not will remove the respective ID from the result >> set. The reason you may want to make it optional is because it >> potentially costs time. (but in reality I'm not sure why a client >> will not want to enable the option - so maybe this should even be >> default) > > I can certainly add that, it seems like a good idea. I don't, however, > see any scope for an option at all. What would the option be called? > -don't_give_me_the_answer_I_actually_want_to_save_time ? Pointless, > imho. If the user queries 'xyz ' with that option, they're > just going to have to do for themselves manually what the method would > have done for them without that option, in order to get the correct > answer. It'll be slower that way, if anything. So the option would > actually be called > - > don't_give_me_the_answer_I_actually_want_so_I_can_get_it_myself_a_litt > le_slower > (!). > > >>> Bio::Taxonomy::Node >>> ------------------- >>> [...] >>> classification() has a proper solution to finding the classification >>> when the array wasn't manually set. >>> >>> # Improvements >>> BEHAVIOUR-CHANGE: node_name() used to be an alias to name >>> ('common'). Now >>> it is an alias to name('scientific'). >>> NOTE: node_name is what is set when ->new(-name => $name) is set, so >>> flatfile and entrez and user-created nodes now implicitly associate >>> the >>> name of the node they create with its scientific name. >> >> I'm not even sure node_name() should just be deprecated. The methods >> falsely suggests that there is only a single and definitive name for >> the taxon node. >> >> In NCBI reality, this is only true for the scientific name of the >> node. In real reality, many nodes have multiple scientific names - >> taxonomy isn't static and therefore the scientific naming of nodes >> isn't either. > > For the programmer not using any database but just making up his own > nodes, I think he needs a node_name() because he may not be thinking > about anything fancy or realistic. He just want to give his node a > single name that he invents. node_name() seems like the ideal method > name to me. > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Mon Jul 24 12:31:44 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 24 Jul 2006 08:31:44 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C48323.5060704@sendu.me.uk> References: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine> <44C00805.7090403@sendu.me.uk> <9777D0A4-5340-4367-879D-F9125393B4FD@uiuc.edu> <21C7763A-B8FF-4E91-BAEC-E5F79E462DD3@gmx.net> <44C48323.5060704@sendu.me.uk> Message-ID: <8B9F12C5-C520-48CF-825F-DD24975B43BB@gmx.net> Sounds good to me, except there is no Bio::TaxonomyI yet, and also Bio::Species shouldn't fully depend on an internet connection or flat file to do anything meaningful. I.e., it should take advantage of a lookup database if there is one, but in the absence of that one should also be able to statically set attribute values to whatever one thinks can be gleaned from a parsed text or whatever. -hilmar On Jul 24, 2006, at 4:21 AM, Sendu Bala wrote: >> I think this would be the way to go. I.e., >> >> >> |------Node >> NodeI----| >> |-| >> |----SpeciesNode >> Species----| > > Actually, if we're changing the name of the module that Species > interacts with, any existing code needs to be re-written. So why not > just do it properly and have Bio::Species interact with Bio::Taxonomy? > > |----Bio::Taxonomy > Bio::TaxonomyI----| > |----Bio::Species > > Or > > Bio::TaxonomyI----|----Bio::Taxonomy----|----Bio::Species > > Leaving Node completely free to be just a node. This way we don't > have a > crufty SpeciesNode there simply for the sake of Bio::Species. > Bio::Species itself provides all the legacy stuff it needs for itself, > while interacting with Nodes via TaxonomyI methods in the 'correct' > way > only. > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From bix at sendu.me.uk Mon Jul 24 12:34:45 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 24 Jul 2006 13:34:45 +0100 Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes In-Reply-To: <27E70EEF-28F6-4B54-9CCB-6090E338CEFB@gmx.net> References: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu> <44C47DC1.8020503@sendu.me.uk> <27E70EEF-28F6-4B54-9CCB-6090E338CEFB@gmx.net> Message-ID: <44C4BE65.8080304@sendu.me.uk> Hilmar Lapp wrote: > > On Jul 24, 2006, at 3:58 AM, Sendu Bala wrote: > >> On a side note, why would we care about 'organelle' when we're dealing >> with taxonomy? Why does the NCBI taxonomy db have a slot for organelle? > > Because some sequences are of the organelle DNA, and Genbank needs a way > to express this. Highly artificial, but still can't be ignored. Ok, but why is it stored as part of the taxonomy? Why isn't it stored in its own field? And does /bioperl/ have to store it as part of the taxonomy? Maybe the file parser could have its own organelle() method and leave all taxonomic classes without such a method. Or it could stay as is, I don't know. Do different organelles in the same species get unique taxonomy ids? From hlapp at gmx.net Mon Jul 24 12:46:51 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 24 Jul 2006 08:46:51 -0400 Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes In-Reply-To: <44C4BE65.8080304@sendu.me.uk> References: <7A7BF302-D84E-427B-ABC0-3EB4D19B33E2@uiuc.edu> <44C47DC1.8020503@sendu.me.uk> <27E70EEF-28F6-4B54-9CCB-6090E338CEFB@gmx.net> <44C4BE65.8080304@sendu.me.uk> Message-ID: <2C99E56B-84D2-4C51-BBF1-76BAF81205AB@gmx.net> On Jul 24, 2006, at 8:34 AM, Sendu Bala wrote: > Hilmar Lapp wrote: >> >> On Jul 24, 2006, at 3:58 AM, Sendu Bala wrote: >> >>> On a side note, why would we care about 'organelle' when we're >>> dealing >>> with taxonomy? Why does the NCBI taxonomy db have a slot for >>> organelle? >> Because some sequences are of the organelle DNA, and Genbank needs >> a way >> to express this. Highly artificial, but still can't be ignored. > > Ok, but why is it stored as part of the taxonomy? Why isn't it > stored in > its own field? And does /bioperl/ have to store it as part of the > taxonomy? No, but clients need to be able to obtain it. Organelles have their own genome. If we talk about the human genome, for instance, most commonly we refer to the nuclear genome only. > Maybe the file parser could have its own organelle() method > and leave all taxonomic classes without such a method. Or it could > stay > as is, I don't know. Like I said above, at the end of the day there needs to be a way to qualify a sequence by the genome it is part of. > > Do different organelles in the same species get unique taxonomy ids? I would have to confirm, but I believe so. As I said, from a genome/ sequence-centric viewpoint, the organelle and nuclear genomes are two different things. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From simon.andrews at bbsrc.ac.uk Mon Jul 24 13:34:10 2006 From: simon.andrews at bbsrc.ac.uk (simon andrews (BI)) Date: Mon, 24 Jul 2006 14:34:10 +0100 Subject: [Bioperl-l] New EMBL format parsing/writing Message-ID: I few weeks ago I saw a couple of messages on this list mentioning the new ID/SV line format used in the latest EMBL release. I'm in the process of moving our database server over to the new format and was looking to update SeqIO::embl.pm. I'm sure someone said they'd made a patch to fix up parsing of the new format, but I can't find it either in CVS or bugzilla. Rather than do this again myself can someone point me to an updated SeqIO::embl.pm please? If there isn't one then I'll look into making the patch myself. Since this is such a major change are there any plans to put out a new release with this fix included? I'm sure this will start to bite more people as the new format becomes more widely adopted. Cheers Simon. -- Simon Andrews PhD Bioinformatics Group The Babraham Institute simon.andrews at bbsrc.ac.uk +44 (0) 1223 496463 From cjfields at uiuc.edu Mon Jul 24 13:44:37 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 08:44:37 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <8B9F12C5-C520-48CF-825F-DD24975B43BB@gmx.net> References: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine> <44C00805.7090403@sendu.me.uk> <9777D0A4-5340-4367-879D-F9125393B4FD@uiuc.edu> <21C7763A-B8FF-4E91-BAEC-E5F79E462DD3@gmx.net> <44C48323.5060704@sendu.me.uk> <8B9F12C5-C520-48CF-825F-DD24975B43BB@gmx.net> Message-ID: <5CE136DF-2443-4AC0-AD12-B333D894A75B@uiuc.edu> Hence the reason to have it be a hybrid of Bio::Species and Tax::Node. Bio::SeqIO::genbank works very happily with the current Bio::Taxonomy::Node now; if we intend to remove most of the method we need to have a similar DB-aware module to house the flatfile data (like Bio::Species) yet be capable of working with Bio::Taxonomy (like Tax::Node). As for organelle(), that could be made into something else (Bio::Annotation::SimpleValue or similar) but as it's always been included with the tax data, that's where it has been. The TaxID in the 'source' seqfeature doesn't refer to the organelle but the organism. Chris On Jul 24, 2006, at 7:31 AM, Hilmar Lapp wrote: > Sounds good to me, except there is no Bio::TaxonomyI yet, and also > Bio::Species shouldn't fully depend on an internet connection or flat > file to do anything meaningful. > > I.e., it should take advantage of a lookup database if there is one, > but in the absence of that one should also be able to statically set > attribute values to whatever one thinks can be gleaned from a parsed > text or whatever. > > -hilmar > > On Jul 24, 2006, at 4:21 AM, Sendu Bala wrote: > >>> I think this would be the way to go. I.e., >>> >>> >>> |------Node >>> NodeI----| >>> |-| >>> |----SpeciesNode >>> Species----| >> >> Actually, if we're changing the name of the module that Species >> interacts with, any existing code needs to be re-written. So why not >> just do it properly and have Bio::Species interact with >> Bio::Taxonomy? >> >> |----Bio::Taxonomy >> Bio::TaxonomyI----| >> |----Bio::Species >> >> Or >> >> Bio::TaxonomyI----|----Bio::Taxonomy----|----Bio::Species >> >> Leaving Node completely free to be just a node. This way we don't >> have a >> crufty SpeciesNode there simply for the sake of Bio::Species. >> Bio::Species itself provides all the legacy stuff it needs for >> itself, >> while interacting with Nodes via TaxonomyI methods in the 'correct' >> way >> only. >> > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Mon Jul 24 13:49:42 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 24 Jul 2006 14:49:42 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <8B9F12C5-C520-48CF-825F-DD24975B43BB@gmx.net> References: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine> <44C00805.7090403@sendu.me.uk> <9777D0A4-5340-4367-879D-F9125393B4FD@uiuc.edu> <21C7763A-B8FF-4E91-BAEC-E5F79E462DD3@gmx.net> <44C48323.5060704@sendu.me.uk> <8B9F12C5-C520-48CF-825F-DD24975B43BB@gmx.net> Message-ID: <44C4CFF6.40609@sendu.me.uk> Hilmar Lapp wrote: > Sounds good to me, except there is no Bio::TaxonomyI yet, Indeed, I propose making one. > Bio::Species shouldn't fully depend on an internet connection or flat > file to do anything meaningful. > > I.e., it should take advantage of a lookup database if there is one, but > in the absence of that one should also be able to statically set > attribute values to whatever one thinks can be gleaned from a parsed > text or whatever. Yes, which is why Bio::Taxonomy is appropriate here. Assuming that Bio::Species isa Bio::TaxonomyI: ... SOURCE Saccharomyces cerevisiae (baker's yeast) ORGANISM Saccharomyces cerevisiae Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces. ... ## the fully-manual way my $species = new Bio::Species; my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces cerevisiae', -rank => 'species', -object_id => 1, -parent_id => 2); my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces', -object_id => 2, -parent_id => 3); # (no assumption that 'Saccharomyces' is the genus, so rank() undefined) my $n3 = [etc] $species->add_node($node); $species->add_node($n2); [etc] ## Using a factory without db access # assume that Bio::Taxonomy::GenbankFactory implements # some modified Bio::Taxonomy::FactoryI my $factory = Bio::Taxonomy::GenbankFactory->new(); my $species = $factory->generate(-classification => ['Saccharomyces cerevisiae', 'Saccharomyces', 'Saccharomycetaceae' ...]); # the generate() method above just does the fully-manual way for you ## Using a factory with db access # assume that Bio::Taxonomy::EntrezFactory implements some # modified Bio::Taxonomy::FactoryI and uses Bio::DB::Taxonomy::entrez # to get the nodes my $factory = Bio::Taxonomy::EntrezFactory->new(); my $species = $factory->fetch(-scientifc_name => 'Saccharomyces cerevisiae'); # (would probably want to come up with a more generic name for the # fetch() and generate() methods, so that all Factories use the same # same method name) It's very clean and flexible this way. Ultimately you always make your Bio::Species the same way - you add nodes to it. You can make those nodes yourself or use a factory. We also solve Chris' earlier quandary: [ in a world where Bio::Taxonomy::Node and Bio::Taxonomy::SpeciesNode exist, and given that Bio::DB::Taxonomy* currently directly make Node objects ] > The only problem I can foresee is which class to use with > Bio::DB::Taxonomy*? I guess one could settle on one class by default and > have the option to use another Bio::Taxonomy::NodeI-implementing class if > you wanted more data/methods available... The way to do it is to have the Bio::DB::Taxonomy* modules return only the information that a Bio::Taxonomy::FactoryI would need to make a NodeI. The specific Factory that you use could generate whatever type of Node you wanted. But actually I propose there is only one Node and the specific Factory that you use determines the kind of Bio::TaxonomyI made; GenbankFactory might make a Bio::Species, while EntrezFactory might make a Bio::Taxonomy. Bio::Species differs from Bio::Taxonomy only so it contains all the legacy methods names that Bio::Species currently has, for backward compatibility. Setting $species->classification() would delete all nodes of self, use a GenbankFactory to make a new Bio::Species, then pull out all its Nodes and add them to self. Unless anyone can think of a better way of doing things, I'll explore the above ideas and start writing code. To summarise: major changes to Bio::DB::Taxonomy* (make them factory slaves), implementation of some Bio::Taxonomy::FactoryIs, tweak Bio::Taxonomy::FactoryI and make Bio::TaxonomyI, make Bio::Species a Bio::TaxonomyI. Oh, Bio::Taxonomy might need some changes as well. It has a classify() method does something with a Bio::Species, which would be all wrong in the new way of doing things. From bix at sendu.me.uk Mon Jul 24 13:53:23 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 24 Jul 2006 14:53:23 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <5CE136DF-2443-4AC0-AD12-B333D894A75B@uiuc.edu> References: <001001c6ac2f$02c8c3f0$15327e82@pyrimidine> <44C00805.7090403@sendu.me.uk> <9777D0A4-5340-4367-879D-F9125393B4FD@uiuc.edu> <21C7763A-B8FF-4E91-BAEC-E5F79E462DD3@gmx.net> <44C48323.5060704@sendu.me.uk> <8B9F12C5-C520-48CF-825F-DD24975B43BB@gmx.net> <5CE136DF-2443-4AC0-AD12-B333D894A75B@uiuc.edu> Message-ID: <44C4D0D3.1020506@sendu.me.uk> Chris Fields wrote: > Bio::SeqIO::genbank works very happily with the current > Bio::Taxonomy::Node now; if we intend to remove most of the method we > need to have a similar DB-aware module to house the flatfile data (like > Bio::Species) yet be capable of working with Bio::Taxonomy (like Tax::Node). Can you give code examples of what Bio::SeqIO::genbank is doing and what makes it 'happy'? What are the requirements? Would it be as happy working with a Bio::Taxonomy object? From aramsey at vecna.com Mon Jul 24 14:23:46 2006 From: aramsey at vecna.com (Al Ramsey) Date: Mon, 24 Jul 2006 10:23:46 -0400 Subject: [Bioperl-l] Making BioPerl Faster Message-ID: <44C4D7F2.6020107@vecna.com> I'm interested into following up with a suggestion from the bioperl.org site about making it faster (http://www.bioperl.org/wiki/Why_BioPerl_is_slow). In particular, I wanted to look a little more into how the object instantiations might be more efficient. Is anyone else looking into this actively now? I want to ask if anyone had any additional insights that weren't previously published before I started. Thank you, Al Ramsey -- Alvin Ramsey, PhD. Vecna Technologies, Inc. 5205 Leesburg Pike Falls Church, VA 22041 aramsey at vecna.com t: 703.998.5333 f: 703.998.5816 From s-merchant at northwestern.edu Mon Jul 24 15:09:49 2006 From: s-merchant at northwestern.edu (Sohel Merchant) Date: Mon, 24 Jul 2006 10:09:49 -0500 Subject: [Bioperl-l] obo_parser.t test warnings In-Reply-To: Message-ID: <004301c6af33$3564a8e0$c2987ca5@pc13> Hey Chris, I usually run perl with all warnings disabled. So I never saw these. I will put a fix to them sometime this week. Thanks, Sohel. _____ From: Chris Fields [mailto:cjfields at uiuc.edu] Sent: Sunday, July 23, 2006 2:10 PM To: bioperl-l List; Hilmar Lapp; s-merchant at northwestern.edu Subject: obo_parser.t test warnings Hilmar, Sohel, Didn't know who to notify, so sorry in advance about cross-posting this to the list. I was running through cleaning up some bugs and found that obo_parser.t is throwing a ton of warnings: bayou-75:~/Chris/Bioperl/bioperl-live natashacapell$ perl -I. -w t/obo_parser.t 1..40 "my" variable $val masks earlier declaration in same scope at Bio/OntologyIO/obo.pm line 592. "my" variable $qh masks earlier declaration in same scope at Bio/OntologyIO/obo.pm line 592. Use of uninitialized value in string eq at Bio/OntologyIO/obo.pm line 239, line 13. ... Good news: all tests pass! Cheers! Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From prabubio at gmail.com Mon Jul 24 15:39:43 2006 From: prabubio at gmail.com (Prabu R) Date: Mon, 24 Jul 2006 21:09:43 +0530 Subject: [Bioperl-l] Remote Blast Execution Message-ID: Dear All! I am trying to run Remote Blast using Bio::Tools::Run::RemoteBlast. I am not able to get the blast result. Upto my knowledge, the Bio::SearchIO::blast hash object does not returns any result. Secondly, I tried 'remote_blast.pl ' a program from CPAN bioperl 1.5release. Command: perl bp_remote_blast.pl -p blastn -d est_mouse -e 1e-5 -i /home/prabucn/Blast/mm_test1.fa Error Message: retrieving blasts.. -------------------- WARNING --------------------- MSG: Possible error (1) while parsing BLAST report! --------------------------------------------------- Please help. Thanks, R. Prabu. Please look into my test program. ---------------------------------------------------------------------------------------------- use Bio::Tools::Run::RemoteBlast; use strict; use Bio::SeqIO; use Bio::SearchIO; my $prog = 'blastn'; my $db = 'est'; my $e_val= '1e-10'; my @params = ( '-prog' => $prog, '-data' => $db, '-expect' => $e_val, '-readmethod' => 'SearchIO' ); my $factory = Bio::Tools::Run::RemoteBlast->new(@params) || die "Cant do"; my $v = 1; my $str = Bio::SeqIO->new(-file=>'mm_test2.txt' , '-format' => 'fasta' ); while (my $input = $str->next_seq()){ my $r = $factory->submit_blast($input); print STDERR "waiting..." if( $v > 0 ); while ( my @rids = $factory->each_rid ) { foreach my $rid ( @rids ) { my $rc = $factory->retrieve_blast($rid); if( !ref($rc) ) { if( $rc < 0 ) { $factory->remove_rid($rid); } print STDERR "." if ( $v > 0 ); sleep 5; } else { print "$rc\n"; my $result = $rc->next_result(); my $filename = $result->query_name()."\.out"; $factory->save_output($filename); $factory->remove_rid($rid); print "\nQuery Name: ", $result->query_name(), "\n"; while ( my $hit = $result->next_hit ) { next unless ( $v > 0); print "\thit name is ", $hit->name, "\n"; while( my $hsp = $hit->next_hsp ) { print "\t\tscore is ", $hsp->score, "\n"; } } } } } } ---------------------------------------------------------------------------------------------- -- "Every noble work is at first impossible." - Thomas Carlyle From cjfields at uiuc.edu Mon Jul 24 15:48:45 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 10:48:45 -0500 Subject: [Bioperl-l] SearchIO - Stop throwing away data In-Reply-To: Message-ID: <001701c6af38$a81c1580$15327e82@pyrimidine> > Hi > > I developing someone > elses work. I wondered whether anyone could identify the > mistake that the previous coder made? > I am not very familiar with SearchIO yet. > > They are trying to extract filenames from an output report. > This is their code: > > # store the query name of the mito db blast hits into an array > my $searchio = new Bio::SearchIO( -file => $blast_mito_output ); > # array to store the mitochondrial BLAST database hits > my @mito_hits; > # name of query for BLAST hit > my $query_name; > Just as a gripe here: you should always designate the '-format' here to be 'blast' for BLAST text output. my $searchio = new Bio::SearchIO(-file => $blast_mito_output, -format => 'blast' ); The default is still text, so the above works, but that very well may change in the future. Each BLAST report is a Result. Each Result contains one or more hits; each hit contains one or more HSPs. SearchIO only parses the information contained in the BLAST report (i.e. no filenames). From here, it looks like you want Hit information, though. The code below copies the query_name from the BlastResult object, $result (i.e. the name of your query sequence, the one you submitted for BLAST'ing against a database). You need the BlastHit data from $hit. Change : $query_name = $result->query_name(); #print "\nQuery $query_name\n"; push(@mito_hits, $query_name); To : $hit_name = $hit->description(); #print "\nHit $hit_name\n"; push(@mito_hits, $hit_name); or, for the hit accession, use $hit_name = $hit->accession(); For all accessions in the description (there may be multiples if sequences are identical), use an array and @hit_name = $hit->get_all_accessions(); You can use a different EventHandler if you want to speed things up: my $searchio = new Bio::SearchIO(-format => $format, -file => $file); $searchio->attach_EventHandler(Bio::SearchIO::FastHitEventBuilder->new); But to have this work you need to update to the latest CVS version of bioperl; this was a recent bug that was fixed. Chris > while ( my $result = $searchio->next_result() ) { > # get the hits and their associated name > # do not want to include these in the clustering step > while( my $hit = $result->next_hit ) { > # store the names of these hits into an array > # these filenames will not be copied over > $query_name = $result->query_name(); > #print "\nQuery $query_name\n"; > push(@mito_hits, $query_name); > } > } > I think they have based it on the code at > http://www.bioperl.org/wiki/HOWTO:SearchIO#Authors > > use Bio::SearchIO; > use Bio::SearchIO::FastHitEventBuilder; > my $searchio = new Bio::SearchIO(-format => $format, -file => $file); > > $searchio->attach_EventHandler(Bio::SearchIO::FastHitEventBuilder->new); > while( my $r = $searchio->next_result ) { > while( my $h = $r->next_hit ) { > # Hits will NOT have HSPs > print $h->significance,"\n"; > } > > which "throws away data you don't want"??? > > I am finding that our code is finding the last file name in the ouput > report, > rather than each and every one. I suspect it is overwriting (or throwing > away the data). > > How do I need to change the code to make sure *every* file name goes > into @mito_hits? > > Thankyou > > Jayne > > _________________________________________________________________ > The new MSN Search Toolbar now includes Desktop search! > http://join.msn.com/toolbar/overview > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From dwaner at scitegic.com Mon Jul 24 16:03:21 2006 From: dwaner at scitegic.com (dwaner at scitegic.com) Date: Mon, 24 Jul 2006 09:03:21 -0700 Subject: [Bioperl-l] New EMBL format parsing/writing Message-ID: Simon, I have already updated SeqIO::embl.pm to support release 87. All I have left to do is generate the patch and update the /t test. I will try to get this submitted to bugzilla today (24 July). - David From cjfields at uiuc.edu Mon Jul 24 16:04:40 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 11:04:40 -0500 Subject: [Bioperl-l] Making BioPerl Faster In-Reply-To: <44C4D7F2.6020107@vecna.com> Message-ID: <001901c6af3a$df146ea0$15327e82@pyrimidine> Give it a look, sure! Not sure if this the only problem though when it comes to speed; I think it's more complicated than that. I think that (at least on WinXP) the Perl version used is also partially to blame. It's possible that something modified between v 5.6 and 5.8 slowed everything down considerably. I always wondered if it had something to do with Unicode support in perl 5.8 ... There is a report on Bugzilla about a dramatic slowdown on sequence parsing between v. 1.4 and v. 1.5 (including the latest, v 1.5.1) http://bugzilla.open-bio.org/show_bug.cgi?id=1875 This is unresolved at this time but may be unrelated to the possible perl versioning issue above. I've a feeling you may find regexes and redundant methods calls also add quite a bit of overhead. I've seen several places where accessors are called over and over w/o assigning to a local variable. Or places where a tr/// would work much faster than a s///. There was an instance of the latter in SeqIO which sped up parsing about 2-3x faster on WinXP. If you want to look at the impact of object instantiation on speed, check out Bio::SearchIO (parsing of BLAST/FASTA/HMMER reports). Lots of method calls, object creation, etc. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Al Ramsey > Sent: Monday, July 24, 2006 9:24 AM > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] Making BioPerl Faster > > I'm interested into following up with a suggestion from the bioperl.org > site about making it faster > (http://www.bioperl.org/wiki/Why_BioPerl_is_slow). In particular, I > wanted to look a little more into how the object instantiations might be > more efficient. Is anyone else looking into this actively now? I want > to ask if anyone had any additional insights that weren't previously > published before I started. > > Thank you, > Al Ramsey > > > -- > Alvin Ramsey, PhD. > > Vecna Technologies, Inc. > 5205 Leesburg Pike > Falls Church, VA 22041 > aramsey at vecna.com > t: 703.998.5333 > f: 703.998.5816 > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Mon Jul 24 16:06:03 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 11:06:03 -0500 Subject: [Bioperl-l] Remote Blast Execution In-Reply-To: Message-ID: <001a01c6af3b$10187f50$15327e82@pyrimidine> You need to update to the latest code (bioperl-live) from CVS. BLAST parsing using RemoteBlast is broken in all the latest releases. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Prabu R > Sent: Monday, July 24, 2006 10:40 AM > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] Remote Blast Execution > > Dear All! > > I am trying to run Remote Blast using Bio::Tools::Run::RemoteBlast. > > I am not able to get the blast result. > Upto my knowledge, the Bio::SearchIO::blast hash object does not returns > any > result. > > > Secondly, I tried 'remote_blast.pl ' a program from CPAN bioperl > 1.5release. > > Command: > perl bp_remote_blast.pl -p blastn -d est_mouse -e 1e-5 -i > /home/prabucn/Blast/mm_test1.fa > > Error Message: > > retrieving blasts.. > > -------------------- WARNING --------------------- > MSG: Possible error (1) while parsing BLAST report! > --------------------------------------------------- > > Please help. > > Thanks, > R. Prabu. > > > Please look into my test program. > -------------------------------------------------------------------------- > -------------------- > use Bio::Tools::Run::RemoteBlast; > use strict; > use Bio::SeqIO; > use Bio::SearchIO; > > my $prog = 'blastn'; > my $db = 'est'; > my $e_val= '1e-10'; > > my @params = ( '-prog' => $prog, > '-data' => $db, > '-expect' => $e_val, > '-readmethod' => 'SearchIO' ); > > my $factory = Bio::Tools::Run::RemoteBlast->new(@params) || die "Cant > do"; > > my $v = 1; > > my $str = Bio::SeqIO->new(-file=>'mm_test2.txt' , '-format' => 'fasta' > ); > > while (my $input = $str->next_seq()){ > my $r = $factory->submit_blast($input); > > print STDERR "waiting..." if( $v > 0 ); > while ( my @rids = $factory->each_rid ) { > foreach my $rid ( @rids ) { > my $rc = $factory->retrieve_blast($rid); > > if( !ref($rc) ) { > if( $rc < 0 ) { > $factory->remove_rid($rid); > } > print STDERR "." if ( $v > 0 ); > sleep 5; > } else { > print "$rc\n"; > my $result = $rc->next_result(); > my $filename = $result->query_name()."\.out"; > $factory->save_output($filename); > $factory->remove_rid($rid); > print "\nQuery Name: ", $result->query_name(), "\n"; > while ( my $hit = $result->next_hit ) { > next unless ( $v > 0); > print "\thit name is ", $hit->name, "\n"; > while( my $hsp = $hit->next_hsp ) { > print "\t\tscore is ", $hsp->score, "\n"; > } > } > } > } > } > } > -------------------------------------------------------------------------- > -------------------- > > -- > "Every noble work is at first impossible." > - Thomas Carlyle > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Mon Jul 24 16:21:39 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 11:21:39 -0500 Subject: [Bioperl-l] New EMBL format parsing/writing In-Reply-To: Message-ID: <001c01c6af3d$3df2dc70$15327e82@pyrimidine> The only proposed EMBL changes I can remember were for Tax data (organism lines). It shouldn't be hard to change the way these are parsed. We could leave parsing of SV for older files and run a check on the ID line format to accommodate old and new sequences, though I have no problem with only supporting the latest formats. Continual support for old deprecated sequence formats leads to lots of cruft over time; SwissPort parsing has the same issue. You would be surprised how many people out there never bother to update their sequences and use old data... I believe you are referring to this (from the latest EMBL release notes): ... 2 CHANGES IN THIS RELEASE 2.1 Changes to the Feature Table Document: Chapter 3.5 "Location" The use of range (.) descriptor within location spans is no longer legal. 2.2 ID line changes ID line structure underwent the following changes * All tokens are separated by a semicolon. * The entry name is not displayed, in its place there is the primary accession number. * The sequence version is indicated. * The topology is a separate token and is indicated for both circular and linear molecules. * Both the data class and taxonomic divisions will be displayed. This is an example of the new ID line: ID CD789012; SV 4; linear; genomic DNA; HTG; MAM; 500 BP. (1) (2) (3) (4) (5) (6) (7) The tokens represent: 1. Primary accession number. 2. 'SV' + sequence version number. 3. Topology: 'circular' or 'linear'. 4. Molecule type. 5. Data class (ANN, CON, PAT, EST, GSS, HTC, HTG, MGA, WGS, TPA, STS, STD, "normal" entries will have STD for standard). 6. Taxonomic division (HUM, MUS, ROD, PRO, MAM, VRT, FUN, PLN, ENV, INV, SYN, UNC, VRL, PHG). 7. Sequence length + 'BP.'. The entry name is no longer displayed in the ID line. A mapping file (entryname to accession number) ftp://ftp.ebi.ac.uk/pub/databases/embl/misc/entryname_to_acc.mapping is provided for those entries where the entryname is not the same as the accession number. The SV line has been dropped as sequence version information is now displayed in the ID line. In order to facilitate the changeover to the new ID line structure, two small utilities have been released: 'new2oldID.pl' and 'old2newID.pl'. They can be used to convert EMBL flat files from the old to the new format and vice-versa. The converters can be found at ftp://ftp.ebi.ac.uk/pub/databases/embl/tools A new version of the Syncron tools (for maintaining synchronised copies of EMBL database updates) that became the working version with EMBL release 87 can be found in the same directory. In this version the tools were adjusted to cope with the new format of the ID line in EMBL entries and some related changes. ... Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of simon andrews (BI) > Sent: Monday, July 24, 2006 8:34 AM > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] New EMBL format parsing/writing > > I few weeks ago I saw a couple of messages on this list mentioning the > new ID/SV line format used in the latest EMBL release. I'm in the > process of moving our database server over to the new format and was > looking to update SeqIO::embl.pm. > > I'm sure someone said they'd made a patch to fix up parsing of the new > format, but I can't find it either in CVS or bugzilla. > > Rather than do this again myself can someone point me to an updated > SeqIO::embl.pm please? If there isn't one then I'll look into making > the patch myself. > > Since this is such a major change are there any plans to put out a new > release with this fix included? I'm sure this will start to bite more > people as the new format becomes more widely adopted. > > > Cheers > > Simon. > > -- > Simon Andrews PhD > Bioinformatics Group > The Babraham Institute > > simon.andrews at bbsrc.ac.uk > +44 (0) 1223 496463 > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Mon Jul 24 16:37:32 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 11:37:32 -0500 Subject: [Bioperl-l] New EMBL format parsing/writing In-Reply-To: Message-ID: <002001c6af3f$76214490$15327e82@pyrimidine> Great work! Does it support old and new EMBL or only the newest? I don't have a problem with dumping old format support, but if we do we need to note this in POD and elsewhere (wiki, perhaps). Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of dwaner at scitegic.com > Sent: Monday, July 24, 2006 11:03 AM > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] New EMBL format parsing/writing > > Simon, > > I have already updated SeqIO::embl.pm to support release 87. All I have > left to do is generate the patch and update the /t test. I will try to > get this submitted to bugzilla today (24 July). > > - David > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Mon Jul 24 18:40:03 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 13:40:03 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C4D0D3.1020506@sendu.me.uk> Message-ID: <002f01c6af50$97242250$15327e82@pyrimidine> I have to do a little catching up on things here; lots of conversation this morning! According to NCBI, the SOURCE line can hold organelle data, an abbreviated version of the scientific name, and the GenBank common name in parentheses. No other information is present. The ORGANISM lines contains the scientific name (NCBI definition) and the lineage, generally only ranked node but not always. I believe it was Nadeem Faruque who indicated that there is some way that NCBI marks the ranks which determines whether or not they appear in the lineage. Here's what Bio::SeqIO::genbank does to get data into and out of GenBank files: ------------------------------------------------------ Bio::SeqIO::genbank in methods next_seq() and _read_GenBank_Species(): 1) Bio::Species acts as a container object 2) The SOURCE data is dumped entirely into common_name() (ughhhh). There is some additional work done as well before instantiating a Bio::Species ; if it is considered an unknown organism there is no Bio::Species object returned. We should get rid of that bit; every GenBank SOURCE has a TaxID and therefore has a node, including plasmids and unknowns. There will be no genus/species or anything else set for that group. 3) The ORGANISM name was divided up into genus(), species(), and subspecies(), based on the classification array (again, ughhh). 4) The classification array is split into an array and dumped into classification() 5) No parsing of potential organelle information occurs. None. Zero. Squat. 6) TaxID is grabbed from the 'source' seqfeature and assigned via ncbi_taxid(). We could use this to also grab the organelle, etc. ------------------------------------------------------ Bio::SeqIO::genbank in method write_seq(): 1) SOURCE line : use the common_name data for output, but tag on the subspecies information (?!?!?!). 2) ORGANISM lines : the name is rebuilt from the organelle() (which should be on the SOURCE line) and genus and species, which comes from the classification array (?!?!?!). The classification array is rebuilt from classification() ------------------------------------------------------ Much of this may be cruft from changes in the official GenBank format that we neglected to update. However, I think there's WAY too much hand-wringing about trying to get everything into genus() species() etc without anything more that the (very scant) information in the flatfile, esp. when using the classification array as a basis. The only places where reliable tax information is present in the flatfile are: 1) SOURCE line (organelle, common name, abbreviated name) 2) ORGANISM lines (scientific name, classification array) 3) 'source' seqfeature (strain/variant (!), organelle, TaxID, etc found here). We should assign those accordingly; we could even use the 'source' seqfeature to grab strain, organelle, etc. just like we now do for the TaxID. Beyond that we're really just guessing the ranks and the genus-species names. Makes no sense, especially when that is easily available in Bio::Taxonomy using entrez/flatfile. We could have Bio::Taxonomy::Species act as a container for IO purpose, ONLY using the methods in the 'reliable information' list above in Bio::SeqIO::genbank and other SeqIO RichSeqs. Then hold the additional data with warnings attached if a lookup hasn't been run, or not set them at all. Or, use Hilmar's suggestion and force the user to use the db handle and ncbi_taxid() to grab a new Bio::Taxonomy::Node/Species object (based on the rank) which has the correct information. As for the other container get/sets: species(), genus() etc. These methods should be present, but only for species or below (hence Bio::Taxonomy::Species). In a way Bio::Taxonomy::Species is not entirely correct as the sequence file many times the sequence is from an organism at the genus level (unassigned species) or subspecies/strain levels, or is unranked (environmental samples, for instance). All of these seem to have TaxIDs though. Don't think it really matters... We could convert Bio::Species into an abstract interface class (Bio::SpeciesI), moving the implemented methods over to Bio::Taxonomy::Species, and have Bio::Taxonomy::Species implement Bio::Taxonomy::NodeI or Bio::TaxonomyI as well. Bio::Taxonomy::Species could be checked with $obj->isa('Bio::TaxonomyI') && $obj->isa('Bio::SpeciesI') Or, modifying Hilmar's suggestion: |-----Tax::Node NodeI/TaxI -| |-----Tax::Species | SpeciesI -------| So Species doesn't 'contaminate' Node. This will allow you to proceed with doing what you want to Bio::Taxonomy::Node; both Node and Species could be checked simultaneously though they need to be changed at some point to implement the same base class, so you could check using : if ($obj->isa('Bio::Taxonomy::NodeI')) { As for getting Bio::SeqIO::genbank to play well with Bio::Taxonomy::Species, all I did was 'clone' the Bio::Taxonomy::Node module into Bio::Taxonomy::Species, removed the warnings in species() and other methods for the time being, and changed the method call for classification() in Bio::SeqIO::genbank to send an array instead of an array_ref. Then I modified the parsing to retain the scientific_name and abbreviated_name (though the latter should go into common_names()). Passed all but one test, where common_name was called and returned the entire SOURCE line (not correct!). Pretty simple, really... BTW, I checked EMBL format, and it is very similar in format to the way GenBank is with the interesting addition of the OG line (for organelle). Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Sendu Bala > Sent: Monday, July 24, 2006 8:53 AM > To: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > Chris Fields wrote: > > Bio::SeqIO::genbank works very happily with the current > > Bio::Taxonomy::Node now; if we intend to remove most of the method we > > need to have a similar DB-aware module to house the flatfile data (like > > Bio::Species) yet be capable of working with Bio::Taxonomy (like > Tax::Node). > > Can you give code examples of what Bio::SeqIO::genbank is doing and what > makes it 'happy'? What are the requirements? Would it be as happy > working with a Bio::Taxonomy object? > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Mon Jul 24 19:24:23 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 14:24:23 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C4CFF6.40609@sendu.me.uk> Message-ID: <003c01c6af56$c5fd2df0$15327e82@pyrimidine> > Hilmar Lapp wrote: > > Sounds good to me, except there is no Bio::TaxonomyI yet, > > Indeed, I propose making one. So, Node would implement this, correct? Naming it Bio::TaxonomyI makes me think that Bio::Taxonomy implements TaxonomyI, not that Bio::Taxonomy::Node implements it. ... > Yes, which is why Bio::Taxonomy is appropriate here. Assuming that > Bio::Species isa Bio::TaxonomyI: > > ... > SOURCE Saccharomyces cerevisiae (baker's yeast) > ORGANISM Saccharomyces cerevisiae > Eukaryota; Fungi; Ascomycota; Saccharomycotina; > Saccharomycetes; > Saccharomycetales; Saccharomycetaceae; Saccharomyces. > > ... > > ## the fully-manual way > my $species = new Bio::Species; > my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces cerevisiae', > -rank => 'species', -object_id => 1, > -parent_id => 2); > my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces', > -object_id => 2, -parent_id => 3); > # (no assumption that 'Saccharomyces' is the genus, so rank() undefined) > my $n3 = [etc] > $species->add_node($node); > $species->add_node($n2); > [etc] Hrmm... why would you add multiple nodes to a species object? A Species is-a Node, not a full Bio::Taxonomy. Taxonomy has-a Node (hence the add_node() method). So, you should be able to add a NodeI-implementing object to a Taxonomy object (either a Node or a Species). Not sure I agree with what you propose here; doesn't seem right... ... > We also solve Chris' earlier quandary: > > [ in a world where Bio::Taxonomy::Node and Bio::Taxonomy::SpeciesNode > exist, and given that Bio::DB::Taxonomy* currently directly make Node > objects ] > > The only problem I can foresee is which class to use with > > Bio::DB::Taxonomy*? I guess one could settle on one class by default > and > > have the option to use another Bio::Taxonomy::NodeI-implementing class > if > > you wanted more data/methods available... > > The way to do it is to have the Bio::DB::Taxonomy* modules return only > the information that a Bio::Taxonomy::FactoryI would need to make a > NodeI. The specific Factory that you use could generate whatever type of > Node you wanted. Yes, using an object factory here makes a lot of sense, returning the correct object type based on the rank. ... > Bio::Species differs from Bio::Taxonomy only so it contains all the > legacy methods names that Bio::Species currently has, for backward > compatibility. Setting $species->classification() would delete all nodes > of self, use a GenbankFactory to make a new Bio::Species, then pull out > all its Nodes and add them to self. The idea is to replace Bio::Species with something that works well, so having it implement a Node-like interface works since it is-a Node. Having it implement a Taxonomy-like interface, though, doesn't make a lot of sense as a species is-not-a Taxonomy. It should act just like a fancier node object. Using a factory in Bio::DB::Taxonomy should solve any issues about what object type is returned, since that could simply be made based on the rank itself (species rank or below == Bio::Taxonomy::Species, genus and above == Bio::Taxonomy::Node). > Unless anyone can think of a better way of doing things, I'll explore > the above ideas and start writing code. To summarise: major changes to > Bio::DB::Taxonomy* (make them factory slaves), implementation of some > Bio::Taxonomy::FactoryIs, tweak Bio::Taxonomy::FactoryI and make > Bio::TaxonomyI, make Bio::Species a Bio::TaxonomyI. Nope. Don't agree. Sorry. I can't see why you would force a Species to be a Taxonomy when it isn't. The object hierarchy doesn't make sense to me. I would just have a simple interface for Node (NodeI), and either convert Bio::Species to an abstract interface or place its methods in Bio::Taxonomy::Species/SpeciesNode. I like the interface idea as Bio::Taxonomy::Node is-a NodeI only, while Bio::Taxonomy::Species is-a NodeI and SpeciesI; these checks can be run using the UNIVERSAL object method 'isa' when using a Factory. I'll repeat: a Node and a Species is-not-a Taxonomy. A Taxonomy object has-a Node or Species or combinations thereof ; all would be NodeI-implementing. That's the reason that add_node() is there, which could be modified to allow only objects that isa->('Bio::Taxonomy::NodeI') (i.e. a Node or a Species). > Oh, Bio::Taxonomy might need some changes as well. It has a classify() > method does something with a Bio::Species, which would be all wrong in > the new way of doing things. We'll have to make eventual changes to anything referencing Bio::Species to get them to work correctly. Getting the object hierarchy finalized and worked out is priority one. Getting Bio::SeqIO modules switched over to Bio::Taxonomy::Species (pretty commonly used) and making sure that Bio::DB::Taxonomy returns the correct objects from the factory is a close second. Any small issues that pop up along the way can be taken care of when they reveal themselves. Chris From cjfields at uiuc.edu Mon Jul 24 19:34:55 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 14:34:55 -0500 Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes In-Reply-To: <2C99E56B-84D2-4C51-BBF1-76BAF81205AB@gmx.net> Message-ID: <003d01c6af58$3dc4ac40$15327e82@pyrimidine> > > Maybe the file parser could have its own organelle() method > > and leave all taxonomic classes without such a method. Or it could > > stay > > as is, I don't know. > > Like I said above, at the end of the day there needs to be a way to > qualify a sequence by the genome it is part of. Agreed. I think Sendu's right in one regard, it doesn't seem to have anything to do with the taxonomy itself. See below... There should be a way of containing this somehow, maybe using a Bio::Annotation::SimpleValue object or having a get/set somehow. > > Do different organelles in the same species get unique taxonomy ids? > > I would have to confirm, but I believe so. As I said, from a genome/ > sequence-centric viewpoint, the organelle and nuclear genomes are two > different things. Looks like the organelle sequence data uses the organism TaxID. I couldn't find organelle-specific taxon information using the TaxBrowser for mitochondrion, chloroplast, or plastid. source 1..426 /organism="Reticulitermes tibialis" /organelle="mitochondrion" /mol_type="genomic DNA" /db_xref="taxon:186107" /haplotype="T9" TaxID refers to the organism ("Reticulitermes tibialis"), not the mitochondrion. source 1..814 /organism="Porterinema fluviatile" /organelle="plastid:chloroplast" /mol_type="genomic DNA" /strain="SAG 124.79" /db_xref="taxon:246123" /country="Germany" TaxID refers to the organism ("Porterinema fluviatile"), not the chloroplast. Chris From bix at sendu.me.uk Mon Jul 24 19:45:09 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 24 Jul 2006 20:45:09 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <003c01c6af56$c5fd2df0$15327e82@pyrimidine> References: <003c01c6af56$c5fd2df0$15327e82@pyrimidine> Message-ID: <44C52345.5060903@sendu.me.uk> Chris Fields wrote: >> Hilmar Lapp wrote: >>> Sounds good to me, except there is no Bio::TaxonomyI yet, >> Indeed, I propose making one. > > So, Node would implement this, correct? Naming it Bio::TaxonomyI makes me > think that Bio::Taxonomy implements TaxonomyI, not that Bio::Taxonomy::Node > implements it. No no, I guess the whole rest of you reply was confused by this one point. Bio::TaxonomyI would be the interface for Bio::Taxonomy. Definitely not a Node. >> Yes, which is why Bio::Taxonomy is appropriate here. Assuming that >> Bio::Species isa Bio::TaxonomyI: >> >> ... >> SOURCE Saccharomyces cerevisiae (baker's yeast) >> ORGANISM Saccharomyces cerevisiae >> Eukaryota; Fungi; Ascomycota; Saccharomycotina; >> Saccharomycetes; >> Saccharomycetales; Saccharomycetaceae; Saccharomyces. >> >> ... >> >> ## the fully-manual way >> my $species = new Bio::Species; >> my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces cerevisiae', >> -rank => 'species', -object_id => 1, >> -parent_id => 2); >> my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces', >> -object_id => 2, -parent_id => 3); >> # (no assumption that 'Saccharomyces' is the genus, so rank() undefined) >> my $n3 = [etc] >> $species->add_node($node); >> $species->add_node($n2); >> [etc] > > > Hrmm... why would you add multiple nodes to a species object? A Species > is-a Node, not a full Bio::Taxonomy. In my proposal, a Bio::Species certainly is a full Bio::Taxonomy. >> Bio::Species differs from Bio::Taxonomy only so it contains all the >> legacy methods names that Bio::Species currently has, for backward >> compatibility. Setting $species->classification() would delete all nodes >> of self, use a GenbankFactory to make a new Bio::Species, then pull out >> all its Nodes and add them to self. > > The idea is to replace Bio::Species with something that works well, so > having it implement a Node-like interface works since it is-a Node. Having > it implement a Taxonomy-like interface, though, doesn't make a lot of sense > as a species is-not-a Taxonomy. Right. So this is why we've been 'butting heads'. Up till now I had no idea why you were so adamant about keeping things the old Bio::Taxonomy::Node way. Bio::Species very definitely has never been, nor do we want it to become, a single node of a taxonomy. It has always been a complete taxonomy. You can tell that by the fact it has a classification, and you could ask what its genus is. This is why I'm proposing that Bio::Species become a Bio::Taxonomy. Because that's the correct object model for the kinds of things Bio::Species wants to do. > Using a factory in Bio::DB::Taxonomy should solve any issues about what > object type is returned, since that could simply be made based on the rank > itself (species rank or below == Bio::Taxonomy::Species, genus and above == > Bio::Taxonomy::Node). Frankly, that idea makes me ill. A Node, at the fundamental level, is just a very simple object that needs to associated a taxonomic rank with a scientific name. If you start making different objects for different ranks, you've departed from any semblance of meaning in the object model. > Nope. Don't agree. Sorry. I can't see why you would force a Species to be > a Taxonomy when it isn't. The object hierarchy doesn't make sense to me. Does it make sense now? > I'll repeat: a Node and a Species is-not-a Taxonomy. I'll repeat: A Node is a Node and a Bio::Species is a Taxonomy ;) > A Taxonomy object has-a Node or Species or combinations thereof ; No, a Taxonomy contains Nodes. One of those Nodes might have a rank() of 'species'. A Bio::Species contains Nodes. One of those Nodes definitely has a rank() of 'species'. It /must/ have other nodes, because the job of Bio::Species has in the past and will in the future be to store all the other taxonomic levels in a Genbank file. For the same reason Bio::Species can't be a Node itself, because you can't store other Nodes inside a Node. From cjfields at uiuc.edu Mon Jul 24 19:49:06 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 14:49:06 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <11A2B917-C633-4806-A6F4-920F02F0BF6E@gmx.net> Message-ID: <003e01c6af5a$390cdea0$15327e82@pyrimidine> Yes, 'largely' the key word. I don't really agree with Sendu's hierarchy scheme (making Species implement Taxonomy and not Node doesn't make sense), but, besides that, everything else seems fine. I like the following setup (which is similar to what you proposed, I believe), which I already posted. |-----Tax::Node NodeI-------| |-----Tax::SpeciesNode | SpeciesI -------| Taxonomy::Node is-a NodeI Taxonomy::SpeciesNode is-a NodeI and-a SpeciesI Bio::Taxonomy 'has-a' NodeI-implementing module SeqIO has-a SpeciesI-implementing module Bio::DB::Taxonomy uses a factory to return NodeI-implementing modules; specifically, a SpeciesNode for species ranks or below, and a Node for anything else. It would be nice to get this hammered out soon. I think we can actually start work on the Bio::Taxonomy::Node/SpeciesNode split; the interface classes would be easy to add. I could work on getting SeqIO to work with Bio::Taxonomy::SpeciesNode when I can (sometime in the next few weeks). Like I mentioned before, I got Bio::SeqIO::genbank already using it but haven't committed it to CVS until we sorted out the class hierarchy and interface-implementation issues. I won't be able to add too much more to this for a few weeks, unfortunately. I need to prepare for a conference as well as finish up a ton of bench research. I'll try keeping up though... Chris > :-) I think we're largely in agreement. As for node_name() I fully > understand the motivation, but it needs to be understood that the > attribute's value will be based on a largely arbitrary choice unless > it is set directly by the user. > > -hilmar > > On Jul 24, 2006, at 4:45 AM, Sendu Bala wrote: > > > Hilmar Lapp wrote: > >> On Jul 20, 2006, at 9:35 AM, Sendu Bala wrote: > >> > >>> Bio::DB::Taxonomy::flatfile > >>> --------------------------- > >>> [...] > >>> > >>> BEHAVIOUR-CHANGE: flatfile used to store within the nodes it > >>> makes the > >>> division as a three letter code, like 'PRI'. However, for > >>> consistency > >>> with entrez and the scientific_name() of the node the division is > >>> supposed to correspond to, it is now stored as the full name, like > >>> 'Primates'. > >> > >> What about adding a method division_code() which would return the 3- > >> letter abbreviation? > >> > >> The abbreviation may be needed by flat-file writers, so it may be > >> handy to have in some cases. > > > > As far as I know you can't get the 3-letter version via entrez, so no > > other module can really expect to be able to get it, not knowing which > > database (flatfile.pm or entez.pm) the taxonomic information is > > coming from. > > > > But of course it would be somewhat harmless to add division_code() > > anyway. It might be better done as a -code => 1 option to division()? > > > > > >>> The names->id solution also stores the artificially uniqued names > >>> like > >>> 'Craniata ', allowing you for the first time to > >>> retrieve the > >>> correct id. Previously the search would have simply failed > >>> completely. > >>> > >>> The names->id solution now handles nodes with scientific names of > >>> 'xyz > >>> (class)', allowing you to retrieve the id with both get_taxonids > >>> ('xyz') > >>> and get_taxonids('xyz (class)'). Previously only the latter would > >>> work. > >> > >> Should angle brackets be allowed too? > > > > Allowed in what sense? You can indeed search for both > > get_taxonids('Craniata ') [returns a single id] and > > get_taxonids('Craniata') [returns multipe ids, one of which is the > > previous answer]. > > > > > >> Maybe there should also be a -names parameter which accepts a hash > >> reference with keys being the kind of name (scientific, common, etc) > >> and the values being array references with the set of names of that > >> kind? > > > > Not sure what you mean. name() has that data structure, though you're > > not supposed to set its hash ref directly. > > > > > >>> or the $node->classification() array. > >> > >> Bio::Taxonomy::Node shouldn't have this attribute. It is legacy > >> brought over from a flawed (because flat) object model in > >> Bio::Species. > > > > Yes, I agree. > > > > > >>> NOTE: entrez modules (and website) cannot cope with '' > >>> in the > >>> query, failing searches like 'Craniata '. For this > >>> reason, if > >>> get_taxonids() is given a query with '' it will > >>> immediately > >>> return undefined, saving a pointless website access. > >> > >> If there is a 'next-best-thing' that is still semantically compatible > >> with the API documentation, I would do that. > >> > >> In this case, if there is a in the query the entrez > >> module should strip it and automatically use the rest for searching. > >> If indeed multiple IDs match there should be a warning to inform the > >> user that entrez cannot use the notation to limit the > >> query results. > > > > I wouldn't like this. I actually had it working this way initially, > > but > > decided that if someone entered 'xyz ' they really didn't > > want multiple ids, expected to get multiple ids with just 'xyz' and > > don't want their query made something else and then be warned about > > it. > > > > > >> In fact, you might as well provide an option to enable an automatic > >> check for the correct branch for each ID if multiple ones are > >> returned. I.e., if this option is enabled, the module would > >> automatically query the parent nodes to see if is in the > >> lineage, and if not will remove the respective ID from the result > >> set. The reason you may want to make it optional is because it > >> potentially costs time. (but in reality I'm not sure why a client > >> will not want to enable the option - so maybe this should even be > >> default) > > > > I can certainly add that, it seems like a good idea. I don't, however, > > see any scope for an option at all. What would the option be called? > > -don't_give_me_the_answer_I_actually_want_to_save_time ? Pointless, > > imho. If the user queries 'xyz ' with that option, they're > > just going to have to do for themselves manually what the method would > > have done for them without that option, in order to get the correct > > answer. It'll be slower that way, if anything. So the option would > > actually be called > > - > > don't_give_me_the_answer_I_actually_want_so_I_can_get_it_myself_a_litt > > le_slower > > (!). > > > > > >>> Bio::Taxonomy::Node > >>> ------------------- > >>> [...] > >>> classification() has a proper solution to finding the classification > >>> when the array wasn't manually set. > >>> > >>> # Improvements > >>> BEHAVIOUR-CHANGE: node_name() used to be an alias to name > >>> ('common'). Now > >>> it is an alias to name('scientific'). > >>> NOTE: node_name is what is set when ->new(-name => $name) is set, so > >>> flatfile and entrez and user-created nodes now implicitly associate > >>> the > >>> name of the node they create with its scientific name. > >> > >> I'm not even sure node_name() should just be deprecated. The methods > >> falsely suggests that there is only a single and definitive name for > >> the taxon node. > >> > >> In NCBI reality, this is only true for the scientific name of the > >> node. In real reality, many nodes have multiple scientific names - > >> taxonomy isn't static and therefore the scientific naming of nodes > >> isn't either. > > > > For the programmer not using any database but just making up his own > > nodes, I think he needs a node_name() because he may not be thinking > > about anything fancy or realistic. He just want to give his node a > > single name that he invents. node_name() seems like the ideal method > > name to me. > > > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From hlapp at gmx.net Mon Jul 24 19:56:02 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 24 Jul 2006 15:56:02 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <003c01c6af56$c5fd2df0$15327e82@pyrimidine> References: <003c01c6af56$c5fd2df0$15327e82@pyrimidine> Message-ID: <88700A84-B426-4BC7-88F2-D5E793870ADF@gmx.net> On Jul 24, 2006, at 3:24 PM, Chris Fields wrote: > >> Hilmar Lapp wrote: >>> Sounds good to me, except there is no Bio::TaxonomyI yet, >> >> Indeed, I propose making one. > > So, Node would implement this, correct? No - > Naming it Bio::TaxonomyI makes me > think that Bio::Taxonomy implements TaxonomyI, not that > Bio::Taxonomy::Node > implements it. I'd suppose so. >> Yes, which is why Bio::Taxonomy is appropriate here. Assuming that >> Bio::Species isa Bio::TaxonomyI: >> >> ... >> SOURCE Saccharomyces cerevisiae (baker's yeast) >> ORGANISM Saccharomyces cerevisiae >> Eukaryota; Fungi; Ascomycota; Saccharomycotina; >> Saccharomycetes; >> Saccharomycetales; Saccharomycetaceae; Saccharomyces. >> >> ... >> >> ## the fully-manual way >> my $species = new Bio::Species; >> my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces >> cerevisiae', >> -rank => 'species', -object_id >> => 1, >> -parent_id => 2); >> my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces', >> -object_id => 2, -parent_id => 3); >> # (no assumption that 'Saccharomyces' is the genus, so rank() >> undefined) >> my $n3 = [etc] >> $species->add_node($node); >> $species->add_node($n2); >> [etc] > > > Hrmm... why would you add multiple nodes to a species object? A > Species > is-a Node, not a full Bio::Taxonomy. No. See above: Bio::Species is-a Bio::Taxonomy. > Taxonomy has-a Node (hence the > add_node() method). So, you should be able to add a NodeI- > implementing > object to a Taxonomy object (either a Node or a Species). Let's keep Bio::Species and Taxonomy::Node separate. They look like representing something similar but once you look at the Bio::Species API (and a Genbank record) you realize they do not. Bio::Species is more like an entire lineage and the species node all flattened out into one. I'm not sure Bio::Species would need to implement a Bio::TaxonomyI interface; it may as well just use an implementation of it internally. I'm not sure how Sendu wants to design this, but for sure Bio::Taxonomy::Node should not be a Bio::Species, and the reverse should rather be avoided too. >> [..] >> The way to do it is to have the Bio::DB::Taxonomy* modules return >> only >> the information that a Bio::Taxonomy::FactoryI would need to make a >> NodeI. The specific Factory that you use could generate whatever >> type of >> Node you wanted. > > Yes, using an object factory here makes a lot of sense, returning the > correct object type based on the rank. Well, I don't think you'd want to create instances of different node classes depending on the rank of the node. However, a particular factory implementation may of course be free to do exactly that. > ... >> Bio::Species differs from Bio::Taxonomy only so it contains all the >> legacy methods names that Bio::Species currently has, for backward >> compatibility. Setting $species->classification() would delete all >> nodes >> of self, use a GenbankFactory to make a new Bio::Species, then >> pull out >> all its Nodes and add them to self. > > The idea is to replace Bio::Species with something that works well, so > having it implement a Node-like interface works since it is-a > Node. Having > it implement a Taxonomy-like interface, though, doesn't make a lot > of sense > as a species is-not-a Taxonomy. It should act just like a fancier > node > object. No, I'd really recommend against muddling up a taxonomy node model with the Bio::Species legacy model. Bio::Species is not a node at all. You may argue it's not a taxonomy either. This is just one more reason for containing the Bio::Species contagious disease of conflating disjoint concepts into one. > > Using a factory in Bio::DB::Taxonomy should solve any issues about > what > object type is returned, since that could simply be made based on > the rank > itself (species rank or below == Bio::Taxonomy::Species, genus and > above == > Bio::Taxonomy::Node). Bio::Taxonomy::Species was an invention of mine and - if created - should not be used for anything else other than representing a taxonomy node as a Bio::Species object iff necessary (i.e., if the client really wants a Bio::Species object). I'd actually like to see what Sendu would come up with. It sounds at the very minimum like an excellent start. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Mon Jul 24 19:59:10 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 24 Jul 2006 15:59:10 -0400 Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes In-Reply-To: <003d01c6af58$3dc4ac40$15327e82@pyrimidine> References: <003d01c6af58$3dc4ac40$15327e82@pyrimidine> Message-ID: <3C520B8C-8755-4A7E-80CF-8B94FEAB867E@gmx.net> On Jul 24, 2006, at 3:34 PM, Chris Fields wrote: > Looks like the organelle sequence data uses the organism TaxID. Then you might as well store it as annotation. Really the only thing that matters is that the flat file writers can get from an expected location. In fact storing as annotation is better e.g. for Biosql since right now the taxonomy model is the NCBI model and so organelle will not be stored (and hence neither be round-tripped). -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Mon Jul 24 20:10:20 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 15:10:20 -0500 Subject: [Bioperl-l] Bio::Species/Bio::Taxonomy changes In-Reply-To: <3C520B8C-8755-4A7E-80CF-8B94FEAB867E@gmx.net> Message-ID: <000001c6af5d$3094b830$15327e82@pyrimidine> Sounds good. Will be easy to change this over. Chris > -----Original Message----- > From: Hilmar Lapp [mailto:hlapp at gmx.net] > Sent: Monday, July 24, 2006 2:59 PM > To: Chris Fields > Cc: 'Sendu Bala'; bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Bio::Species/Bio::Taxonomy changes > > > On Jul 24, 2006, at 3:34 PM, Chris Fields wrote: > > > Looks like the organelle sequence data uses the organism TaxID. > > Then you might as well store it as annotation. Really the only thing > that matters is that the flat file writers can get from an expected > location. > > In fact storing as annotation is better e.g. for Biosql since right > now the taxonomy model is the NCBI model and so organelle will not be > stored (and hence neither be round-tripped). > > -hilmar > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > From hlapp at gmx.net Mon Jul 24 20:12:39 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 24 Jul 2006 16:12:39 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <003e01c6af5a$390cdea0$15327e82@pyrimidine> References: <003e01c6af5a$390cdea0$15327e82@pyrimidine> Message-ID: <5FB07071-42D7-4F43-B2A1-3AF5F1FC5193@gmx.net> On Jul 24, 2006, at 3:49 PM, Chris Fields wrote: > Yes, 'largely' the key word. I don't really agree with Sendu's > hierarchy > scheme (making Species implement Taxonomy and not Node doesn't make > sense), > but, besides that, everything else seems fine. I like the > following setup > (which is similar to what you proposed, I believe), which I already > posted. > > |-----Tax::Node > NodeI-------| > |-----Tax::SpeciesNode > | > SpeciesI -------| > > Taxonomy::Node is-a NodeI > Taxonomy::SpeciesNode is-a NodeI and-a SpeciesI I don't even think we would need SpeciesI - why would a species- ranked taxonomy node be so different from any other node such that it would need its own interface. Chris - just one suggestion: take a step back and imagine a Bioperl in which Bio::Species had never existed. Instead, only taxonomy nodes existed, and code that can effectively deal with them, including filtering by rank. In this picture, what would you make to want to introduce SpeciesI and Bio::Species? Frankly, I don't see anything. I.e., the only reason is backward compatibility (which is a valid reason), but let's not glorify Bio::Species by adding ill-conceived interfaces. > > Bio::DB::Taxonomy uses a factory to return NodeI-implementing modules; > specifically, a SpeciesNode for species ranks or below, and a Node for > anything else. Like I said before, SpeciesNode or whatever it's called would draw its right of existence solely from backward compatibility - don't use it for anything else. And if you can achieve backward compatibility by other means, don't even create a SpeciesNode. My $0.02 ... -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Mon Jul 24 21:34:29 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 16:34:29 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <5FB07071-42D7-4F43-B2A1-3AF5F1FC5193@gmx.net> Message-ID: <000101c6af68$f27521a0$15327e82@pyrimidine> > I don't even think we would need SpeciesI - why would a species- > ranked taxonomy node be so different from any other node such that it > would need its own interface. > > Chris - just one suggestion: take a step back and imagine a Bioperl > in which Bio::Species had never existed. Instead, only taxonomy nodes > existed, and code that can effectively deal with them, including > filtering by rank. In this picture, what would you make to want to > introduce SpeciesI and Bio::Species? Argh!!! Just when I thought I could pull away... Okay. I thought it would be nice to have a class that could accomplish two things: 1) Act as a container for GenBank taxonomy information; Bio::Taxonomy::Node, as written by Jason, was meant to be a replacement for Bio::Species. 2) Also act as a bridge, so you had the option to retrieve the Species object from a sequence object and have it act like a Node (be db-aware out-of-the-box, so to speak). Also, I'm trying to follow the original idea as proposed by Jason (this is from perldoc Bio::Taxonomy::Node): DESCRIPTION This is the next generation (for Bioperl) of representing Taxonomy information. Previously all information was managed by a single object called Bio::Species. This new implementation allows representation of the intermediate nodes not just the species nodes and can relate their connections. Which, to me, indicated that this would eventually replace Bio::Species (so, in effect, must at least contain the relevant data for sequence objects w/o being completely reliant on DB, yet still be DB-aware). Everything about Bio::Species on the wiki also leads me to believe that this was the original intent for Bio::Taxonomy::Node. http://www.bioperl.org/wiki/Module:Bio::Species http://www.bioperl.org/wiki/Project_priority_list#Taxonomy_.2F_Species_data And all the original methods (genus(), species(), etc.) also seem to indicate this. That's really it. I could give a toss about getting taxonomy information directly from Bio::Species. And you're right: in hindsight Bio::Species is flawed. However, it seemed from the beginning of this discussion with Sendu and the proposed changes, that Bio::Species should stick around in some capacity but should also be involved with Bio::Taxonomy (contrary to Jason's idea above). Now I'm hearing something completely different (Sendu still argues that it should be involved). I had originally wanted to start delegating everything over to Taxonomy::Node about a month ago, when I found that it was remarkably easy to do so. However, when Sendu proposed making changes to remove methods in Bio::Taxonomy::Node and make sweeping changes to Taxonomy which would prevent an easy transition over to Node, I felt that it would be harder to effectively have it take over for Bio::Species when parsing SeqIO objects (all the calls to genus/species/subspecies etc methods would have to be removed from all the classes which use Bio::Species). Hence Bio::Taxonomy::Species as a compromise. Now it turns out no one wants to have either Bio::Species (your 'contagion' references clues me in there) or Bio::Taxonomy::Species. If we think it would be better to completely toss all this out the window and use only a bare-bones Node, then I'm fine with that. But if we go that route we should just get rid of the Bio::Species 'disease' completely and have things be much simpler. Simple is good! I think Node can still act as a viable container class for the tax data from a GenBank file (it's original purpose) as long as it has the very basic methods for doing so. That would require: scientific_name() - ORGANISM line data common_names() - which could hold common names (in parentheses on the SOURCE line) and the abbreviated name (from the SOURCE line) ncbi_taxid() - from the 'source' seqfeature (already there). The lineage information and organelle information could be stored in Node or in SimpleValue objects. My vote is for the latter as there's no need for a classification() container for Node, which you have repeatedly pointed out. > Frankly, I don't see anything. I.e., the only reason is backward > compatibility (which is a valid reason), but let's not glorify > Bio::Species by adding ill-conceived interfaces. I think we should just get rid of Bio::Species completely. We would need to go in and rework species parsing in the SeqIO modules that use Bio::Species, but that would only make things simpler, not more complex. Get rid of trying to figure out what is a genus or species based on the GenBank information only, and have the bridge between the sequences be stored in a Taxonomy::Node object (which should contain the NCBI TaxID, so then it can use the associated DB object to traverse up and down other nodes). The interface idea was a proposed compromise i.e. my 'bridge' between GenBank taxonomy hell and Bio::Taxonomy bliss, and intended to follow what I thought was Jason's original intent for Bio::Taxonomy::Node. Nothing more. > > Bio::DB::Taxonomy uses a factory to return NodeI-implementing modules; > > specifically, a SpeciesNode for species ranks or below, and a Node for > > anything else. > > Like I said before, SpeciesNode or whatever it's called would draw > its right of existence solely from backward compatibility - don't use > it for anything else. And if you can achieve backward compatibility > by other means, don't even create a SpeciesNode. Agreed. But, if there is such venom towards Bio::Species, why not put it out of it's misery as well? Seems like it has outlived it's usefulness. Chris From cjfields at uiuc.edu Mon Jul 24 21:53:46 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 16:53:46 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C52345.5060903@sendu.me.uk> Message-ID: <000201c6af6b$a4534580$15327e82@pyrimidine> > > I'll repeat: a Node and a Species is-not-a Taxonomy. > > I'll repeat: A Node is a Node and a Bio::Species is a Taxonomy ;) Nope. I think this is incorrect. Here's why. Let's look at the reasons Bio::Taxonomy was started, shall we? >From perldoc Bio::Taxonomy: DESCRIPTION Bio::Taxonomy object represents any rank-level in taxonomy system, rather than Bio::Species which is able to represent only species-level. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >From perldoc Bio::Taxonomy::Node DESCRIPTION This is the next generation (for Bioperl) of representing Taxonomy information. Previously all information was managed by a single object called Bio::Species. This new implementation allows representation of the intermediate nodes not just the species nodes and can relate their ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ connections. Bioperl wiki: http://www.bioperl.org/wiki/Project_priority_list#Taxonomy_.2F_Species_data http://www.bioperl.org/wiki/Module:Bio::Species Both talk about delegating or replacing Bio::Species with Bio::Taxonomy::Node. Everyone of those indicates what the original idea for Bio::Taxonomy::Node was (eventual replacement for Bio::Species). Even the original methods for Bio::Taxonomy::Node are the same. So, according to this alone, Bio::Species would eventually be replaced by Bio::Taxonomy::Node. I wanted an easier transition to Node from Bio::Species (hell, just a few changes and using Bio::Taxonomy::Node worked fine!) , but your proposals made sense. I saw having a Species-based Tax object as a nice compromise, but Hilmar has made a few good points: would we have a Bio::Species object around knowing what we know now? When Bio::Species was originally designed, it was probably before the NCBI Tax database existed. I think it has outlasted its current use. I have posted a response to Hilmar. I think we should just get rid of Bio::Species altogether and have a Taxonomy::Node contain the basic data (scientific_name(), common_names(), etc). And remove any SeqIO parsing of genus/species to simplify everything. All this extra parsing and hand-wringing over trying to get species/genus information from a GenBank file just mucks up ORGANISM and SOURCE line parsing anyway. Simplify it. Simple is good. Radical? Yes, but I agree with him that Bio::Species has outlasted it's use. As for organelle and lineage information, they could be placed in SimpleValue objects. If anyone wants to grab tax information, they can use the Node object to get it but they'll need a local flatfile database or network connection to do so. This also means there is no need for a Bio::DB::Taxonomy factory: just return Node objects directly. Each format (flatfile and entrez) currently works this way anyway, correct? Simplifies that. Simple is better. Of course, we couldn't get rid of Bio::Species until all the following were shifted over to Node somehow: ; > Instances: 2 BP Module : Bio::Cluster::SequenceFamily Instances: 4 BP Module : Bio::Cluster::UniGene Instances: 1 BP Module : Bio::Cluster::UniGeneI Instances: 1 BP Module : Bio::DB::FileCache Instances: 3 BP Module : Bio::DB::GFF::Segment Instances: 1 BP Module : Bio::DB::Taxonomy::flatfile Instances: 2 BP Module : Bio::Graph::IO::psi_xml Instances: 1 BP Module : Bio::Map::CytoMap Instances: 1 BP Module : Bio::Map::LinkageMap Instances: 3 BP Module : Bio::Map::MapI Instances: 3 BP Module : Bio::Map::SimpleMap Instances: 3 BP Module : Bio::Matrix::PSM::InstanceSite Instances: 6 BP Module : Bio::Phenotype::Correlate Instances: 1 BP Module : Bio::Phenotype::OMIM::OMIMentry Instances: 3 BP Module : Bio::Phenotype::OMIM::OMIMparser Instances: 5 BP Module : Bio::Phenotype::Phenotype Instances: 2 BP Module : Bio::Phenotype::PhenotypeI Instances: 4 BP Module : Bio::Seq Instances: 3 BP Module : Bio::SeqI Instances: 2 BP Module : Bio::SeqIO::agave Instances: 4 BP Module : Bio::SeqIO::bsml Instances: 2 BP Module : Bio::SeqIO::bsml_sax Instances: 1 BP Module : Bio::SeqIO::chadoxml Instances: 1 BP Module : Bio::SeqIO::chaos Instances: 4 BP Module : Bio::SeqIO::embl Instances: 2 BP Module : Bio::SeqIO::entrezgene Instances: 3 BP Module : Bio::SeqIO::game::seqHandler Instances: 4 BP Module : Bio::SeqIO::genbank Instances: 2 BP Module : Bio::SeqIO::kegg Instances: 2 BP Module : Bio::SeqIO::locuslink Instances: 4 BP Module : Bio::SeqIO::swiss Instances: 2 BP Module : Bio::SeqIO::table Instances: 2 BP Module : Bio::SeqIO::tigr Instances: 2 BP Module : Bio::SeqIO::tigrxml Instances: 7 BP Module : Bio::SeqIO::tinyseq Instances: 4 BP Module : Bio::Taxonomy Instances: 1 BP Module : Bio::Taxonomy::Node Instances: 6 BP Module : Bio::Taxonomy::Taxon Instances: 9 BP Module : Bio::Taxonomy::Tree Instances: 5 BP Module : Bio::Tools::Analysis::Protein::ELM Chris From bix at sendu.me.uk Mon Jul 24 22:15:31 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 24 Jul 2006 23:15:31 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <000101c6af68$f27521a0$15327e82@pyrimidine> References: <000101c6af68$f27521a0$15327e82@pyrimidine> Message-ID: <44C54683.70707@sendu.me.uk> Chris Fields wrote: > > Also, I'm trying to follow the original idea as proposed by Jason (this is > from perldoc Bio::Taxonomy::Node): > > Which, to me, indicated that this would eventually replace Bio::Species Well, we don't really know that Jason didn't later change his mind, but in any case it doesn't make sense (anymore, given that we have Bio::Taxonomy). In a direct reply to me you point out specific passages in the current docs that explain why you have thought we should delegate or replace Bio::Species with Bio::Taxonomy::Node. With respect, the old plans are not something we are forced to blindly follow. We decide for ourselves if they make sense, we decide for ourselves if there is a better way of doing it, and then we do it the best way. So if you ignore what those old bits of documentation say, just pretend you never ever read them, would my proposals make sense or not? Since those old proposals were never implemented we have no reason to try and stick with them if there is a better proposal. And for the record, '...Bio::Species which is able to represent only species-level' can (correctly) be interpreted as 'Bio::Species is only supposed to be used for representing a taxonomy that includes the species-level'. You can't interpret it literally because Bio::Species is used for levels below species, and also represents all the levels above species-level as well. Either Jason got it wrong when he wrote that, or you have misinterpreted it. Likewise, let's play the interpretation game again: 'Previously all information was managed by a single object called Bio::Species. [the Bio::Taxonomy::Node] implementation allows representation of the intermediate nodes not just the species nodes'. Note the apposition of 'single object' vs implication of multiple Node objects to do the same job. I imagine at the time Jason wrote that there was no Bio::Taxonomy, no holder for multiple Nodes. > I had originally wanted to start delegating everything over to > Taxonomy::Node about a month ago, when I found that it was remarkably easy > to do so. However, when Sendu proposed making changes to remove methods in > Bio::Taxonomy::Node and make sweeping changes to Taxonomy which would > prevent an easy transition over to Node, But an equally easy transition to Bio::Taxonomy instead. I don't know why you would care about the name of the class we switch to. My concern is that when the switch is made it makes sense. > If we think it would be better to completely toss all this out the window > and use only a bare-bones Node, then I'm fine with that. But if we go that > route we should just get rid of the Bio::Species 'disease' completely and > have things be much simpler. Simple is good! > > I think Node can still act as a viable container class for the tax data from > a GenBank file (it's original purpose) as long as it has the very basic > methods for doing so. That would require: > > scientific_name() - ORGANISM line data > common_names() - which could hold common names (in parentheses on the SOURCE > line) and the abbreviated name (from the SOURCE line) > ncbi_taxid() - from the 'source' seqfeature (already there). > > The lineage information and organelle information could be stored in Node or > in SimpleValue objects. My vote is for the latter as there's no need for a > classification() container for Node, which you have repeatedly pointed out. No, this is the whole point. The lineage information can NOT be stored in a Node (unless you absuse Node by having all those crufty methods like genus() and classification()), and why would we store it in SimpleValue objects when we have Bio::Taxonomy? Bio::Taxonomy is completely perfect for storing the taxonomic information from a GenBank file. That's all you need to worry about. Can we represent the data correctly? Yes. Do we gain all the good things about a pure Bio::Taxonomy? Yes. Can we still do everything we used to be able to do? Yes. > I think we should just get rid of Bio::Species completely. There's no need to get rid of Bio::Species. It can be a Bio::Taxonomy with backward-compatible methods. No harm done, all good. I'll tell you what. This will be easier if I just write the code for my proposals, including whatever changes would be needed in Bio::SeqIO::genbank et al. You'll see how easy and appropriate it is, and hopefully everyone will be happy. Perhaps you could just hold off doing any similar-but-contradictory work until then. From hlapp at gmx.net Mon Jul 24 23:47:10 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 24 Jul 2006 19:47:10 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C54683.70707@sendu.me.uk> References: <000101c6af68$f27521a0$15327e82@pyrimidine> <44C54683.70707@sendu.me.uk> Message-ID: <86C97545-8DDF-4FE3-9CE9-1799D8BD2EB5@gmx.net> On Jul 24, 2006, at 6:15 PM, Sendu Bala wrote: > I'll tell you what. This will be easier if I just write the code > for my > proposals, including whatever changes would be needed in > Bio::SeqIO::genbank et al. Never get in the way of somebody who threatens to code :-) so I certainly won't. I think you're on the right track. My suggestion is, if you have a good picture in front of you of how it's going to look like when done, just pretend for a second it is done already and give us some code examples that use the new (to be done) API. As a start, some of the situations it's currently used in: - genbank.pm parsing and setting species information for the sequence - user asking for the scientific name of the species of the sequence (obviously, the call would remain unchanged: $seq->species->binomial (). But what happens behind the scene?) - genbank.pm writing the SOURCE information for a sequence Replace genbank.pm with your rich annotation source parser of choice. Then maybe some advanced uses: - from a sequence stream, retain only those of primates - like above, but only mitochondrial sequences - for an organism, query entrez for all sequences of strains, varieties, or subspecies sequences for that organism Add your own if these sound stupid ... Just an idea. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Tue Jul 25 02:06:16 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 21:06:16 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <86C97545-8DDF-4FE3-9CE9-1799D8BD2EB5@gmx.net> References: <000101c6af68$f27521a0$15327e82@pyrimidine> <44C54683.70707@sendu.me.uk> <86C97545-8DDF-4FE3-9CE9-1799D8BD2EB5@gmx.net> Message-ID: <4678548F-ABEC-4E14-AD7F-D282D2DC2730@uiuc.edu> > >> I'll tell you what. This will be easier if I just write the code >> for my >> proposals, including whatever changes would be needed in >> Bio::SeqIO::genbank et al. > > Never get in the way of somebody who threatens to code :-) so I > certainly won't. I think you're on the right track. Fine by me. My only request: I don't want every sequence passing through SeqIO having an automatic DB lookup performed on it. SeqIO parsing of GenBank files is slow enough as it is w/o enforcing lookups, even if they are cached. If you want lookups, have it as an option and not as default behavior. We could have the option for a lookup added pretty easily in genbank.pm _initialize or the main SeqIO constructor as a simple Boolean flag. That might be pretty nice. ... > (). But what happens behind the scene?) > - genbank.pm writing the SOURCE information for a sequence You know, the only really divisive point here is the lineage data and how to store it in _read_GenBank_Species or reproduce it in write_seq (). Again, I don't think we should have a forced lookup for this; it should just be stored as is, either in Node or SimpleValue. Again, I think the latter as everyone seems averse to containing this in Node. > Then maybe some advanced uses: > > - from a sequence stream, retain only those of primates > - like above, but only mitochondrial sequences > - for an organism, query entrez for all sequences of strains, > varieties, or subspecies sequences for that organism For the primate example, would you screen those out via the in-file lineage or using lookups? Something like '$seqout->write_seq($seq) if ($seq->species->organelle eq 'mitochondrion');' for the mitochondria example, which would mean leaving organelle() in Species/Node or whatever is used. The last one, I think, can be done w/o using the sequence directly using NCBI's ELink and the TaxID to cross-reference the nucleotide database. You would probably have to walk through all child nodes, but it's feasible that way. > Add your own if these sound stupid ... > > Just an idea. > > -hilmar > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Tue Jul 25 02:29:57 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 24 Jul 2006 21:29:57 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C54683.70707@sendu.me.uk> References: <000101c6af68$f27521a0$15327e82@pyrimidine> <44C54683.70707@sendu.me.uk> Message-ID: <6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu> Look, we're just going back and forth on this stupid little thing, when the only point we really are divided on is what object type we should store certain items in a GenBank file (Bio::Species/ Bio::Tax::Node/Bio::Whatever). In particular, the main sticking point is the lineage. We could go back and forth on what Jason really intended. Personally, I think his past statements are quite clear on what his intent was (he's very clear in the wiki on what Bio::Taxonomy::Node was built to replace, in two separate posts and within the last four months). The reality is he's not here and you're willing to do the job. There is one thing I will make perfectly clear here: there should never, ever be enforced lookups for SeqIO (even using caches), though I have no problem having optional ones. This is something I have stated before and what you propose below steers dangerously in that direction. Where, for instance, do you store the lineage from a GenBank file? Do you want to do a series of Tax lookups to restore that data? I think that the number one complaint for sequence parsing is speed, which would only get slower with lookups (even cached). What I propose is we make it as simple as possible. Remove the unnecessary genus/species/subspecies parsing in genbank.pm, store the scientific name, common names, and lineage in some easily accessible way to make it easier for everyday users to use, have it tied to Bio::Taxonomy in some way (I propose Node, as it contains almost all the methods needed) so that you could get more information by moving up and down nodes, or retrieve more information. I, personally, don't see the point in having Bio:Species around after this discussion as Node seems to do the job adequately. My last word (I will be exiting this discussion and the group for two weeks): This would have been MUCH easier if all three of us could have gone to the local bar for a beer and discussed it. We should just take the time out to videoconference next time. Chris > Chris Fields wrote: >> >> Also, I'm trying to follow the original idea as proposed by Jason >> (this is >> from perldoc Bio::Taxonomy::Node): >> >> Which, to me, indicated that this would eventually replace >> Bio::Species > > Well, we don't really know that Jason didn't later change his mind, > but > in any case it doesn't make sense (anymore, given that we have > Bio::Taxonomy). > > In a direct reply to me you point out specific passages in the current > docs that explain why you have thought we should delegate or replace > Bio::Species with Bio::Taxonomy::Node. With respect, the old plans are > not something we are forced to blindly follow. We decide for ourselves > if they make sense, we decide for ourselves if there is a better > way of > doing it, and then we do it the best way. > > So if you ignore what those old bits of documentation say, just > pretend > you never ever read them, would my proposals make sense or not? Since > those old proposals were never implemented we have no reason to try > and > stick with them if there is a better proposal. > > And for the record, '...Bio::Species which is able to represent only > species-level' can (correctly) be interpreted as 'Bio::Species is only > supposed to be used for representing a taxonomy that includes the > species-level'. You can't interpret it literally because > Bio::Species is > used for levels below species, and also represents all the levels > above > species-level as well. Either Jason got it wrong when he wrote > that, or > you have misinterpreted it. > > Likewise, let's play the interpretation game again: 'Previously all > information was managed by a single object called Bio::Species. [the > Bio::Taxonomy::Node] implementation allows representation of the > intermediate nodes not just the species nodes'. Note the apposition of > 'single object' vs implication of multiple Node objects to do the same > job. I imagine at the time Jason wrote that there was no > Bio::Taxonomy, > no holder for multiple Nodes. > > >> I had originally wanted to start delegating everything over to >> Taxonomy::Node about a month ago, when I found that it was >> remarkably easy >> to do so. However, when Sendu proposed making changes to remove >> methods in >> Bio::Taxonomy::Node and make sweeping changes to Taxonomy which would >> prevent an easy transition over to Node, > > But an equally easy transition to Bio::Taxonomy instead. I don't know > why you would care about the name of the class we switch to. My > concern > is that when the switch is made it makes sense. > > >> If we think it would be better to completely toss all this out the >> window >> and use only a bare-bones Node, then I'm fine with that. But if >> we go that >> route we should just get rid of the Bio::Species 'disease' >> completely and >> have things be much simpler. Simple is good! >> >> I think Node can still act as a viable container class for the tax >> data from >> a GenBank file (it's original purpose) as long as it has the very >> basic >> methods for doing so. That would require: >> >> scientific_name() - ORGANISM line data >> common_names() - which could hold common names (in parentheses on >> the SOURCE >> line) and the abbreviated name (from the SOURCE line) >> ncbi_taxid() - from the 'source' seqfeature (already there). >> >> The lineage information and organelle information could be stored >> in Node or >> in SimpleValue objects. My vote is for the latter as there's no >> need for a >> classification() container for Node, which you have repeatedly >> pointed out. > > No, this is the whole point. The lineage information can NOT be stored > in a Node (unless you absuse Node by having all those crufty methods > like genus() and classification()), and why would we store it in > SimpleValue objects when we have Bio::Taxonomy? > > Bio::Taxonomy is completely perfect for storing the taxonomic > information from a GenBank file. That's all you need to worry > about. Can > we represent the data correctly? Yes. Do we gain all the good things > about a pure Bio::Taxonomy? Yes. Can we still do everything we used to > be able to do? Yes. > > >> I think we should just get rid of Bio::Species completely. > > There's no need to get rid of Bio::Species. It can be a Bio::Taxonomy > with backward-compatible methods. No harm done, all good. > > > I'll tell you what. This will be easier if I just write the code > for my > proposals, including whatever changes would be needed in > Bio::SeqIO::genbank et al. You'll see how easy and appropriate it is, > and hopefully everyone will be happy. > > Perhaps you could just hold off doing any similar-but-contradictory > work > until then. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From hlapp at gmx.net Tue Jul 25 03:31:41 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 24 Jul 2006 23:31:41 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu> References: <000101c6af68$f27521a0$15327e82@pyrimidine> <44C54683.70707@sendu.me.uk> <6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu> Message-ID: <49AA1351-5741-4E19-AFCE-EEA5532531B6@gmx.net> On Jul 24, 2006, at 10:29 PM, Chris Fields wrote: > [...] > We could go back and forth on what Jason really intended. [...] The > reality is he's not here and you're willing to do the job. Right. And, knowing Jason, I think he'd be perfectly fine with seeing his original idea develop in a possibly different direction, provided it will all work nicely in the end. I'm willing to take the beating on me if that doesn't turn out to be true ... > > There is one thing I will make perfectly clear here: there should > never, ever be enforced lookups for SeqIO (even using caches), You certainly don't want taxonomy lookups during the parsing stage, and also not for the client requesting properties of the species that have been parsed with high confidence, i.e., genus and species for a straightforward binomial like 'Homo sapiens'. Writing sequences, IMHO, doesn't have to be as fast. It may be better to emit strict format a bit slower rather than sloppy format a bit faster. Upon parsing, one idea could be for the flat file parser to set a dirty bit in the parsed out species if the parsed text didn't follow strict binomial conventions, hence the parser may have made a mistake and if a client requests the information it is better to lookup the correct values from a taxonomy database. I.e., you could try with a strict regex first that would imply a high-confidence result. If that fails you don't give up but mark the result as untrustworthy. > [...] > This would have been MUCH easier if all three of us could have gone > to the local bar for a beer and discussed it. We should just take > the time out to videoconference next time. You're not honestly suggesting that a videoconference is better than having beer together? Enjoy your trip, and thanks for hanging in there in the discussion, I appreciate it. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Tue Jul 25 05:53:33 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 25 Jul 2006 00:53:33 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <49AA1351-5741-4E19-AFCE-EEA5532531B6@gmx.net> References: <000101c6af68$f27521a0$15327e82@pyrimidine> <44C54683.70707@sendu.me.uk> <6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu> <49AA1351-5741-4E19-AFCE-EEA5532531B6@gmx.net> Message-ID: <190F1365-1F8B-401F-8FBA-8DE39D851111@uiuc.edu> So do we intend on having everyone who installs bioperl have a local copy of the taxonomy dumpfile? Or perform a remote lookup via Entrez? Seems a bit extreme. I would like the option of not having the lookup run; as I mentioned to Sendu, one of the biggest complaints about bioperl is speed. Additional lookups won't help on that end. Chris On Jul 24, 2006, at 10:31 PM, Hilmar Lapp wrote: > > On Jul 24, 2006, at 10:29 PM, Chris Fields wrote: > >> [...] >> We could go back and forth on what Jason really intended. [...] The >> reality is he's not here and you're willing to do the job. > > Right. And, knowing Jason, I think he'd be perfectly fine with seeing > his original idea develop in a possibly different direction, provided > it will all work nicely in the end. I'm willing to take the beating > on me if that doesn't turn out to be true ... > >> >> There is one thing I will make perfectly clear here: there should >> never, ever be enforced lookups for SeqIO (even using caches), > > You certainly don't want taxonomy lookups during the parsing stage, > and also not for the client requesting properties of the species that > have been parsed with high confidence, i.e., genus and species for a > straightforward binomial like 'Homo sapiens'. > > Writing sequences, IMHO, doesn't have to be as fast. It may be better > to emit strict format a bit slower rather than sloppy format a bit > faster. > > Upon parsing, one idea could be for the flat file parser to set a > dirty bit in the parsed out species if the parsed text didn't follow > strict binomial conventions, hence the parser may have made a mistake > and if a client requests the information it is better to lookup the > correct values from a taxonomy database. I.e., you could try with a > strict regex first that would imply a high-confidence result. If that > fails you don't give up but mark the result as untrustworthy. > > >> [...] >> This would have been MUCH easier if all three of us could have gone >> to the local bar for a beer and discussed it. We should just take >> the time out to videoconference next time. > > You're not honestly suggesting that a videoconference is better than > having beer together? > > Enjoy your trip, and thanks for hanging in there in the discussion, I > appreciate it. > > -hilmar > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Tue Jul 25 07:05:23 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Tue, 25 Jul 2006 08:05:23 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu> References: <000101c6af68$f27521a0$15327e82@pyrimidine> <44C54683.70707@sendu.me.uk> <6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu> Message-ID: <44C5C2B3.1020304@sendu.me.uk> Chris Fields wrote: > > There is one thing I will make perfectly clear here: there should > never, ever be enforced lookups for SeqIO (even using caches), though > I have no problem having optional ones. This is something I have > stated before and what you propose below steers dangerously in that > direction. Where, for instance, do you store the lineage from a > GenBank file? Do you want to do a series of Tax lookups to restore > that data? I think that the number one complaint for sequence > parsing is speed, which would only get slower with lookups (even > cached). I already gave a code example of exactly how Bio::Taxonomy is perfect for storing the lineage data in a GenBank file with or without a database lookup. I think perhaps at the time you first read this you basically ignored it because you had trouble with the idea of adding nodes to a species. If you have been glossing over my argument, it may be instructive to go over what I've been saying with a clear eye. Anyway, here it is again, and remember in this example, Bio::Species isa Bio::Taxonomy: ## the fully-manual way my $species = new Bio::Species; my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces cerevisiae', -rank => 'species', -object_id => 1, -parent_id => 2); my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces', -object_id => 2, -parent_id => 3); # (no assumption that 'Saccharomyces' is the genus, so rank() undefined) my $n3 = [etc] $species->add_node($node); $species->add_node($n2); [etc] ## Using a factory without db access # assume that Bio::Taxonomy::GenbankFactory implements # some modified Bio::Taxonomy::FactoryI my $factory = Bio::Taxonomy::GenbankFactory->new(); my $species = $factory->generate(-classification => ['Saccharomyces cerevisiae', 'Saccharomyces', 'Saccharomycetaceae' ...]); # the generate() method above just does the fully-manual way for you ## Using a factory with db access # assume that Bio::Taxonomy::EntrezFactory implements some # modified Bio::Taxonomy::FactoryI and uses Bio::DB::Taxonomy::entrez # to get the nodes my $factory = Bio::Taxonomy::EntrezFactory->new(); my $species = $factory->fetch(-scientifc_name => 'Saccharomyces cerevisiae'); So now do you see how we're able to do the Genbank no-db way and the db-using way with the same object model? We're able to do it the same, sane way because a Node is just a node; you can make them yourself manually, or retrieve them from a database. Once you stick them in a Taxonomy you can then (potentially) ask all the questions of the data that you can with existing Bio::Species. No cruft is required anywhere at all. All the Taxonomy classes can be 'pure', while only Bio::Species has to have backward-compatibility methods. From bernd.web at gmail.com Tue Jul 25 10:47:50 2006 From: bernd.web at gmail.com (Bernd Web) Date: Tue, 25 Jul 2006 12:47:50 +0200 Subject: [Bioperl-l] Structure::IO Message-ID: <716af09c0607250347w24e0d8dbj674758049bc0a4e2@mail.gmail.com> Hi, Does someone have experience with Bio::Structure::IO? The example III.9.1 from the bptutorial.pl covers most, but what is e.g. the chain() method of Bio::Structure::Entry doing? The POD states: Title : chain Usage : @chains = $structure->chain($chain); Function: Connects a (or a list of) Chain objects to a Bio::Structure::Entry. Returns : list of Bio::Structure::Residue objects Args : One Residue or a reference to an array of Residue objects But in e.g my $stream = Bio::Structure::IO->new(-file => $filename, -format => 'pdb'); while ( my $struc = $stream->next_structure() ) { for my $chain ($struc->get_chains) { my $chainid = $chain->id; my @chains = $struc->chain($chain); } } I get Bio::Structure::Chain=HASH(0x9f1ab50). What is the function of the chain method and how to use it? Best regards, bernd From bernd.web at gmail.com Tue Jul 25 11:44:28 2006 From: bernd.web at gmail.com (Bernd Web) Date: Tue, 25 Jul 2006 13:44:28 +0200 Subject: [Bioperl-l] SeqUtils Message-ID: <716af09c0607250444y3e005fb1t4e20094fd8db993d@mail.gmail.com> Hi, With Bio::SeqUtils it may be nice to support 3 letter codes with capitals only, too. Now my $string = Bio::SeqUtils->seq3in($seqobj, 'METGLYTER'); will give in $string->seq: XXX. Possibly the capitals in MetGlyTer are used to find the amino acids codes? If not maybe it's easy to implement case-insensitive, or all-capitals for AA codes in SeqUtils? In addition about the POD: maybe it's better not use use $string since Bio::SeqUtils->seq3in does not return a string but a Bio::PrimarySeq object. Regards, Bernd From cjfields at uiuc.edu Tue Jul 25 12:28:01 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 25 Jul 2006 07:28:01 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C5C2B3.1020304@sendu.me.uk> References: <000101c6af68$f27521a0$15327e82@pyrimidine> <44C54683.70707@sendu.me.uk> <6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu> <44C5C2B3.1020304@sendu.me.uk> Message-ID: <05718B29-6432-4ACA-A10F-2B384CB68191@uiuc.edu> Look, you explaining this to me, as you see it, does not convince me that its the correct or right way to do it. Okay? Can we agree on that? I do not think that Species and Taxonomy are the same thing. A species should not hold more than one node. A species, by definition, is a rank in Taxonomy, and is a node, not a full Taxonomy, so Bio::Species should be a Node, not a Taxonomy. I don't see how I can be any clearer... The fact that it may work is beyond the point. That's like putting duct tape on a leak to me. Why not just simplify Bio::Species into a Node? Or make it into a Node and get rid of it altogether. You are going to do what you want to do, regardless of what I say. Seems to be par for the course here. I'm REALLY tired of arguing the point. Okay? Just drop it. I have other priorities in life besides goddamned bioperl right now... Chris On Jul 25, 2006, at 2:05 AM, Sendu Bala wrote: > Chris Fields wrote: >> >> There is one thing I will make perfectly clear here: there should >> never, ever be enforced lookups for SeqIO (even using caches), though >> I have no problem having optional ones. This is something I have >> stated before and what you propose below steers dangerously in that >> direction. Where, for instance, do you store the lineage from a >> GenBank file? Do you want to do a series of Tax lookups to restore >> that data? I think that the number one complaint for sequence >> parsing is speed, which would only get slower with lookups (even >> cached). > > I already gave a code example of exactly how Bio::Taxonomy is perfect > for storing the lineage data in a GenBank file with or without a > database lookup. I think perhaps at the time you first read this you > basically ignored it because you had trouble with the idea of adding > nodes to a species. If you have been glossing over my argument, it may > be instructive to go over what I've been saying with a clear eye. > Anyway, here it is again, and remember in this example, > Bio::Species isa > Bio::Taxonomy: > > > ## the fully-manual way > my $species = new Bio::Species; > my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces > cerevisiae', > -rank => 'species', -object_id > => 1, > -parent_id => 2); > my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces', > -object_id => 2, -parent_id => 3); > # (no assumption that 'Saccharomyces' is the genus, so rank() > undefined) > my $n3 = [etc] > $species->add_node($node); > $species->add_node($n2); > [etc] > > ## Using a factory without db access > # assume that Bio::Taxonomy::GenbankFactory implements > # some modified Bio::Taxonomy::FactoryI > my $factory = Bio::Taxonomy::GenbankFactory->new(); > my $species = $factory->generate(-classification => ['Saccharomyces > cerevisiae', 'Saccharomyces', > 'Saccharomycetaceae' ...]); > # the generate() method above just does the fully-manual way for you > > ## Using a factory with db access > # assume that Bio::Taxonomy::EntrezFactory implements some > # modified Bio::Taxonomy::FactoryI and uses Bio::DB::Taxonomy::entrez > # to get the nodes > my $factory = Bio::Taxonomy::EntrezFactory->new(); > my $species = $factory->fetch(-scientifc_name => 'Saccharomyces > cerevisiae'); > > > So now do you see how we're able to do the Genbank no-db way and the > db-using way with the same object model? We're able to do it the same, > sane way because a Node is just a node; you can make them yourself > manually, or retrieve them from a database. Once you stick them in a > Taxonomy you can then (potentially) ask all the questions of the data > that you can with existing Bio::Species. No cruft is required anywhere > at all. All the Taxonomy classes can be 'pure', while only > Bio::Species > has to have backward-compatibility methods. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Tue Jul 25 12:52:03 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Tue, 25 Jul 2006 13:52:03 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <05718B29-6432-4ACA-A10F-2B384CB68191@uiuc.edu> References: <000101c6af68$f27521a0$15327e82@pyrimidine> <44C54683.70707@sendu.me.uk> <6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu> <44C5C2B3.1020304@sendu.me.uk> <05718B29-6432-4ACA-A10F-2B384CB68191@uiuc.edu> Message-ID: <44C613F3.7070903@sendu.me.uk> Chris Fields wrote: > A species should not hold more than one node. A species, by > definition, is a rank in Taxonomy, and is a node, not a full > Taxonomy, so Bio::Species should be a Node, not a Taxonomy. I don't > see how I can be any clearer... Right, we have differing viewpoints because you're concerned with what Bio::Species /should/ be, based on the name of the file and perhaps its original intent, whilst I am treating it as what it actually /is/, which is an object that is used to contain information about multiple taxonomic nodes. > The fact that it may work is beyond the point. That's like putting > duct tape on a leak to me. Why not just simplify Bio::Species into a > Node? Or make it into a Node and get rid of it altogether. Bio::Species, again ignore the name, is just a thing that lets us store and retrieve a certain set of data. If we simplified it into a pure Node, it could no longer do that job. If we just get rid of it all together it can no longer do its job. By making it a Bio::Taxonomy it can continue to do its job without having to have Node objects with cruft. It would also gain the useful methods of Bio::Taxonomy at the same time. I really don't mean to upset you, and I apologise for having done so. I've been presenting what I thought was a logical argument in favour of Bio::Species as Bio::Taxonomy, and waiting to see if anyone would come up with a logical argument why that would be inappropriate, or why something else would be better. I'm not saying you're wrong and I'm certainly listening and would change my choice based on what you have to say. I don't think it's fair to say that disregarding what you have to say is 'par for the course' - I already /have/ regarded what you had to say in this thread and ended up doing scientific_name() as purely what we get from the database. From hlapp at gmx.net Tue Jul 25 13:47:47 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 25 Jul 2006 09:47:47 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C5C2B3.1020304@sendu.me.uk> References: <000101c6af68$f27521a0$15327e82@pyrimidine> <44C54683.70707@sendu.me.uk> <6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu> <44C5C2B3.1020304@sendu.me.uk> Message-ID: On Jul 25, 2006, at 3:05 AM, Sendu Bala wrote: > [...] > ## the fully-manual way > my $species = new Bio::Species; > my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces > cerevisiae', > -rank => 'species', -object_id > => 1, > -parent_id => 2); If this is meant as an example for the use cases I enumerated, then you wouldn't have the parent_id from a Genbank file. However, you didn't have that before either, so no problem. > my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces', > -object_id => 2, -parent_id => 3); > # (no assumption that 'Saccharomyces' is the genus, so rank() > undefined) I think in a confident parse you want to assign 'genus' if there's little doubt, for example 'Saccharomyces cerevisiae'. Not sure whether there are weird viri whose names look innocuous but in reality the name doesn't follow binomial convention. > my $n3 = [etc] > $species->add_node($node); > $species->add_node($n2); I know why you are doing this, but seeing this people will hit a mental snag. You should listen to Chris' refusal to see the sense in this as an indication that many people down the road won't see the sense either. So instead, make the logical model in your design more obvious, which I think ultimately will help maintainability as well. For example: my $taxonomy = Bio::Taxonomy->new(); my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces cerevisiae', -rank => 'species', -object_id => 1, -parent_id => 2); my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces', -object_id => 2, -parent_id => 3); $taxonomy->add_node($node); $taxonomy->add_node($n2); my $species = Bio::Species->new(-lineage => $taxonomy); print $species->binomial(); print $species->genus(); # this may trigger a lookup if a taxonomy db handle has been set, e.g.: # $taxonomy->db_handle(Bio::DB::Taxonomy->new(-source => 'entrez')); print $species->classification(); > [etc] > > ## Using a factory without db access > # assume that Bio::Taxonomy::GenbankFactory implements > # some modified Bio::Taxonomy::FactoryI > my $factory = Bio::Taxonomy::GenbankFactory->new(); > my $species = $factory->generate(-classification => ['Saccharomyces > cerevisiae', 'Saccharomyces', > 'Saccharomycetaceae' ...]); > # the generate() method above just does the fully-manual way for you Except the method name would be create_object(), the parameter would be a hash ref, and the return value would be a Bio::TaxonomyI compliant object: my $taxonomy = $factory->create_object({-classification => ['Saccharomyces cerevisiae', 'Saccharomyces', 'Saccharomycetaceae' ...]}); my $species = Bio::Species->new(-lineage => $taxonomy); > > ## Using a factory with db access > # assume that Bio::Taxonomy::EntrezFactory implements some > # modified Bio::Taxonomy::FactoryI and uses Bio::DB::Taxonomy::entrez > # to get the nodes > my $factory = Bio::Taxonomy::EntrezFactory->new(); The logic where to do a lookup on should not be duplicated here. It only belongs under Bio::DB::Taxonomy::*. > my $species = $factory->fetch(-scientifc_name => 'Saccharomyces > cerevisiae'); Likewise, use the methods defined in Bio::DB::Taxonomy, and again, the return type is Bio::Taxonomy, which you would pass to Bio::Species->new(). -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Tue Jul 25 13:54:14 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 25 Jul 2006 09:54:14 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <190F1365-1F8B-401F-8FBA-8DE39D851111@uiuc.edu> References: <000101c6af68$f27521a0$15327e82@pyrimidine> <44C54683.70707@sendu.me.uk> <6E129BAF-08E3-4BDA-8D1C-F853BE7231C9@uiuc.edu> <49AA1351-5741-4E19-AFCE-EEA5532531B6@gmx.net> <190F1365-1F8B-401F-8FBA-8DE39D851111@uiuc.edu> Message-ID: <793AFD5C-D220-493F-BE11-B9023DC9F569@gmx.net> We intend on having everyone who wants correct taxonomy parsing results for the entire kingdom of life to define his/her authoritative taxonomy database, be it local or not, be it HTTP or SQL queried. If you don't care about the correctness of the taxonomy parse, or if the taxonomy information in the flat file is trivially parseable because it conforms to standard binomial convention, then whatever is to be put in place needs to work fine regardless of whether a taxonomy database is defined or not. -hilmar On Jul 25, 2006, at 1:53 AM, Chris Fields wrote: > So do we intend on having everyone who installs bioperl have a local > copy of the taxonomy dumpfile? Or perform a remote lookup via > Entrez? Seems a bit extreme. > > I would like the option of not having the lookup run; as I mentioned > to Sendu, one of the biggest complaints about bioperl is speed. > Additional lookups won't help on that end. > > Chris > > On Jul 24, 2006, at 10:31 PM, Hilmar Lapp wrote: > >> >> On Jul 24, 2006, at 10:29 PM, Chris Fields wrote: >> >>> [...] >>> We could go back and forth on what Jason really intended. [...] The >>> reality is he's not here and you're willing to do the job. >> >> Right. And, knowing Jason, I think he'd be perfectly fine with seeing >> his original idea develop in a possibly different direction, provided >> it will all work nicely in the end. I'm willing to take the beating >> on me if that doesn't turn out to be true ... >> >>> >>> There is one thing I will make perfectly clear here: there should >>> never, ever be enforced lookups for SeqIO (even using caches), >> >> You certainly don't want taxonomy lookups during the parsing stage, >> and also not for the client requesting properties of the species that >> have been parsed with high confidence, i.e., genus and species for a >> straightforward binomial like 'Homo sapiens'. >> >> Writing sequences, IMHO, doesn't have to be as fast. It may be better >> to emit strict format a bit slower rather than sloppy format a bit >> faster. >> >> Upon parsing, one idea could be for the flat file parser to set a >> dirty bit in the parsed out species if the parsed text didn't follow >> strict binomial conventions, hence the parser may have made a mistake >> and if a client requests the information it is better to lookup the >> correct values from a taxonomy database. I.e., you could try with a >> strict regex first that would imply a high-confidence result. If that >> fails you don't give up but mark the result as untrustworthy. >> >> >>> [...] >>> This would have been MUCH easier if all three of us could have gone >>> to the local bar for a beer and discussed it. We should just take >>> the time out to videoconference next time. >> >> You're not honestly suggesting that a videoconference is better than >> having beer together? >> >> Enjoy your trip, and thanks for hanging in there in the discussion, I >> appreciate it. >> >> -hilmar >> -- >> =========================================================== >> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : >> =========================================================== >> >> >> >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Tue Jul 25 14:58:29 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 25 Jul 2006 09:58:29 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <793AFD5C-D220-493F-BE11-B9023DC9F569@gmx.net> Message-ID: <002601c6affa$ca4433f0$15327e82@pyrimidine> Agreed. I fully support the addition of an optional lookup; it gives much more flexibility SeqIO re: your previous examples of screening sequence streams for sequences that are primate, mitochondrial, etc. The key word I want to emphasize is 'optional', not 'enforced'. I appreciate what Sendu is trying to do; I really do. I think carrying over an object named 'Bio::Species' into Taxonomy is too confusing (your 'contagion' analogy, as it were). The 'species' concept (biologically speaking here, not talking about the Bioperl class) is a taxonomic rank (i.e. part of a taxonomy). I'm trying to take a biologist's point of view here. What is a 'species'? Or, if we were to stick strictly with using NCBI definitions, what is a 'species'? The NCBI definition of 'species' is simply a rank in a lineage, so it is (in Bioperl terms) a Node. If we were to follow that line of reasoning, why also have a Species object represent a Taxonomy as well? It's way too confusing. Sendu's repeatedly stating "a Species is a Taxonomy" makes some sense in a BioPerl world only, as we're speaking about a class that has been around for a long time, one that acted as a container of sorts for sequence data. And I understand what he intends to do. Conceptually speaking here, though, the way it is laid out, a Bio::Species object can hold a Node that represents a 'species' rank, as well as a 'genus' Node, and a 'family' node, and on and on. That's not a 'species', that's a taxonomy. So just call it a Taxonomy. The object itself (Bio::Species) never truly represented a 'species' anyway, biologically speaking, every time it held sequence data. It could be a subspecies, strain, plasmid, unknown, or an unclassified rank ('no rank') or environmental sample. It really held a fancier representation of a node, as based on the TaxID. My final point is, saying "a species is a taxonomy" to the rest of the biological world doesn't make sense. Maybe it makes sense to you and I and Sendu, in our little Bioperl world. But to the thousands of users out there who don't completely grok the Bioperl class structure, it's just confusing. If I were to get an object back that was labeled Bio::Species, as a biologist I would expect it to be part of a taxonomy, not the actual Taxonomy itself. So, why not cut to the chase: if we are to fundamentally change the concept of what Bio::Species is by making it a Taxonomy/TaxonomyI or whatever, why not just use a Taxonomy object altogether and not bother with Bio::Species at all? Deprecate it. BTW, I'll be in Connecticut for five days at UConn. So I hope to escape the heat for a bit. Thanks for listening to my side of things. Chris > -----Original Message----- > From: Hilmar Lapp [mailto:hlapp at gmx.net] > Sent: Tuesday, July 25, 2006 8:54 AM > To: Chris Fields > Cc: Sendu Bala; bioperl-l at bioperl.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > We intend on having everyone who wants correct taxonomy parsing > results for the entire kingdom of life to define his/her > authoritative taxonomy database, be it local or not, be it HTTP or > SQL queried. > > If you don't care about the correctness of the taxonomy parse, or if > the taxonomy information in the flat file is trivially parseable > because it conforms to standard binomial convention, then whatever is > to be put in place needs to work fine regardless of whether a > taxonomy database is defined or not. > > -hilmar > > On Jul 25, 2006, at 1:53 AM, Chris Fields wrote: > > > So do we intend on having everyone who installs bioperl have a local > > copy of the taxonomy dumpfile? Or perform a remote lookup via > > Entrez? Seems a bit extreme. > > > > I would like the option of not having the lookup run; as I mentioned > > to Sendu, one of the biggest complaints about bioperl is speed. > > Additional lookups won't help on that end. > > > > Chris > > > > On Jul 24, 2006, at 10:31 PM, Hilmar Lapp wrote: > > > >> > >> On Jul 24, 2006, at 10:29 PM, Chris Fields wrote: > >> > >>> [...] > >>> We could go back and forth on what Jason really intended. [...] The > >>> reality is he's not here and you're willing to do the job. > >> > >> Right. And, knowing Jason, I think he'd be perfectly fine with seeing > >> his original idea develop in a possibly different direction, provided > >> it will all work nicely in the end. I'm willing to take the beating > >> on me if that doesn't turn out to be true ... > >> > >>> > >>> There is one thing I will make perfectly clear here: there should > >>> never, ever be enforced lookups for SeqIO (even using caches), > >> > >> You certainly don't want taxonomy lookups during the parsing stage, > >> and also not for the client requesting properties of the species that > >> have been parsed with high confidence, i.e., genus and species for a > >> straightforward binomial like 'Homo sapiens'. > >> > >> Writing sequences, IMHO, doesn't have to be as fast. It may be better > >> to emit strict format a bit slower rather than sloppy format a bit > >> faster. > >> > >> Upon parsing, one idea could be for the flat file parser to set a > >> dirty bit in the parsed out species if the parsed text didn't follow > >> strict binomial conventions, hence the parser may have made a mistake > >> and if a client requests the information it is better to lookup the > >> correct values from a taxonomy database. I.e., you could try with a > >> strict regex first that would imply a high-confidence result. If that > >> fails you don't give up but mark the result as untrustworthy. > >> > >> > >>> [...] > >>> This would have been MUCH easier if all three of us could have gone > >>> to the local bar for a beer and discussed it. We should just take > >>> the time out to videoconference next time. > >> > >> You're not honestly suggesting that a videoconference is better than > >> having beer together? > >> > >> Enjoy your trip, and thanks for hanging in there in the discussion, I > >> appreciate it. > >> > >> -hilmar > >> -- > >> =========================================================== > >> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > >> =========================================================== > >> > >> > >> > >> > >> > >> _______________________________________________ > >> Bioperl-l mailing list > >> Bioperl-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > Christopher Fields > > Postdoctoral Researcher > > Lab of Dr. Robert Switzer > > Dept of Biochemistry > > University of Illinois Urbana-Champaign > > > > > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > From cjfields at uiuc.edu Tue Jul 25 15:36:40 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 25 Jul 2006 10:36:40 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: Message-ID: <003301c6b000$203cc560$15327e82@pyrimidine> > On Jul 25, 2006, at 3:05 AM, Sendu Bala wrote: > > > [...] > > ## the fully-manual way > > my $species = new Bio::Species; > > my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces > > cerevisiae', > > -rank => 'species', -object_id > > => 1, > > -parent_id => 2); > > If this is meant as an example for the use cases I enumerated, then > you wouldn't have the parent_id from a Genbank file. However, you > didn't have that before either, so no problem. > > > my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces', > > -object_id => 2, -parent_id => 3); > > # (no assumption that 'Saccharomyces' is the genus, so rank() > > undefined) > > I think in a confident parse you want to assign 'genus' if there's > little doubt, for example 'Saccharomyces cerevisiae'. Not sure > whether there are weird viri whose names look innocuous but in > reality the name doesn't follow binomial convention. > > > my $n3 = [etc] > > $species->add_node($node); > > $species->add_node($n2); > > I know why you are doing this, but seeing this people will hit a > mental snag. You should listen to Chris' refusal to see the sense in > this as an indication that many people down the road won't see the > sense either. Thanks for pointing that out. I think there is only a small, fundamental difference in our views here. I'm trying to view this as an outsider would, a biologist not familiar with the Bioperl class structure. I understand what Sendu's trying to accomplish but it's really confusing to someone not familiar with what Bio::Species is. Hilmar, you had pointed out several times that Bio::Species and Bio::Taxonomy shouldn't directly intermingle. My original thought for genbank.pm _read_GenBank_Species() was this, copied and pasted from my local genbank.pm. It's sort of extreme, but it passes tests just fine. sub _read_GenBank_Species { my( $self,$buffer) = @_; $_ = $$buffer; my @organelles = qw(plastid chloroplast mitochondrion); my( $source_data, $common_name, @class, $ns_name, $organelle, $source_flag, $sci_name, $abbr ); while (defined($_) || defined($_ = $self->_readline())) { # de-HTMLify (links that may be encountered here don't contain # escaped '>', so a simple-minded approach suffices) s/<[^>]+>//g; if ( /^SOURCE\s+(.*)/o ) { $source_data = $1; $source_data =~ s/\.$//; # remove trailing dot # does it have a GenBank common name in parentheses? $common_name = $source_data =~ m{\((.*)\)}xms; # organelle? If we find additional odd ones, # add to @organelle $organelle = grep { $_ =~ $source_data } @organelles; $source_flag = 1; } elsif ( /^\s{2}ORGANISM\s+(.*)/o ) { $sci_name = $1; $source_flag = 0; } elsif ($source_flag) { # no ORGANISM $common_name .= $source_data; $common_name =~ s/\n//g; $common_name =~ s/\s+/ /g; $source_flag = 0; } elsif ( /^\s+(.+)/o ) { # lineage information my $line = $1; # only split on ';' or '.' so that classification # that is 2 words will still get matched, use # map() to remove trailing/leading spaces push(@class, map { s/^\s+//; s/\s+$//; $_; } split /[;\.]+/, $line) if ( $line =~ /(;|\.)/ ); } else { # reach end of GenBank tax info last; } $_ = undef; # Empty $_ to trigger read of next line } $$buffer = $_; @class = reverse @class; my $make = Bio::Taxonomy::Node->new(); $make->common_name( $common_name ) if $common_name; $make->scientific_name($sci_name) if $sci_name; # could use SimpleValue objs here instead $make->classification( @class ) if @class; $make->organelle($organelle) if $organelle; return $make; } # back in next_seq...grab the TaxID from 'source' # seqfeature # could check organelle() here as well # add taxon_id from source if available if($species && ($feat->primary_tag eq 'source') && $feat->has_tag('db_xref') && (! $species->ncbi_taxid())) { foreach my $tagval ($feat->get_tag_values('db_xref')) { if(index($tagval,"taxon:") == 0) { $species->ncbi_taxid(substr($tagval,6)); last; } } } In other words, remove the extra parsing of genus() species() subspecies etc. All GenBank sequences have a node represented in NCBI's tax database (I checked it out). Even plasmids, unknowns, environmental samples. Chris > So instead, make the logical model in your design more obvious, which > I think ultimately will help maintainability as well. For example: > > my $taxonomy = Bio::Taxonomy->new(); > my $node = new Bio::Taxonomy::Node(-name => 'Saccharomyces cerevisiae', > -rank => 'species', -object_id > => 1, > -parent_id => 2); > my $n2 = new Bio::Taxonomy::Node(-name => 'Saccharomyces', > -object_id => 2, -parent_id => 3); > $taxonomy->add_node($node); > $taxonomy->add_node($n2); > > my $species = Bio::Species->new(-lineage => $taxonomy); > print $species->binomial(); > print $species->genus(); > # this may trigger a lookup if a taxonomy db handle has been set, e.g.: > # $taxonomy->db_handle(Bio::DB::Taxonomy->new(-source => 'entrez')); > print $species->classification(); > > > > [etc] > > > > ## Using a factory without db access > > # assume that Bio::Taxonomy::GenbankFactory implements > > # some modified Bio::Taxonomy::FactoryI > > my $factory = Bio::Taxonomy::GenbankFactory->new(); > > my $species = $factory->generate(-classification => ['Saccharomyces > > cerevisiae', 'Saccharomyces', > > 'Saccharomycetaceae' ...]); > > # the generate() method above just does the fully-manual way for you > > Except the method name would be create_object(), the parameter would > be a hash ref, and the return value would be a Bio::TaxonomyI > compliant object: > > my $taxonomy = $factory->create_object({-classification => > ['Saccharomyces > cerevisiae', 'Saccharomyces', > 'Saccharomycetaceae' ...]}); > my $species = Bio::Species->new(-lineage => $taxonomy); > > > > > > ## Using a factory with db access > > # assume that Bio::Taxonomy::EntrezFactory implements some > > # modified Bio::Taxonomy::FactoryI and uses Bio::DB::Taxonomy::entrez > > # to get the nodes > > my $factory = Bio::Taxonomy::EntrezFactory->new(); > > The logic where to do a lookup on should not be duplicated here. It > only belongs under Bio::DB::Taxonomy::*. > > > my $species = $factory->fetch(-scientifc_name => 'Saccharomyces > > cerevisiae'); > > Likewise, use the methods defined in Bio::DB::Taxonomy, and again, > the return type is Bio::Taxonomy, which you would pass to > Bio::Species->new(). > > -hilmar > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From bix at sendu.me.uk Tue Jul 25 17:49:04 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Tue, 25 Jul 2006 18:49:04 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <003301c6b000$203cc560$15327e82@pyrimidine> References: <003301c6b000$203cc560$15327e82@pyrimidine> Message-ID: <44C65990.4080500@sendu.me.uk> Chris Fields wrote: > If I were to get an object back that was labeled Bio::Species, as a > biologist I would expect it to be part of a taxonomy, not the actual > Taxonomy itself. I think this is the most important sentence in the discussion. Ok, so it's clear to me that a better solution is needed than my Bio::Taxonomy-related proposal. Sorry for being so slow on the uptake. I also needed to start trying to code my Taxonomy proposal to see some issues with it. [... in another email...] > I'm trying to view this as an outsider would, > a biologist not familiar with the Bioperl class structure. Ok, let's come up with a proposal that makes sense to the biologist and better matches Jason's original idea. ---- long post follows; there's a summary at the end As a biologist when I consider a species I have the following primary questions. Let's see how we would answer them using a) Bio::Species and genbank.pm as they are now, b) Bio::Species if it was a 'pure' Bio::Taxonomy::Node with no cruft (or if we just dropped Bio::Species and used Node directly), and Chris' updated genbank.pm. Let's say we got our species information from a genbank file where the scientific name and tax id are available to be parsed out. # What is the species' name? a) Not guaranteed to be correct. b) Correct thanks to recent changes to Node, just use scientific_name() # What is the lineage of this species? a) I can get a classification array with classification(). It's a bit rubbish though, I can't tell what any of the array elements are supposed to be. b) A pure Node wouldn't store the lineage on itself. There are two obvious solutions: 1) add cruft to Node by giving it a classification() method - works as well/bad as a). 2) call get_Lineage_Nodes(), which has the benefit of telling me what rank each ancestor was, if that information had been in the file (more likely, if Node was generated from database). Problem: get_Lineage_Nodes() only works if it can $self->db_handle->get_Taxonomy_Node(-taxonid => $self->parent_id); which obviously doesn't work if the nodes in our lineage didn't come from a database, but from the parsing of a genbank flat file. As we parse the genbank file we can certainly make nodes for each word in the list: inside genbank.pm... @class = reverse @class; my @nodes; my $fake_id = 1; foreach my $sci_name (@class) { push(@nodes, new Bio::Taxonomy::Node(-name => $sci_name, object_id => $fake_id++, parent_id => $fake_id); } But how do we keep these nodes and make them returnable later by get_Lineage_Nodes? Perhaps: my $taxonomy = new Bio::Taxonomy; foreach my $node (@nodes) { $taxonomy->add_node($node); } ... my $make = Bio::Taxonomy::Node->new(); ... $make->db_handle($taxonomy); Bio::Taxonomy would have to implement get_Taxonomy_Node (it has get_node which only accepts a rank). Of course this is ugly, storing a Taxonomy in our database handle. We could have a new Bio::DB::Taxonomy:: class instead, that treated a classification array like a database? It could have the added bonus of building up an entire database internally as more input arrays are given to it, able to therefore give each node a unique but consistent id. It would break if one time you gave it qw(Homo Primates) and another time qw(Homo Hominidae Primates), however. Ideas? # What if I don't want the whole lineage, just to know what a specific rank like genus is for my species? a) use genus(), but not guaranteed to be correct. b) two solutions: 1) add cruft to Node by adding a genus() method: as good/bad as a). 2) use get_Lineage_Nodes() or get_Parent_Node() until you find a node with your rank() of interest. Same problems as for lineage question, but also it would be nicer to have a get_node('rank_name') style method. But such a method belongs in something like Bio::Taxonomy, not Node. At the very least a method like genus() would be implemented using pure Node methods like get_Parent_Node(), returning undefined if no parent had a rank() of 'genus', never guessing it. # Is this species the same as another species? a) Not guaranteed to be correct. (no unique id so forced to compare names) b) Correct answer by using object_id() method, along with Chris' change to genbank.pm. # What is the most recent common ancestor of this species and another? a) Can't be answered. b) Use get_LCA_Node(), but same issues as the lineage question, since get_LCA_Node requires a working get_Lineage_Nodes(). It also requires correct (unique) ids for all nodes in all lineages to give the guaranteed correct answer. But at least you /might/ get the correct answer even using only the data in genbank files and no db lookup. ---- summary: It seems like the main problem with Node right now is that it has classification() and things like genus(). I propose pure Node method solutions to answer the questions classification() and genus() were implemented to answer, but in a better, cruft-free way. Bio::DB::Taxonomy::genbank anyone? Then if you started with a Species/Node generated by a genbank parse, and wanted certain questions answered correctly, you only have to set a different db_handle(). The Node only stores the static and hopefully correct information about itself, whilst all other questions go via db_handle, so you can dynamically swap back and forth between databases depending on if you need speed or accuracy. From cjfields at uiuc.edu Tue Jul 25 18:24:12 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 25 Jul 2006 13:24:12 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C65990.4080500@sendu.me.uk> Message-ID: <000001c6b017$873176a0$15327e82@pyrimidine> Sendu, you'll have to make the changes how you see fit. You see my point now, which is great. >From my perspective, all the object type (used to contain taxonomy file information) needs to contain is the scientific name and common names like the SOURCE line abbreviated name and the actual GenBank common name, if present. All the other cruft (i.e. genus/species/subspecies) can be excised, and the proper taxonomic information, if wanted, could be accessed via the object and it's TaxID. Organelle and lineage information needs to be retained (for the non-taxonomists) and could be stored in that object, bumped to SimpleValue objects, or just set (alternative, since the data is small) using a get/set value within the sequence object itself. This would be the bare-bones approach, which Node can fulfill. I also like Hilmar's proposal about including optional lookups, which greatly increases the flexibility when screening sequences. This will likely require a more complicated object structure (i.e. taxonomy with nodes). You suggested a Taxonomy-like object which would work; but don't force Bio::Species into the mix. Why not just use a simple Bio::Taxonomy object for that (Hilmar's point). When one asks for $species->species, they'll get a Node or Taxonomy, whichever is used (that's up to you). The Node represents a more-barebones variation, while the Taxonomy object scheme would be more fully-realized. Either way will work for me. Just don't call it 'species'. ; > Once this is all done, will we really have a need for Bio::Species? That's my other point. The only real use for it was as a container object for sequence data. That job is now done via a Taxonomy/Node object. The only real use it would have is as a container for taxonomic information for species ranks or below. I think Node/Taxonomy can handle evan that though, so now it's also redundant. If a class is not useful and is redundant, maybe it should be deprecated. Anyway, I can't get involved anymore at this point; I'm too busy with getting ready for the Kadner Institute next week. Good luck! Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Sendu Bala > Sent: Tuesday, July 25, 2006 12:49 PM > To: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > Chris Fields wrote: > > If I were to get an object back that was labeled Bio::Species, as a > > biologist I would expect it to be part of a taxonomy, not the actual > > Taxonomy itself. > > I think this is the most important sentence in the discussion. Ok, so > it's clear to me that a better solution is needed than my > Bio::Taxonomy-related proposal. Sorry for being so slow on the uptake. I > also needed to start trying to code my Taxonomy proposal to see some > issues with it. > > > [... in another email...] > > I'm trying to view this as an outsider would, > > a biologist not familiar with the Bioperl class structure. > > Ok, let's come up with a proposal that makes sense to the biologist and > better matches Jason's original idea. > > ---- long post follows; there's a summary at the end > > As a biologist when I consider a species I have the following primary > questions. Let's see how we would answer them using a) Bio::Species and > genbank.pm as they are now, b) Bio::Species if it was a 'pure' > Bio::Taxonomy::Node with no cruft (or if we just dropped Bio::Species > and used Node directly), and Chris' updated genbank.pm. Let's say we got > our species information from a genbank file where the scientific name > and tax id are available to be parsed out. > > # What is the species' name? > a) Not guaranteed to be correct. > b) Correct thanks to recent changes to Node, just use scientific_name() > > > # What is the lineage of this species? > a) I can get a classification array with classification(). It's a bit > rubbish though, I can't tell what any of the array elements are supposed > to be. > b) A pure Node wouldn't store the lineage on itself. There are two > obvious solutions: 1) add cruft to Node by giving it a classification() > method - works as well/bad as a). 2) call get_Lineage_Nodes(), which has > the benefit of telling me what rank each ancestor was, if that > information had been in the file (more likely, if Node was generated > from database). Problem: get_Lineage_Nodes() only works if it can > $self->db_handle->get_Taxonomy_Node(-taxonid => $self->parent_id); > which obviously doesn't work if the nodes in our lineage didn't come > from a database, but from the parsing of a genbank flat file. As we > parse the genbank file we can certainly make nodes for each word in the > list: > inside genbank.pm... @class = reverse @class; > my @nodes; my $fake_id = 1; > foreach my $sci_name (@class) { > push(@nodes, new Bio::Taxonomy::Node(-name => $sci_name, object_id => > $fake_id++, parent_id => $fake_id); > } > But how do we keep these nodes and make them returnable later by > get_Lineage_Nodes? Perhaps: > my $taxonomy = new Bio::Taxonomy; > foreach my $node (@nodes) { > $taxonomy->add_node($node); > } > ... > my $make = Bio::Taxonomy::Node->new(); > ... > $make->db_handle($taxonomy); > Bio::Taxonomy would have to implement get_Taxonomy_Node (it has get_node > which only accepts a rank). Of course this is ugly, storing a Taxonomy > in our database handle. We could have a new Bio::DB::Taxonomy:: class > instead, that treated a classification array like a database? It could > have the added bonus of building up an entire database internally as > more input arrays are given to it, able to therefore give each node a > unique but consistent id. It would break if one time you gave it qw(Homo > Primates) and another time qw(Homo Hominidae Primates), however. Ideas? > > > # What if I don't want the whole lineage, just to know what a specific > rank like genus is for my species? > a) use genus(), but not guaranteed to be correct. > b) two solutions: 1) add cruft to Node by adding a genus() method: as > good/bad as a). 2) use get_Lineage_Nodes() or get_Parent_Node() until > you find a node with your rank() of interest. Same problems as for > lineage question, but also it would be nicer to have a > get_node('rank_name') style method. But such a method belongs in > something like Bio::Taxonomy, not Node. At the very least a method like > genus() would be implemented using pure Node methods like > get_Parent_Node(), returning undefined if no parent had a rank() of > 'genus', never guessing it. > > > # Is this species the same as another species? > a) Not guaranteed to be correct. (no unique id so forced to compare names) > b) Correct answer by using object_id() method, along with Chris' change > to genbank.pm. > > > # What is the most recent common ancestor of this species and another? > a) Can't be answered. > b) Use get_LCA_Node(), but same issues as the lineage question, since > get_LCA_Node requires a working get_Lineage_Nodes(). It also requires > correct (unique) ids for all nodes in all lineages to give the > guaranteed correct answer. But at least you /might/ get the correct > answer even using only the data in genbank files and no db lookup. > > > ---- summary: > > It seems like the main problem with Node right now is that it has > classification() and things like genus(). I propose pure Node method > solutions to answer the questions classification() and genus() were > implemented to answer, but in a better, cruft-free way. > > Bio::DB::Taxonomy::genbank anyone? > > Then if you started with a Species/Node generated by a genbank parse, > and wanted certain questions answered correctly, you only have to set a > different db_handle(). The Node only stores the static and hopefully > correct information about itself, whilst all other questions go via > db_handle, so you can dynamically swap back and forth between databases > depending on if you need speed or accuracy. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From hlapp at gmx.net Tue Jul 25 19:18:00 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 25 Jul 2006 15:18:00 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <000001c6b017$873176a0$15327e82@pyrimidine> References: <000001c6b017$873176a0$15327e82@pyrimidine> Message-ID: <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> On Jul 25, 2006, at 2:24 PM, Chris Fields wrote: > Once this is all done, will we really have a need for Bio::Species? No, except for backwards compatibility. Phasing it out will go over a couple of releases. E.g., v1.6.x could have deprecation warning in the documentation. v1.7+ would have deprecation warnings in the code written to stderr. Just as an aside, we can't just drastically change the return type of a method. Instead, if at all possible, there should be a new method so that the old can be phased out over time but otherwise not changed. I.e., don't change $seq->species() to now all of a sudden return a node or taxonomic lineage, even if initially Bio::Species is returned with some magic under the hood. Instead, create something like # return a Bio::Taxonomy::Node: my $taxon = $seq->taxon(); # alternative approach: return a lineage (taxonomy) # this would be Bio::TaxonomyI compliant my $lineage = $seq->lineage(); The former would require the lineage (and organelle for completeness) information to be either easily (though not necessarily directly) accessible through the node, or added as annotation. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Tue Jul 25 19:30:40 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 25 Jul 2006 14:30:40 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> Message-ID: <000101c6b020$d09bc7b0$15327e82@pyrimidine> Sounds good to me. I'm fine with any way that it's worked out, either Taxonomy or Node-based, as long as there no Bio::Species-based confusion re: Taxonomy, and that this eventually leads to getting rid of Bio::Species altogether. Have fun, guys! (hey, probably the shortest response I have written)... Chris > -----Original Message----- > From: Hilmar Lapp [mailto:hlapp at gmx.net] > Sent: Tuesday, July 25, 2006 2:18 PM > To: Chris Fields > Cc: 'Sendu Bala'; bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > > On Jul 25, 2006, at 2:24 PM, Chris Fields wrote: > > > Once this is all done, will we really have a need for Bio::Species? > > No, except for backwards compatibility. Phasing it out will go over a > couple of releases. E.g., v1.6.x could have deprecation warning in > the documentation. v1.7+ would have deprecation warnings in the code > written to stderr. > > Just as an aside, we can't just drastically change the return type of > a method. Instead, if at all possible, there should be a new method > so that the old can be phased out over time but otherwise not > changed. I.e., don't change $seq->species() to now all of a sudden > return a node or taxonomic lineage, even if initially Bio::Species is > returned with some magic under the hood. Instead, create something like > > # return a Bio::Taxonomy::Node: > my $taxon = $seq->taxon(); > > # alternative approach: return a lineage (taxonomy) > # this would be Bio::TaxonomyI compliant > my $lineage = $seq->lineage(); > > The former would require the lineage (and organelle for completeness) > information to be either easily (though not necessarily directly) > accessible through the node, or added as annotation. > > -hilmar > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > From cjfields at uiuc.edu Wed Jul 26 02:16:36 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 25 Jul 2006 21:16:36 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C65990.4080500@sendu.me.uk> References: <003301c6b000$203cc560$15327e82@pyrimidine> <44C65990.4080500@sendu.me.uk> Message-ID: <1748BD60-EFBE-4872-94A4-7CA3D58BE458@uiuc.edu> One last thing before I shut off bioperl for a week and concentrate on Connecticut; On Jul 25, 2006, at 12:49 PM, Sendu Bala wrote: > Chris Fields wrote: >> If I were to get an object back that was labeled Bio::Species, as a >> biologist I would expect it to be part of a taxonomy, not the actual >> Taxonomy itself. > > I think this is the most important sentence in the discussion. Ok, so > it's clear to me that a better solution is needed than my > Bio::Taxonomy-related proposal. Sorry for being so slow on the > uptake. I > also needed to start trying to code my Taxonomy proposal to see some > issues with it. ... Again, thanks for noticing that. > ---- summary: > > It seems like the main problem with Node right now is that it has > classification() and things like genus(). I propose pure Node method > solutions to answer the questions classification() and genus() were > implemented to answer, but in a better, cruft-free way. > > Bio::DB::Taxonomy::genbank anyone? Ach... You're compromising here; that's not like you. I think you're making this too complicated by trying too many things at once. Don't think sudden dramatic changes in the API. Sneak changes in in a way that doesn't scare users away, then let them get used to the new way of grabbing Tax data. Make your point that it's more accurate to do it this way (you'll have defenders in Hilmar and I, BTW). Do this (start with genbank.pm): 1) Switch out Bio::Species with Node or Taxonomy; relocate other information temporarily (Bio::Species, get/sets in Seq object, SimpleValue). Leave Bio::Species in for the time being, but don't bother making any additional changes to it. 2) Make sure next_seq() and write_seq() work and pass tests. Add additional tests for the Tax/Node object (you could even use the tax dump data you recently added for more complicated tests). 3) Add in additional stuff bit by bit until it is where you would like it. 4) Make sure parsing is kosher with the latest release notes. Probably should make sure write_seq follows what the release note state to some degree. And, really, you won't break anything with genbank.pm organelle() parsing. If you look at the module the organelle isn't even touched in next_seq() or _read_GenBank_Species(), so it was broken to begin with! My proposal, though extreme, was to remove genus() etc (which you wanted as well with Node). You could leave this cruft for the time being in Bio::Species, which could still act as a sequence tax info holder object. It just won't be the >default< Seq tax information object, which would be Bio::Taxonomy or Node. Hence Hilmar's suggestion to use a $seq->taxon() method to return a Node/Taxonomy, and a $seq->species() would still return a Bio::Species object. It's redundant, but only for the time being, and the redundant information wouldn't have a major memory footprint anyway (not like the feature table or the full sequence might). Any information that isn't stored in whatever Tax object you use (i.e. lineage or organelle) could be stored temporarily in another fashion, such as a get/set in Seq or SimpleValue object, to make next_seq/ write_seq work (such as $seq->organelle() or $seq->classification(), instead of $seq->species->organelle and so on). Hilmar then suggests, around 1.6-ish release, note the changes made to SeqIO towards Bio::Taxonomy-based objects, and indicate that Bio::Species via species() and it's associated methods will be deprecated around 1.7 (gives everybody notice on API issues). Then add warnings to Bio::Species in 1.7 noting the deprecation, then remove from core completely in 1.8 - 2.0. One last thing, which is minor really: I remember seeing something about having Nodes with 'no rank' ignored unless a flag is used. That may be bad news for some organisms in sequence files where the TaxID is for a 'no rank' rank, such as environmental samples. May want to think about that here. I'm hoping the releases will start popping out a bit more periodically than they have been. There have been volunteers to release periodic updates for bug fixes etc. If I get a chance I'll try keeping up. Don't count on it though. The conference is 7am-9pm most days, for five days straight! Chris > > Then if you started with a Species/Node generated by a genbank parse, > and wanted certain questions answered correctly, you only have to > set a > different db_handle(). The Node only stores the static and hopefully > correct information about itself, whilst all other questions go via > db_handle, so you can dynamically swap back and forth between > databases > depending on if you need speed or accuracy. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From vrramnar at student.cs.uwaterloo.ca Wed Jul 26 02:44:17 2006 From: vrramnar at student.cs.uwaterloo.ca (vrramnar at student.cs.uwaterloo.ca) Date: Tue, 25 Jul 2006 22:44:17 -0400 Subject: [Bioperl-l] SNP reference file download In-Reply-To: <775C8983-D81E-474C-B4F5-C765ABEF4A2C@uiuc.edu> References: <000001c6b01f$bfd54e20$15327e82@pyrimidine> <1153868024.44c6a0f83fce6@www.nexusmail.uwaterloo.ca> <775C8983-D81E-474C-B4F5-C765ABEF4A2C@uiuc.edu> Message-ID: <1153881857.44c6d7016775f@www.nexusmail.uwaterloo.ca> Hey Chris, I believe I updated all those modules already as I downloaded the entire DB.tar from Bioperl live. Here is my code: #!/usr/bin/perl -w use Bio::Perl; use Bio::DB::EUtilities; my @ids = qw(rs4986950); # With the "rs" before the number the warning says: "no returned links" # Without the "rs" before the number the warning says: "No databases returned; empty linkset" my $elink = Bio::DB::EUtilities->new( -eutil => 'elink', -id => \@ids, -db => 'omim', -dbfrom => 'snp'); $elink->get_response; print "IDs: ", join q(,), $elink->get_ids; Which gives the following error: -------------------- WARNING --------------------- MSG: No databases returned; empty linkset --------------------------------------------------- ------------- EXCEPTION ------------- MSG: Must use database to access IDs STACK Bio::DB::EUtilities::ElinkData::get_LinkIds_by_db /Library/Perl/5.8.6/Bio/ DB/EUtilities/ElinkData.pm:201 STACK Bio::DB::EUtilities::get_ids /Library/Perl/5.8.6/Bio/DB/EUtilities.pm:482 STACK toplevel getOmimNum:13 -------------------------------------- All I really want is the OMIM id number under the section: NCBI Resource Links from the page: http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=1800562 Any idea why this still isn't working?? Rohan Quoting Chris Fields : > Odd, I thought XML::Simple was part of the 5.8 core. Guess I was > wrong. I plan on changing this to a more robust parser soon (likely > XML::SAX or XML::Twig, which will also require a download). > > That warning occurs when if you don't have a link to OMIM present (No > databases returned; empty linkset). The way Elink works is it stores > internal data in a separate object (ELinkData) contained in an > internal cache. The method get_ids() works for all EUtilities to > retrieve IDs, even from ELink objects. The unique problem with ELink > is, since you can search multiple databases. you can retrieve > multiple sets of IDs. > > If you haven't done it, update your EUtilities; the problem is > similar to one I fixed today (I stated something about updating in my > last post). Also, update the main Bio::DB::EUtilities and > Bio::GenericWebDBI as well (the last is the base class from which > EUtilities is based). The 'Count:1' was a debugging statement I > forgot to remove a while ago which I changed in CVS yesterday. It's > possible that commit had other changes which I forgot about. > > Sorry about that, but it is still experimental (emphasis on the > 'mental'). > > Chris > > On Jul 25, 2006, at 5:53 PM, vrramnar at student.cs.uwaterloo.ca wrote: > > > > > Hey Chris, > > > > Ignore the last email, I fixed that problem and downloaded/ > > installed the > > required XML modules. > > > > However, I am now getting this error message: > > > > -------------------- WARNING --------------------- > > MSG: No databases returned; empty linkset > > --------------------------------------------------- > > Count: 1 > > > > ------------- EXCEPTION ------------- > > MSG: Must use database to access IDs > > STACK Bio::DB::EUtilities::ElinkData::get_LinkIds_by_db /Library/ > > Perl/5.8.6/Bio/ > > DB/EUtilities/ElinkData.pm:201 > > STACK Bio::DB::EUtilities::get_ids /Library/Perl/5.8.6/Bio/DB/ > > EUtilities.pm:483 > > STACK toplevel getOmimNum:15 > > > > -------------------------------------- > > > > What does this mean?? > > > > Rohan > > > > Quoting Chris Fields : > > > >> Okay, had to fix an odd bug from ELink due to the way NCBI returns > >> data. > >> > >> You'll need to update the EUtilities modules in bioperl from CVS > >> to make > >> sure this works. > >> > >> This is how it's done: > ---------------------------------------- This mail sent through www.mywaterloo.ca From cjfields at uiuc.edu Wed Jul 26 05:01:41 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 26 Jul 2006 00:01:41 -0500 Subject: [Bioperl-l] SNP reference file download In-Reply-To: <1153881857.44c6d7016775f@www.nexusmail.uwaterloo.ca> References: <000001c6b01f$bfd54e20$15327e82@pyrimidine> <1153868024.44c6a0f83fce6@www.nexusmail.uwaterloo.ca> <775C8983-D81E-474C-B4F5-C765ABEF4A2C@uiuc.edu> <1153881857.44c6d7016775f@www.nexusmail.uwaterloo.ca> Message-ID: The below ID doesn't have any OMIM linked data, hence the warning. The problem is that NCBI, when it doesn't find a link, doesn't send something constructive to tell you that. It sends the original ID encoded in XML, but no actual DB's or ID data links. That's what the warning means. I'll make the original warning a bit more direct: No databases returned; no IDs found. The thrown error is from a logic problem; I have fixed it and committed to CVS. Here's the web page: no OMIM data there either... http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=4986950 Try changing your ID list to this: my @ids = qw(4986950 1800562); You should get back only one ID (only one has an OMIM number). By the way, the SNP data ID is only the digits (don't include the 'rs' designation). Chris On Jul 25, 2006, at 9:44 PM, vrramnar at student.cs.uwaterloo.ca wrote: > > Hey Chris, > > I believe I updated all those modules already as I downloaded the > entire DB.tar > from Bioperl live. Here is my code: > > #!/usr/bin/perl -w > > use Bio::Perl; > use Bio::DB::EUtilities; > > my @ids = qw(rs4986950); > # With the "rs" before the number the warning says: "no returned > links" > # Without the "rs" before the number the warning says: "No > databases returned; > empty linkset" > > > my $elink = Bio::DB::EUtilities->new( -eutil => 'elink', > -id => \@ids, > -db => 'omim', > -dbfrom => 'snp'); > $elink->get_response; > print "IDs: ", join q(,), $elink->get_ids; > > Which gives the following error: > > -------------------- WARNING --------------------- > MSG: No databases returned; empty linkset > --------------------------------------------------- > > ------------- EXCEPTION ------------- > MSG: Must use database to access IDs > STACK Bio::DB::EUtilities::ElinkData::get_LinkIds_by_db /Library/ > Perl/5.8.6/Bio/ > DB/EUtilities/ElinkData.pm:201 > STACK Bio::DB::EUtilities::get_ids /Library/Perl/5.8.6/Bio/DB/ > EUtilities.pm:482 > STACK toplevel getOmimNum:13 > > -------------------------------------- > > All I really want is the OMIM id number under the section: NCBI > Resource Links > from the page: > http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=1800562 > > Any idea why this still isn't working?? > > Rohan > > > Quoting Chris Fields : > >> Odd, I thought XML::Simple was part of the 5.8 core. Guess I was >> wrong. I plan on changing this to a more robust parser soon (likely >> XML::SAX or XML::Twig, which will also require a download). >> >> That warning occurs when if you don't have a link to OMIM present (No >> databases returned; empty linkset). The way Elink works is it stores >> internal data in a separate object (ELinkData) contained in an >> internal cache. The method get_ids() works for all EUtilities to >> retrieve IDs, even from ELink objects. The unique problem with ELink >> is, since you can search multiple databases. you can retrieve >> multiple sets of IDs. >> >> If you haven't done it, update your EUtilities; the problem is >> similar to one I fixed today (I stated something about updating in my >> last post). Also, update the main Bio::DB::EUtilities and >> Bio::GenericWebDBI as well (the last is the base class from which >> EUtilities is based). The 'Count:1' was a debugging statement I >> forgot to remove a while ago which I changed in CVS yesterday. It's >> possible that commit had other changes which I forgot about. >> >> Sorry about that, but it is still experimental (emphasis on the >> 'mental'). >> >> Chris >> >> On Jul 25, 2006, at 5:53 PM, vrramnar at student.cs.uwaterloo.ca wrote: >> >>> >>> Hey Chris, >>> >>> Ignore the last email, I fixed that problem and downloaded/ >>> installed the >>> required XML modules. >>> >>> However, I am now getting this error message: >>> >>> -------------------- WARNING --------------------- >>> MSG: No databases returned; empty linkset >>> --------------------------------------------------- >>> Count: 1 >>> >>> ------------- EXCEPTION ------------- >>> MSG: Must use database to access IDs >>> STACK Bio::DB::EUtilities::ElinkData::get_LinkIds_by_db /Library/ >>> Perl/5.8.6/Bio/ >>> DB/EUtilities/ElinkData.pm:201 >>> STACK Bio::DB::EUtilities::get_ids /Library/Perl/5.8.6/Bio/DB/ >>> EUtilities.pm:483 >>> STACK toplevel getOmimNum:15 >>> >>> -------------------------------------- >>> >>> What does this mean?? >>> >>> Rohan >>> >>> Quoting Chris Fields : >>> >>>> Okay, had to fix an odd bug from ELink due to the way NCBI returns >>>> data. >>>> >>>> You'll need to update the EUtilities modules in bioperl from CVS >>>> to make >>>> sure this works. >>>> >>>> This is how it's done: >> > > > > > ---------------------------------------- > This mail sent through www.mywaterloo.ca Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Wed Jul 26 09:19:29 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Wed, 26 Jul 2006 10:19:29 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <1748BD60-EFBE-4872-94A4-7CA3D58BE458@uiuc.edu> References: <003301c6b000$203cc560$15327e82@pyrimidine> <44C65990.4080500@sendu.me.uk> <1748BD60-EFBE-4872-94A4-7CA3D58BE458@uiuc.edu> Message-ID: <44C733A1.9070201@sendu.me.uk> Chris Fields wrote: > >> It seems like the main problem with Node right now is that it has >> classification() and things like genus(). I propose pure Node method >> solutions to answer the questions classification() and genus() were >> implemented to answer, but in a better, cruft-free way. >> >> Bio::DB::Taxonomy::genbank anyone? > > Ach... You're compromising here; No, I don't think so. Let me explain... (another very long email, but with the same conclusion as above) > 1) Switch out Bio::Species with Node or Taxonomy; relocate other > information temporarily (Bio::Species, get/sets in Seq object, > SimpleValue). Leave Bio::Species in for the time being, but don't > bother making any additional changes to it. [...] > Hence Hilmar's suggestion to use a $seq->taxon() method to return a > Node/Taxonomy, and a $seq->species() would still return a > Bio::Species object. It's redundant, As I see it, the problem to be solved is this: a) A node should just be a node, holding only information about itself (but this can include information on who its parent is, and methods relating to getting its parents/children as new objects - but the data of its parents/children must never be stored on itself). b) Bio::Species isn't very good at its job; you can't ask reasonable taxonomic questions of it and get correct answers. c) We need to transition Bio::Species to something better - something that lets us do the same job as Bio::Species, but do it better. An important aspect of 'better' is that we can switch from the taxonomic information in a genbank file or similar to the information in a taxonomic database if we want certain taxonomic questions answered correctly. But also, we should be able to answer all questions with a good chance of a correct answer even without database access/installation. There are a variety of possible solutions. How can we decide which is best? What would a good solution be? The 'something better' we transition Bio::Species to will become the preferred (or at least de facto standard) way of dealing with taxonomic information in bioperl. This taxonomic module (or set of modules) must be able to model taxonomic information anywhere it is found - databases or genbank files or anything else. If it can't, it would be fundamentally flawed. d) We can immediately discount any solution that involves storing some taxonomic information outside of the tax module. If we find ourselves putting lineage data in a genbank file in SimpleValue objects or similar, we can be pretty sure we've used a poor solution to the problem. That would be a compromise. e) If the thing we transition Bio::Species to can't do everything Bio::Species did (doing it in a different and better way is fine of course), it's not suitable for transitioning to (this is why Node needed all the cruft added to it before it was a suitable candidate). If it /can/ do everything Bio::Species did, there would be no harm immediately making Bio::Species inherit from the new tax module, reimplementing Bio::Species as necessary but making no API change. So any solution that would /require/ $seq->taxon() and $seq->species() wouldn't be a good one, and would be a compromise. But we do want to get rid of Bio::Species eventually, so I'm not saying we shouldn't have a $seq->taxon() or similar, only that either method would give you the same type of object with the same methods available ($seq->taxon->isa('tax module') && ($seq->species->isa('Bio::Species') && $seq->species->isa('tax module')). I see 2 possible solutions to the problem. What should 'tax module' be?: 1) Bio::Taxonomy or other similar class that is a container of multiple nodes. Naively this makes logical sense since one of the jobs Bio::Species has is to store a lineage, and a lineage is best represented as a set of Nodes. So let's have a single object with all our Nodes in it. Problems: Bio::Taxonomy itself, as currently written, is fundamentally flawed. It requires that you know the ranks and order of ranks of all your input nodes before you input them. It requires that all ranks have unique names. It doesn't handle ranks of 'no rank'. You can't have more than one lineage in an instance because you can't have two nodes with the same rank. If you don't know the ranks of your nodes (ie. genbank) there is no way to maintain the order of your lineage because there is no modelling of parent/child. I had planned to re-write it such that the rank-centric implementation was removed and we had parent/child implementation instead. But then there is nothing to stop you adding nodes that are disconnected from the others, creating a broken mess. Bio::Taxonomy::Tree might have been a little more suitable because it implements Bio::Tree::TreeI, but sadly it is also rank-centric and actually requires input of both Bio::Species and Bio::Taxonomy objects to its most useful methods. More important than issues with current implementations of node-container classes, such classes are unable to let us solve problem c) in a good way, and also leave us potentially storing in memory Node objects representing the same taxonomic node multiple times in different instances of the node-container. For problem c) if we were to switch from genbank nodes to database the solution is to delete all the nodes in the container and then get them all again from the database. What if you didn't even have a lineage-related question? You've just retrieved 10s of nodes from the database for no reason (and then store them), when all you wanted was accurate information on the node you were interested in. All in all, it's pretty horrible. Unsuitable implementations plus excess database retrieval plus massive waste of memory with duplicated nodes does not equal a good solution. 2) Bio::Taxonomy::Node. First, solve problem a) by getting rid of methods binomial(), species(), genus(), sub_species(), variant(), organelle(), classification() and show_all(). Except for organelle() which doesn't belong in taxonomy, all of these Bio::Species 'questions' can still be answered by Node - just not in a single method call. I outlined how to answer them in the previous post. For backward compatibility make Bio::Species a Node and implement the suggested way of answering the questions the proper 'Node' way under those methods. Problems: Well, those questions can't actually be answered by Node if the starting point was genbank data or manually created Nodes. The solution is clean and simple: Bio::DB::Taxonomy::genbank or perhaps better named Bio::DB::Taxonomy::list (because it makes a taxonomy database from an ordered list of names - I don't see anything inherently wrong or ugly with that). Then everything magically just works. We get all the power to ask all our questions that Node has already when working with the ncbi database, but we get it when working with genbank data. We suffer none of the problems of a node-container class. We can easily switch databases on the fly. What's not to like? From bix at sendu.me.uk Wed Jul 26 10:00:01 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Wed, 26 Jul 2006 11:00:01 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> Message-ID: <44C73D21.3010301@sendu.me.uk> Hilmar Lapp wrote: > Instead, create something like > > # return a Bio::Taxonomy::Node: > my $taxon = $seq->taxon(); Yes, but $seq->species() would also > # alternative approach: return a lineage (taxonomy) > # this would be Bio::TaxonomyI compliant > my $lineage = $seq->lineage(); I've since come to the conclusion that anything Taxonomy-ish would be inappropriate - see recent post. > The former would require the lineage (and organelle for completeness) > information to be either easily (though not necessarily directly) > accessible through the node, or added as annotation. That specifically is the main problem with Node as it is now. You shouldn't store information about the lineage (essentially information about other nodes) on the node object itself. Storing it as annotation on the Node or elsewhere is terrible: you lose all the power of Node and can no longer ask any lineage-related questions. There is no need for this split in functionality - when you don't have database access and just some genbank files, you can't answer any taxonomic questions involving lineage, vs. when you do have database access suddenly you can start doing useful things. My proposed solution is that bioperl's taxonomy model always lets you answer the same questions regardless of your source for taxonomic information - see recent post. From cjfields at uiuc.edu Wed Jul 26 12:16:29 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 26 Jul 2006 07:16:29 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C733A1.9070201@sendu.me.uk> References: <003301c6b000$203cc560$15327e82@pyrimidine> <44C65990.4080500@sendu.me.uk> <1748BD60-EFBE-4872-94A4-7CA3D58BE458@uiuc.edu> <44C733A1.9070201@sendu.me.uk> Message-ID: <0155C616-8E07-49C8-B4A1-2B77B4DE09F3@uiuc.edu> > ... > > I see 2 possible solutions to the problem. What should 'tax module' > be?: > > 1) Bio::Taxonomy or other similar class that is a container of > multiple > nodes. Naively this makes logical sense since one of the jobs > Bio::Species has is to store a lineage, and a lineage is best > represented as a set of Nodes. So let's have a single object with all > our Nodes in it. Problems: > > Bio::Taxonomy itself, as currently written, is fundamentally > flawed. It > requires that you know the ranks and order of ranks of all your input > nodes before you input them. It requires that all ranks have unique > names. It doesn't handle ranks of 'no rank'. You can't have more than > one lineage in an instance because you can't have two nodes with the > same rank. If you don't know the ranks of your nodes (ie. genbank) > there > is no way to maintain the order of your lineage because there is no > modelling of parent/child. > I had planned to re-write it such that the rank-centric implementation > was removed and we had parent/child implementation instead. But then > there is nothing to stop you adding nodes that are disconnected > from the > others, creating a broken mess. > > Bio::Taxonomy::Tree might have been a little more suitable because it > implements Bio::Tree::TreeI, but sadly it is also rank-centric and > actually requires input of both Bio::Species and Bio::Taxonomy objects > to its most useful methods. > > More important than issues with current implementations of > node-container classes, such classes are unable to let us solve > problem > c) in a good way, and also leave us potentially storing in memory Node > objects representing the same taxonomic node multiple times in > different > instances of the node-container. For problem c) if we were to switch > from genbank nodes to database the solution is to delete all the nodes > in the container and then get them all again from the database. > What if > you didn't even have a lineage-related question? You've just retrieved > 10s of nodes from the database for no reason (and then store them), > when > all you wanted was accurate information on the node you were > interested in. > > All in all, it's pretty horrible. Unsuitable implementations plus > excess > database retrieval plus massive waste of memory with duplicated nodes > does not equal a good solution. > > > 2) Bio::Taxonomy::Node. First, solve problem a) by getting rid of > methods binomial(), species(), genus(), sub_species(), > variant(), organelle(), classification() and show_all(). Except for > organelle() which doesn't belong in taxonomy, all of these > Bio::Species > 'questions' can still be answered by Node - just not in a single > method > call. I outlined how to answer them in the previous post. For backward > compatibility make Bio::Species a Node and implement the suggested way > of answering the questions the proper 'Node' way under those methods. > Problems: > > Well, those questions can't actually be answered by Node if the > starting > point was genbank data or manually created Nodes. The solution is > clean > and simple: Bio::DB::Taxonomy::genbank or perhaps better named > Bio::DB::Taxonomy::list (because it makes a taxonomy database from an > ordered list of names - I don't see anything inherently wrong or ugly > with that). Then everything magically just works. We get all the power > to ask all our questions that Node has already when working with the > ncbi database, but we get it when working with genbank data. We suffer > none of the problems of a node-container class. We can easily switch > databases on the fly. That 'broken mess' (referring to Bio::Taxonomy) is up to the user. You could make it more stringent (i.e. only allow connected nodes, starting with a single initiating node then build from there), though I don't think that's necessary as most people would probably use some sort of factory to generate a taxonomy (a warning might be appropriate). You would have to watch out for potential circular structures. Have it do what you want. I believe the original intent of Taxonomy was to allow building a full-fledged taxonomic structure, so it should stay that way. Sendu, you have to realize this is up to how you want to implement it. We're giving you the freedom to do what you want to Bio::Taxonomy. Of course, if we think you're off we'll reel you back in, but you seem to be on the right track. Realize that the only contentious issue here is that horrible lineage line in the GenBank file. We should have a way to rebuild it as it was from the original file (i.e. not rebuild it from scratch with DB lookups by default). However, you should also have the option to rebuild it from lookups (i.e. correctly), which you could do with a Taxonomy. Note this Bio::Taxonomy method: classify Title : classify Usage : @obj[][0-1] = taxonomy->classify($species); Function: return a ranked classification Returns : @obj of taxa and ranks as word pairs separated by "@" Args : Bio::Species object As Bio::Species will be deprecated, you can use that method in a dual, sneaky way: 1) directly store the lineage information, 2) return the real one (DB lookups) if needed (i,e, if some flag is set, for instance). And, if a Bio::Species argument is used, do what the docs state (catch it early on with an if block and return within it). Bio::Species, as used within genbank.pm, doesn't use Bio::Taxonomy in any way. I don't know if you even need to retain its original purpose here; you might be able to get away with changing the fundamental way this method works altogether. That's up to you. my 2c Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Wed Jul 26 12:49:05 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Wed, 26 Jul 2006 13:49:05 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <0155C616-8E07-49C8-B4A1-2B77B4DE09F3@uiuc.edu> References: <003301c6b000$203cc560$15327e82@pyrimidine> <44C65990.4080500@sendu.me.uk> <1748BD60-EFBE-4872-94A4-7CA3D58BE458@uiuc.edu> <44C733A1.9070201@sendu.me.uk> <0155C616-8E07-49C8-B4A1-2B77B4DE09F3@uiuc.edu> Message-ID: <44C764C1.9010804@sendu.me.uk> Chris Fields wrote: > We're giving you the freedom to do what you want to Bio::Taxonomy. I don't want to do anything with Bio::Taxonomy any more. I've already shown that it isn't suitable for the job. Regardless of how it is implemented, the entire idea of a class that contains Nodes isn't appropriate, for reasons already stated. > Realize that the only contentious issue here is > that horrible lineage line in the GenBank file. We should have a way to > rebuild it as it was from the original file (i.e. not rebuild it from > scratch with DB lookups by default). However, you should also have the > option to rebuild it from lookups (i.e. correctly), which you could do > with a Taxonomy. And I've already shown how rebuilding with a Taxonomy is very far from ideal, while switching db_handle on a Node would be perfect. Why are you now advocating Taxonomy when there is no reason to? > Note this Bio::Taxonomy method: > > classify > > Title : classify > Usage : @obj[][0-1] = taxonomy->classify($species); > Function: return a ranked classification > Returns : @obj of taxa and ranks as word pairs separated by "@" > Args : Bio::Species object Note that all this method does is let you combine a list of rank names with the classification array in a Bio::Species, spitting out some weird data structure. It is only of interest to Bio::Taxonomy::Tree. We're in the situation where we don't know the rank names corresponding to the classification array in a Bio::Species generated by genbank et al. So classify() is of zero value. > As Bio::Species will be deprecated, you can use that method in a dual, > sneaky way: 1) directly store the lineage information, No. Lineage information must be in the form of Nodes or you can't answer lineage-related taxonomic questions. > 2) return the real one (DB lookups) if needed Messy. Doing it with Node would be far superior. Again, Node works all the time, while Taxonomy would work badly or not at all some of the time. Rather than suggest ways of using Taxonomy, tell me what is wrong with my current Node plan. From cjfields at uiuc.edu Wed Jul 26 15:15:28 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 26 Jul 2006 10:15:28 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C764C1.9010804@sendu.me.uk> Message-ID: <002801c6b0c6$59279fa0$15327e82@pyrimidine> I advocate anything but Bio::Species that allows you the option to use lookups for correct taxonomic information and not guesswork (current Bio::Species). So, you could pretty much replace Species immediately with a DB-aware container object with simple get/sets. As of now, that would be that Node or Taxonomy. I have done this already, just haven't committed it yet. And, when I mentioned having freedom to do what you want with Bio::Taxonomy, that includes all of it (including Node, Tree, etc). We just want it to be reasonable and not 'duct tape' for the various Bio::Species mistakes of the past. I don't think the problem here is really that complicated (still, the only thing is the lineage stuff in a sequence file, right?). > > As Bio::Species will be deprecated, you can use that method in a dual, > > sneaky way: 1) directly store the lineage information, > > No. Lineage information must be in the form of Nodes or you can't answer > lineage-related taxonomic questions. You must have a way to store the 'horrible lineage information' data, as is, for those users who do not care about taxonomy and just want to convert seq streams. You shouldn't burden the everyday user with something that is pretty specialized, this being finding correct taxonomic information based on DB lookups for a particular reason (screening sequences, as Hilmar pointed out, was one possibility). I don't care how, but store lineage information as it appears in the file (scalar string) or in a simple data structure (array, maybe?) capable of retaining the information in some way. There are many many ways of doing this which I have previously pointed out; take your pick. Hilmar, in a previous post, told me to take a step back and contemplate a world w/o Bio::Species, where you would design a system capable of dealing with sequence file taxonomic data in a way that allows you to get correct tax information when needed via NCBI Taxonomy data, yet not sacrifice speed if you're just interested in converting sequences via SeqIO. Would you design a Bio::Species class, then? Would you attempt to spend time parsing out species and genus information, when the correct data is sitting on the NCBI server or in a local flatfile? No. You would retain the minimal data necessary in an object for reading and writing data, but have the >option< available to run a lookup. Therefore, Bio::Taxonomy::Node was born. A little prematurely, yes. Probably needed to bake a bit more... Anyway, we must eventually sever our reliance on Bio::Species in order to deprecate it, so the lineage information must be contained, as it appears in the file, somewhere else. And my point with the classify() Bio::Taxonomy method is not to use it as is; you could sneak in your own data if needed. It was an example of a possible way of containing the lineage data, but not meant to be an absolute way. It's up to you how you want to implement it. I think the classes that are currently in place are more than capable of handling the job. Hence my statement before that you are trying to get too many things going right out the starting gate. Start simply by replacing Bio::Species, then worry about other issues. If you think that a specialized class would work, fine, but IMHO I don't think it's absolutely necessary. I had proposed such a class before (more like a Bio::Species-like Tax object) but was shut down, and rightly so; it's unnecessarily complicated and 'contaminates' Bio::Taxonomy with extra unnecessary methods (classification(), genus(), and so on). My last proposal was to eventually strip out the unreliable taxonomic parsing in the various SeqIO modules and replace it with something simple, which seemed to be a consensus among us all. This has to do with Hilmar's post-apocalyptic vision of a Bio::Species-free world. That will eventually happen, and Bioperl will eventually switch over completely to Bio::Taxonomy::Whatever. And Bio::Species can join BPLite and other deprecated modules in the BioPerl Boot Hill. But, for now that can't happen. We all strive for the best information possible. However, you can't sacrifice the needs of other users, a majority whom probably care squat about taxonomy, with your (our) own needs. As I have repeatedly stated, simple is good. We can't just usurp the API for our own wishes w/o warning, so the change has to be gradual and Bio::Species must stick around for the time being. And we must make it optional to have DB lookups or the villagers will be storming the castle. Listen, Sendu. If you can wait a couple of weeks for further discussion then we can slog on with this. But right now I just don't have any more time for this, sorry. You can have the last word and I'll respond when I get back. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Sendu Bala > Sent: Wednesday, July 26, 2006 7:49 AM > To: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > Chris Fields wrote: > > We're giving you the freedom to do what you want to Bio::Taxonomy. > > I don't want to do anything with Bio::Taxonomy any more. I've already > shown that it isn't suitable for the job. Regardless of how it is > implemented, the entire idea of a class that contains Nodes isn't > appropriate, for reasons already stated. > > > > Realize that the only contentious issue here is > > that horrible lineage line in the GenBank file. We should have a way to > > rebuild it as it was from the original file (i.e. not rebuild it from > > scratch with DB lookups by default). However, you should also have the > > option to rebuild it from lookups (i.e. correctly), which you could do > > with a Taxonomy. > > And I've already shown how rebuilding with a Taxonomy is very far from > ideal, while switching db_handle on a Node would be perfect. Why are you > now advocating Taxonomy when there is no reason to? > > > > Note this Bio::Taxonomy method: > > > > classify > > > > Title : classify > > Usage : @obj[][0-1] = taxonomy->classify($species); > > Function: return a ranked classification > > Returns : @obj of taxa and ranks as word pairs separated by "@" > > Args : Bio::Species object > > Note that all this method does is let you combine a list of rank names > with the classification array in a Bio::Species, spitting out some weird > data structure. It is only of interest to Bio::Taxonomy::Tree. > We're in the situation where we don't know the rank names corresponding > to the classification array in a Bio::Species generated by genbank et > al. So classify() is of zero value. > > > > As Bio::Species will be deprecated, you can use that method in a dual, > > sneaky way: 1) directly store the lineage information, > > No. Lineage information must be in the form of Nodes or you can't answer > lineage-related taxonomic questions. > > > > 2) return the real one (DB lookups) if needed > > Messy. Doing it with Node would be far superior. > > > Again, Node works all the time, while Taxonomy would work badly or not > at all some of the time. Rather than suggest ways of using Taxonomy, > tell me what is wrong with my current Node plan. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From morissardj at gmail.com Wed Jul 26 14:59:54 2006 From: morissardj at gmail.com (Morissard =?utf-8?b?asOpcm9tZQ==?=) Date: Wed, 26 Jul 2006 14:59:54 +0000 (UTC) Subject: [Bioperl-l] Accessing TRANSFAC matrices References: <20060717145000.0aig7ymcmurk4wsk@webmail.embl.de> <44BEA9FB.1070009@utk.edu> Message-ID: Hi that may help you ? http://morissardjerome.free.fr/Data/files/matrices.zip From hlapp at gmx.net Wed Jul 26 15:36:32 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 26 Jul 2006 11:36:32 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C73D21.3010301@sendu.me.uk> References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> Message-ID: On Jul 26, 2006, at 6:00 AM, Sendu Bala wrote: > Hilmar Lapp wrote: >> Instead, create something like >> >> # return a Bio::Taxonomy::Node: >> my $taxon = $seq->taxon(); > > Yes, but $seq->species() would also $seq->species() would return a Bio::Species object which may not be more than a thin shell anymore around an implementation that delegates almost everything to a lineage object (Bio::Taxonomy). $seq->taxon() in contrast need not return such a backwards-compatible construct. > >> # alternative approach: return a lineage (taxonomy) >> # this would be Bio::TaxonomyI compliant >> my $lineage = $seq->lineage(); > > I've since come to the conclusion that anything Taxonomy-ish would be > inappropriate - see recent post. Not sure which one you mean, and please don't reference really long emails, you're asking a lot of other people to organize your thoughts for them. At any rate, my point is that if you only name it appropriately you can avoid misconceptions about what is being returned. The fact that it's confusing to return a taxonomy from a method called species() doesn't mean it's equally bad to return a lineage (which is a limited taxonomy) from a method called lineage(). > [...] > > My proposed solution is that bioperl's taxonomy model always lets you > answer the same questions regardless of your source for taxonomic > information - see recent post. See above ... And I'd rather see some code or API examples than extensive elaborations. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Wed Jul 26 15:38:50 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 26 Jul 2006 11:38:50 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C733A1.9070201@sendu.me.uk> References: <003301c6b000$203cc560$15327e82@pyrimidine> <44C65990.4080500@sendu.me.uk> <1748BD60-EFBE-4872-94A4-7CA3D58BE458@uiuc.edu> <44C733A1.9070201@sendu.me.uk> Message-ID: <3E7DE5A0-C094-4894-BAB9-03F62EA98A3D@gmx.net> On Jul 26, 2006, at 5:19 AM, Sendu Bala wrote: > Chris Fields wrote: >> >>> It seems like the main problem with Node right now is that it has >>> classification() and things like genus(). I propose pure Node method >>> solutions to answer the questions classification() and genus() were >>> implemented to answer, but in a better, cruft-free way. >>> >>> Bio::DB::Taxonomy::genbank anyone? >> >> Ach... You're compromising here; > > No, I don't think so. Let me explain... > (another very long email, but with the same conclusion as above) Sorry, can you summarize this in a few sentences? If you do want feedback from me you really need to be more concise. -hilmar > > >> 1) Switch out Bio::Species with Node or Taxonomy; relocate other >> information temporarily (Bio::Species, get/sets in Seq object, >> SimpleValue). Leave Bio::Species in for the time being, but don't >> bother making any additional changes to it. > [...] >> Hence Hilmar's suggestion to use a $seq->taxon() method to return a >> Node/Taxonomy, and a $seq->species() would still return a >> Bio::Species object. It's redundant, > > As I see it, the problem to be solved is this: > > a) A node should just be a node, holding only information about itself > (but this can include information on who its parent is, and methods > relating to getting its parents/children as new objects - but the data > of its parents/children must never be stored on itself). > > b) Bio::Species isn't very good at its job; you can't ask reasonable > taxonomic questions of it and get correct answers. > > c) We need to transition Bio::Species to something better - something > that lets us do the same job as Bio::Species, but do it better. An > important aspect of 'better' is that we can switch from the taxonomic > information in a genbank file or similar to the information in a > taxonomic database if we want certain taxonomic questions answered > correctly. But also, we should be able to answer all questions with a > good chance of a correct answer even without database access/ > installation. > > There are a variety of possible solutions. How can we decide which is > best? What would a good solution be? > > The 'something better' we transition Bio::Species to will become the > preferred (or at least de facto standard) way of dealing with > taxonomic > information in bioperl. This taxonomic module (or set of modules) must > be able to model taxonomic information anywhere it is found - > databases > or genbank files or anything else. If it can't, it would be > fundamentally flawed. > > d) We can immediately discount any solution that involves storing some > taxonomic information outside of the tax module. If we find ourselves > putting lineage data in a genbank file in SimpleValue objects or > similar, we can be pretty sure we've used a poor solution to the > problem. That would be a compromise. > > e) If the thing we transition Bio::Species to can't do everything > Bio::Species did (doing it in a different and better way is fine of > course), it's not suitable for transitioning to (this is why Node > needed > all the cruft added to it before it was a suitable candidate). If it > /can/ do everything Bio::Species did, there would be no harm > immediately > making Bio::Species inherit from the new tax module, reimplementing > Bio::Species as necessary but making no API change. So any solution > that > would /require/ $seq->taxon() and $seq->species() wouldn't be a good > one, and would be a compromise. But we do want to get rid of > Bio::Species eventually, so I'm not saying we shouldn't have a > $seq->taxon() or similar, only that either method would give you the > same type of object with the same methods available > ($seq->taxon->isa('tax module') && ($seq->species->isa('Bio::Species') > && $seq->species->isa('tax module')). > > > I see 2 possible solutions to the problem. What should 'tax module' > be?: > > 1) Bio::Taxonomy or other similar class that is a container of > multiple > nodes. Naively this makes logical sense since one of the jobs > Bio::Species has is to store a lineage, and a lineage is best > represented as a set of Nodes. So let's have a single object with all > our Nodes in it. Problems: > > Bio::Taxonomy itself, as currently written, is fundamentally > flawed. It > requires that you know the ranks and order of ranks of all your input > nodes before you input them. It requires that all ranks have unique > names. It doesn't handle ranks of 'no rank'. You can't have more than > one lineage in an instance because you can't have two nodes with the > same rank. If you don't know the ranks of your nodes (ie. genbank) > there > is no way to maintain the order of your lineage because there is no > modelling of parent/child. > I had planned to re-write it such that the rank-centric implementation > was removed and we had parent/child implementation instead. But then > there is nothing to stop you adding nodes that are disconnected > from the > others, creating a broken mess. > > Bio::Taxonomy::Tree might have been a little more suitable because it > implements Bio::Tree::TreeI, but sadly it is also rank-centric and > actually requires input of both Bio::Species and Bio::Taxonomy objects > to its most useful methods. > > More important than issues with current implementations of > node-container classes, such classes are unable to let us solve > problem > c) in a good way, and also leave us potentially storing in memory Node > objects representing the same taxonomic node multiple times in > different > instances of the node-container. For problem c) if we were to switch > from genbank nodes to database the solution is to delete all the nodes > in the container and then get them all again from the database. > What if > you didn't even have a lineage-related question? You've just retrieved > 10s of nodes from the database for no reason (and then store them), > when > all you wanted was accurate information on the node you were > interested in. > > All in all, it's pretty horrible. Unsuitable implementations plus > excess > database retrieval plus massive waste of memory with duplicated nodes > does not equal a good solution. > > > 2) Bio::Taxonomy::Node. First, solve problem a) by getting rid of > methods binomial(), species(), genus(), sub_species(), > variant(), organelle(), classification() and show_all(). Except for > organelle() which doesn't belong in taxonomy, all of these > Bio::Species > 'questions' can still be answered by Node - just not in a single > method > call. I outlined how to answer them in the previous post. For backward > compatibility make Bio::Species a Node and implement the suggested way > of answering the questions the proper 'Node' way under those methods. > Problems: > > Well, those questions can't actually be answered by Node if the > starting > point was genbank data or manually created Nodes. The solution is > clean > and simple: Bio::DB::Taxonomy::genbank or perhaps better named > Bio::DB::Taxonomy::list (because it makes a taxonomy database from an > ordered list of names - I don't see anything inherently wrong or ugly > with that). Then everything magically just works. We get all the power > to ask all our questions that Node has already when working with the > ncbi database, but we get it when working with genbank data. We suffer > none of the problems of a node-container class. We can easily switch > databases on the fly. > > What's not to like? > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From jay at jays.net Wed Jul 26 15:32:53 2006 From: jay at jays.net (Jay Hannah) Date: Wed, 26 Jul 2006 08:32:53 -0700 Subject: [Bioperl-l] Anyone else at OSCON right now? Message-ID: <44C78B25.80503@jays.net> Any other BioPerl'ers here in Portland for OSCON? I'd love to chat about your life w/ BioPerl. I'm here until Saturday morning. j http://oscon.kwiki.org/index.cgi?JayHannah From adamnkraut at gmail.com Wed Jul 26 14:32:42 2006 From: adamnkraut at gmail.com (Adam Kraut) Date: Wed, 26 Jul 2006 10:32:42 -0400 Subject: [Bioperl-l] Structure::IO In-Reply-To: <716af09c0607250347w24e0d8dbj674758049bc0a4e2@mail.gmail.com> References: <716af09c0607250347w24e0d8dbj674758049bc0a4e2@mail.gmail.com> Message-ID: <134ede0b0607260732u79f0dea2if8f4ea98a5e03524@mail.gmail.com> Hi bernd, Can you better explain what it is you want to do with pdb files? From your example it looks like you want to do something with each chain, but it is unclear what you want to do here: my @chains = $struc->chain($chain); With that said, I was never able to use Bio::Structure in the way that I wanted. I now use the MMTSB Perl libraries instead: http://mmtsb.scripps.edu/cgi-bin/tooldoc?perlpackages Specifically the Molecule module may be useful here. Regards, Adam On 7/25/06, Bernd Web wrote: > > Hi, > > Does someone have experience with Bio::Structure::IO? > The example III.9.1 from the bptutorial.pl covers most, but what is e.g. > the > chain() method of Bio::Structure::Entry doing? The POD states: > > Title : chain > Usage : @chains = $structure->chain($chain); > Function: Connects a (or a list of) Chain objects to a > Bio::Structure::Entry. > Returns : list of Bio::Structure::Residue objects > Args : One Residue or a reference to an array of Residue objects > > But in e.g > my $stream = Bio::Structure::IO->new(-file => $filename, > -format => 'pdb'); > while ( my $struc = $stream->next_structure() ) { > for my $chain ($struc->get_chains) { > my $chainid = $chain->id; > my @chains = $struc->chain($chain); > } > } > > I get Bio::Structure::Chain=HASH(0x9f1ab50). > > What is the function of the chain method and how to use it? > > Best regards, > bernd > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- Adam N. Kraut National Resource for Biomedical Supercomputing http://www.nrbsc.org/sb/ From bix at sendu.me.uk Wed Jul 26 16:11:25 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Wed, 26 Jul 2006 17:11:25 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <002801c6b0c6$59279fa0$15327e82@pyrimidine> References: <002801c6b0c6$59279fa0$15327e82@pyrimidine> Message-ID: <44C7942D.6050603@sendu.me.uk> Chris Fields wrote: >> No. Lineage information must be in the form of Nodes or you can't answer >> lineage-related taxonomic questions. > > You must have a way to store the 'horrible lineage information' data, as is, > for those users who do not care about taxonomy and just want to convert seq > streams. You shouldn't burden the everyday user with something that is > pretty specialized, this being finding correct taxonomic information based > on DB lookups for a particular reason (screening sequences, as Hilmar > pointed out, was one possibility). I am certainly not requiring that anyone find 'correct taxonomic information'. The whole reason I am backing my current proposal is that it works equally well with or without access to NCBI's taxonomy database. Your proposals work poorly without access to such. > I don't care how, but store lineage information as it appears in the file > (scalar string) or in a simple data structure (array, maybe?) capable of > retaining the information in some way. There are many many ways of doing > this which I have previously pointed out; take your pick. I've taken my pick. To set: my $db = new Bio::DB::Taxonomy(-source => 'list', -lineage => @lineage); $node->db_handle($db); To get: @lineage = map { $_->scientific_name } $node->get_Lineage_Nodes; That is as simple as it is going to get in a world where we have 'pure' Nodes or any other kind of pure taxonomic class. If you want to hide the taxonomic complexity from end-users who want to make and store their own lineage of their species without having to know the details of how bioperl's taxonomy modules are supposed to work, tell them to use Bio::Species: To set: $species->classification(@lineage); To get: @lineage = $species->classification; Of course in this example I propose that behind the scenes Bio::Species is a Bio:Taxonomy::Node and just implements classification() the pure Node way, given above. Let me make my requirement very clear: the solution must allow you to find the most recent common ancestor of two solution-objects without access to the NCBI taxonomy database, using exactly the same method call you would use if you /did/ have access to the NCBI taxonomy database. The method in question shouldn't need any special-case code depending on the presence or absence of NCBI taxonomy database. That's the litmus test. I'll tend to reject any solution that fails. From bix at sendu.me.uk Wed Jul 26 16:25:41 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Wed, 26 Jul 2006 17:25:41 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <3E7DE5A0-C094-4894-BAB9-03F62EA98A3D@gmx.net> References: <003301c6b000$203cc560$15327e82@pyrimidine> <44C65990.4080500@sendu.me.uk> <1748BD60-EFBE-4872-94A4-7CA3D58BE458@uiuc.edu> <44C733A1.9070201@sendu.me.uk> <3E7DE5A0-C094-4894-BAB9-03F62EA98A3D@gmx.net> Message-ID: <44C79785.6050705@sendu.me.uk> Hilmar Lapp wrote: > > On Jul 26, 2006, at 5:19 AM, Sendu Bala wrote: > >>>> It seems like the main problem with Node right now is that it has >>>> classification() and things like genus(). I propose pure Node method >>>> solutions to answer the questions classification() and genus() were >>>> implemented to answer, but in a better, cruft-free way. >>>> >>>> Bio::DB::Taxonomy::genbank anyone? > > Sorry, can you summarize this in a few sentences? If you do want > feedback from me you really need to be more concise. A bad solution-module stores any kind of taxonomic information outside of the solution-module or in an inconsistent form. By 'inconsistent' I mean, sometimes you store the name of a taxonomic rank with $node->node_name, other times you store it in an array or scalar held directly on the solution-module or elsewhere. Bio::Taxonomy specifically is not usable. Generally speaking, classes that are containers of multiple nodes are also inappropriate, because they result in excess database retrieval and excess storage of duplicated information amongst instances of such classes. Bio::Taxonomy::Node combined with Bio::DB::Taxonomy::list would probably be ideal. From cjfields at uiuc.edu Wed Jul 26 16:49:40 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 26 Jul 2006 11:49:40 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: Message-ID: <000001c6b0d3$7d936ec0$15327e82@pyrimidine> Hilmar, apologies ahead of time for not being too concise! It's my last hurrah on this thread. No, really! ... > > Yes, but $seq->species() would also > > $seq->species() would return a Bio::Species object which may not be > more than a thin shell anymore around an implementation that > delegates almost everything to a lineage object (Bio::Taxonomy). > > $seq->taxon() in contrast need not return such a backwards-compatible > construct. In genbank.pm _read_GenBank_Species (initial implementation, to switch out Bio::Species with Taxonomy/Node object): 1) Assign data to both Bio::Species (as currently implemented) and Bio::Taxonomy::Node (new way). 2) Assign organelle to Bio::Species and the Seq object get/set organelle(). 3) Assign lineage information to Bio::Species and as an array to the Seq object get/set lineage(). Replace the get/set above with your method of choice, just no Bio::Species. In genbank.pm write_seq() 1) if DB_lookup flag is defined, use $seq->taxon() to build lineage 2) If not, use $seq->lineage(). The fine details (how do you build the lineage?!?) can be worked out along the way. The wonders of CVS! The Taxonomy class used here could be returned using Hilmar's $seq->taxon() and Bio::Species can be returned via $seq->species(). Makes perfect sense! Separated! Nothing complicated about it. Nice and clean. And Bio::Species can eventually be shown the exit door. Elvis has left the building... Organelle-specific sequence TaxIDs, as they refer to the organism and not the organelle, could be placed elsewhere, preferably somewhere more accessible such as $seq->organelle(). And lineage, similarly, could be placed in $seq->lineage(), which would store it as a raw string or as an array. There are many other ways I had pointed out (SimpleValue, Node, etc); I don't care, as long as we eventually sever the Bio::Species tumor from SeqIO. ... > ...And I'd rather see some code or API examples than > extensive elaborations. > > -hilmar Hilmar's right; working code does speaks louder than words. The energy spent in writing up full expositions is better spent elsewhere, hence: I need to get back to work! Wish I could contribute more. Chris From bix at sendu.me.uk Wed Jul 26 17:13:43 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Wed, 26 Jul 2006 18:13:43 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> Message-ID: <44C7A2C7.2070100@sendu.me.uk> Hilmar Lapp wrote: > On Jul 26, 2006, at 6:00 AM, Sendu Bala wrote: > >> Hilmar Lapp wrote: >>> Instead, create something like >>> >>> # return a Bio::Taxonomy::Node: >>> my $taxon = $seq->taxon(); >> Yes, but $seq->species() would also > > $seq->species() would return a Bio::Species object which may not be > more than a thin shell anymore around an implementation that > delegates almost everything to a lineage object (Bio::Taxonomy). I actually forgot to finish that sentence. I was going to suggest Bio::Species isa Bio::Taxonomy::Node and would indeed delegate most of its implementation to Node. >>> # alternative approach: return a lineage (taxonomy) >>> # this would be Bio::TaxonomyI compliant >>> my $lineage = $seq->lineage(); >> I've since come to the conclusion that anything Taxonomy-ish would be >> inappropriate - see recent post. > > The fact that it's confusing to return a taxonomy from a method called species() > doesn't mean it's equally bad to return a lineage (which is a limited > taxonomy) from a method called lineage(). You wouldn't need to though. If you want a lineage you could ask your node for its lineage. There's no point in having a whole other class that contains a node and all its ancestor nodes, when to get the ancestors of a node all you have to do is $node->get_Lineage_Nodes(). >> My proposed solution is that bioperl's taxonomy model always lets you >> answer the same questions regardless of your source for taxonomic >> information - see recent post. > > See above ... And I'd rather see some code or API examples The fine details of the following may be slightly off, but it's just to provide an example. I'll use Test.pm syntax. my @human = qw('Homo sapiens' Homo Mammalia Eukaryota); my @mouse = qw('Mus musculus' Mus Mammalia Eukaryota); Old way with Node ----------------- my $h_node = new Bio::Taxonomy::Node(-classification => @human); my $m_node = new Bio::Taxonomy::Node(-classification => @mouse); @human = map { $_->scientific_name } $h_node->get_Lineage_Nodes; ok @human, 0; # failure to work as expected @human = $h_node->classification; ok join(", ", @human), "Homo sapiens, Homo, Mammalia, Eukaryota"; my $lca = $h_node->get_LCA_Node($m_node); ok $lca, undef; # failure to do anything useful because our lineage data # is in an array, not in nodes # try again with entrez - must make brand new objects my $db = new Bio::DB::Taxonomy(-source => 'entrez'); $h_node = $db->get_Taxonomy_Node(-name => 'Homo sapiens'); $m_node = $db->get_Taxonomy_Node(-name => 'Mus musculus'); @human = map { $_->scientific_name } $h_node->get_Lineage_Nodes; ok join(", ", @human) eq "Homo sapiens, Homo, Homo/Pan/Gorilla group, Hominidae, ..."; # now it works! $lca = $h_node->get_LCA_Node($m_node); ok $lca->scientific_name, 'Mammalia'; # and now this works! Old way with Bio::Species ------------------------- # forget about it, Species has nothing like a get_LCA_Node() Proposed way with Node ---------------------- my $db = new Bio::DB::Taxonomy(-source => 'list', -lineage => @human); my $h_node = $db->get_Taxonomy_Node(-name => 'Homo sapiens'); $db->add_lineage(@mouse); # or make a new db my $m_node = $db->get_Taxonomy_Node(-name => 'Mus musculus'); @human = map { $_->scientific_name } $h_node->get_Lineage_Nodes; ok join(", ", @human), "Homo sapiens, Homo, Mammalia, Eukaryota"; # works as expected my $lca = $h_node->get_LCA_Node($m_node); ok $lca->scientific_name, 'Mammalia'; # works first time # try again with entrez - just change the db_handle $h_node->db_handle(new Bio:DB::Taxonomy(-source => 'entrez'); @human = map { $_->scientific_name } $h_node->get_Lineage_Nodes; ok join(", ", @human) eq "Homo sapiens, Homo, Homo/Pan/Gorilla group, Hominidae, ..."; $lca = $h_node->get_LCA_Node($m_node); ok $lca->scientific_name, 'Mammalia'; Proposed way with Bio::Species ------------------------------ # (Bio::Species isa Bio::Taxonomy::Node, implements its methods like # above) my $h_species = new Bio::Species(-classification => @human); my $m_species = new Bio::Species(-classification => @mouse); @human = map { $_->scientific_name } $h_node->get_Lineage_Nodes; ok join(", ", @human), "Homo sapiens, Homo, Mammalia, Eukaryota"; @human = $h_species->classification; ok join(", ", @human), "Homo sapiens, Homo, Mammalia, Eukaryota"; my $lca = $h_species->get_LCA_Node($m_species); ok $lca->scientific_name, 'Mammalia'; # trying again with entrez behaves as per proposed Node, above From angshu96 at gmail.com Wed Jul 26 17:15:35 2006 From: angshu96 at gmail.com (Angshu Kar) Date: Wed, 26 Jul 2006 12:15:35 -0500 Subject: [Bioperl-l] WUBLASTP parsing problem Message-ID: Hi, Does WU-BLASTP has got something to do with the length of the sequence names (or the sequence names)? What is happening here is I use fasta format proteins to build the blast (I do a distributed blastp) report. But when I parse the report (using bioperl), the query column remains empty for some results as : * 328857 6.6e-135 325331 6.3e-114 325329 1.0e-113 325332 1.7e-113 325330 2.7e-113 . . *. while for some its perfect as: *267750 280003 7.5e-301 267750 348279 7.5e-301 267750 345867 2.0e-300 267750 251915 2.0e-300 267750 346539 6.7e-300 . *. . Some of my sequences are as: *IMGA|AC159872_38.1 hypothetical protein AC159872.12 35121-35051 H EGN_Mt050401 20060209 TIGR 1671.m00013 mrsciilhnmivederdtyaqrwtefeqpggngsstpqpystelrdpdvhhklqtdlvkh iwikfgmyrd* * And part of the blastp (the one where I'm facing the issue) result is as: *Smallest * * Sum High Probability Sequences producing High-scoring Segment Pairs: Score P(N) N gi|33333045|gb|AAQ11687.1| MADS box protein [Triticum aes... 1318 6.6e-135 1 gi|47681327|gb|AAT37484.1| MADS5 protein [Dendrocalamus l... 1120 6.3e-114 1 gi|47681331|gb|AAT37486.1| MADS7 protein [Dendrocalamus l... 1118 1.0e-113 1 gi|47681325|gb|AAT37483.1| MADS4 protein [Dendrocalamus l... 1116 1.7e-113 1 gi|47681329|gb|AAT37485.1| MADS6 protein [Dendrocalamus l... 1114 2.7e-113 1 gi|47681323|gb|AAT37482.1| MADS3 protein [Dendrocalamus l... 1114 2.7e-113 1 11674.m04224|LOC_Os08g41950|protein K-box region, putative 976 1.1e-98 1 gi|28630961|gb|AAO45877.1| MADS5 [Lolium perenne] 967 1.0e-97 1 gi|44888605|gb|AAS48129.1| AGAMOUS LIKE9-like protein [Ho... 964 2.1e-97 1 11674.m04223|LOC_Os08g41950|protein K-box region, putative 899 1.6e-90 1 gi|34979580|gb|AAQ83834.1| MADS box protein [Asparagus of... 875 5.8e-88 1* Could you please let me know if I'm missing something? Has the gi got to do anything with this? Thanking you, Angshu From cain.cshl at gmail.com Wed Jul 26 16:19:26 2006 From: cain.cshl at gmail.com (Scott Cain) Date: Wed, 26 Jul 2006 12:19:26 -0400 Subject: [Bioperl-l] Installing staden io_lib on windows? Message-ID: <1153930767.2632.5.camel@localhost.localdomain> Hi all, I'm wondering if anyone has tried to install Staden's io_lib on Windows, and if so, how did it go? I am not much of a Windows person, but I've tried to make it under cygwin only to get this message: make all-recursive make[1]: Entering directory `/home/scott/io_lib-1.9.2' Making all in read make[2]: Entering directory `/home/scott/io_lib-1.9.2/read' if gcc -DHAVE_CONFIG_H -I. -I. -I.. -I.. -I../include -I../read -I../alf -I../abi -I../ctf -I../ztr -I../plain -I../scf -I../sff -I../exp_file -I../utils -I/usr/local/include -g -O2 -MT Read.o -MD -MP -MF ".deps/Read.Tpo" -c -o Read.o Read.c; \ then mv -f ".deps/Read.Tpo" ".deps/Read.Po"; else rm -f ".deps/Read.Tpo"; exit 1; fi In file included from Read.h:43, from Read.c:40: ../utils/os.h:346:2: #error Must define SP_BIG_ENDIAN or SP_LITTLE_ENDIAN in Makefile make[2]: *** [Read.o] Error 1 make[2]: Leaving directory `/home/scott/io_lib-1.9.2/read' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/home/scott/io_lib-1.9.2' make: *** [all] Error 2 I'm guessing there is a flag I can pass to the configure script to get the endian-ness right, but I don't know (and I don't know if this is just the beginning of a long, fruitless road :-) I would like to use Bio::SCF (from CPAN) in conjuction with the trace glyph in BioGraphics to view traces in GBrowse. Thanks for any advice, Scott -- ------------------------------------------------------------------------ Scott Cain, Ph. D. cain.cshl at gmail.com GMOD Coordinator (http://www.gmod.org/) 216-392-3087 Cold Spring Harbor Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: From morissardj at gmail.com Wed Jul 26 20:49:58 2006 From: morissardj at gmail.com (leverdeterre) Date: Wed, 26 Jul 2006 13:49:58 -0700 (PDT) Subject: [Bioperl-l] Accessing TRANSFAC matrices In-Reply-To: References: <44BEA9FB.1070009@utk.edu> Message-ID: <5510746.post@talk.nabble.com> i'm happy for helping you i'have done a page whitch can interrest you http://morissardjerome.free.fr/Data/index.html there is more information about the 397 matrix file ( in the 3 first line) and i'm done all the logo file . ++ -- View this message in context: http://www.nabble.com/Re%3A-Accessing-TRANSFAC-matrices-tf1969467.html#a5510746 Sent from the Perl - Bioperl-L forum at Nabble.com. From morissardj at gmail.com Wed Jul 26 21:15:19 2006 From: morissardj at gmail.com (leverdeterre) Date: Wed, 26 Jul 2006 14:15:19 -0700 (PDT) Subject: [Bioperl-l] Blast Output Parsing In-Reply-To: References: Message-ID: <5511136.post@talk.nabble.com> and without Bioperl i think that may help you http://morissardjerome.free.fr/perl/blastparser.html -- View this message in context: http://www.nabble.com/Blast-Output-Parsing-tf1974691.html#a5511136 Sent from the Perl - Bioperl-L forum at Nabble.com. From osborne1 at optonline.net Wed Jul 26 21:00:50 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Wed, 26 Jul 2006 17:00:50 -0400 Subject: [Bioperl-l] SeqUtils In-Reply-To: <716af09c0607250444y3e005fb1t4e20094fd8db993d@mail.gmail.com> Message-ID: Bernd, That's easily done, changed both POD and code. Brian O. On 7/25/06 7:44 AM, "Bernd Web" wrote: > Hi, > > With Bio::SeqUtils it may be nice to support 3 letter codes with > capitals only, too. > Now > > my $string = Bio::SeqUtils->seq3in($seqobj, 'METGLYTER'); > > will give in $string->seq: XXX. > > Possibly the capitals in MetGlyTer are used to find the amino acids codes? > If not maybe it's easy to implement case-insensitive, or all-capitals > for AA codes in SeqUtils? > > In addition about the POD: maybe it's better not use use $string since > Bio::SeqUtils->seq3in does not return a string but a Bio::PrimarySeq > object. > > Regards, > Bernd > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From osborne1 at optonline.net Wed Jul 26 21:24:34 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Wed, 26 Jul 2006 17:24:34 -0400 Subject: [Bioperl-l] Structure::IO In-Reply-To: <716af09c0607250347w24e0d8dbj674758049bc0a4e2@mail.gmail.com> Message-ID: Bernd, I'm not following your question. The POD in the latest Bio::Structure::Entry shows: =head2 chain() Title : chain Usage : @chains = $structure->chain($chain); Function: Connects a Chain or a list of Chain objects to a Bio::Structure::Entry. Returns : List of Bio::Structure::Chain objects Args : A Chain or a reference to an array of Chain objects =cut Which is not what you've copied and pasted. What version of Bioperl do you use? Brian O. On 7/25/06 6:47 AM, "Bernd Web" wrote: > Hi, > > Does someone have experience with Bio::Structure::IO? > The example III.9.1 from the bptutorial.pl covers most, but what is e.g. the > chain() method of Bio::Structure::Entry doing? The POD states: > > Title : chain > Usage : @chains = $structure->chain($chain); > Function: Connects a (or a list of) Chain objects to a Bio::Structure::Entry. > Returns : list of Bio::Structure::Residue objects > Args : One Residue or a reference to an array of Residue objects > > But in e.g > my $stream = Bio::Structure::IO->new(-file => $filename, > -format => 'pdb'); > while ( my $struc = $stream->next_structure() ) { > for my $chain ($struc->get_chains) { > my $chainid = $chain->id; > my @chains = $struc->chain($chain); > } > } > > I get Bio::Structure::Chain=HASH(0x9f1ab50). > > What is the function of the chain method and how to use it? > > Best regards, > bernd > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From hlapp at gmx.net Thu Jul 27 05:06:52 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 27 Jul 2006 01:06:52 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C7A2C7.2070100@sendu.me.uk> References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> <44C7A2C7.2070100@sendu.me.uk> Message-ID: <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net> I think this looks like a great solution. You could also name Bio::DB::Taxonomy::list as Bio::DB::Taxonomy::inmemory because it really isn't much else than an in-memory database (of limited content if you populate it from flat-file sequence annotation). The only reservation I have is that you'd have methods on Node that don't really operate on the node instance but rather operate on the taxonomy (database) behind the scenes. That's what I would have used Bio::Taxonomy for, not so much as a container than as a class with (conceptually) 'static' methods corresponding to those that are now in Node, like get_Lineage_Nodes(). They would optionally accept a db_handle too, or use a default one set as an attribute. However, leaving/having these methods on Node really isn't such a big deal and I'm sure would even be preferred by many people for the sake of simplicity. So overall I think you should just go ahead. -hilmar On Jul 26, 2006, at 1:13 PM, Sendu Bala wrote: > > The fine details of the following may be slightly off, but it's > just to > provide an example. I'll use Test.pm syntax. > > my @human = qw('Homo sapiens' Homo Mammalia Eukaryota); > my @mouse = qw('Mus musculus' Mus Mammalia Eukaryota); > > > [...] > Proposed way with Node > ---------------------- > > my $db = new Bio::DB::Taxonomy(-source => 'list', -lineage => @human); > my $h_node = $db->get_Taxonomy_Node(-name => 'Homo sapiens'); > $db->add_lineage(@mouse); # or make a new db > my $m_node = $db->get_Taxonomy_Node(-name => 'Mus musculus'); > > @human = map { $_->scientific_name } $h_node->get_Lineage_Nodes; > ok join(", ", @human), "Homo sapiens, Homo, Mammalia, Eukaryota"; > # works as expected > > my $lca = $h_node->get_LCA_Node($m_node); > ok $lca->scientific_name, 'Mammalia'; # works first time > > # try again with entrez - just change the db_handle > $h_node->db_handle(new Bio:DB::Taxonomy(-source => 'entrez'); > > @human = map { $_->scientific_name } $h_node->get_Lineage_Nodes; > ok join(", ", @human) eq "Homo sapiens, Homo, Homo/Pan/Gorilla group, > Hominidae, ..."; > > $lca = $h_node->get_LCA_Node($m_node); > ok $lca->scientific_name, 'Mammalia'; > > [...] -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From bix at sendu.me.uk Thu Jul 27 07:07:22 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 27 Jul 2006 08:07:22 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net> References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> <44C7A2C7.2070100@sendu.me.uk> <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net> Message-ID: <44C8662A.3080904@sendu.me.uk> Hilmar Lapp wrote: > The only reservation I have is that you'd have methods on Node that > don't really operate on the node instance but rather operate on the > taxonomy (database) behind the scenes. That's what I would have used > Bio::Taxonomy for, not so much as a container than as a class with > (conceptually) 'static' methods corresponding to those that are now > in Node, like get_Lineage_Nodes(). Yes, I had the same reservation. But it somehow seemed reasonable for me to ask a node for its lineage, though I draw the line at having a method like get_node('rank_name'). That's the only thing Bio::Taxonomy would have been good for, so it's a trade off between some nice methods and the problems inherent in a node-container class. Though, perhaps we almost have the best of both worlds, since the database is effectively a container without the problems: $node->db_handle->get_Taxonomy_Node(-rank 'rank_name', -lineage_of => $node); ? > So overall I think you should just go ahead. Great, will do. From maximilianh at gmail.com Thu Jul 27 08:56:44 2006 From: maximilianh at gmail.com (Maximilian Haeussler) Date: Thu, 27 Jul 2006 10:56:44 +0200 Subject: [Bioperl-l] TRANSFAC matrices, open acces Message-ID: <76f031ae0607270156lde64ff9m590257d98899fe64@mail.gmail.com> Actually, the fact that the transfac matrices are belonging to a company is quite inconvenient for biologists and bioinformatics analyses working in this field. There are some projects to annotate cis-sequences in regular intervals by volunteers and put the data into the public domain, one of them is the oreganno database http://www.oreganno.org/. Its first annotation jamboree will be held in Gent at the end of this year. If you're interested in cis-sequences, want to meet others that are and are willing to contribute some annotation efforts, don't hestitate to come to gent, it's conveniently placed in the middle of europe and registration costs almost nothing. http://www.dmbr.ugent.be/bioit/contents/regcreative/ One day, hopefully, journals will oblige authors to put their sequences in a common format into genbank but as long as regulation is not seen as an important part of genome annotation, a lot manual annotation will have to be done. cheers max > On 26/07/06, leverdeterre wrote: > > > > i'm happy for helping you > > i'have done a page whitch can interrest you > > http://morissardjerome.free.fr/Data/index.html > > > > there is more information about the 397 matrix file ( in the 3 first line) > > and i'm done all the logo file . > > > > ++ > > -- > > View this message in context: http://www.nabble.com/Re%3A-Accessing-TRANSFAC-matrices-tf1969467.html#a5510746 > > Sent from the Perl - Bioperl-L forum at Nabble.com. > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > -- Maximilian Haeussler, CNRS/INRA Gif-sur-Yvette, France tel: +33 6 12 82 76 16 skype: maximilianhaeussler From morissardj at gmail.com Thu Jul 27 09:10:19 2006 From: morissardj at gmail.com (leverdeterre) Date: Thu, 27 Jul 2006 02:10:19 -0700 (PDT) Subject: [Bioperl-l] Accessing TRANSFAC matrices In-Reply-To: <5510746.post@talk.nabble.com> References: <44BEA9FB.1070009@utk.edu> <5510746.post@talk.nabble.com> Message-ID: <5517747.post@talk.nabble.com> Sorry i remove all this data because they are the proprity of TRANSFAC .. http://www.gene-regulation.com/pub/databases/transfac/doc/misc.html The TRANSFAC? database is free for users from non-profit organizations only. Users from commercial enterprises have to license the TRANSFAC? database and accompanying programs. -- View this message in context: http://www.nabble.com/Re%3A-Accessing-TRANSFAC-matrices-tf1969467.html#a5517747 Sent from the Perl - Bioperl-L forum at Nabble.com. From maximilianh at gmail.com Thu Jul 27 08:44:47 2006 From: maximilianh at gmail.com (Maximilian Haeussler) Date: Thu, 27 Jul 2006 10:44:47 +0200 Subject: [Bioperl-l] Accessing TRANSFAC matrices In-Reply-To: <5510746.post@talk.nabble.com> References: <44BEA9FB.1070009@utk.edu> <5510746.post@talk.nabble.com> Message-ID: <76f031ae0607270144of6ff9cbtbd9f3045bbc4e6e1@mail.gmail.com> I'm pretty sure that you are not allowed to distribute these matrices: http://www.gene-regulation.com/pub/databases/transfac/doc/misc.html [well...but if you don't care and biobase doesn't complain... actually anyone can scrape the matrices from the website with wget.] max On 26/07/06, leverdeterre wrote: > > i'm happy for helping you > i'have done a page whitch can interrest you > http://morissardjerome.free.fr/Data/index.html > > there is more information about the 397 matrix file ( in the 3 first line) > and i'm done all the logo file . > > ++ > -- > View this message in context: http://www.nabble.com/Re%3A-Accessing-TRANSFAC-matrices-tf1969467.html#a5510746 > Sent from the Perl - Bioperl-L forum at Nabble.com. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From bix at sendu.me.uk Thu Jul 27 09:55:01 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 27 Jul 2006 10:55:01 +0100 Subject: [Bioperl-l] TRANSFAC matrices, open acces In-Reply-To: <76f031ae0607270156lde64ff9m590257d98899fe64@mail.gmail.com> References: <76f031ae0607270156lde64ff9m590257d98899fe64@mail.gmail.com> Message-ID: <44C88D75.7040102@sendu.me.uk> Maximilian Haeussler wrote: > Actually, the fact that the transfac matrices are belonging to a > company is quite inconvenient for biologists and bioinformatics > analyses working in this field. The public version is adequate though. It would certainly be useful to have Bioperl access to transfac and other regulation databases. I'll probably write some suitable modules if no one beats me to it. From sdavis2 at mail.nih.gov Thu Jul 27 11:43:09 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 27 Jul 2006 07:43:09 -0400 Subject: [Bioperl-l] TRANSFAC matrices, open acces In-Reply-To: <44C88D75.7040102@sendu.me.uk> Message-ID: On 7/27/06 5:55 AM, "Sendu Bala" wrote: > Maximilian Haeussler wrote: >> Actually, the fact that the transfac matrices are belonging to a >> company is quite inconvenient for biologists and bioinformatics >> analyses working in this field. > > The public version is adequate though. It would certainly be useful to > have Bioperl access to transfac and other regulation databases. I'll > probably write some suitable modules if no one beats me to it. I haven't used it in a while, but the TFBS family of modules are, if I recall correctly, bioperl-compatible, though not part of bioperl. In any case, for those who aren't aware, it might be worth a quick look: http://forkhead.cgb.ki.se/TFBS/ Sean From bix at sendu.me.uk Thu Jul 27 12:01:03 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 27 Jul 2006 13:01:03 +0100 Subject: [Bioperl-l] TRANSFAC matrices, open acces In-Reply-To: References: Message-ID: <44C8AAFF.6060100@sendu.me.uk> Sean Davis wrote: > > On 7/27/06 5:55 AM, "Sendu Bala" wrote: > >> Maximilian Haeussler wrote: >>> Actually, the fact that the transfac matrices are belonging to a >>> company is quite inconvenient for biologists and bioinformatics >>> analyses working in this field. > >> The public version is adequate though. It would certainly be useful to >> have Bioperl access to transfac and other regulation databases. I'll >> probably write some suitable modules if no one beats me to it. > > I haven't used it in a while, but the TFBS family of modules are, if I > recall correctly, bioperl-compatible, though not part of bioperl. In any > case, for those who aren't aware, it might be worth a quick look: Yes. It only has online access to Transfac though, and the inheritance and returned objects are TFBS specific so you miss out on whatever goodness there may be in the rest of bioperl. Still, recommended to use if you want programmatic access to Transfac matrices right now. From bernd.web at gmail.com Thu Jul 27 10:14:13 2006 From: bernd.web at gmail.com (Bernd Web) Date: Thu, 27 Jul 2006 12:14:13 +0200 Subject: [Bioperl-l] Structure::IO In-Reply-To: References: <716af09c0607250347w24e0d8dbj674758049bc0a4e2@mail.gmail.com> Message-ID: <716af09c0607270314u4e2b1eb8y6c1b87f5b3abd8e1@mail.gmail.com> Hi Thanks for your notes. The text I pasted comes from http://doc.bioperl.org/releases/bioperl-1.5.1/ but indeed Entry.pm (v1.25 2006/07/04) shows a different POD. I am trying to get annotation out of PDB. ID is not a problem, but I would like to have the HEADER and possibly comment fields to a (FastA) description line, but how? Bio::Structure::Entry v.1.25 does not list the annotation method in the POD anymore (due to a missing empty line before =head). $struc->annotation still exists; I can get the keys but not the values with $struc->annotation($struc->seqres) (Can't locate object method "get_Annotations" via package "Bio::PrimarySeq"). (Example script attached). The POD states: annotation: $obj->annotation($seq_obj). So I thought of a PrimarySeq object to pass to annotation. The PrimarySeq object ($struc->seqres) does not contain a description: $struc->seqres->desc is uninitialized. Is it possible to get annotation from header/comments fields with Bio::Structure? Best regards, Bernd On 7/26/06, Brian Osborne wrote: > Bernd, > > I'm not following your question. The POD in the latest Bio::Structure::Entry > shows: > > =head2 chain() > > Title : chain > Usage : @chains = $structure->chain($chain); > Function: Connects a Chain or a list of Chain objects to a > Bio::Structure::Entry. > Returns : List of Bio::Structure::Chain objects > Args : A Chain or a reference to an array of Chain objects > > =cut > > Which is not what you've copied and pasted. What version of Bioperl do you > use? > > Brian O. > > > > On 7/25/06 6:47 AM, "Bernd Web" wrote: > > > Hi, > > > > Does someone have experience with Bio::Structure::IO? > > The example III.9.1 from the bptutorial.pl covers most, but what is e.g. the > > chain() method of Bio::Structure::Entry doing? The POD states: > > > > Title : chain > > Usage : @chains = $structure->chain($chain); > > Function: Connects a (or a list of) Chain objects to a Bio::Structure::Entry. > > Returns : list of Bio::Structure::Residue objects > > Args : One Residue or a reference to an array of Residue objects > > > > But in e.g > > my $stream = Bio::Structure::IO->new(-file => $filename, > > -format => 'pdb'); > > while ( my $struc = $stream->next_structure() ) { > > for my $chain ($struc->get_chains) { > > my $chainid = $chain->id; > > my @chains = $struc->chain($chain); > > } > > } > > > > I get Bio::Structure::Chain=HASH(0x9f1ab50). > > > > What is the function of the chain method and how to use it? > > > > Best regards, > > bernd > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > -------------- next part -------------- #!/usr/bin/perl -w use warnings; use strict; use Bio::Structure::IO; my $filename = $ARGV[0]; my $stream = Bio::Structure::IO->new( -file => $filename, -format => 'pdb'); while ( my $struc = $stream->next_structure() ) { print "SEQRES DESC: ", $struc->seqres->desc, "\n"; print join(" ", keys %{$struc->annotation($struc->seqres)}), "\n"; print join(" ", keys %{$struc->annotation()}), "\n"; print join(" ", values %{$struc->annotation()}), "\n"; #(partly) works print join(" ", values %{$struc->annotation($struc->seqres)}), "\n"; #does not work } From bix at sendu.me.uk Thu Jul 27 13:31:54 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 27 Jul 2006 14:31:54 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net> References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> <44C7A2C7.2070100@sendu.me.uk> <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net> Message-ID: <44C8C04A.8070504@sendu.me.uk> Hilmar Lapp wrote: > > So overall I think you should just go ahead. One last suggestion for discussion: It may be appropriate is to rename Bio::Taxonomy::Node to clarify that Node has no particular reliance on or association with Bio::Taxonomy or the other modules in Bio/Taxonomy/. How about calling it Bio::Taxon? It is more obvious what to expect from something called 'Bio::Taxon' when you know that it is the new 'Bio::Species': like Bio::Species but for any taxon. It also makes the class 'top-level' which I think most people are happier using; seems like things in sub-directories are more for advanced users. From hlapp at gmx.net Thu Jul 27 13:44:25 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 27 Jul 2006 09:44:25 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C8C04A.8070504@sendu.me.uk> References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> <44C7A2C7.2070100@sendu.me.uk> <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net> <44C8C04A.8070504@sendu.me.uk> Message-ID: <077A86AB-174D-426E-B65F-0954A72A6019@gmx.net> I don't think the top-level or sub-directory matters at all and I don't want anybody to get used to the notion that that may imply anything (except possibly better thought-out structure for the sub- directory level). For instance RichSeq is what all rich annotation sequence format parsers return, yet it is in a sub-directory. I don't any real objection to Bio::Taxon though if that's what you'd like to name it - although, what will happen to the Bio::Taxonomy hierarchy then? Phased out? -hilmar On Jul 27, 2006, at 9:31 AM, Sendu Bala wrote: > Hilmar Lapp wrote: >> >> So overall I think you should just go ahead. > > One last suggestion for discussion: > > It may be appropriate is to rename Bio::Taxonomy::Node to clarify that > Node has no particular reliance on or association with > Bio::Taxonomy or > the other modules in Bio/Taxonomy/. > > How about calling it Bio::Taxon? > > It is more obvious what to expect from something called 'Bio::Taxon' > when you know that it is the new 'Bio::Species': like Bio::Species but > for any taxon. It also makes the class 'top-level' which I think most > people are happier using; seems like things in sub-directories are > more > for advanced users. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Thu Jul 27 13:48:32 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 27 Jul 2006 08:48:32 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C8662A.3080904@sendu.me.uk> Message-ID: <002a01c6b183$59779880$15327e82@pyrimidine> Sounds good to me; agree with Hilmar's suggestion of 'in_memory' as well, but it's your choice. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Sendu Bala > Sent: Thursday, July 27, 2006 2:07 AM > To: bioperl-l at bioperl.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > Hilmar Lapp wrote: > > The only reservation I have is that you'd have methods on Node that > > don't really operate on the node instance but rather operate on the > > taxonomy (database) behind the scenes. That's what I would have used > > Bio::Taxonomy for, not so much as a container than as a class with > > (conceptually) 'static' methods corresponding to those that are now > > in Node, like get_Lineage_Nodes(). > > Yes, I had the same reservation. But it somehow seemed reasonable for me > to ask a node for its lineage, though I draw the line at having a method > like get_node('rank_name'). That's the only thing Bio::Taxonomy would > have been good for, so it's a trade off between some nice methods and > the problems inherent in a node-container class. > > Though, perhaps we almost have the best of both worlds, since the > database is effectively a container without the problems: > $node->db_handle->get_Taxonomy_Node(-rank 'rank_name', > -lineage_of => $node); ? > > > > So overall I think you should just go ahead. > > Great, will do. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From osborne1 at optonline.net Thu Jul 27 13:44:33 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Thu, 27 Jul 2006 09:44:33 -0400 Subject: [Bioperl-l] Structure::IO In-Reply-To: <716af09c0607270314u4e2b1eb8y6c1b87f5b3abd8e1@mail.gmail.com> Message-ID: Bernd, I'll need to take a look a closer look at the POD but from your description it seems it's wrong, or certainly incomplete. To get the HEADER line you'll do something like: my $stream = Bio::Structure::IO->new(-file => $filename, -format => 'pdb'); my $struc = $stream->next_structure(); my $anncoll = $struc->annotation; my @headers = $anncoll->get_Annotations('header'); This implies that all these top-level annotations are associated with the entry, not with the chains. I don't use Bio::Structure so don't assume this is true for all annotations. There are 2 ways to explore this further. One is to look at t/StructIO.t or other tests, useful examples are frequently found in the tests. The other is to run your script in the debugger: >perl -d pdb.pl 1CAM.pdb By examining the variables your script creates using the "x" command you get to see exactly where strings are stored and what the names of the keys are, this is how I found the HEADER line. Type "h" for the debugger's Help. Brian O. On 7/27/06 6:14 AM, "Bernd Web" wrote: > Hi > > Thanks for your notes. The text I pasted comes from > http://doc.bioperl.org/releases/bioperl-1.5.1/ but indeed Entry.pm > (v1.25 2006/07/04) shows a different POD. > > I am trying to get annotation out of PDB. ID is not a problem, but I > would like to have the HEADER and possibly comment fields to a (FastA) > description line, but how? > > Bio::Structure::Entry v.1.25 does not list the annotation method in > the POD anymore (due to a missing empty line before =head). > $struc->annotation still exists; I can get the keys but not the values > with $struc->annotation($struc->seqres) (Can't locate object method > "get_Annotations" via package "Bio::PrimarySeq"). > (Example script attached). > > The POD states: annotation: $obj->annotation($seq_obj). So I thought > of a PrimarySeq object to pass to annotation. > > The PrimarySeq object ($struc->seqres) does not contain a description: > $struc->seqres->desc is uninitialized. > > Is it possible to get annotation from header/comments fields with > Bio::Structure? > > Best regards, > Bernd > > > On 7/26/06, Brian Osborne wrote: >> Bernd, >> >> I'm not following your question. The POD in the latest Bio::Structure::Entry >> shows: >> >> =head2 chain() >> >> Title : chain >> Usage : @chains = $structure->chain($chain); >> Function: Connects a Chain or a list of Chain objects to a >> Bio::Structure::Entry. >> Returns : List of Bio::Structure::Chain objects >> Args : A Chain or a reference to an array of Chain objects >> >> =cut >> >> Which is not what you've copied and pasted. What version of Bioperl do you >> use? >> >> Brian O. >> >> >> >> On 7/25/06 6:47 AM, "Bernd Web" wrote: >> >>> Hi, >>> >>> Does someone have experience with Bio::Structure::IO? >>> The example III.9.1 from the bptutorial.pl covers most, but what is e.g. the >>> chain() method of Bio::Structure::Entry doing? The POD states: >>> >>> Title : chain >>> Usage : @chains = $structure->chain($chain); >>> Function: Connects a (or a list of) Chain objects to a >>> Bio::Structure::Entry. >>> Returns : list of Bio::Structure::Residue objects >>> Args : One Residue or a reference to an array of Residue objects >>> >>> But in e.g >>> my $stream = Bio::Structure::IO->new(-file => $filename, >>> -format => 'pdb'); >>> while ( my $struc = $stream->next_structure() ) { >>> for my $chain ($struc->get_chains) { >>> my $chainid = $chain->id; >>> my @chains = $struc->chain($chain); >>> } >>> } >>> >>> I get Bio::Structure::Chain=HASH(0x9f1ab50). >>> >>> What is the function of the chain method and how to use it? >>> >>> Best regards, >>> bernd >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> >> From aaron.j.mackey at gsk.com Thu Jul 27 12:54:05 2006 From: aaron.j.mackey at gsk.com (aaron.j.mackey at gsk.com) Date: Thu, 27 Jul 2006 08:54:05 -0400 Subject: [Bioperl-l] Installing staden io_lib on windows? In-Reply-To: <1153930767.2632.5.camel@localhost.localdomain> Message-ID: Hi Scott, > In file included from Read.h:43, > from Read.c:40: > ../utils/os.h:346:2: #error Must define SP_BIG_ENDIAN or > SP_LITTLE_ENDIAN in Makefile os.h has a bunch of #ifdef statements that check for platforms, and there isn't one for cygwin (but there is for MinGW) Try running configure with "--CFLAGS=-DSP_LITTLE_ENDIAN" or somesuch Also take a look at the MinGW section of os.h to see if there are others you will likely need (e.g. NOPIPE, NOLOCKF, etc) Alternatively, you may want to just edit the original os.h to duplicate the MinGW section with the appropriate compiler constant for CYGWIN (__CYGWIN__ I'm guessing, but don't really know for sure). Good luck, -Aaron From bix at sendu.me.uk Thu Jul 27 14:06:23 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 27 Jul 2006 15:06:23 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <077A86AB-174D-426E-B65F-0954A72A6019@gmx.net> References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> <44C7A2C7.2070100@sendu.me.uk> <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net> <44C8C04A.8070504@sendu.me.uk> <077A86AB-174D-426E-B65F-0954A72A6019@gmx.net> Message-ID: <44C8C85F.2010104@sendu.me.uk> Hilmar Lapp wrote: > I don't think the top-level or sub-directory matters at all and I don't > want anybody to get used to the notion that that may imply anything > (except possibly better thought-out structure for the sub-directory > level). For instance RichSeq is what all rich annotation sequence format > parsers return, yet it is in a sub-directory. Well, I'm not aware that I've ever used a RichSeq ;). But your point is taken. > I don't any real objection to Bio::Taxon though if that's what you'd > like to name it - although, what will happen to the Bio::Taxonomy > hierarchy then? Phased out? At the moment it seems to me that the Bio::Taxonomy modules (excluding Node) aren't really usable. Jason wrote a comment in t/TaxonTree.t which tests Taxon and Tree: ## I am pretty sure this module is going the way of the dodo bird so ## I am not sure how much work to put into fixing the tests/module FactoryI is strange (it isn't intended to work like any other Bioperl factory) and there are no implementers of it, while Taxonomy.pm itself would be redundant after my Node changes and has implementation issues, though it may make more sense to some people. My vote is phase out. What is the actual process involved in renaming a module in Bioperl? From hlapp at gmx.net Thu Jul 27 14:29:09 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 27 Jul 2006 10:29:09 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C8C85F.2010104@sendu.me.uk> References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> <44C7A2C7.2070100@sendu.me.uk> <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net> <44C8C04A.8070504@sendu.me.uk> <077A86AB-174D-426E-B65F-0954A72A6019@gmx.net> <44C8C85F.2010104@sendu.me.uk> Message-ID: How do you mean 'process'? You create a new module, and then you deprecate the ones you're phasing out. If possible you rewrite the implementation to use the new module. Not sure this answers your question? -hilmar On Jul 27, 2006, at 10:06 AM, Sendu Bala wrote: > Hilmar Lapp wrote: >> I don't think the top-level or sub-directory matters at all and I >> don't >> want anybody to get used to the notion that that may imply anything >> (except possibly better thought-out structure for the sub-directory >> level). For instance RichSeq is what all rich annotation sequence >> format >> parsers return, yet it is in a sub-directory. > > Well, I'm not aware that I've ever used a RichSeq ;). But your > point is > taken. > > >> I don't any real objection to Bio::Taxon though if that's what you'd >> like to name it - although, what will happen to the Bio::Taxonomy >> hierarchy then? Phased out? > > At the moment it seems to me that the Bio::Taxonomy modules (excluding > Node) aren't really usable. Jason wrote a comment in t/TaxonTree.t > which > tests Taxon and Tree: > > ## I am pretty sure this module is going the way of the dodo bird so > ## I am not sure how much work to put into fixing the tests/module > > FactoryI is strange (it isn't intended to work like any other Bioperl > factory) and there are no implementers of it, while Taxonomy.pm itself > would be redundant after my Node changes and has implementation > issues, > though it may make more sense to some people. > > My vote is phase out. > > > What is the actual process involved in renaming a module in Bioperl? > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Thu Jul 27 14:29:39 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 27 Jul 2006 09:29:39 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <077A86AB-174D-426E-B65F-0954A72A6019@gmx.net> Message-ID: <003101c6b189$17f5d2e0$15327e82@pyrimidine> I'll respond to both here: > Sendu Bala wrote: > > One last suggestion for discussion: > > It may be appropriate is to rename Bio::Taxonomy::Node to clarify that > Node has no particular reliance on or association with Bio::Taxonomy or > the other modules in Bio/Taxonomy/. > > How about calling it Bio::Taxon? > > It is more obvious what to expect from something called 'Bio::Taxon' > when you know that it is the new 'Bio::Species': like Bio::Species but > for any taxon. It also makes the class 'top-level' which I think most > people are happier using; seems like things in sub-directories are more > for advanced users. Hilmar explains the namespace issue with Bioperl more concisely below. You should still be able to use a Node in a Taxonomy, but then again you should also be able to use a Taxon in a Taxonomy as well (by definition, a Taxon is part of a Taxonomy as it is a taxonomic unit). The whole "looking at this from a biologist's perspective" thing again... http://en.wikipedia.org/wiki/Taxon BTW, what exactly is Bio::Taxonomy::Taxon used for? Looks like it is used more for building taxonomic trees that anything, so shouldn't it be moved to Bio::Tree:Taxon (that name isn't used)? Then you could use Bio::Taxonomy::Taxon for your purposes. See, the only concern I have with using the name Bio::Taxon is people confusing it with Bio::Taxonomy itself or with Bio::Taxonomy::Taxon. Though I agree that the name makes sense for what you want. > Hilmar Lapp wrote: > > I don't think the top-level or sub-directory matters at all and I > don't want anybody to get used to the notion that that may imply > anything (except possibly better thought-out structure for the sub- > directory level). For instance RichSeq is what all rich annotation > sequence format parsers return, yet it is in a sub-directory. > > I don't any real objection to Bio::Taxon though if that's what you'd > like to name it - although, what will happen to the Bio::Taxonomy > hierarchy then? Phased out? > > -hilmar I'm not sure how many people out there use Bio::Taxonomy. I think they use the tree-building modules in Bio::Tree more than anything. And there haven't been any panicked users protesting at the gates yet about the many posts for Bio::Taxonomy changes (well, except me, and 'I got better'). Chris From cjfields at uiuc.edu Thu Jul 27 14:54:06 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 27 Jul 2006 09:54:06 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C8C85F.2010104@sendu.me.uk> Message-ID: <003201c6b18c$829330e0$15327e82@pyrimidine> > > I don't any real objection to Bio::Taxon though if that's what you'd > > like to name it - although, what will happen to the Bio::Taxonomy > > hierarchy then? Phased out? > > At the moment it seems to me that the Bio::Taxonomy modules (excluding > Node) aren't really usable. Jason wrote a comment in t/TaxonTree.t which > tests Taxon and Tree: > > ## I am pretty sure this module is going the way of the dodo bird so > ## I am not sure how much work to put into fixing the tests/module > > FactoryI is strange (it isn't intended to work like any other Bioperl > factory) and there are no implementers of it, while Taxonomy.pm itself > would be redundant after my Node changes and has implementation issues, > though it may make more sense to some people. > > My vote is phase out. > > > What is the actual process involved in renaming a module in Bioperl? This is how many times the phrase "Bio::Taxonomy" is used in Bioperl in directory Bio (which should catch any namespace usage like Node, etc.): Instances: 2 BP Module : Bio::DB::Taxonomy Instances: 4 BP Module : Bio::DB::Taxonomy::entrez Instances: 7 BP Module : Bio::DB::Taxonomy::flatfile Instances: 1 BP Module : Bio::Expression::Platform Instances: 1 BP Module : Bio::SeqIO::genbank Instances: 22 BP Module : Bio::Taxonomy Instances: 8 BP Module : Bio::Taxonomy::FactoryI Instances: 17 BP Module : Bio::Taxonomy::Node Instances: 20 BP Module : Bio::Taxonomy::Taxon Instances: 39 BP Module : Bio::Taxonomy::Tree Hmm, not much. Almost all hits are within Bio::DB::taxonomy or Bio::Taxonomy. The SeqIO::genbank was my change BTW; just haven't tossed the code yet. Therefore, the only one left that would be affected (outside of Bio::Taxonomy and Bio::DB::Taxonomy) is Allen Day's Bio::Expression::Platform class, which uses Bio::DB::Taxonomy::entrez to grab Nodes; that would just be changed over to whatever class you plan on using. And that class hasn't been documented at all outside the methods. Furthermore, judging by the mail list archives the Bio::Taxonomy modules had very little usage outside of Node. Jason mentioned on an old post that he could never get Bio::Taxonomy::Taxon/Tree to work and that Dan Kortschak had moved (Dan's last post was in 2003). Hence the test file comments. And you make a good point with Bio::Taxonomy::FactoryI. I agree, if the modules haven't served a useful purpose they should be phased out. Chris From cjfields at uiuc.edu Thu Jul 27 15:15:25 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 27 Jul 2006 10:15:25 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: Message-ID: <003301c6b18f$7d114000$15327e82@pyrimidine> Wow, we're doing a little bioperl spring cleaning here! I agree with Hilmar: create a new module (Bio::Taxon), which claims the namespace, and deprecate the old ones. How 'broken', exactly, is Bio::Taxonomy? The idea behind it seems just (container for Nodes) but maybe it should just be reconfigured, and all the classes in directory Bio/Taxonomy deprecated. Or should we start from scratch completely? Don't know if it has been attempted but it would be nice to have a way for building taxonomic trees from Node/Taxon information using a Taxonomy-like container object. I like the way NCBI does something along these lines with BLAST output now. BTW, thanks guys for a rousing discussion! Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Hilmar Lapp > Sent: Thursday, July 27, 2006 9:29 AM > To: Sendu Bala > Cc: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > How do you mean 'process'? You create a new module, and then you > deprecate the ones you're phasing out. If possible you rewrite the > implementation to use the new module. > > Not sure this answers your question? > > -hilmar > > On Jul 27, 2006, at 10:06 AM, Sendu Bala wrote: > > > Hilmar Lapp wrote: > >> I don't think the top-level or sub-directory matters at all and I > >> don't > >> want anybody to get used to the notion that that may imply anything > >> (except possibly better thought-out structure for the sub-directory > >> level). For instance RichSeq is what all rich annotation sequence > >> format > >> parsers return, yet it is in a sub-directory. > > > > Well, I'm not aware that I've ever used a RichSeq ;). But your > > point is > > taken. > > > > > >> I don't any real objection to Bio::Taxon though if that's what you'd > >> like to name it - although, what will happen to the Bio::Taxonomy > >> hierarchy then? Phased out? > > > > At the moment it seems to me that the Bio::Taxonomy modules (excluding > > Node) aren't really usable. Jason wrote a comment in t/TaxonTree.t > > which > > tests Taxon and Tree: > > > > ## I am pretty sure this module is going the way of the dodo bird so > > ## I am not sure how much work to put into fixing the tests/module > > > > FactoryI is strange (it isn't intended to work like any other Bioperl > > factory) and there are no implementers of it, while Taxonomy.pm itself > > would be redundant after my Node changes and has implementation > > issues, > > though it may make more sense to some people. > > > > My vote is phase out. > > > > > > What is the actual process involved in renaming a module in Bioperl? > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From hlapp at gmx.net Thu Jul 27 15:29:04 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 27 Jul 2006 11:29:04 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <003101c6b189$17f5d2e0$15327e82@pyrimidine> References: <003101c6b189$17f5d2e0$15327e82@pyrimidine> Message-ID: On Jul 27, 2006, at 10:29 AM, Chris Fields wrote: > See, the only concern I have with using the name Bio::Taxon is people > confusing it with Bio::Taxonomy itself or with > Bio::Taxonomy::Taxon. Though > I agree that the name makes sense for what you want. I don't think Bio::Taxonomy is used a lot in earnest if at all, so it you even test the waters by deprecating them right away by putting warning statements there and see whether anybody complains about the cluttered terminal screens. If this goes into snapshot releases and release candidates leading up to 1.6 then they may be phased out right away. Unless anybody on the list has strong objections? Anybody using Bio::Taxonomy? -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From skirov at utk.edu Thu Jul 27 13:57:19 2006 From: skirov at utk.edu (skirov) Date: Thu, 27 Jul 2006 09:57:19 -0400 Subject: [Bioperl-l] TRANSFAC matrices, open acces Message-ID: <44E2E794@webmail.utk.edu> Just a quick note- Bio::Matrix::PSM::IO can parse matrix.dat if you can get it- and as far as I can tell this is not easy- you have to contact the company to get access and it is not clear what their conditions are. This is the reason I have decided not to maintain the transfac parser. Stefan >===== Original Message From Sendu Bala ===== >Sean Davis wrote: >> >> On 7/27/06 5:55 AM, "Sendu Bala" wrote: >> >>> Maximilian Haeussler wrote: >>> Actually, the fact that the transfac matrices are belonging to a >>> company is quite inconvenient for biologists and bioinformatics >>> analyses working in this field. > > >>> The public version is adequate though. It would certainly be useful to >>> have Bioperl access to transfac and other regulation databases. I'll >>> probably write some suitable modules if no one beats me to it. >> >> I haven't used it in a while, but the TFBS family of modules are, if I >> recall correctly, bioperl-compatible, though not part of bioperl. In any >> case, for those who aren't aware, it might be worth a quick look: > >Yes. It only has online access to Transfac though, and the inheritance >and returned objects are TFBS specific so you miss out on whatever >goodness there may be in the rest of bioperl. > >Still, recommended to use if you want programmatic access to Transfac >matrices right now. >_______________________________________________ >Bioperl-l mailing list >Bioperl-l at lists.open-bio.org >http://lists.open-bio.org/mailman/listinfo/bioperl-l From bix at sendu.me.uk Thu Jul 27 16:30:38 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 27 Jul 2006 17:30:38 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> <44C7A2C7.2070100@sendu.me.uk> <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net> <44C8C04A.8070504@sendu.me.uk> <077A86AB-174D-426E-B65F-0954A72A6019@gmx.net> <44C8C85F.2010104@sendu.me.uk> Message-ID: <44C8EA2E.8030000@sendu.me.uk> Hilmar Lapp wrote: > How do you mean 'process'? You create a new module, and then you > deprecate the ones you're phasing out. If possible you rewrite the > implementation to use the new module. > > Not sure this answers your question? I guess. I was thinking of just making Bio::Taxonomy::Node isa Bio::Taxon and then simply removing all the code from Node, leaving just some perldoc that said it had been renamed? Or should there be some methods that issue a warning and then call SUPER? From hlapp at gmx.net Thu Jul 27 16:38:30 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 27 Jul 2006 12:38:30 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C8EA2E.8030000@sendu.me.uk> References: <000001c6b017$873176a0$15327e82@pyrimidine> <9B1D3E4C-41D4-4F3D-A212-A57A1CC6E21C@gmx.net> <44C73D21.3010301@sendu.me.uk> <44C7A2C7.2070100@sendu.me.uk> <86D32350-0412-40AC-8C34-51DCD3064711@gmx.net> <44C8C04A.8070504@sendu.me.uk> <077A86AB-174D-426E-B65F-0954A72A6019@gmx.net> <44C8C85F.2010104@sendu.me.uk> <44C8EA2E.8030000@sendu.me.uk> Message-ID: <881BB312-DF1B-43F6-A38D-8B543738244F@gmx.net> That's what I said could be possible here on much shorter notice that we'd do usually due to the low usage. Eventually deprecated modules should also be physically removed, so you want to prepare for that. (removing a module breaks scripts that used it; issuing a warning alerts to this being forthcoming.) -hilmar On Jul 27, 2006, at 12:30 PM, Sendu Bala wrote: > Hilmar Lapp wrote: >> How do you mean 'process'? You create a new module, and then you >> deprecate the ones you're phasing out. If possible you rewrite the >> implementation to use the new module. >> >> Not sure this answers your question? > > I guess. I was thinking of just making Bio::Taxonomy::Node isa > Bio::Taxon and then simply removing all the code from Node, leaving > just > some perldoc that said it had been renamed? > > Or should there be some methods that issue a warning and then call > SUPER? > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From sanges at biogem.it Thu Jul 27 16:37:05 2006 From: sanges at biogem.it (Remo Sanges) Date: Thu, 27 Jul 2006 18:37:05 +0200 Subject: [Bioperl-l] TRANSFAC matrices, open acces In-Reply-To: <44E2E794@webmail.utk.edu> References: <44E2E794@webmail.utk.edu> Message-ID: <44C8EBB1.5070709@biogem.it> Here is also my 2c on TFBS: skirov wrote: >Just a quick note- Bio::Matrix::PSM::IO can parse matrix.dat if you can get >it- and as far as I can tell this is not easy- you have to contact the company >to get access and it is not clear what their conditions are. This is the >reason I have decided not to maintain the transfac parser. >Stefan > > >>===== Original Message From Sendu Bala ===== >>Sean Davis wrote: >> >> >>>On 7/27/06 5:55 AM, "Sendu Bala" wrote: >>> >>> >>> >>>>Maximilian Haeussler wrote: >>>>Actually, the fact that the transfac matrices are belonging to a >>>>company is quite inconvenient for biologists and bioinformatics >>>>analyses working in this field. >>>> >>>> >>>>The public version is adequate though. It would certainly be useful to >>>>have Bioperl access to transfac and other regulation databases. I'll >>>>probably write some suitable modules if no one beats me to it. >>>> >>>> >>>I haven't used it in a while, but the TFBS family of modules are, if I >>>recall correctly, bioperl-compatible, though not part of bioperl. In any >>>case, for those who aren't aware, it might be worth a quick look: >>> >>> >>Yes. It only has online access to Transfac though >> TFBS::DB::LocalTRANSFAC - can parse local transfac matrices (matrix.dat) >>, and the inheritance >>and returned objects are TFBS specific so you miss out on whatever >>goodness there may be in the rest of bioperl. >> >> >> In TFBS there are modules which inherithed from Bio::SeqFeature::Generic and Bio::Root::Root. See for example TFBS::Site. So probably it is not so bad.... Here is the link cutted from the Sean's e-mail: http://forkhead.cgb.ki.se/TFBS/ HTH Remo From osborne1 at optonline.net Thu Jul 27 16:49:26 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Thu, 27 Jul 2006 12:49:26 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <881BB312-DF1B-43F6-A38D-8B543738244F@gmx.net> Message-ID: Sendu, And add the module or modules names to the DEPRECATED file. Brian O. On 7/27/06 12:38 PM, "Hilmar Lapp" wrote: > Eventually deprecated modules should also be physically removed From MEC at stowers-institute.org Thu Jul 27 17:28:03 2006 From: MEC at stowers-institute.org (Cook, Malcolm) Date: Thu, 27 Jul 2006 12:28:03 -0500 Subject: [Bioperl-l] TRANSFAC matrices, open acces Message-ID: re: >Yes. It only has online access to Transfac though, not quite true. It does support access to local transfac data files if you have them. --Malcolm From cjfields at uiuc.edu Thu Jul 27 17:45:28 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 27 Jul 2006 12:45:28 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <881BB312-DF1B-43F6-A38D-8B543738244F@gmx.net> Message-ID: <000301c6b1a4$73ef3fd0$15327e82@pyrimidine> Makes sense to me. From my previous post the only bioperl class that used it was Bio::Expression::Platform, and that only for grabbing Node objects from Bio::DB::Taxonomy::entrez (so, change it to use whatever object Bio::DB::Taxonomy returns). I couldn't find anything else in the core outside of the Bio::DB::Taxonomy and Bio::Taxonomy classes and tests that use them. There aren't even any scripts or examples. If you implement Bio::Root::RootI (and pretty much everything does), you could use warn() or deprecated() for these easily: ... Title : warn Usage : $object->warn("Warning message"); Function: Places a warning. What happens now is down to the verbosity of the object (value of $obj->verbose) verbosity 0 or not set => small warning verbosity -1 => no warning verbosity 1 => warning with stack trace verbosity 2 => converts warnings into throw ... Title : deprecated Usage : $obj->deprecated("Method X is deprecated"); Function: Prints a message about deprecation unless verbose is < 0 (which means be quiet) Returns : none Args : Message string to print to STDERR ... Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Hilmar Lapp > Sent: Thursday, July 27, 2006 11:39 AM > To: Sendu Bala > Cc: bioperl-l at bioperl.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes > > That's what I said could be possible here on much shorter notice that > we'd do usually due to the low usage. > > Eventually deprecated modules should also be physically removed, so > you want to prepare for that. (removing a module breaks scripts that > used it; issuing a warning alerts to this being forthcoming.) > > -hilmar > > On Jul 27, 2006, at 12:30 PM, Sendu Bala wrote: > > > Hilmar Lapp wrote: > >> How do you mean 'process'? You create a new module, and then you > >> deprecate the ones you're phasing out. If possible you rewrite the > >> implementation to use the new module. > >> > >> Not sure this answers your question? > > > > I guess. I was thinking of just making Bio::Taxonomy::Node isa > > Bio::Taxon and then simply removing all the code from Node, leaving > > just > > some perldoc that said it had been renamed? > > > > Or should there be some methods that issue a warning and then call > > SUPER? > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From bix at sendu.me.uk Thu Jul 27 19:30:47 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 27 Jul 2006 20:30:47 +0100 Subject: [Bioperl-l] TRANSFAC matrices, open acces In-Reply-To: References: Message-ID: <44C91467.5050001@sendu.me.uk> Cook, Malcolm wrote: > re: > >> Yes. It only has online access to Transfac though, > > not quite true. It does support access to local transfac data files if > you have them. And to local Jaspar files. I wasn't clear, but I meant for the 'only' to modify 'online'. Ie. it doesn't give you access to any other online databases. From bix at sendu.me.uk Thu Jul 27 19:55:32 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 27 Jul 2006 20:55:32 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <003101c6b189$17f5d2e0$15327e82@pyrimidine> References: <003101c6b189$17f5d2e0$15327e82@pyrimidine> Message-ID: <44C91A34.1040406@sendu.me.uk> Chris Fields wrote: > BTW, what exactly is Bio::Taxonomy::Taxon used for? Looks like it is used > more for building taxonomic trees that anything, so shouldn't it be moved to > Bio::Tree:Taxon (that name isn't used)? Then you could use > Bio::Taxonomy::Taxon for your purposes. It actually seemed more like a possible replacement for Bio::Taxonomy::Node. Thanks to its Tree::NodeI implementation it has the big advantage over Bio::Taxonomy::Node that you access the lineage without a database. From the programmer's point of view it seemed more natural, being able to create nodes and add descendants. I decided against it because I felt the added complexity wasn't really worth it, and Bio::Taxonomy::Node had some of its own advantages. If this turns out to be the wrong choice, my Bio::Taxon can always be reimplemented to also implement Tree::NodeI in the future. > See, the only concern I have with using the name Bio::Taxon is people > confusing it with Bio::Taxonomy itself or with Bio::Taxonomy::Taxon. Though > I agree that the name makes sense for what you want. I don't think you'd confuse it directly with Bio::Taxonomy, but you could certainly waste some time thinking it was appropriate to stick Bio::Taxon objects in Bio::Taxonomy objects - theoretically it might work but ultimately you'd just be wasting your time. I'll make sure the docs in the Taxonomy modules steer people in the right direction. From bix at sendu.me.uk Thu Jul 27 20:18:06 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 27 Jul 2006 21:18:06 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <003301c6b18f$7d114000$15327e82@pyrimidine> References: <003301c6b18f$7d114000$15327e82@pyrimidine> Message-ID: <44C91F7E.2040000@sendu.me.uk> Chris Fields wrote: > How 'broken', exactly, is Bio::Taxonomy? Its certainly usable as-is, but there are some gotchas. # It has an acknowledged weakness of not coping with multiple ranks of the same name (notably 'no rank'). # You can't have 2 nodes with the same rank (so can only build a single lineage, not a whole menagerie). # You must supply a list of all your rank names correctly ordered before you can add any nodes (or trust that the default list is satisfactory - it won't be if you have just a single 'no rank'). # You simply don't need it if you have Bio::Taxonomy::Nodes with db_handle set, or Bio::Taxonomy::Taxons. In my opinion, the burden is just too great for this ever to have been a 'fun' module to use. It was only required so that people could manually create their own Bio::Taxonomy::Nodes and form a lineage without a database. > Don't know if it has been attempted but it would be nice to have a way for > building taxonomic trees from Node/Taxon information using a Taxonomy-like > container object. I like the way NCBI does something along these lines with > BLAST output now. Not really sure what you mean. I don't think you'd require a container object to do any particular task. Can you clarify? From clarsen at vecna.com Thu Jul 27 19:59:50 2006 From: clarsen at vecna.com (Chris Larsen) Date: Thu, 27 Jul 2006 15:59:50 -0400 (EDT) Subject: [Bioperl-l] Working code Message-ID: <7263.70.106.6.26.1154030390.squirrel@mail.vecna.com> Hey gang, You said you wanted to see working code: ------------------------------------------- > ...And I'd rather see some code or API examples than > extensive elaborations. > > -hilmar Hilmar's right; working code does speaks louder than words. -Chris -------------------------------------------- So here's some: http://www.biohealthbase.org/GSearch/ We've just released the v2 of Bioinformatic Resource Center's website "Biohealthbase". Earlier I pointed out BHB v1 to the list; then we had implemented GBrowse on top of GUS 3. There was some data processing using BioPerl packages to generate well-formatted data for the Oracle instance. But new micro-organisms are added now, so we have Francisella, Mycobacterium, Microsporidia, Giardia, and Influenza. They are under GUS 3.5. We also now have some web-capable BLASTing under there (Please no spam!) And multiple sequence alignments and dendrograms are to come, using MUSCLE instead of ClustalW. Currently, a Bioperl I/O module accepts the output from BLAST and writes up some HTML, then our web app on another server displays the URL content. But we will improve on this model in v3 for MSA et al. since the requirements are different for multiple vs single alignments. Thanks again for the open source! Chris ---------------------------- Christopher Larsen, Ph.D. Senior Scientist Vecna Technologies, Inc. 5004 Lehigh Rd College Park, MD 20740-3821 e: clarsen at vecna.com ph: (240) 737-1625 f: (301) 699-3180 From skirov at utk.edu Thu Jul 27 13:56:45 2006 From: skirov at utk.edu (skirov) Date: Thu, 27 Jul 2006 09:56:45 -0400 Subject: [Bioperl-l] TRANSFAC matrices, open acces Message-ID: <44E2E5B9@webmail.utk.edu> Just a quick note- Bio::Matrix::PSM::IO can parse matrix.dat if you can get it- and as far as I can tell this is not easy- you have to contact the company to get access and it is not clear what their conditions are. Stefan >===== Original Message From Sendu Bala ===== >Sean Davis wrote: >> >> On 7/27/06 5:55 AM, "Sendu Bala" wrote: >> >>> Maximilian Haeussler wrote: >>> Actually, the fact that the transfac matrices are belonging to a >>> company is quite inconvenient for biologists and bioinformatics >>> analyses working in this field. > > >>> The public version is adequate though. It would certainly be useful to >>> have Bioperl access to transfac and other regulation databases. I'll >>> probably write some suitable modules if no one beats me to it. >> >> I haven't used it in a while, but the TFBS family of modules are, if I >> recall correctly, bioperl-compatible, though not part of bioperl. In any >> case, for those who aren't aware, it might be worth a quick look: > >Yes. It only has online access to Transfac though, and the inheritance >and returned objects are TFBS specific so you miss out on whatever >goodness there may be in the rest of bioperl. > >Still, recommended to use if you want programmatic access to Transfac >matrices right now. >_______________________________________________ >Bioperl-l mailing list >Bioperl-l at lists.open-bio.org >http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Fri Jul 28 01:19:51 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 27 Jul 2006 20:19:51 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes In-Reply-To: <44C91F7E.2040000@sendu.me.uk> References: <003301c6b18f$7d114000$15327e82@pyrimidine> <44C91F7E.2040000@sendu.me.uk> Message-ID: <3DAB9065-3633-4D50-B97E-41F2BB58C6EB@uiuc.edu> ... >> Don't know if it has been attempted but it would be nice to have a >> way for >> building taxonomic trees from Node/Taxon information using a >> Taxonomy-like >> container object. I like the way NCBI does something along these >> lines with >> BLAST output now. > > Not really sure what you mean. I don't think you'd require a container > object to do any particular task. Can you clarify? Let's say you start with a list of sequence IDs from a BLAST report and wanted to find the taxonomic relationship between the BLAST hits. NCBI does something similar to this in their last few BLAST output revisions from the CGI interface; they have a link which contains the organisms ranked taxonomically in various ways. There is probably a Bioperl-specific way of doing this but I haven't spent the effort yet working out how. No big deal, really. I have PLENTY else to work on. Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From R.Birnie at leeds.ac.uk Fri Jul 28 09:39:34 2006 From: R.Birnie at leeds.ac.uk (Richard Birnie) Date: Fri, 28 Jul 2006 10:39:34 +0100 Subject: [Bioperl-l] whole genome annotation Message-ID: Hello all, I'm just trying to familiarise myself with BioPerl and I'm a little overwhelmed by the sheer volume of information available on the wiki. I'm hoping someone can point in the right direction through the labyrinth. This may become a little longwinded but I'll try and get all the annoying newbie questions out of the way in one go. Let me try and explain what I'm aiming for. I have some CGH data downloaded from the Progenetix database (http://www.progenetix.de/~pgscripts/progenetix/Aboutprogenetix.html), this data is simplified to record simply gain/loss/amplification of whole chromosome bands at 862 band resolution to facilitate the combination of data from multiple different studies. What I'd like to be able to do is download a copy of the human genome sequence with annotation describing the locations of chromosome bands and preferably of known genes. I then want to be able to manipulate the genome data based on the CGH data to mimic deletions. The ultimate goal of this is to be able to feed the manipulated genome data into a program (metashark) that predicts the structure of metabolic networks based on genome annotation compared to a reference genome, in this case a complete 'normal' human genome and see what effect that has on the metabolic pathways. I appreciate that is a bit vague but thats sort of my problem, I'm not a bioinformatician really so I'm not sue of the details of what I want. I just happen to have an question to answer and bioperl seems the way to go (for this project and more generally). I've started looking at the HOWTOs and read the main bioperl tutorial. I also looked at the CGL comparative genomics library but I haven't penetrated far into that yet. I'm ok with basic perl although not much object oriented stuff. I don't really have much experience with handling sequence data on a whole genome scale either, a few genbank files for my favourite genes is fine but I need some guidance to work on this scale. What I'm looking for is someone to give me a start. I'd greatly appreciate it if someone could spell out the general steps for downloading a complete copy of the human genome and its annotations (if this is even a feasible approach) and how to put it all together. Not actual code just the general concept for each step and which tools from the bioperl set would be most appropriate for each step so that I can focus what I need to read about, even a little pseudo-code if I'm lucky. If I can get the genome data downloaded and setup properly I'll work out how to apply the CGH data to it myself. If example code for what I'm trying to describe is included somewhere, great could someone point to where. Thanks for your patience. best regards, Richard Dr Richard Birnie Scientific Officer Section of Pathology and Tumour Biology Welcome Brenner Building, LIMM St James University Hospital Beckett St, Leeds, LS9 7TF Tel:0113 3438624 e-mail: r.birnie at leeds.ac.uk From sdavis2 at mail.nih.gov Fri Jul 28 11:59:17 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Fri, 28 Jul 2006 07:59:17 -0400 Subject: [Bioperl-l] whole genome annotation In-Reply-To: References: Message-ID: <44C9FC15.3040503@mail.nih.gov> Richard Birnie wrote: > Hello all, > > I'm just trying to familiarise myself with BioPerl and I'm a little overwhelmed by the sheer volume of information available on the wiki. I'm hoping someone can point in the right direction through the labyrinth. This may become a little longwinded but I'll try and get all the annoying newbie questions out of the way in one go. > > Let me try and explain what I'm aiming for. I have some CGH data downloaded from the Progenetix database (http://www.progenetix.de/~pgscripts/progenetix/Aboutprogenetix.html), this data is simplified to record simply gain/loss/amplification of whole chromosome bands at 862 band resolution to facilitate the combination of data from multiple different studies. > > What I'd like to be able to do is download a copy of the human genome sequence with annotation describing the locations of chromosome bands and preferably of known genes. I then want to be able to manipulate the genome data based on the CGH data to mimic deletions. The ultimate goal of this is to be able to feed the manipulated genome data into a program (metashark) that predicts the structure of metabolic networks based on genome annotation compared to a reference genome, in this case a complete 'normal' human genome and see what effect that has on the metabolic pathways. > > I appreciate that is a bit vague but thats sort of my problem, I'm not a bioinformatician really so I'm not sue of the details of what I want. I just happen to have an question to answer and bioperl seems the way to go (for this project and more generally). I've started looking at the HOWTOs and read the main bioperl tutorial. I also looked at the CGL comparative genomics library but I haven't penetrated far into that yet. I'm ok with basic perl although not much object oriented stuff. I don't really have much experience with handling sequence data on a whole genome scale either, a few genbank files for my favourite genes is fine but I need some guidance to work on this scale. > > What I'm looking for is someone to give me a start. I'd greatly appreciate it if someone could spell out the general steps for downloading a complete copy of the human genome and its annotations (if this is even a feasible approach) and how to put it all together. Not actual code just the general concept for each step and which tools from the bioperl set would be most appropriate for each step so that I can focus what I need to read about, even a little pseudo-code if I'm lucky. If I can get the genome data downloaded and setup properly I'll work out how to apply the CGH data to it myself. > > If example code for what I'm trying to describe is included somewhere, great could someone point to where. Hi, Richard. Bioperl is good for many things, but for simply grabbing all the locations of human genes in the genome and chromosome band locations, I wouldn't use bioperl. It sounds to me like you are interested in getting the genes associated with each chromosomal band? If so, just download the cytoband.txt and refFlat.txt files from the UCSC genome browser site. cytoband.txt contains the base pair locations for each of the cytobands. refFlat.txt contains the base pair locations of "refseq" genes. It is then simply a matter of finding overlapping regions (genes with cytobands) to determine which genes are in which cytobands. Since the files are tab-delimited text, they are very easy to work with (in perl, excel, python, ...). Don't get me wrong--I really appreciate the power of bioperl, but in this case, your task lends itself to a simpler (and MUCH) faster approach. Sean From R.Birnie at leeds.ac.uk Fri Jul 28 12:21:46 2006 From: R.Birnie at leeds.ac.uk (Richard Birnie) Date: Fri, 28 Jul 2006 13:21:46 +0100 Subject: [Bioperl-l] whole genome annotation References: <44C9FC15.3040503@mail.nih.gov> Message-ID: -----Original Message----- From: Sean Davis [mailto:sdavis2 at mail.nih.gov] Sent: Fri 7/28/2006 12:59 To: Richard Birnie Cc: bioperl-l at lists.open-bio.org Subject: Re: [Bioperl-l] whole genome annotation Richard Birnie wrote: > Hello all, > > I'm just trying to familiarise myself with BioPerl and I'm a little overwhelmed by the sheer volume of information available on the wiki. I'm hoping someone can point in the right direction through the labyrinth. This may become a little longwinded but I'll try and get all the annoying newbie questions out of the way in one go. > > Let me try and explain what I'm aiming for. I have some CGH data downloaded from the Progenetix database (http://www.progenetix.de/~pgscripts/progenetix/Aboutprogenetix.html), this data is simplified to record simply gain/loss/amplification of whole chromosome bands at 862 band resolution to facilitate the combination of data from multiple different studies. > > What I'd like to be able to do is download a copy of the human genome sequence with annotation describing the locations of chromosome bands and preferably of known genes. I then want to be able to manipulate the genome data based on the CGH data to mimic deletions. The ultimate goal of this is to be able to feed the manipulated genome data into a program (metashark) that predicts the structure of metabolic networks based on genome annotation compared to a reference genome, in this case a complete 'normal' human genome and see what effect that has on the metabolic pathways. > > I appreciate that is a bit vague but thats sort of my problem, I'm not a bioinformatician really so I'm not sue of the details of what I want. I just happen to have an question to answer and bioperl seems the way to go (for this project and more generally). I've started looking at the HOWTOs and read the main bioperl tutorial. I also looked at the CGL comparative genomics library but I haven't penetrated far into that yet. I'm ok with basic perl although not much object oriented stuff. I don't really have much experience with handling sequence data on a whole genome scale either, a few genbank files for my favourite genes is fine but I need some guidance to work on this scale. > > What I'm looking for is someone to give me a start. I'd greatly appreciate it if someone could spell out the general steps for downloading a complete copy of the human genome and its annotations (if this is even a feasible approach) and how to put it all together. Not actual code just the general concept for each step and which tools from the bioperl set would be most appropriate for each step so that I can focus what I need to read about, even a little pseudo-code if I'm lucky. If I can get the genome data downloaded and setup properly I'll work out how to apply the CGH data to it myself. > > If example code for what I'm trying to describe is included somewhere, great could someone point to where. Hi, Richard. Bioperl is good for many things, but for simply grabbing all the locations of human genes in the genome and chromosome band locations, I wouldn't use bioperl. It sounds to me like you are interested in getting the genes associated with each chromosomal band? If so, just download the cytoband.txt and refFlat.txt files from the UCSC genome browser site. cytoband.txt contains the base pair locations for each of the cytobands. refFlat.txt contains the base pair locations of "refseq" genes. It is then simply a matter of finding overlapping regions (genes with cytobands) to determine which genes are in which cytobands. Since the files are tab-delimited text, they are very easy to work with (in perl, excel, python, ...). Don't get me wrong--I really appreciate the power of bioperl, but in this case, your task lends itself to a simpler (and MUCH) faster approach. Sean Thanks for the response Sean, getting the genes associated with each band is certainly part of what I need and your suggestion will help with that. I did look at the UCSC site but as you know there is such a volume of info on there I didn't really know which files I needed. However my main goal requires slightly more. What I want to be able to do is take the chromosomal band annotation info and compare that against the CGH data I have. From this I'd like to then be able say "OK band 8q13.1 (or whatever) is deleted, so make a copy of the genome with the actual sequence associated with that band removed." Then I could feed both sequences into metashark which predicts the structure of metabolic pathways based on genome annotation and see what effect deleting that region of DNA has on the structure of the metabolic network. Knowing which genes are involved is necessary for identifying what are the important components within the region. Are there tools in Bioperl for making this comparison? It can probably be reduced to a straight comparison of data structures so I may just use regular perl for this part unless there is anything designed for purpose. The thing I was struggling with was how to store and manipulate genomic sequence data in such quantities. Since this morning I've had a better look at the CGL library and associated datastore module, I think I can do it using these tools but I'm having a few dependency issues getting it installed right now. So I'll go back to wrestling with that. regards, Richard From valiente at lsi.upc.edu Fri Jul 28 12:10:19 2006 From: valiente at lsi.upc.edu (Gabriel Valiente) Date: Fri, 28 Jul 2006 15:10:19 +0300 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: References: Message-ID: >>> At the moment it seems to me that the Bio::Taxonomy modules >>> (excluding >>> Node) aren't really usable. I've been using Bio::Taxonomy and Bio::DB::Taxonomy a lot, they are very useful modules. Whatever Bio::Taxon or Bio::Taxonomy::Taxon turns out to be, please do keep the Bio::DB::Taxonomy functionality. BTW, does anybody know how to include branch lengths in Bio::DB::Taxonomy? Thanks a lot, Gabriel From y.itan at ucl.ac.uk Fri Jul 28 12:07:32 2006 From: y.itan at ucl.ac.uk (Yuval Itan) Date: Fri, 28 Jul 2006 13:07:32 +0100 Subject: [Bioperl-l] Getting sequences by base pair locations Message-ID: <05a8f85bdc36dc7e370d0935b6394f9d@ucl.ac.uk> Hello all, I was BLATing a few hundred human genes against the chimp genome, and kept the best chimp hits for every human gene. I have the base pair start and end location for every chimp hit, and I need to get the sequence for each of these chimp hits. Here is an example for a few chimp hits bp locations: Start End 142854 144504 154479 155198 153066 167370 163146 163559 I have one chimp genome file (about 3GB) including all chromosomes, but I could also get one file per chromosome if that would make things easier. Does anyone have a script or a link for an interface that can do the job? Thank you very much. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/enriched Size: 835 bytes Desc: not available URL: From hlapp at gmx.net Fri Jul 28 12:59:43 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Fri, 28 Jul 2006 08:59:43 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: References: Message-ID: <233D3060-5CF7-4DF7-8EF6-6762CF45B94D@gmx.net> If I understand Sendu's proposal correctly then the existing methods in Bio::DB::Taxonomy will remain largely unchanged (methods may be added though). Can you describe briefly what you use Bio::Taxonomy for, e.g., which methods you use primarily and the context? -hilmar On Jul 28, 2006, at 8:10 AM, Gabriel Valiente wrote: >>>> At the moment it seems to me that the Bio::Taxonomy modules >>>> (excluding >>>> Node) aren't really usable. > > I've been using Bio::Taxonomy and Bio::DB::Taxonomy a lot, they are > very useful modules. Whatever Bio::Taxon or Bio::Taxonomy::Taxon > turns out to be, please do keep the Bio::DB::Taxonomy functionality. > > BTW, does anybody know how to include branch lengths in > Bio::DB::Taxonomy? > > Thanks a lot, > > Gabriel > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From bix at sendu.me.uk Fri Jul 28 13:01:44 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Fri, 28 Jul 2006 14:01:44 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: References: Message-ID: <44CA0AB8.7040205@sendu.me.uk> Gabriel Valiente wrote: >>>> At the moment it seems to me that the Bio::Taxonomy modules >>>> (excluding >>>> Node) aren't really usable. > > I've been using Bio::Taxonomy Can I ask how you've been using it? > and Bio::DB::Taxonomy a lot, they are > very useful modules. Whatever Bio::Taxon or Bio::Taxonomy::Taxon > turns out to be, please do keep the Bio::DB::Taxonomy functionality. Bio::DB::Taxonomy is staying virtually unaltered. > BTW, does anybody know how to include branch lengths in > Bio::DB::Taxonomy? At the moment, you don't 'include' anything at all in the DB modules yourself, since they are read-only. They give you Nodes which you can alter afterwards. I plan to add something like a 'distance to parent' in Node (Bio::Taxon) so you can work out branch lengths; you can't do that yet. From bix at sendu.me.uk Fri Jul 28 13:13:44 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Fri, 28 Jul 2006 14:13:44 +0100 Subject: [Bioperl-l] Getting sequences by base pair locations In-Reply-To: <05a8f85bdc36dc7e370d0935b6394f9d@ucl.ac.uk> References: <05a8f85bdc36dc7e370d0935b6394f9d@ucl.ac.uk> Message-ID: <44CA0D88.3000404@sendu.me.uk> Yuval Itan wrote: > Hello all, > > I was BLATing a few hundred human genes against the chimp genome, and > kept the best chimp hits for every human gene. > I have the base pair start and end location for every chimp hit, and I > need to get the sequence for each of these chimp hits. Here is an > example for a few chimp hits bp locations: > > Start End* > *142854 144504 > 154479 155198 > 153066 167370 > 163146 163559 > > I have one chimp genome file (about 3GB) including all chromosomes, but > I could also get one file per chromosome if that would make things > easier. Does anyone have a script or a link for an interface that can do > the job? If your genome file is in some standard format, use SeqIO. http://www.bioperl.org/wiki/HOWTO:SeqIO And then get the sequence corresponding to the correct chromosome and get the desired chunk with subseq(); http://www.bioperl.org/wiki/HOWTO:Beginners#The_Sequence_Object You'd also have to make sure that the data used during the blat is exactly the same data you have in your big file. From sdavis2 at mail.nih.gov Fri Jul 28 13:28:02 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Fri, 28 Jul 2006 09:28:02 -0400 Subject: [Bioperl-l] whole genome annotation In-Reply-To: References: <44C9FC15.3040503@mail.nih.gov> Message-ID: <44CA10E2.8010205@mail.nih.gov> Richard Birnie wrote: > > -----Original Message----- > From: Sean Davis [mailto:sdavis2 at mail.nih.gov] > Sent: Fri 7/28/2006 12:59 > To: Richard Birnie > Cc: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] whole genome annotation > > Richard Birnie wrote: > >>Hello all, >> >>I'm just trying to familiarise myself with BioPerl and I'm a little overwhelmed by the sheer volume of information available on the wiki. I'm hoping someone can point in the right direction through the labyrinth. This may become a little longwinded but I'll try and get all the annoying newbie questions out of the way in one go. >> >>Let me try and explain what I'm aiming for. I have some CGH data downloaded from the Progenetix database (http://www.progenetix.de/~pgscripts/progenetix/Aboutprogenetix.html), this data is simplified to record simply gain/loss/amplification of whole chromosome bands at 862 band resolution to facilitate the combination of data from multiple different studies. >> >>What I'd like to be able to do is download a copy of the human genome sequence with annotation describing the locations of chromosome bands and preferably of known genes. I then want to be able to manipulate the genome data based on the CGH data to mimic deletions. The ultimate goal of this is to be able to feed the manipulated genome data into a program (metashark) that predicts the structure of metabolic networks based on genome annotation compared to a reference genome, in this case a complete 'normal' human genome and see what effect that has on the metabolic pathways. >> >>I appreciate that is a bit vague but thats sort of my problem, I'm not a bioinformatician really so I'm not sue of the details of what I want. I just happen to have an question to answer and bioperl seems the way to go (for this project and more generally). I've started looking at the HOWTOs and read the main bioperl tutorial. I also looked at the CGL comparative genomics library but I haven't penetrated far into that yet. I'm ok with basic perl although not much object oriented stuff. I don't really have much experience with handling sequence data on a whole genome scale either, a few genbank files for my favourite genes is fine but I need some guidance to work on this scale. >> >>What I'm looking for is someone to give me a start. I'd greatly appreciate it if someone could spell out the general steps for downloading a complete copy of the human genome and its annotations (if this is even a feasible approach) and how to put it all together. Not actual code just the general concept for each step and which tools from the bioperl set would be most appropriate for each step so that I can focus what I need to read about, even a little pseudo-code if I'm lucky. If I can get the genome data downloaded and setup properly I'll work out how to apply the CGH data to it myself. >> >>If example code for what I'm trying to describe is included somewhere, great could someone point to where. > > > Hi, Richard. > > Bioperl is good for many things, but for simply grabbing all the > locations of human genes in the genome and chromosome band locations, I > wouldn't use bioperl. It sounds to me like you are interested in > getting the genes associated with each chromosomal band? If so, just > download the cytoband.txt and refFlat.txt files from the UCSC genome > browser site. cytoband.txt contains the base pair locations for each of > the cytobands. refFlat.txt contains the base pair locations of "refseq" > genes. It is then simply a matter of finding overlapping regions (genes > with cytobands) to determine which genes are in which cytobands. Since > the files are tab-delimited text, they are very easy to work with (in > perl, excel, python, ...). Don't get me wrong--I really appreciate the > power of bioperl, but in this case, your task lends itself to a simpler > (and MUCH) faster approach. > > Sean > > Thanks for the response Sean, > > getting the genes associated with each band is certainly part of what I need and your suggestion will help with that. I did look at the UCSC site but as you know there is such a volume of info on there I didn't really know which files I needed. > > However my main goal requires slightly more. What I want to be able to do is take the chromosomal band annotation info and compare that against the CGH data I have. From this I'd like to then be able say "OK band 8q13.1 (or whatever) is deleted, so make a copy of the genome with the actual sequence associated with that band removed." Then I could feed both sequences into metashark which predicts the structure of metabolic pathways based on genome annotation and see what effect deleting that region of DNA has on the structure of the metabolic network. Knowing which genes are involved is necessary for identifying what are the important components within the region. Are there tools in Bioperl for making this comparison? It can probably be reduced to a straight comparison of data structures so I may just use regular perl for this part unless there is anything designed for purpose. > > The thing I was struggling with was how to store and manipulate genomic sequence data in such quantities. Since this morning I've had a better look at the CGL library and associated datastore module, I think I can do it using these tools but I'm having a few dependency issues getting it installed right now. So I'll go back to wrestling with that. Ahh. I see. Metashark actually searches the remaining sequence in the human genome? If that is the case, then you need the start and end positions of the chromosomal bands, which you can download from the ucsc genome browser. Follow the links to download and then to the genome of your choice and finally get the chromband.txt file. The other piece of the puzzle is the bio::DB::Fasta module. It allows extremely fast access to a set of fasta files, which it first indexes. Here is the documentation for it: http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/DB/Fasta.html You could imagine making a hash indexed by chromosome band of a hash of starts and ends for each band. For each CGH experiment, find those regions that are deleted. Exclude those when looping through all the chromosome bands, pulling the sequence using Bio::DB::Fasta for each band and writing that to a file for metashark. Sean From sdavis2 at mail.nih.gov Fri Jul 28 13:30:52 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Fri, 28 Jul 2006 09:30:52 -0400 Subject: [Bioperl-l] Getting sequences by base pair locations In-Reply-To: <05a8f85bdc36dc7e370d0935b6394f9d@ucl.ac.uk> References: <05a8f85bdc36dc7e370d0935b6394f9d@ucl.ac.uk> Message-ID: <44CA118C.7010401@mail.nih.gov> Yuval Itan wrote: > Hello all, > > I was BLATing a few hundred human genes against the chimp genome, and > kept the best chimp hits for every human gene. > I have the base pair start and end location for every chimp hit, and I > need to get the sequence for each of these chimp hits. Here is an > example for a few chimp hits bp locations: > > Start End* > *142854 144504 > 154479 155198 > 153066 167370 > 163146 163559 > > I have one chimp genome file (about 3GB) including all chromosomes, but > I could also get one file per chromosome if that would make things > easier. Does anyone have a script or a link for an interface that can do > the job? See this module: http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/DB/Fasta.html Sean From osborne1 at optonline.net Fri Jul 28 13:35:02 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Fri, 28 Jul 2006 09:35:02 -0400 Subject: [Bioperl-l] whole genome annotation In-Reply-To: Message-ID: Richard, A good starting point is a FAQ page we created that describes various ways of extracting genomic sequence: http://www.bioperl.org/wiki/Getting_Genomic_Sequences Check that out, and Sean's suggestion, and write back to bioperl-l if you have questions. One thing that this page doesn't really address is the special challenge that comes with working with very large sequences, this is something you might have to consider as well. You also asked about downloading the human genome and its annotations. There's also more than one way to do this as well. You'd have access to this data if you used the ENSEMBL API but you can get the Genbank files at ftp://ftp.ncbi.nih.gov/genomes/. Having said that I should add that one of the advantages of the ENSEMBL API approach is that you don't have to download the entire genome. Don't know what machine you're working on but, again, trying to manipulate very large sequences may tax your computer as well as your patience. Brian O. On 7/28/06 5:39 AM, "Richard Birnie" wrote: > Hello all, > > I'm just trying to familiarise myself with BioPerl and I'm a little > overwhelmed by the sheer volume of information available on the wiki. I'm > hoping someone can point in the right direction through the labyrinth. This > may become a little longwinded but I'll try and get all the annoying newbie > questions out of the way in one go. > > Let me try and explain what I'm aiming for. I have some CGH data downloaded > from the Progenetix database > (http://www.progenetix.de/~pgscripts/progenetix/Aboutprogenetix.html), this > data is simplified to record simply gain/loss/amplification of whole > chromosome bands at 862 band resolution to facilitate the combination of data > from multiple different studies. > > What I'd like to be able to do is download a copy of the human genome sequence > with annotation describing the locations of chromosome bands and preferably of > known genes. I then want to be able to manipulate the genome data based on the > CGH data to mimic deletions. The ultimate goal of this is to be able to feed > the manipulated genome data into a program (metashark) that predicts the > structure of metabolic networks based on genome annotation compared to a > reference genome, in this case a complete 'normal' human genome and see what > effect that has on the metabolic pathways. > > I appreciate that is a bit vague but thats sort of my problem, I'm not a > bioinformatician really so I'm not sue of the details of what I want. I just > happen to have an question to answer and bioperl seems the way to go (for this > project and more generally). I've started looking at the HOWTOs and read the > main bioperl tutorial. I also looked at the CGL comparative genomics library > but I haven't penetrated far into that yet. I'm ok with basic perl although > not much object oriented stuff. I don't really have much experience with > handling sequence data on a whole genome scale either, a few genbank files for > my favourite genes is fine but I need some guidance to work on this scale. > > What I'm looking for is someone to give me a start. I'd greatly appreciate it > if someone could spell out the general steps for downloading a complete copy > of the human genome and its annotations (if this is even a feasible approach) > and how to put it all together. Not actual code just the general concept for > each step and which tools from the bioperl set would be most appropriate for > each step so that I can focus what I need to read about, even a little > pseudo-code if I'm lucky. If I can get the genome data downloaded and setup > properly I'll work out how to apply the CGH data to it myself. > > If example code for what I'm trying to describe is included somewhere, great > could someone point to where. > > Thanks for your patience. > best regards, > Richard > > > > Dr Richard Birnie > Scientific Officer > Section of Pathology and Tumour Biology > Welcome Brenner Building, LIMM > St James University Hospital > Beckett St, Leeds, LS9 7TF > Tel:0113 3438624 > e-mail: r.birnie at leeds.ac.uk > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From sdavis2 at mail.nih.gov Fri Jul 28 13:41:45 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Fri, 28 Jul 2006 09:41:45 -0400 Subject: [Bioperl-l] Getting sequences by base pair locations In-Reply-To: <44CA0D88.3000404@sendu.me.uk> References: <05a8f85bdc36dc7e370d0935b6394f9d@ucl.ac.uk> <44CA0D88.3000404@sendu.me.uk> Message-ID: <44CA1419.3030100@mail.nih.gov> Sendu Bala wrote: > Yuval Itan wrote: > >>Hello all, >> >>I was BLATing a few hundred human genes against the chimp genome, and >>kept the best chimp hits for every human gene. >>I have the base pair start and end location for every chimp hit, and I >>need to get the sequence for each of these chimp hits. Here is an >>example for a few chimp hits bp locations: >> >>Start End* >>*142854 144504 >>154479 155198 >>153066 167370 >>163146 163559 >> >>I have one chimp genome file (about 3GB) including all chromosomes, but >>I could also get one file per chromosome if that would make things >>easier. Does anyone have a script or a link for an interface that can do >>the job? > > > If your genome file is in some standard format, use SeqIO. > http://www.bioperl.org/wiki/HOWTO:SeqIO > > And then get the sequence corresponding to the correct chromosome and > get the desired chunk with subseq(); > http://www.bioperl.org/wiki/HOWTO:Beginners#The_Sequence_Object My guess is that Yuval will need random access to the sequences. With seqIO, this is possible with a relatively large amount of memory, but Bio::DB::Fasta might be the better bet. Alternatively, make a custom track (see the documentation for doing so at the UCSC genome browser site), upload it, and then getting the DNA is trivial with just a couple of mouseclicks. This method also has the advantage of being able to do things like viewing the data in genome coordinates and allows the possibility of doing interections with known chimp genes so you could find hits that don't overlap known chimp genes, for example. Sean From valiente at lsi.upc.edu Fri Jul 28 13:53:10 2006 From: valiente at lsi.upc.edu (Gabriel Valiente) Date: Fri, 28 Jul 2006 16:53:10 +0300 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: <001301c6b24b$da38ba80$15327e82@pyrimidine> References: <001301c6b24b$da38ba80$15327e82@pyrimidine> Message-ID: <1D2EF357-E8DE-4F15-824A-9C9359389685@lsi.upc.edu> > Would be nice to know how you use Bio::Taxonomy. You are the first > here who > seems to have a use for it. I'm using it to obtain a reference taxonomy for a set of organisms, against which to assess a phylogeny obtained by the usual protocol (fetch rRNA sequences, align them, obtain a distance matrix, cluster). Roughly: use Bio::DB::Taxonomy; my $nodesfile = "nodes.dmp"; my $namesfile = "names.dmp"; my $db = new Bio::DB::Taxonomy(-source => 'flatfile', -directory => "./db/", -nodesfile => $nodesfile, -namesfile => $namesfile); my @species = (...); for my $ncbi_name (@species) { my $ncbi_id = $db->get_taxonid($ncbi_name); my $node = $db->get_Taxonomy_Node(-taxonid => $ncbi_id); my @lineage = get_lineage_nodes($node); # ... } Here, get_lineage_nodes could be added as a method to Bio::Taxonomy::Node or equivalent: sub get_lineage_nodes{ my $node = shift; my @lineage; while ($node->node_name ne "root") { $node = $node->get_Parent_Node; unshift @lineage, $node; } return @lineage; } I've also written a method to merge the full lineages of a set of Bio::Taxonomy::Node object into a Bio::Tree::Tree object. I'd be glad to contribute it as well, but I'm not sure where it would fit. > As for branch lengths, I think you're confusing > 'taxonomy' (classification > of organisms based on just about anything) with > 'phylogeny' (evolutionary > relatedness). Note in the Wikipedia article below the use of the term > 'phylogenetic taxonomy', which is the classification of organisms > based on > evolutionary relationships. > > http://en.wikipedia.org/wiki/Taxonomy > > http://en.wikipedia.org/wiki/Phylogeny > > NCBI has a disclaimer about the Taxonomy database that is related > to this: > > http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi? > chapter=how > cite > > There are HOWTOs on tree manipulation, population genetics, and > PAML on the > wiki which might be a good start for Bioperl phylogenetic methods: > > http://www.bioperl.org/wiki/HOWTO:Trees > > http://www.bioperl.org/wiki/HOWTO:PAML > > http://www.bioperl.org/wiki/HOWTO:PopGen Thanks a lot. Let me check it and get back to the discussion later on. Gabriel > Chris > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> bounces at lists.open-bio.org] On Behalf Of Gabriel Valiente >> Sent: Friday, July 28, 2006 7:10 AM >> To: bioperl-l at lists.open-bio.org >> Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) >> >>>>> At the moment it seems to me that the Bio::Taxonomy modules >>>>> (excluding >>>>> Node) aren't really usable. >> >> I've been using Bio::Taxonomy and Bio::DB::Taxonomy a lot, they are >> very useful modules. Whatever Bio::Taxon or Bio::Taxonomy::Taxon >> turns out to be, please do keep the Bio::DB::Taxonomy functionality. >> >> BTW, does anybody know how to include branch lengths in >> Bio::DB::Taxonomy? >> >> Thanks a lot, >> >> Gabriel >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l From R.Birnie at leeds.ac.uk Fri Jul 28 13:56:15 2006 From: R.Birnie at leeds.ac.uk (Richard Birnie) Date: Fri, 28 Jul 2006 14:56:15 +0100 Subject: [Bioperl-l] whole genome annotation References: Message-ID: Thanks folks, That should be enough to get me going. At least I can see the wood for the trees now. Richard Dr Richard Birnie Scientific Officer Section of Pathology and Tumour Biology Welcome Brenner Building, LIMM St James University Hospital Beckett St, Leeds, LS9 7TF Tel:0113 3438624 e-mail: r.birnie at leeds.ac.uk -----Original Message----- From: Brian Osborne [mailto:osborne1 at optonline.net] Sent: Fri 7/28/2006 14:35 To: Richard Birnie; bioperl-l at lists.open-bio.org Subject: Re: [Bioperl-l] whole genome annotation Richard, A good starting point is a FAQ page we created that describes various ways of extracting genomic sequence: http://www.bioperl.org/wiki/Getting_Genomic_Sequences Check that out, and Sean's suggestion, and write back to bioperl-l if you have questions. One thing that this page doesn't really address is the special challenge that comes with working with very large sequences, this is something you might have to consider as well. You also asked about downloading the human genome and its annotations. There's also more than one way to do this as well. You'd have access to this data if you used the ENSEMBL API but you can get the Genbank files at ftp://ftp.ncbi.nih.gov/genomes/. Having said that I should add that one of the advantages of the ENSEMBL API approach is that you don't have to download the entire genome. Don't know what machine you're working on but, again, trying to manipulate very large sequences may tax your computer as well as your patience. Brian O. On 7/28/06 5:39 AM, "Richard Birnie" wrote: > Hello all, > > I'm just trying to familiarise myself with BioPerl and I'm a little > overwhelmed by the sheer volume of information available on the wiki. I'm > hoping someone can point in the right direction through the labyrinth. This > may become a little longwinded but I'll try and get all the annoying newbie > questions out of the way in one go. > > Let me try and explain what I'm aiming for. I have some CGH data downloaded > from the Progenetix database > (http://www.progenetix.de/~pgscripts/progenetix/Aboutprogenetix.html), this > data is simplified to record simply gain/loss/amplification of whole > chromosome bands at 862 band resolution to facilitate the combination of data > from multiple different studies. > > What I'd like to be able to do is download a copy of the human genome sequence > with annotation describing the locations of chromosome bands and preferably of > known genes. I then want to be able to manipulate the genome data based on the > CGH data to mimic deletions. The ultimate goal of this is to be able to feed > the manipulated genome data into a program (metashark) that predicts the > structure of metabolic networks based on genome annotation compared to a > reference genome, in this case a complete 'normal' human genome and see what > effect that has on the metabolic pathways. > > I appreciate that is a bit vague but thats sort of my problem, I'm not a > bioinformatician really so I'm not sue of the details of what I want. I just > happen to have an question to answer and bioperl seems the way to go (for this > project and more generally). I've started looking at the HOWTOs and read the > main bioperl tutorial. I also looked at the CGL comparative genomics library > but I haven't penetrated far into that yet. I'm ok with basic perl although > not much object oriented stuff. I don't really have much experience with > handling sequence data on a whole genome scale either, a few genbank files for > my favourite genes is fine but I need some guidance to work on this scale. > > What I'm looking for is someone to give me a start. I'd greatly appreciate it > if someone could spell out the general steps for downloading a complete copy > of the human genome and its annotations (if this is even a feasible approach) > and how to put it all together. Not actual code just the general concept for > each step and which tools from the bioperl set would be most appropriate for > each step so that I can focus what I need to read about, even a little > pseudo-code if I'm lucky. If I can get the genome data downloaded and setup > properly I'll work out how to apply the CGH data to it myself. > > If example code for what I'm trying to describe is included somewhere, great > could someone point to where. > > Thanks for your patience. > best regards, > Richard > > > > Dr Richard Birnie > Scientific Officer > Section of Pathology and Tumour Biology > Welcome Brenner Building, LIMM > St James University Hospital > Beckett St, Leeds, LS9 7TF > Tel:0113 3438624 > e-mail: r.birnie at leeds.ac.uk > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Fri Jul 28 13:43:47 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 28 Jul 2006 08:43:47 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: Message-ID: <001301c6b24b$da38ba80$15327e82@pyrimidine> Now I get personal email? Yikes! Sendu has indicated that Bio::DB::Taxonomy will stay essentially unchanged. If anything changes, it >may< be the class used to hold the Node information. Would be nice to know how you use Bio::Taxonomy. You are the first here who seems to have a use for it. As for branch lengths, I think you're confusing 'taxonomy' (classification of organisms based on just about anything) with 'phylogeny' (evolutionary relatedness). Note in the Wikipedia article below the use of the term 'phylogenetic taxonomy', which is the classification of organisms based on evolutionary relationships. http://en.wikipedia.org/wiki/Taxonomy http://en.wikipedia.org/wiki/Phylogeny NCBI has a disclaimer about the Taxonomy database that is related to this: http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=how cite There are HOWTOs on tree manipulation, population genetics, and PAML on the wiki which might be a good start for Bioperl phylogenetic methods: http://www.bioperl.org/wiki/HOWTO:Trees http://www.bioperl.org/wiki/HOWTO:PAML http://www.bioperl.org/wiki/HOWTO:PopGen Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Gabriel Valiente > Sent: Friday, July 28, 2006 7:10 AM > To: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) > > >>> At the moment it seems to me that the Bio::Taxonomy modules > >>> (excluding > >>> Node) aren't really usable. > > I've been using Bio::Taxonomy and Bio::DB::Taxonomy a lot, they are > very useful modules. Whatever Bio::Taxon or Bio::Taxonomy::Taxon > turns out to be, please do keep the Bio::DB::Taxonomy functionality. > > BTW, does anybody know how to include branch lengths in > Bio::DB::Taxonomy? > > Thanks a lot, > > Gabriel > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Fri Jul 28 14:15:38 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 28 Jul 2006 09:15:38 -0500 Subject: [Bioperl-l] Getting sequences by base pair locations In-Reply-To: <44CA118C.7010401@mail.nih.gov> Message-ID: <001401c6b250$4e3c2490$15327e82@pyrimidine> Yutal, You can also do this remotely if the file you want is in GenBank (and you don't want to store the data locally). The nice thing about using this is any seqfeatures in the GenBank file within the region requested is also returned. Note that if data is stored in a RefSeq file you'll need to add the parameter '-no_redirect => 1,' to the Bio::DB::GenBank object. I would NOT recommend this for huge numbers of sequences (>2000) as you would be spamming NCBI with thousands of repeated requests; if you did have a relatively large number you could run this overnight, which is what I do. Bio::DB::Fasta would be better if you have tons of hits. Use this in a loop to grab the sequences one at a time based on the start, stop positions, (and strand, if you need it): # Below is from Bio::DB::GenBank POD, with some modifications my $factory = Bio::DB::GenBank->new( -seq_start => $start, -seq_stop => $end, -strand => $strand # 1=plus, 2=minus ); my $seq_obj; eval { $seq_obj = $factory->get_Seq_by_acc($sf->seq_id); }; if( $@ ) { print STDERR "Unable to retrieve from $start to $end.\n"; print STDERR "Error!\n$@"; print STDERR "Attempting to move on...\n"; next; } print STDERR "Got sequence: ",$seq_obj->description,"\n"; print STDERR "\tLength: ",$seq_obj->length,"\n"; my $sf_len = $sf->length; The eval{} block is needed to make sure retrieval worked via network connections and to not end based on a network error (the object throws an error which eval catches, logs it to STDERR, thus allowing you to continue on). Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Sean Davis > Sent: Friday, July 28, 2006 8:31 AM > To: Yuval Itan > Cc: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Getting sequences by base pair locations > > Yuval Itan wrote: > > Hello all, > > > > I was BLATing a few hundred human genes against the chimp genome, and > > kept the best chimp hits for every human gene. > > I have the base pair start and end location for every chimp hit, and I > > need to get the sequence for each of these chimp hits. Here is an > > example for a few chimp hits bp locations: > > > > Start End* > > *142854 144504 > > 154479 155198 > > 153066 167370 > > 163146 163559 > > > > I have one chimp genome file (about 3GB) including all chromosomes, but > > I could also get one file per chromosome if that would make things > > easier. Does anyone have a script or a link for an interface that can do > > the job? > > See this module: > > http://doc.bioperl.org/releases/bioperl-current/bioperl- > live/Bio/DB/Fasta.html > > Sean > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Fri Jul 28 14:35:21 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 28 Jul 2006 09:35:21 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: <1D2EF357-E8DE-4F15-824A-9C9359389685@lsi.upc.edu> Message-ID: <001501c6b253$0fed08a0$15327e82@pyrimidine> > use Bio::DB::Taxonomy; > I've also written a method to merge the full lineages of a set of > Bio::Taxonomy::Node object into a Bio::Tree::Tree object. I'd be glad > to contribute it as well, but I'm not sure where it would fit. Ah, that would be great (I had mentioned something along these lines to do with BLAST reports). But does this actually use Bio::Taxonomy directly? Taxonomy::Node does not inherit methods from Bio::Taxonomy AFAIK. So, anything that Sendu does may not dramatically impact your code. Sendu? You might need to address some of this to Sendu. Big changes are afoot for Bio::Taxonomy and Bio::Taxonomy::Node. He's heading that up. Chris > ... > Thanks a lot. Let me check it and get back to the discussion later on. > > Gabriel > > > Chris > > ... From cjfields at uiuc.edu Fri Jul 28 14:37:09 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 28 Jul 2006 09:37:09 -0500 Subject: [Bioperl-l] Getting sequences by base pair locations In-Reply-To: <44CA1419.3030100@mail.nih.gov> Message-ID: <001601c6b253$4ec57170$15327e82@pyrimidine> ... > > If your genome file is in some standard format, use SeqIO. > > http://www.bioperl.org/wiki/HOWTO:SeqIO > > > > And then get the sequence corresponding to the correct chromosome and > > get the desired chunk with subseq(); > > http://www.bioperl.org/wiki/HOWTO:Beginners#The_Sequence_Object > > My guess is that Yuval will need random access to the sequences. With > seqIO, this is possible with a relatively large amount of memory, but > Bio::DB::Fasta might be the better bet. Agreed. This is one of the bioperl 'speed' issue areas: http://www.bioperl.org/wiki/Project_priority_list Bio::DB::Fasta returns a specialized PrimarySeq object which gets around the current speed issues with SeqIO. > Alternatively, make a custom track (see the documentation for doing so > at the UCSC genome browser site), upload it, and then getting the DNA is > trivial with just a couple of mouseclicks. This method also has the > advantage of being able to do things like viewing the data in genome > coordinates and allows the possibility of doing interections with known > chimp genes so you could find hits that don't overlap known chimp genes, > for example. > > Sean Would be nice to have a more automated and direct way of doing something along these lines within bioperl (with the obvious caveat of not spamming the server). You can currently retrieve chunks of sequence based on start, stop, strand from GenBank. Ah, one can dream... Chris From bix at sendu.me.uk Fri Jul 28 14:38:20 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Fri, 28 Jul 2006 15:38:20 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: <1D2EF357-E8DE-4F15-824A-9C9359389685@lsi.upc.edu> References: <001301c6b24b$da38ba80$15327e82@pyrimidine> <1D2EF357-E8DE-4F15-824A-9C9359389685@lsi.upc.edu> Message-ID: <44CA215C.2070607@sendu.me.uk> Gabriel Valiente wrote: >> Would be nice to know how you use Bio::Taxonomy. You are the first >> here who >> seems to have a use for it. > > I'm using it to obtain a reference taxonomy for a set of organisms, > against which to assess a phylogeny obtained by the usual protocol > (fetch rRNA sequences, align them, obtain a distance matrix, > cluster). Roughly: > > use Bio::DB::Taxonomy; Ah, we were specifically wondering if you had used Bio/Taxonomy.pm, not Taxonomy modules in general. Again, DB::Taxonomy usage will be unaffected. > Here, get_lineage_nodes could be added as a method to > Bio::Taxonomy::Node or equivalent: > > sub get_lineage_nodes{ > my $node = shift; > my @lineage; > while ($node->node_name ne "root") { > $node = $node->get_Parent_Node; > unshift @lineage, $node; > } > return @lineage; > } I think you must have an older version of bioperl. Bio::Taxonomy::Node has a method get_Lineage_Nodes() which more or less does exactly that. > I've also written a method to merge the full lineages of a set of > Bio::Taxonomy::Node object into a Bio::Tree::Tree object. I'd be glad > to contribute it as well, but I'm not sure where it would fit. Post it and I'll see if it will fit anywhere :) From cuiw at ncbi.nlm.nih.gov Fri Jul 28 13:46:50 2006 From: cuiw at ncbi.nlm.nih.gov (Cui, Wenwu (NIH/NLM/NCBI) [C]) Date: Fri, 28 Jul 2006 09:46:50 -0400 Subject: [Bioperl-l] Getting sequences by base pair locations In-Reply-To: <05a8f85bdc36dc7e370d0935b6394f9d@ucl.ac.uk> Message-ID: <18C407FD4FFB424292D769FBD68C1987C7C254@NIHCESMLBX8.nih.gov> Maybe the easiest way is to use LWP to get the webpage. Here is an example for CHIMP1A:10:12345678:12348888: http://www.ensembl.org/Pan_troglodytes/exportview?format=fasta&l=10%3A12 345678-12348888&action=export&_format=Text&output=txt&submit=Continue+%3 E%3E Wenwu Cui ________________________________ From: Yuval Itan [mailto:y.itan at ucl.ac.uk] Sent: Friday, July 28, 2006 8:08 AM To: bioperl-l at lists.open-bio.org Subject: [Bioperl-l] Getting sequences by base pair locations Hello all, I was BLATing a few hundred human genes against the chimp genome, and kept the best chimp hits for every human gene. I have the base pair start and end location for every chimp hit, and I need to get the sequence for each of these chimp hits. Here is an example for a few chimp hits bp locations: Start End 142854 144504 154479 155198 153066 167370 163146 163559 I have one chimp genome file (about 3GB) including all chromosomes, but I could also get one file per chromosome if that would make things easier. Does anyone have a script or a link for an interface that can do the job? Thank you very much. From valiente at lsi.upc.edu Fri Jul 28 14:49:28 2006 From: valiente at lsi.upc.edu (Gabriel Valiente) Date: Fri, 28 Jul 2006 17:49:28 +0300 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: <001501c6b253$0fed08a0$15327e82@pyrimidine> References: <001501c6b253$0fed08a0$15327e82@pyrimidine> Message-ID: <5563CD94-DC99-46A3-A56A-485D4A4D3031@lsi.upc.edu> >> use Bio::DB::Taxonomy; > > > >> I've also written a method to merge the full lineages of a set of >> Bio::Taxonomy::Node object into a Bio::Tree::Tree object. I'd be glad >> to contribute it as well, but I'm not sure where it would fit. > > Ah, that would be great (I had mentioned something along these > lines to do > with BLAST reports). But does this actually use Bio::Taxonomy > directly? > Taxonomy::Node does not inherit methods from Bio::Taxonomy AFAIK. So, > anything that Sendu does may not dramatically impact your code. > Sendu? It is a general algorithm I devised that takes a set of paths and builds up a tree. The input paths are full lineages coming from Bio::DB::Taxonomy, the output tree is a Bio::Tree::Tree. This is why I said I don't know exactly where it would belong, it looks to me more like a standalone script than a Bio::Taxonomy or Bio::Tree method. Gabriel > You might need to address some of this to Sendu. Big changes are > afoot for > Bio::Taxonomy and Bio::Taxonomy::Node. He's heading that up. > > Chris > >> ... >> Thanks a lot. Let me check it and get back to the discussion later >> on. >> >> Gabriel >> >>> Chris >>> > ... From sdavis2 at mail.nih.gov Fri Jul 28 15:21:09 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Fri, 28 Jul 2006 11:21:09 -0400 Subject: [Bioperl-l] Getting sequences by base pair locations In-Reply-To: <001601c6b253$4ec57170$15327e82@pyrimidine> References: <001601c6b253$4ec57170$15327e82@pyrimidine> Message-ID: <44CA2B65.8070906@mail.nih.gov> Chris Fields wrote: > Would be nice to have a more automated and direct way of doing something > along these lines within bioperl (with the obvious caveat of not spamming > the server). You can currently retrieve chunks of sequence based on start, > stop, strand from GenBank. The ENSembl API has some features that can be useful for these types of things. I, personally, have a mirror of the UCSC mysql database (very easy to do with just rsync and mysql) and try to turn questions like these into SQL queries. That, combined with Bio::DB::Fasta, can make a useful automated pipeline for getting arbitrary sequences associated with genomic locations meeting specific criteria. It is much faster than anything one can do over the web and doesn't have access limitations. Sean From cjfields at uiuc.edu Fri Jul 28 15:27:17 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 28 Jul 2006 10:27:17 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: <5563CD94-DC99-46A3-A56A-485D4A4D3031@lsi.upc.edu> Message-ID: <000001c6b25a$4f9392b0$15327e82@pyrimidine> > It is a general algorithm I devised that takes a set of paths and > builds up a tree. The input paths are full lineages coming from > Bio::DB::Taxonomy, the output tree is a Bio::Tree::Tree. This is why > I said I don't know exactly where it would belong, it looks to me > more like a standalone script than a Bio::Taxonomy or Bio::Tree method. > > Gabriel Agreed. You could submit the script as an example here if it is short, or via Bugzilla as an enhancement request: http://bugzilla.open-bio.org/ It could be added to the scripts\ or examples\ directory in bioperl-core. Chris From valiente at lsi.upc.edu Fri Jul 28 16:35:20 2006 From: valiente at lsi.upc.edu (Gabriel Valiente) Date: Fri, 28 Jul 2006 19:35:20 +0300 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: <000001c6b25a$4f9392b0$15327e82@pyrimidine> References: <000001c6b25a$4f9392b0$15327e82@pyrimidine> Message-ID: <3DB992C6-DF16-42B9-8C36-F3B5C8CCBDE7@lsi.upc.edu> >> It is a general algorithm I devised that takes a set of paths and >> builds up a tree. The input paths are full lineages coming from >> Bio::DB::Taxonomy, the output tree is a Bio::Tree::Tree. This is why >> I said I don't know exactly where it would belong, it looks to me >> more like a standalone script than a Bio::Taxonomy or Bio::Tree >> method. >> >> Gabriel > > Agreed. You could submit the script as an example here if it is > short, or > via Bugzilla as an enhancement request: > > http://bugzilla.open-bio.org/ > > It could be added to the scripts\ or examples\ directory in bioperl- > core. Here it is. Please check it and include for instance as taxonomy2tree.PLS in the scripts/tree or scripts/taxonomy directory. Disclaimer: I'm also publishing part of this code in a conference paper. The script is already fully functional but anyway, I have a couple of improvements in mind. The minor one is provision for cmdline input. How would you like to input an array of names? The other one is to remove internal node labels and contract elementary paths, for instance reducing the tree: (((((((((((((((((((((((((((("Pongo pygmaeus")Pongo,(("Gorilla gorilla")Gorilla,("Pan troglodytes")Pan,("Homo sapiens")Homo)"Homo/ Pan/Gorilla group")Hominidae)Hominoidea)Catarrhini)Simiiformes) Primates)Euarchontoglires)Eutheria)Theria)Mammalia)Amniota)Tetrapoda) Sarcopterygii)Euteleostomi)Teleostomi)"Gnathostomata ") Vertebrata)"Craniata ")Chordata)Deuterostomia)Coelomata) Bilateria)Eumetazoa)Metazoa)"Fungi/Metazoa group")Eukaryota)"cellular organisms")root; to the tree: ("Pongo pygmaeus",("Gorilla gorilla","Pan troglodytes","Homo sapiens")); It is certainly easy to remove all internal node labels. On the other hand, I've been working on contraction of elementary paths for quite a while, but always got stuck with internals of the Bio::Tree methods to remove nodes. Thus, so far the only working code I have consists of removing elementary branches while making a deep copy of the tree, which certainly is not quite elegant... Thanks a lot, Gabriel #!/usr/bin/perl -w # Author: Gabriel Valiente # Purpose: Bio::DB::Taxonomy -> Bio::Tree::Tree use strict; use Bio::DB::Taxonomy; use Bio::TreeIO; my $nodesfile = "nodes.dmp"; my $namesfile = "names.dmp"; my $db = new Bio::DB::Taxonomy(-source => 'flatfile', -directory => "./db/", -nodesfile => $nodesfile, -namesfile => $namesfile); # the input to the script is an array of species names my @species = ('Orangutan', 'Gorilla', 'Chimpanzee', 'Human'); my $root = new Bio::Tree::Node(-id => "root"); my $tree = new Bio::Tree::Tree(-root => $root); # the full lineages of the species are merged into a tree for my $name (@species) { my $ncbi_id = $db->get_taxonid($name); if ($ncbi_id) { my $node = $db->get_Taxonomy_Node(-taxonid => $ncbi_id); my @lineage = get_lineage_nodes($node); shift @lineage; # discard root push @lineage, $node; merge_path($root, \@lineage); } else { warn "no NCBI Taxonomy node for species ",$name,"\n"; } } # the tree is output in Newick format my $output = new Bio::TreeIO(-format => 'newick'); $output->write_tree($tree); # the actual merging of full lineages is performed by a recursive method sub merge_path { my $root = shift; my $path = shift; my @path = @{$path}; if (@path) { my $top = shift @path; my @children = grep { $_->id eq $top->node_name } $root- >each_Descendent; if (@children) { # $root has a $child with id eq $top name my $child = shift @children; merge_path($child,\@path); } else { # add $top and @path below $root my $node = $root; unshift @path, $top; while (@path) { my $top = shift @path; my $name = $top->node_name; my $child = new Bio::Tree::Node(-id => "$name"); $node->add_Descendent($child); $node = $child; } } } } # the full lineage of a species is recovered by traversing the taxonomy sub get_lineage_nodes{ my $node = shift; my @lineage; while ($node->node_name ne "root") { $node = $node->get_Parent_Node; unshift @lineage, $node; } return @lineage; } =head1 NAME taxonomy2tree - builds a taxonomic tree based on the full lineages of a set of species names =head1 DESCRIPTION This script requires that the bioperl-run pkg be also installed. Providing the nodes.dmp and names.dmp files from the NCBI Taxonomy dump (see Bio::DB::Taxonomy::flatfile for more info) is only necessary on the first time running. This will create the local indexes and may take quite a long time. However once created, these indexes will allow fast access for species to taxon id OR taxon id to species name lookups. =cut From MEC at stowers-institute.org Fri Jul 28 16:44:43 2006 From: MEC at stowers-institute.org (Cook, Malcolm) Date: Fri, 28 Jul 2006 11:44:43 -0500 Subject: [Bioperl-l] Getting sequences by base pair locations Message-ID: There are many options. But, it looks like you only have start end coordinates! Where do you know which chromosome/contig the hit was on? Assuming you have this, if you did the blat with a local copy of the blat program and a the genome, then in addition to the blat command, you have the twoBitToFa command which can extract the hits from the blat index (see http://genome.ucsc.edu/goldenPath/help/blatSpec.html ) Or did you do the blat at ucsc? Malcolm Cook Database Applications Manager, Bioinformatics Stowers Institute for Medical Research oh - I replied similarly in the Bio BB forum, but it is held for moderation so am replying here as well ________________________________ From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Yuval Itan Sent: Friday, July 28, 2006 7:08 AM To: bioperl-l at lists.open-bio.org Subject: [Bioperl-l] Getting sequences by base pair locations Hello all, I was BLATing a few hundred human genes against the chimp genome, and kept the best chimp hits for every human gene. I have the base pair start and end location for every chimp hit, and I need to get the sequence for each of these chimp hits. Here is an example for a few chimp hits bp locations: Start End 142854 144504 154479 155198 153066 167370 163146 163559 I have one chimp genome file (about 3GB) including all chromosomes, but I could also get one file per chromosome if that would make things easier. Does anyone have a script or a link for an interface that can do the job? Thank you very much. From osborne1 at optonline.net Fri Jul 28 17:25:12 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Fri, 28 Jul 2006 13:25:12 -0400 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: <3DB992C6-DF16-42B9-8C36-F3B5C8CCBDE7@lsi.upc.edu> Message-ID: Gabriel, It looks like most of the Bioperl scripts use Getopt::Long. It's documentation says, in part: Options can take multiple values at once, for example --coordinates 52.2 16.4 --rgbcolor 255 255 149 This can be accomplished by adding a repeat specifier to the option specification. Repeat specifiers are very similar to the {...} repeat specifiers that can be used with regular expression patterns. For example, the above command line would be handled as follows: GetOptions('coordinates=f{2}' => \@coor, 'rgbcolor=i{3}' => \@color); So the arguments are space-delimited on the command line. Is the problem that the names can be binomial? Brian O. On 7/28/06 12:35 PM, "Gabriel Valiente" wrote: > The minor one is provision for cmdline input. > How would you like to input an array of names? From golharam at umdnj.edu Fri Jul 28 18:03:39 2006 From: golharam at umdnj.edu (Ryan Golhar) Date: Fri, 28 Jul 2006 14:03:39 -0400 Subject: [Bioperl-l] Bio::Align::DNAStatistics module has errors? Message-ID: <01a701c6b270$28232130$2f01a8c0@GOLHARMOBILE1> This is from the description: This object contains routines for calculating various statistics and distances for DNA alignments. The routines are not well tested and do contain errors at this point. Work is underway to correct them, but do not expect this code to give you the right answer currently! Use dnadist/distmat in the PHLYIP or EMBOSS packages to calculate the dis- tances. Any idea what the errors are and what is/is not usable? From lzhtom at hotmail.com Sat Jul 29 02:00:23 2006 From: lzhtom at hotmail.com (zhihua li) Date: Sat, 29 Jul 2006 02:00:23 +0000 Subject: [Bioperl-l] how to get annotations (especially ensembl IDs) for a list of genes? Message-ID: Hi all, I have a list of like 300 genes (actually their refseq IDs). Now I wanna get more information (annotations) for each of the genes. Speficially, I want a mapping of the refseq IDs to Ensembl gene IDs. I know how to do it through a web page. But I'm wondering if I can also do it via bioperl, by using some modules or packages. Can anyone help me out here? Thanks a lot! From jason.stajich at duke.edu Sat Jul 29 05:18:50 2006 From: jason.stajich at duke.edu (Jason Stajich) Date: Fri, 28 Jul 2006 22:18:50 -0700 Subject: [Bioperl-l] Bio::Align::DNAStatistics module has errors? Message-ID: I think that msg was CYA by me at some point - I am pretty sure I made tests based on numbers from PHYLIP and EMBOSS but was hoping for someone else to help. At this point I have no reliable time to really work on, but I hope someone who is interested in it will give it a whirl. There may be some boundary cases that don't work where seqs are too short or have a zero number of a particular nt but in general the nums should jive. I am not sure about all the NG Ks and Ka as I didn't write those but I believe Richard vetted them pretty well first. There are a couple of methods not implemented too - am always hopeful other people will see this as a great starting point and roll up their sleeves to join in... -jason -- Jason Stajich Duke University http://www.duke.edu/~jes12 From bix at sendu.me.uk Sat Jul 29 07:25:38 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Sat, 29 Jul 2006 08:25:38 +0100 Subject: [Bioperl-l] how to get annotations (especially ensembl IDs) for a list of genes? In-Reply-To: References: Message-ID: <44CB0D72.20104@sendu.me.uk> zhihua li wrote: > Hi all, > > I have a list of like 300 genes (actually their refseq IDs). Now I > wanna get more information (annotations) for each of the genes. > Speficially, I want a mapping of the refseq IDs to Ensembl gene IDs. > > I know how to do it through a web page. But I'm wondering if I can also > do it via bioperl One possible way is to use the Ensembl perl API: http://www.ensembl.org/info/software/core/core_tutorial.html You'd get a gene or transcript adapator and use fetch_all_by_external_name() iirc. I'm aware that not every entrez id can be mapped that way, but perhaps most if not all refseqs will work. From bix at sendu.me.uk Sat Jul 29 07:54:52 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Sat, 29 Jul 2006 08:54:52 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: <001301c6b24b$da38ba80$15327e82@pyrimidine> References: <001301c6b24b$da38ba80$15327e82@pyrimidine> Message-ID: <44CB144C.6050507@sendu.me.uk> Chris Fields wrote: > > As for branch lengths, I think you're confusing 'taxonomy' (classification > of organisms based on just about anything) with 'phylogeny' (evolutionary > relatedness). Note in the Wikipedia article below the use of the term > 'phylogenetic taxonomy', which is the classification of organisms based on > evolutionary relationships. > > http://en.wikipedia.org/wiki/Taxonomy > > http://en.wikipedia.org/wiki/Phylogeny Indeed. The two can be considered closely intertwined - if you were making a phylogeny you might hang it on a taxonomy. At any rate, you need to know a bunch of evolutionarily related species names before you start work, and Bio::Taxonomy::Node has been as good a place as any to get that. > There are HOWTOs on tree manipulation, population genetics, and PAML on the > wiki which might be a good start for Bioperl phylogenetic methods: > > http://www.bioperl.org/wiki/HOWTO:Trees Which is why the Trees HOWTO talks about taxa, and a number of the Taxonomy modules have phylogenetic methods like get_lca. (And why there is Bio::Taxonomy::Taxon and Tree.) I suppose this is another reason to make Bio::Taxonomy::Node (ne Bio::Taxon) implement Bio::Tree::NodeI. (for these reasons I don't think Gabriel's method isn't best appropriate as a script - it's something you might do all the time, as a matter of course. If Bio::Taxon wasa Bio::Tree::NodeI you would just do my $tree = new Bio::Tree::Tree(-root => $bio_taxon); and blamo, instant phylogenetic taxonomy) From cjfields at uiuc.edu Sat Jul 29 11:49:29 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sat, 29 Jul 2006 06:49:29 -0500 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: <44CB144C.6050507@sendu.me.uk> References: <001301c6b24b$da38ba80$15327e82@pyrimidine> <44CB144C.6050507@sendu.me.uk> Message-ID: <5C765CAF-4976-45D1-BDA4-1E432081FF4D@uiuc.edu> On Jul 29, 2006, at 2:54 AM, Sendu Bala wrote: > Chris Fields wrote: >> >> As for branch lengths, I think you're confusing >> 'taxonomy' (classification >> of organisms based on just about anything) with >> 'phylogeny' (evolutionary >> relatedness). Note in the Wikipedia article below the use of the >> term >> 'phylogenetic taxonomy', which is the classification of organisms >> based on >> evolutionary relationships. >> >> http://en.wikipedia.org/wiki/Taxonomy >> >> http://en.wikipedia.org/wiki/Phylogeny > > Indeed. The two can be considered closely intertwined - if you were > making a phylogeny you might hang it on a taxonomy. At any rate, you > need to know a bunch of evolutionarily related species names before > you > start work, and Bio::Taxonomy::Node has been as good a place as any to > get that. Intertwined, yes, but not exactly the same. Hence the NCBI disclaimer I mentioned: How to reference the NCBI taxonomy database The NCBI taxonomy database is not a primary source for taxonomic or phylogenetic information. Furthermore, the database does not follow a single taxonomic treatise but rather attempts to incorporate phylogenetic and taxonomic knowledge from a variety of sources, including the published literature, web-based databases, and the advice of sequence submitters and outside taxonomy experts. Consequently, the NCBI taxonomy database is not a phylogenetic or taxonomic authority and should not be cited as such. >> There are HOWTOs on tree manipulation, population genetics, and >> PAML on the >> wiki which might be a good start for Bioperl phylogenetic methods: >> >> http://www.bioperl.org/wiki/HOWTO:Trees > > Which is why the Trees HOWTO talks about taxa, and a number of the > Taxonomy modules have phylogenetic methods like get_lca. (And why > there > is Bio::Taxonomy::Taxon and Tree.) Are we still thinking about deprecating those? I have seen very little mention of those modules from the mail list archives, and Jason mentioned that Bio::Taxonomy::Taxon hasn't been modified in a long time. > I suppose this is another reason to make Bio::Taxonomy::Node (ne > Bio::Taxon) implement Bio::Tree::NodeI. > > (for these reasons I don't think Gabriel's method isn't best > appropriate > as a script - it's something you might do all the time, as a matter of > course. If Bio::Taxon wasa Bio::Tree::NodeI you would just do my > $tree = > new Bio::Tree::Tree(-root => $bio_taxon); and blamo, instant > phylogenetic taxonomy) Brian already deposited the script (see bioperl-guts). You could use it for the methods, of course noting Gabriel's contribution. Sounds like a good plan to me ; > Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From nabil at broad.mit.edu Sun Jul 30 04:28:00 2006 From: nabil at broad.mit.edu (Nabil Hafez) Date: Sun, 30 Jul 2006 00:28:00 -0400 Subject: [Bioperl-l] Strangeness in parsing blast file Message-ID: <44CC3550.5070105@broad.mit.edu> Hi, I am having a somewhat similar problem to what was posted in http://bioperl.org/pipermail/bioperl-l/2006-May/021416.html however, I have read through all of that thread and I don't believe what I am experiencing is the exact same problem. I also realize that the Bioperl version 1.5.1 fixes a problem with blast parsing. My problem: My blastresults file parses fine and everything works swimmingly if I pass the blast output file by name such as $blast_result = 'test.blastout'; however when I do $blast_result = &do_blast($sample_fasta); even though in both cases $blast_result evaluate to "test.blastout", the parsing doesn't work, more specifically it gets an undefined variable for $result in while( my $result = $report_obj->next_result ) { Sorr y for the long email - any help would be appreciated, Thanks - Nabil The code...non releavant parts trimmed for size constraints....debugging from working and non-working versions after the code. use strict; use Bio::SearchIO; use Getopt::Std; use List::Util qw(shuffle); use Benchmark; my ($inputfile, $samplefile, $blastfile, $blast_result, $blast_report, $blast_verbose); #files generated #------------------# # Subroutine Calls # #------------------# my $test = &create_sample_file($inputfile); #inputfile being a fasta file containing nucleotide sequence $blast_result = &do_blast($test); ##$blast_result = 'test.blastout'; #when this is uncommented and replace the previous two lines with test.blastout being normal blast output - the script works fine. &parse_blast($blast_result); ####################### # create_sample_file # # Input: Original Fasta File # # Output: Fasta file containing randomly sampled reads # # sub create_sample_file { my $in = shift; my $linecount = 0; my @lines; $samplefile = $in . "_sample"; #Determine total # of reads in input fasta $totalreads = `$grep -c '>' $inputfile`; $totalreads =~ s/\s+//; chomp $totalreads; if ($totalreads > 1000) { #sample if more than 1000 reads $sample_reads = sprintf("%.0f", $totalreads * ($per_to_sample/100)); #number of reads to sample } else { #otherwise use all reads $sample_reads = $totalreads; } $/ = '>'; #define fasta record input seperator open (IN, "<$in") or die "Cannot open $in $!\n"; open (OUT, ">$samplefile") or die "Cannot open $samplefile $!\n"; while () { #read lines into an array chomp; push (@lines, $_); } @lines = shuffle(@lines); #shuffle array foreach (@lines) { print OUT ">$_" if $linecount <= $sample_reads; #output to file sampled number of reads $linecount++; } close IN; close OUT; return $samplefile; }#end create_sample_file ####################### # do_blast # # Input: Fasta File containing SCREENED sampled reads # # Output: Blast File # # sub do_blast { my $bf = shift; my $blastoutput = $bf . ".blastout"; print "Blasting against $db ...\n"; `blast/blast-2.2.10/bin/megablast -d /prodinfo/proddata_ntblastdb/nt -e 1e-50 -p 99 -D 2 -i test -o test.blastout`; return $blastoutput; }#end do_blast ####################### # parse_blast # # Input: Blast file # # Output: Creates hash containing best hit for each read # # sub parse_blast { my $blastoutfile = shift; if (! -e $blastoutfile) { die "$blastoutfile does not exist $!\n"; } print "Parsing blast hits ...\n"; my $report_obj = new Bio::SearchIO(-verbose => 1, -format => 'blast', -file => $blastoutfile); die "no valid $report_obj" unless defined $report_obj; while( my $result = $report_obj->next_result ) { die "no valid $result" unless defined $result; while( my $hit = $result->next_hit ) { while( my $hsp = $hit->next_hsp ) { my $name = $result->query_name; my $hitDesc = $hit->description; my $length = $hsp->length('total'); my $per_id = sprintf("%.2f", $hsp->percent_identity); my $eval = $hsp->evalue; next if (defined $blast_results{$name} && $blast_results{$name}->[0] > $length); #only keep best hit for any read $blast_results{$name} = [$length, $per_id, $eval, $hitDesc]; #store in hash of arrays } } } } #end parse_blast Debug of NON-working blast-parse: main::(454/scripts/fasta_blasta_mb.pl:151): 151: my $sample_fasta = &create_sample_file($inputfile); DB<2> n main::(454/scripts/fasta_blasta_mb.pl:152): 152: $blast_result = &do_blast($sample_fasta); DB<2> x $sample_fasta 0 'G782.2005-08-16-16-48.fasta_sample' DB<3> n Blasting against NT ... main::(454/scripts/fasta_blasta_mb.pl:154): 154: &parse_blast($blast_result); DB<3> s main::parse_blast(454/scripts/fasta_blasta_mb.pl:293): 293: my $blastoutfile = shift; DB<3> s main::parse_blast(454/scripts/fasta_blasta_mb.pl:295): 295: if (! -e $blastoutfile) { DB<3> x $blastoutfile 0 'G782.2005-08-16-16-48.fasta_sample.blastout' DB<4> s main::parse_blast(454/scripts/fasta_blasta_mb.pl:299): 299: print "Parsing blast hits ...\n"; DB<4> s Parsing blast hits ... main::parse_blast(454/scripts/fasta_blasta_mb.pl:302): 302: my $report_obj = new Bio::SearchIO(-verbose => 1, 303: -format => 'blast', 304: -file => $blastoutfile);#or die "Could not open blast report $!"; DB<4> s Bio::SearchIO::new(/util/lib/perl5/site_perl/5.8.0/Bio/SearchIO.pm:129): 129: my($caller, at args) = @_; DB<4> r scalar context return from Bio::SearchIO::new: '_file' => 'G782.2005-08-16-16-48.fasta_sample.blastout' '_filehandle' => GLOB(0x8cef40c) -> *Symbol::GEN1 FileHandle({*Symbol::GEN1}) => fileno(3) '_flush_on_write' => 1 '_handler' => Bio::SearchIO::IteratedSearchResultEventBuilder=HASH(0x9505724) '_factories' => HASH(0x95054c0) 'hit' => Bio::Factory::ObjectFactory=HASH(0x95017b8) '_loaded_types' => HASH(0x9506c0c) 'Bio::Search::Hit::BlastHit' => 1 '_root_verbose' => 0 'interface' => 'Bio::Search::Hit::HitI' 'type' => 'Bio::Search::Hit::BlastHit' 'hsp' => Bio::Factory::ObjectFactory=HASH(0x9500e10) '_loaded_types' => HASH(0x9506c18) 'Bio::Search::HSP::GenericHSP' => 1 '_root_verbose' => 0 'interface' => 'Bio::Search::HSP::HSPI' 'type' => 'Bio::Search::HSP::GenericHSP' 'iteration' => Bio::Factory::ObjectFactory=HASH(0x9506c60) '_loaded_types' => HASH(0x9506af8) 'Bio::Search::Iteration::GenericIteration' => 1 '_root_verbose' => 0 'interface' => 'Bio::Search::Iteration::IterationI' 'type' => 'Bio::Search::Iteration::GenericIteration' 'result' => Bio::Factory::ObjectFactory=HASH(0x9504c80) '_loaded_types' => HASH(0x9501f74) 'Bio::Search::Result::BlastResult' => 1 '_root_verbose' => 0 'interface' => 'Bio::Search::Result::ResultI' 'type' => 'Bio::Search::Result::BlastResult' '_inclusion_threshold' => 0.001 '_root_verbose' => 1 '_handler_cache' => Bio::SearchIO::IteratedSearchResultEventBuilder=HASH(0x9505724) -> REUSED_ADDRESS '_notfirsttime' => 0 '_reporttype' => '' '_root_cleanup_methods' => ARRAY(0x8cde434) 0 CODE(0x82a9aec) -> &Bio::Root::IO::_io_cleanup in /util/lib/perl5/site_perl/5.8.0/Bio/Root/IO.pm:544-570 1 CODE(0x82a9aec) -> REUSED_ADDRESS '_root_verbose' => 1 main::parse_blast(454/scripts/fasta_blasta_mb.pl:307): 307: die "no valid $report_obj" unless defined $report_obj; DB<4> s main::parse_blast(454/scripts/fasta_blasta_mb.pl:310): 310: while( my $result = $report_obj->next_result ) { DB<4> s Bio::SearchIO::blast::next_result(/util/lib/perl5/site_perl/5.8.0/Bio/SearchIO/blast.pm:389): 389: my ($self) = @_; DB<4> r scalar context return from Bio::SearchIO::blast::next_result: undef Bio::SearchIO::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/SearchIO.pm:438): 438: my $self = shift; DB<4> r scalar context return from Bio::SearchIO::DESTROY: '' Bio::Root::Root::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/Root/Root.pm:404): 404: my $self = shift; DB<4> r scalar context return from Bio::Root::Root::DESTROY: undef Bio::Root::Root::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/Root/Root.pm:404): 404: my $self = shift; DB<4> r scalar context return from Bio::Root::Root::DESTROY: undef Bio::Root::Root::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/Root/Root.pm:404): 404: my $self = shift; DB<4> r scalar context return from Bio::Root::Root::DESTROY: undef Bio::Root::Root::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/Root/Root.pm:404): 404: my $self = shift; DB<4> r scalar context return from Bio::Root::Root::DESTROY: undef Bio::Root::Root::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/Root/Root.pm:404): 404: my $self = shift; DB<4> r scalar context return from Bio::Root::Root::DESTROY: undef main::(454/scripts/fasta_blasta_mb.pl:155): 155: &output_results(); DB<4> x $result 0 undef Debug of WORKING blast-parse: Parsing blast hits ... main::parse_blast(454/scripts/fasta_blasta_mb.pl:302): 302: my $report_obj = new Bio::SearchIO(-verbose => 1, 303: -format => 'blast', 304: -file => $blastoutfile);#or die "Could not open blast report $!"; DB<3> s Bio::SearchIO::new(/util/lib/perl5/site_perl/5.8.0/Bio/SearchIO.pm:129): 129: my($caller, at args) = @_; DB<3> r scalar context return from Bio::SearchIO::new: '_file' => 'G782.2005-08-16-16-48.fasta_sample.blastout' '_filehandle' => GLOB(0x8763100) -> *Symbol::GEN1 FileHandle({*Symbol::GEN1}) => fileno(3) '_flush_on_write' => 1 '_handler' => Bio::SearchIO::IteratedSearchResultEventBuilder=HASH(0x8ab3be4) '_factories' => HASH(0x8ab1594) 'hit' => Bio::Factory::ObjectFactory=HASH(0x8a7b7c0) '_loaded_types' => HASH(0x8abee10) 'Bio::Search::Hit::BlastHit' => 1 '_root_verbose' => 0 'interface' => 'Bio::Search::Hit::HitI' 'type' => 'Bio::Search::Hit::BlastHit' 'hsp' => Bio::Factory::ObjectFactory=HASH(0x8a87200) '_loaded_types' => HASH(0x8abee1c) 'Bio::Search::HSP::GenericHSP' => 1 '_root_verbose' => 0 'interface' => 'Bio::Search::HSP::HSPI' 'type' => 'Bio::Search::HSP::GenericHSP' 'iteration' => Bio::Factory::ObjectFactory=HASH(0x8abee64) '_loaded_types' => HASH(0x8abecfc) 'Bio::Search::Iteration::GenericIteration' => 1 '_root_verbose' => 0 'interface' => 'Bio::Search::Iteration::IterationI' 'type' => 'Bio::Search::Iteration::GenericIteration' 'result' => Bio::Factory::ObjectFactory=HASH(0x8a81a84) '_loaded_types' => HASH(0x8a96ce8) 'Bio::Search::Result::BlastResult' => 1 '_root_verbose' => 0 'interface' => 'Bio::Search::Result::ResultI' 'type' => 'Bio::Search::Result::BlastResult' '_inclusion_threshold' => 0.001 '_root_verbose' => 1 '_handler_cache' => Bio::SearchIO::IteratedSearchResultEventBuilder=HASH(0x8ab3be4) -> REUSED_ADDRESS '_notfirsttime' => 0 '_reporttype' => '' '_root_cleanup_methods' => ARRAY(0x8762efc) 0 CODE(0x82a9aec) -> &Bio::Root::IO::_io_cleanup in /util/lib/perl5/site_perl/5.8.0/Bio/Root/IO.pm:544-570 1 CODE(0x82a9aec) -> REUSED_ADDRESS '_root_verbose' => 1 main::parse_blast(454/scripts/fasta_blasta_mb.pl:307): 307: die "no valid $report_obj" unless defined $report_obj; DB<3> s main::parse_blast(454/scripts/fasta_blasta_mb.pl:310): 310: while( my $result = $report_obj->next_result ) { DB<3> s Bio::SearchIO::blast::next_result(/util/lib/perl5/site_perl/5.8.0/Bio/SearchIO/blast.pm:389): 389: my ($self) = @_; DB<3> r blast.pm: unrecognized line Reference: Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb Miller (2000), blast.pm: unrecognized line "A greedy algorithm for aligning DNA sequences", blast.pm: unrecognized line J Comput Biol 2000; 7(1-2):203-14. blast.pm: unrecognized line Score E Got NCBI HSP score=354, evalue 0.0 scalar context return from Bio::SearchIO::blast::next_result: '_algorithm' => 'MEGABLAST' '_algorithm_version' => '2.2.10 [Oct-19-2004]' '_dbentries' => 4249067 '_dbletters' => 17735149364 '_dbname' => 'All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS,GSS,environmental samples or phase 0, 1 or 2 HTGS sequences) ' '_hitindex' => 0 '_hits' => ARRAY(0x8b2acd0) empty array '_inclusion_threshold' => 0.001 '_iteration_count' => 1 '_iteration_index' => 0 '_iterations' => ARRAY(0x8b2ac4c) 0 Bio::Search::Iteration::GenericIteration=HASH(0x8b1cacc) '_newhits_below_threshold' => ARRAY(0x8b1ca84) 0 Bio::Search::Hit::BlastHit=HASH(0x8b1cf64) '_accession' => 'AE004091' '_algorithm' => 'MEGABLAST' '_description' => 'Pseudomonas aeruginosa PAO1, complete genome' '_hsps' => ARRAY(0x8b1ceb0) 0 Bio::Search::HSP::GenericHSP=HASH(0x8b2098c) '_algorithm' => 'MEGABLAST' '_frac_conserved' => HASH(0x8b266a0) 'hit' => 0.991803278688525 'query' => 0.991803278688525 'total' => 0.991803278688525 '_frac_identical' => HASH(0x8b2658c) 'hit' => 0.991803278688525 'query' => 0.991803278688525 'total' => 0.991803278688525 '_gaps' => HASH(0x8b24d94) 'hit' => 0 'query' => 0 'total' => 0 '_gsf_tag_hash' => HASH(0x8b20998) empty hash '_hit_string' => 'cctgacctccgctcaactgcgcaaatacgccagcgccggtcggccgttccccgaagggcgcctgctggccgcctcctgccacgacgcggaggaactggccctggctgcctcgatgggagtggagttcgtcaccctttcgccggtacagccgaccgagagccatcccggcgagccggcgctgggttgggacaaggccgccgaactgatcgccggcttcaaccagccggtctacctgctgggtggcctcggtccgcagcaagccgagcaggcttgggagcatggagcccagggcgtggcgggtatccgtgcgttctggccgggcggcctttgacggtggaatgaagaaaaaaggaggcttcggcctcc' '_homology_string' => '|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||' etc...... From torsten.seemann at infotech.monash.edu.au Sun Jul 30 05:41:30 2006 From: torsten.seemann at infotech.monash.edu.au (Torsten Seemann) Date: Sun, 30 Jul 2006 15:41:30 +1000 Subject: [Bioperl-l] Strangeness in parsing blast file In-Reply-To: <44CC3550.5070105@broad.mit.edu> References: <44CC3550.5070105@broad.mit.edu> Message-ID: <44CC468A.40700@infotech.monash.edu.au> > sub do_blast { > my $bf = shift; > my $blastoutput = $bf . ".blastout"; > print "Blasting against $db ...\n"; > `blast/blast-2.2.10/bin/megablast -d /prodinfo/proddata_ntblastdb/nt > -e 1e-50 -p 99 -D 2 -i test -o test.blastout`; > return $blastoutput; > }#end do_blast Should "-o test.blastoutput" be "-o $blastoutput" ? Otherwise you are returning the name of a non-existent file, which naturally Bio::SearchIO will not be able to find a blast result in. Alternatively use Bio::Tools::Run::StandaloneBlast to invoke megablast rather than back-ticks - that way you avoid any intermediate file and get a Bio::SearchIO object back directly. --Torsten From nabil at broad.mit.edu Sun Jul 30 14:11:03 2006 From: nabil at broad.mit.edu (Nabil Hafez) Date: Sun, 30 Jul 2006 10:11:03 -0400 Subject: [Bioperl-l] Strangeness in parsing blast file In-Reply-To: <44CC468A.40700@infotech.monash.edu.au> References: <44CC3550.5070105@broad.mit.edu> <44CC468A.40700@infotech.monash.edu.au> Message-ID: <44CCBDF7.2010601@broad.mit.edu> I had modified the variables a bit to try and make them more readable than what is in my code, in my code -o $blastoutput is what it is, like I said, the blast portion works absolutely fine - i.e. the do_blast sub routine is fully functional. here's a cut and paste from my actual code my $MBLAST = "/prodinfo/prod3pty/blast/blast-2.2.10/bin/megablast"; my $blastdb = "/prodinfo/proddata_ntblastdb/nt"; my $e_val = "1e-50"; #Default e-value Getopt_long my $percent_id = "99"; #Default percentage identity my $per_to_sample ="10"; #Default for percentage of reads to sample sub do_blast { my $bf = shift; my $blastoutput = $bf . ".blastout"; print "Blasting against $db ...\n"; `$MBLAST -d $blastdb -e $e_val -p $percent_id -D 2 -i $bf -o $blastoutput`; return $blastoutput; } I will try your suggestion to use the Bio::Tools::Run::StandaloneBlast, is megablast supported by this module? Thanks Nabil Torsten Seemann wrote: > >> sub do_blast { >> my $bf = shift; >> my $blastoutput = $bf . ".blastout"; >> print "Blasting against $db ...\n"; >> `blast/blast-2.2.10/bin/megablast -d >> /prodinfo/proddata_ntblastdb/nt -e 1e-50 -p 99 -D 2 -i test -o >> test.blastout`; > > > return $blastoutput; > > }#end do_blast > > Should "-o test.blastoutput" be "-o $blastoutput" ? > > Otherwise you are returning the name of a non-existent file, which > naturally Bio::SearchIO will not be able to find a blast result in. > > Alternatively use Bio::Tools::Run::StandaloneBlast to invoke megablast > rather than back-ticks - that way you avoid any intermediate file and > get a Bio::SearchIO object back directly. > > --Torsten > From bix at sendu.me.uk Sun Jul 30 16:20:54 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Sun, 30 Jul 2006 17:20:54 +0100 Subject: [Bioperl-l] Strangeness in parsing blast file In-Reply-To: <44CCBDF7.2010601@broad.mit.edu> References: <44CC3550.5070105@broad.mit.edu> <44CC468A.40700@infotech.monash.edu.au> <44CCBDF7.2010601@broad.mit.edu> Message-ID: <44CCDC66.2030604@sendu.me.uk> Nabil Hafez wrote: > I had modified the variables a bit to try and make them more readable > than what is in my code, in my code -o $blastoutput is > what it is, like I said, the blast portion works absolutely fine - i.e. > the do_blast sub routine is fully functional. How do you know? > `$MBLAST -d $blastdb -e $e_val -p $percent_id -D 2 -i $bf -o > $blastoutput`; Does this command definitely produce exactly the same file as the one you use to show that parse_blast() does sometimes work (when you avoid using do_blast())? Btw, http://perldoc.perl.org/perlfaq8.html#What's-wrong-with-using-backticks-in-a-void-context%3f > I will try your suggestion to use the Bio::Tools::Run::StandaloneBlast, > is megablast supported by this module? No, it doesn't. You could cheat and call _runblast() directly (give it an executable string and a string of args to megablast), and provide -outfile to new(). From nabil at broad.mit.edu Mon Jul 31 00:13:16 2006 From: nabil at broad.mit.edu (Nabil Hafez) Date: Sun, 30 Jul 2006 20:13:16 -0400 Subject: [Bioperl-l] Strangeness in parsing blast file In-Reply-To: <44CCDC66.2030604@sendu.me.uk> References: <44CC3550.5070105@broad.mit.edu> <44CC468A.40700@infotech.monash.edu.au> <44CCBDF7.2010601@broad.mit.edu> <44CCDC66.2030604@sendu.me.uk> Message-ID: <44CD4B1C.5070907@broad.mit.edu> Sendu Bala wrote: >Nabil Hafez wrote: > > >>I had modified the variables a bit to try and make them more readable >>than what is in my code, in my code -o $blastoutput is >>what it is, like I said, the blast portion works absolutely fine - i.e. >>the do_blast sub routine is fully functional. >> >> > >How do you know? > > > Because it creates a file containing all of the blastoutput, this works every time - a file is created with the blastoutput. >> `$MBLAST -d $blastdb -e $e_val -p $percent_id -D 2 -i $bf -o >>$blastoutput`; >> >> > >Does this command definitely produce exactly the same file as the one >you use to show that parse_blast() does sometimes work (when you avoid >using do_blast())? > > > Yes - the exact same file because I produce the file with do_blast() and then when it fails to parse it ends but there is a blastoutput file created in my directory. If i re-run the script again just feeding in the name of the file that was created, it parses it just fine. So basically the parsing works whenever I feed it a blastoupt file but it can't seem to parse the same file that was created and then passed to the parse_blast() subroutine >Btw, >http://perldoc.perl.org/perlfaq8.html#What's-wrong-with-using-backticks-in-a-void-context%3f > >Good to know. Thanks. > > >>I will try your suggestion to use the Bio::Tools::Run::StandaloneBlast, >>is megablast supported by this module? >> >> > >No, it doesn't. You could cheat and call _runblast() directly (give it >an executable string and a string of args to megablast), and provide >-outfile to new(). > > > I still don't think the blast is a problem since I get perfect blastoutput everytime. >_______________________________________________ >Bioperl-l mailing list >Bioperl-l at lists.open-bio.org >http://lists.open-bio.org/mailman/listinfo/bioperl-l > > From cjfields at uiuc.edu Mon Jul 31 02:52:16 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sun, 30 Jul 2006 21:52:16 -0500 Subject: [Bioperl-l] Strangeness in parsing blast file In-Reply-To: <44CD4B1C.5070907@broad.mit.edu> References: <44CC3550.5070105@broad.mit.edu> <44CC468A.40700@infotech.monash.edu.au> <44CCBDF7.2010601@broad.mit.edu> <44CCDC66.2030604@sendu.me.uk> <44CD4B1C.5070907@broad.mit.edu> Message-ID: <81C49D1F-0468-4B63-8D7A-09E1C48573F0@uiuc.edu> As an aside, BLAST 2.2.13 or later cannot be parsed using Bioperl 1.5.1. You have to update to the latest bioperl-live (from CVS). Chris On Jul 30, 2006, at 7:13 PM, Nabil Hafez wrote: > > > Sendu Bala wrote: > >> Nabil Hafez wrote: >> >> >>> I had modified the variables a bit to try and make them more >>> readable >>> than what is in my code, in my code -o $blastoutput is >>> what it is, like I said, the blast portion works absolutely fine >>> - i.e. >>> the do_blast sub routine is fully functional. >>> >>> >> >> How do you know? >> >> >> > Because it creates a file containing all of the blastoutput, this > works > every time - a file is created with the > blastoutput. > >>> `$MBLAST -d $blastdb -e $e_val -p $percent_id -D 2 -i $bf -o >>> $blastoutput`; >>> >>> >> >> Does this command definitely produce exactly the same file as the one >> you use to show that parse_blast() does sometimes work (when you >> avoid >> using do_blast())? >> >> >> > Yes - the exact same file because I produce the file with do_blast() > and then when it fails to parse it ends but > there is a blastoutput file created in my directory. If i re-run the > script again just feeding in the name of the file that was > created, it parses it just fine. So basically the parsing works > whenever I feed it a blastoupt file but it can't seem to parse > the same file that was created and then passed to the parse_blast() > subroutine > >> Btw, >> http://perldoc.perl.org/perlfaq8.html#What's-wrong-with-using- >> backticks-in-a-void-context%3f >> >> Good to know. Thanks. >> >> >>> I will try your suggestion to use the >>> Bio::Tools::Run::StandaloneBlast, >>> is megablast supported by this module? >>> >>> >> >> No, it doesn't. You could cheat and call _runblast() directly >> (give it >> an executable string and a string of args to megablast), and provide >> -outfile to new(). >> >> >> > I still don't think the blast is a problem since I get perfect > blastoutput everytime. > >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Mon Jul 31 08:29:28 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 31 Jul 2006 09:29:28 +0100 Subject: [Bioperl-l] Bio::*Taxonomy* changes (Chris Fields) In-Reply-To: <5C765CAF-4976-45D1-BDA4-1E432081FF4D@uiuc.edu> References: <001301c6b24b$da38ba80$15327e82@pyrimidine> <44CB144C.6050507@sendu.me.uk> <5C765CAF-4976-45D1-BDA4-1E432081FF4D@uiuc.edu> Message-ID: <44CDBF68.2040803@sendu.me.uk> Chris Fields wrote: > On Jul 29, 2006, at 2:54 AM, Sendu Bala wrote: > >>> http://www.bioperl.org/wiki/HOWTO:Trees >> Which is why the Trees HOWTO talks about taxa, and a number of the >> Taxonomy modules have phylogenetic methods like get_lca. (And why >> there >> is Bio::Taxonomy::Taxon and Tree.) > > Are we still thinking about deprecating those? I have seen very > little mention of those modules from the mail list archives, and > Jason mentioned that Bio::Taxonomy::Taxon hasn't been modified in a > long time. Yes, they would both be redundant and nonsensical with the planned changes to Bio::Species. From Xianjun.Dong at bccs.uib.no Mon Jul 31 11:55:59 2006 From: Xianjun.Dong at bccs.uib.no (Xianjun Dong) Date: Mon, 31 Jul 2006 13:55:59 +0200 Subject: [Bioperl-l] PAML + Codeml problem.. In-Reply-To: 4A98ACB8EC146149872BAC9A132A582C277AC4@icex5.ic.ac.uk Message-ID: <1154346960.6517.19.camel@lauvtre.ii.uib.no> Hi, I have a problem during running the Codeml Wiki-HOWTO code: Here is the error message: //////////////////////////////////////////////////////////////// [xianjund at lauvtre kaks]$ perl paml.pl test.fa -------------------- WARNING --------------------- MSG: There was an error - see error_string for the program output STACK Bio::Tools::Run::Phylo::PAML::Codeml::run /Home/extern/xianjund/src/bioperl/bioperl-run/Bio/Tools/Run/Phylo/PAML/Codeml.pm:581 STACK toplevel paml.pl:61 ------------- EXCEPTION: Bio::Root::NotImplemented ------------- MSG: Unknown format of PAML output STACK: Error::throw STACK: Bio::Root::Root::throw /usr/lib/perl5/site_perl/5.8.5/Bio/Root/Root.pm:328 STACK: Bio::Tools::Phylo::PAML::_parse_summary /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:359 STACK: Bio::Tools::Phylo::PAML::next_result /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:224 STACK: paml.pl:62 ---------------------------------------------------------------- //////////////////////////////////////////////////////////////// My test sequence is: >ENST00000361390 ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCTAATCGCAATGGCATTCCTAATGCTTACCGAACGAAAAATTCTAGGCTATATACAACTACGCAAAGGCCCCAACGTTGTAGGCCCCTACGGGCTACTACAACCCTTCGCTGACGCCATAAAACTCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATCACCGCCCCGACCTTAGCTCTCACCATCGCTCTTCTACTATGAACCCCCCTCCCCATACCCAACCCCCTGGTCAACCTCAACCTAGGCCTCCTATTTATTCTAGCCACCTCTAGCCTAGCCGTTTACTCAATCCTCTGATCAGGGTGAGCATCAAACTCAAACTACGCCCTGATCGGCGCACTGCGAGCAGTAGCCCAAACAATCTCATATGAAGTCACCCTAGCCATCATTCTACTATCAACATTACTAATAAGTGGCTCCTTTAACCTCTCCACCCTTATCACAACACAAGAACACCTCTGATTACTCCTGCCATCATGACCCTTGGCCATAATATGATTTATCTCCACACTAGCAGAGACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACTAGTCTCAGGCTTCAACATCGAATACGCCGCAGGCCCCTTCGCCCTATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTACAATCTTCCTAGGAACAACATATGACGCACTCTCCCCTGAACTCTACACAACATATTTTGTCACCAAGACCCTACTTCTAACCTCCCTGTTCTTATGAATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACACCTCCTATGAAAAAACTTCCTACCACTCACCCTAGCATTACTTATATGATATGTCTCCATACCCATTACAATCTCCAGCATTCCCCCTCAAACCTAA >ENSMUST00000082392 GTGTTCTTTATTAATATCCTAACACTCCTCGTCCCCATTCTAATCGCCATAGCCTTCCTAACATTAGTAGAACGCAAAATCTTAGGGTACATACAACTACGAAAAGGCCCTAACATTGTTGGTCCATACGGCATTTTACAACCATTTGCAGACGCCATAAAATTATTTATAAAAGAACCAATACGCCCTTTAACAACCTCTATATCCTTATTTATTATTGCACCTACCCTATCACTCACACTAGCATTAAGTCTATGAGTTCCCCTACCAATACCACACCCATTAATTAATTTAAACCTAGGGATTTTATTTATTTTAGCAACATCTAGCCTATCAGTTTACTCCATTCTATGATCAGGATGAGCCTCAAACTCCAAATACTCACTATTCGGAGCTTTACGAGCCGTAGCCCAAACAATTTCATATGAAGTAACCATAGCTATTATCCTTTTATCAGTTCTATTAATAAATGGATCCTACTCTCTACAAACACTTATTACAACCCAAGAACACATATGATTACTTCTGCCAGCCTGACCCATAGCCATAATATGATTTATCTCAACCCTAGCAGAAACAAACCGGGCCCCCTTCGACCTGACAGAAGGAGAATCAGAATTAGTATCAGGGTTTAACGTAGAATACGCAGCCGGCCCATTCGCGTTATTCTTTATAGCAGAGTACACTAACATTATTCTAATAAACGCCCTAACAACTATTATCTTCCTAGGACCCCTATACTATATCAATTTACCAGAACTCTACTCAACTAACTTCATAATAGAAGCTCTACTACTATCATCAACATTCCTATGGATCCGAGCATCTTATCCACGCTTCCGTTACGATCAACTTATACATCTTCTATGAAAAAACTTTCTACCCCTAACACTAGCATTATGTATGTGACATATTTCTTTACCAATTTTTACAGCGGGAGTACCACCATACATATAG Sure, I checked it. There is some stop codon in it. If I replace it with non-stop codon, it works. For example, >ENST00000361390 ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCcaaTCGCAATGGCATTCCcaaTGCTTACCGAACGAAAAATTCcaaGCTATATACAACTACGCAAAGGCCCCAACGTTGcaaGCCCCTACGGGCTACTACAACCCTTCGCcaaCGCCAcaaAACTCTTCACCAAAGAGCCCCcaaAACCCGCCACATCTACCATCACCCTCTACATCACCGCCCCGACCTcaaCTCTCACCATCGCTCTTCTACTAcaaACCCCCCTCCCCATACCCAACCCCCTGGTCAACCTCAACCcaaGCCTCCTATTTATTCcaaCCACCTCcaaCCcaaCCGTTTACTCAATCCTCcaaTCAGGGcaaGCATCAAACTCAAACTACGCCCcaaTCGGCGCACTGCGAGCAGcaaCCCAAACAATCTCATAcaaAGTCACCCcaaCCATCATTCTACTATCAACATTACcaacaaGTGGCTCCTTcaaCCTCTCCACCCTTATCACAACACAAGAACACCTCcaaTTACTCCTGCCATCAcaaCCCTTGGCCAcaaTAcaaTTTATCTCCACACcaaCAGAGACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACcaaTCTCAGGCTTCAACATCGAATACGCCGCAGGCCCCTTCGCCCTATTCTTCAcaaCCGAATACACAAACATTATTAcaacaaACACCCTCACCACTACAATCTTCCcaaGAACAACATAcaaCGCACTCTCCCCcaaACTCTACACAACATATTTTGTCACCAAGACCCTACTTCcaaCCTCCCTGTTCTTAcaaATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACACCTCCTAcaaAAAAACTTCCTACCACTCACCCcaaCATTACTTATAcaaTATGTCTCCATACCCATTACAATCTCCAGCATTCCCCCTCAAACCcaa >ENSMUST00000082392 GTGTTCTTTATcaaTATCCcaaCACTCCTCGTCCCCATTCcaaTCGCCAcaaCCTTCCcaaCATcaacaaAACGCAAAATCTcaaGGTACATACAACTACGAAAAGGCCCcaaCATTGTTGGTCCATACGGCATTTTACAACCATTTGCAGACGCCAcaaAATTATTTAcaaAAGAACCAATACGCCCTTcaaCAACCTCTATATCCTTATTTATTATTGCACCTACCCTATCACTCACACcaaCATcaaGTCTAcaaGTTCCCCTACCAATACCACACCCATcaaTcaaTTcaaACCcaaGGATTTTATTTATTTcaaCAACATCcaaCCTATCAGTTTACTCCATTCTAcaaTCAGGAcaaGCCTCAAACTCCAAATACTCACTATTCGGAGCTTTACGAGCCGcaaCCCAAACAATTTCATAcaaAGcaaCCAcaaCTATTATCCTTTTATCAGTTCTATcaacaaATGGATCCTACTCTCTACAAACACTTATTACAACCCAAGAACACATAcaaTTACTTCTGCCAGCCcaaCCCAcaaCCAcaaTAcaaTTTATCTCAACCCcaaCAGAAACAAACCGGGCCCCCTTCGACCcaaCAGAAGGAGAATCAGAATcaaTATCAGGGTTcaaCGcaaAATACGCAGCCGGCCCATTCGCGTTATTCTTTAcaaCAGAGTACACcaaCATTATTCcaacaaACGCCCcaaCAACTATTATCTTCCcaaGACCCCTATACTATATCAATTTACCAGAACTCTACTCAACcaaCTTCAcaacaaAAGCTCTACTACTATCATCAACATTCCTATGGATCCGAGCATCTTATCCACGCTTCCGTTACGATCAACTTATACATCTTCTAcaaAAAAACTTTCTACCCCcaaCACcaaCATTATGTATGcaaCATATTTCTTTACCAATTTTTACAGCGGGAGTACCACCATACATAcaa But my question is: it does not occur in the codon position (say, the third codon's position is not a times of 3). Why it effect the result? And also there is code to filter out the stop codon in the sample code (as the following shown) /////////////////////////////// if( $pseq =~ /\*/ && $pseq !~ /\*$/ ) { warn("provided a CDS sequence with a stop codon, PAML will choke!"); exit(0); } # Tcoffee can't handle '*' even if it is trailing $pseq =~ s/\*//g; ///////////////////////////// So, when translate back from aa_aln to dna_aln, there should be no stop codon included. SO, why it can not pass? Thanks for answer! P.S: attach my code here: ///////////////////////////////////////////////////////// #!/usr/bin/perl -w use strict; use Bio::Tools::Run::Phylo::PAML::Codeml; use Bio::Tools::Run::Alignment::Clustalw; # for projecting alignments from protein to R/DNA space use Bio::Align::Utilities qw(aa_to_dna_aln); # for input of the sequence data use Bio::SeqIO; use Bio::AlignIO; my $aln_factory = Bio::Tools::Run::Alignment::Clustalw->new('quiet'=>1); my $seqdata = shift || 'test.fa'; my $seqio = new Bio::SeqIO(-file => $seqdata, -format => 'fasta'); my %seqs; my @prots; # process each sequence while ( my $seq = $seqio->next_seq ) { $seqs{$seq->display_id} = $seq; # translate them into protein my $protein = $seq->translate(); my $pseq = $protein->seq(); if( $pseq =~ /\*/ && $pseq !~ /\*$/ ) { warn("provided a CDS sequence with a stop codon, PAML will choke!"); exit(0); } # Tcoffee can't handle '*' even if it is trailing $pseq =~ s/\*//g; $protein->seq($pseq); push @prots, $protein; } if( @prots < 2 ) { warn("Need at least 2 CDS sequences to proceed"); exit(0); } # open(OUT, ">align_output.txt") || die("cannot open output align_output for writing"); # Align the sequences with clustalw my $aa_aln = $aln_factory->align(\@prots); # project the protein alignment back to CDS coordinates my $dna_aln = aa_to_dna_aln($aa_aln, \%seqs); my @each = $dna_aln->each_seq(); my $kaks_factory = Bio::Tools::Run::Phylo::PAML::Codeml->new ( -params => { 'runmode' => -2, 'seqtype' => 1, }, -save_tempfiles => 1, -verbose => 1); # set the alignment object $kaks_factory->alignment($dna_aln); # run the KaKs analysis my ($rc,$parser) = $kaks_factory->run(); my $result = $parser->next_result; my $MLmatrix = $result->get_MLmatrix(); my @otus = $result->get_seqs(); # this gives us a mapping from the PAML order of sequences back to # the input order (since names get truncated) my @pos = map { my $c= 1; foreach my $s ( @each ) { last if( $s->display_id eq $_->display_id ); $c++; } $c; } @otus; print join("\t", qw(SEQ1 SEQ2 Ka Ks Ka/Ks PROT_PERCENTID CDNA_PERCENTID)),"\n"; for( my $i = 0; $i < (scalar @otus -1) ; $i++) { for( my $j = $i+1; $j < (scalar @otus); $j++ ) { my $sub_aa_aln = $aa_aln->select_noncont($pos[$i],$pos[$j]); my $sub_dna_aln = $dna_aln->select_noncont($pos[$i],$pos[$j]); print join("\t", $otus[$i]->display_id, $otus[$j]->display_id,$MLmatrix->[$i]->[$j]- >{'dN'}, $MLmatrix->[$i]->[$j]->{'dS'}, $MLmatrix->[$i]->[$j]->{'omega'}, sprintf("%.2f",$sub_aa_aln- >percentage_identity), sprintf("%.2f",$sub_dna_aln- >percentage_identity), ), "\n"; } } -- Xianjun Dong PhD Student Computational Biology Unit Bergen Center for Computational Science University of Bergen H?yteknologisenteret, Thorm?hlensgate 55 N-5008 Bergen,Norway. Webpage: http://www.ii.uib.no/~xianjund/ MSN: sterding at hotmail.com Phone No: +47 - 55584354 (office) +47 - 47361688 (mobile) Fax No: +47 - 55584295 From golharam at umdnj.edu Mon Jul 31 15:20:33 2006 From: golharam at umdnj.edu (Ryan Golhar) Date: Mon, 31 Jul 2006 11:20:33 -0400 Subject: [Bioperl-l] PAML + Codeml problem.. In-Reply-To: <1154346960.6517.19.camel@lauvtre.ii.uib.no> Message-ID: <027201c6b4b4$ddc201f0$2f01a8c0@GOLHARMOBILE1> Hi Xianjun, I just did some work on this module including the example. >> it does not occur in the codon position >>(say, the third codon's position is not a times of 3). >>Why it effect the result? If I'm interpreting your question correctly, the stop codons in your sequence occur in-frame. This is why it is choking. >>So, when translate back from aa_aln to dna_aln, there should be no stop codon included. SO, why it can not pass? The Ka and Ks statistics are not calculated based on the protein sequence, they are calculated based on the DNA sequence. The protein sequence is used to provide a alignment for the codons of the DNA sequence. Checking the protein sequence for * is easier to identify in-frame stop codons than scanning the DNA sequence. The two checks for stop codons you mentioned are to check for stop codons within the sequence without worry for the last amino acid. The second part remove the * at the end of the sequence (not the middle). If you want to remove the in-frame stop codons, you can, but do so before translating it to protein sequences. Ryan -----Original Message----- From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Xianjun Dong Sent: Monday, July 31, 2006 7:56 AM To: bioperl-l at lists.open-bio.org Subject: [Bioperl-l] PAML + Codeml problem.. Hi, I have a problem during running the Codeml Wiki-HOWTO code: Here is the error message: //////////////////////////////////////////////////////////////// [xianjund at lauvtre kaks]$ perl paml.pl test.fa -------------------- WARNING --------------------- MSG: There was an error - see error_string for the program output STACK Bio::Tools::Run::Phylo::PAML::Codeml::run /Home/extern/xianjund/src/bioperl/bioperl-run/Bio/Tools/Run/Phylo/PAML/C odeml.pm:581 STACK toplevel paml.pl:61 ------------- EXCEPTION: Bio::Root::NotImplemented ------------- MSG: Unknown format of PAML output STACK: Error::throw STACK: Bio::Root::Root::throw /usr/lib/perl5/site_perl/5.8.5/Bio/Root/Root.pm:328 STACK: Bio::Tools::Phylo::PAML::_parse_summary /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:359 STACK: Bio::Tools::Phylo::PAML::next_result /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:224 STACK: paml.pl:62 ---------------------------------------------------------------- //////////////////////////////////////////////////////////////// My test sequence is: >ENST00000361390 ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCTAATCGCAATGGCATTCCTAATGCTTACCGAA CGAAAAATTCTAGGCTATATACAACTACGCAAAGGCCCCAACGTTGTAGGCCCCTACGGGCTACTACAACCC TTCGCTGACGCCATAAAACTCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATC ACCGCCCCGACCTTAGCTCTCACCATCGCTCTTCTACTATGAACCCCCCTCCCCATACCCAACCCCCTGGTC AACCTCAACCTAGGCCTCCTATTTATTCTAGCCACCTCTAGCCTAGCCGTTTACTCAATCCTCTGATCAGGG TGAGCATCAAACTCAAACTACGCCCTGATCGGCGCACTGCGAGCAGTAGCCCAAACAATCTCATATGAAGTC ACCCTAGCCATCATTCTACTATCAACATTACTAATAAGTGGCTCCTTTAACCTCTCCACCCTTATCACAACA CAAGAACACCTCTGATTACTCCTGCCATCATGACCCTTGGCCATAATATGATTTATCTCCACACTAGCAGAG ACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACTAGTCTCAGGCTTCAACATCGAATACGCC GCAGGCCCCTTCGCCCTATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTACA ATCTTCCTAGGAACAACATATGACGCACTCTCCCCTGAACTCTACACAACATATTTTGTCACCAAGACCCTA CTTCTAACCTCCCTGTTCTTATGAATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACACCTC CTATGAAAAAACTTCCTACCACTCACCCTAGCATTACTTATATGATATGTCTCCATACCCATTACAATCTCC AGCATTCCCCCTCAAACCTAA >ENSMUST00000082392 GTGTTCTTTATTAATATCCTAACACTCCTCGTCCCCATTCTAATCGCCATAGCCTTCCTAACATTAGTAGAA CGCAAAATCTTAGGGTACATACAACTACGAAAAGGCCCTAACATTGTTGGTCCATACGGCATTTTACAACCA TTTGCAGACGCCATAAAATTATTTATAAAAGAACCAATACGCCCTTTAACAACCTCTATATCCTTATTTATT ATTGCACCTACCCTATCACTCACACTAGCATTAAGTCTATGAGTTCCCCTACCAATACCACACCCATTAATT AATTTAAACCTAGGGATTTTATTTATTTTAGCAACATCTAGCCTATCAGTTTACTCCATTCTATGATCAGGA TGAGCCTCAAACTCCAAATACTCACTATTCGGAGCTTTACGAGCCGTAGCCCAAACAATTTCATATGAAGTA ACCATAGCTATTATCCTTTTATCAGTTCTATTAATAAATGGATCCTACTCTCTACAAACACTTATTACAACC CAAGAACACATATGATTACTTCTGCCAGCCTGACCCATAGCCATAATATGATTTATCTCAACCCTAGCAGAA ACAAACCGGGCCCCCTTCGACCTGACAGAAGGAGAATCAGAATTAGTATCAGGGTTTAACGTAGAATACGCA GCCGGCCCATTCGCGTTATTCTTTATAGCAGAGTACACTAACATTATTCTAATAAACGCCCTAACAACTATT ATCTTCCTAGGACCCCTATACTATATCAATTTACCAGAACTCTACTCAACTAACTTCATAATAGAAGCTCTA CTACTATCATCAACATTCCTATGGATCCGAGCATCTTATCCACGCTTCCGTTACGATCAACTTATACATCTT CTATGAAAAAACTTTCTACCCCTAACACTAGCATTATGTATGTGACATATTTCTTTACCAATTTTTACAGCG GGAGTACCACCATACATATAG Sure, I checked it. There is some stop codon in it. If I replace it with non-stop codon, it works. For example, >ENST00000361390 ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCcaaTCGCAATGGCATTCCcaaTGCTTACCGAA CGAAAAATTCcaaGCTATATACAACTACGCAAAGGCCCCAACGTTGcaaGCCCCTACGGGCTACTACAACCC TTCGCcaaCGCCAcaaAACTCTTCACCAAAGAGCCCCcaaAACCCGCCACATCTACCATCACCCTCTACATC ACCGCCCCGACCTcaaCTCTCACCATCGCTCTTCTACTAcaaACCCCCCTCCCCATACCCAACCCCCTGGTC AACCTCAACCcaaGCCTCCTATTTATTCcaaCCACCTCcaaCCcaaCCGTTTACTCAATCCTCcaaTCAGGG caaGCATCAAACTCAAACTACGCCCcaaTCGGCGCACTGCGAGCAGcaaCCCAAACAATCTCATAcaaAGTC ACCCcaaCCATCATTCTACTATCAACATTACcaacaaGTGGCTCCTTcaaCCTCTCCACCCTTATCACAACA CAAGAACACCTCcaaTTACTCCTGCCATCAcaaCCCTTGGCCAcaaTAcaaTTTATCTCCACACcaaCAGAG ACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACcaaTCTCAGGCTTCAACATCGAATACGCC GCAGGCCCCTTCGCCCTATTCTTCAcaaCCGAATACACAAACATTATTAcaacaaACACCCTCACCACTACA ATCTTCCcaaGAACAACATAcaaCGCACTCTCCCCcaaACTCTACACAACATATTTTGTCACCAAGACCCTA CTTCcaaCCTCCCTGTTCTTAcaaATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACACCTC CTAcaaAAAAACTTCCTACCACTCACCCcaaCATTACTTATAcaaTATGTCTCCATACCCATTACAATCTCC AGCATTCCCCCTCAAACCcaa >ENSMUST00000082392 GTGTTCTTTATcaaTATCCcaaCACTCCTCGTCCCCATTCcaaTCGCCAcaaCCTTCCcaaCATcaacaaAA CGCAAAATCTcaaGGTACATACAACTACGAAAAGGCCCcaaCATTGTTGGTCCATACGGCATTTTACAACCA TTTGCAGACGCCAcaaAATTATTTAcaaAAGAACCAATACGCCCTTcaaCAACCTCTATATCCTTATTTATT ATTGCACCTACCCTATCACTCACACcaaCATcaaGTCTAcaaGTTCCCCTACCAATACCACACCCATcaaTc aaTTcaaACCcaaGGATTTTATTTATTTcaaCAACATCcaaCCTATCAGTTTACTCCATTCTAcaaTCAGGA caaGCCTCAAACTCCAAATACTCACTATTCGGAGCTTTACGAGCCGcaaCCCAAACAATTTCATAcaaAGca aCCAcaaCTATTATCCTTTTATCAGTTCTATcaacaaATGGATCCTACTCTCTACAAACACTTATTACAACC CAAGAACACATAcaaTTACTTCTGCCAGCCcaaCCCAcaaCCAcaaTAcaaTTTATCTCAACCCcaaCAGAA ACAAACCGGGCCCCCTTCGACCcaaCAGAAGGAGAATCAGAATcaaTATCAGGGTTcaaCGcaaAATACGCA GCCGGCCCATTCGCGTTATTCTTTAcaaCAGAGTACACcaaCATTATTCcaacaaACGCCCcaaCAACTATT ATCTTCCcaaGACCCCTATACTATATCAATTTACCAGAACTCTACTCAACcaaCTTCAcaacaaAAGCTCTA CTACTATCATCAACATTCCTATGGATCCGAGCATCTTATCCACGCTTCCGTTACGATCAACTTATACATCTT CTAcaaAAAAACTTTCTACCCCcaaCACcaaCATTATGTATGcaaCATATTTCTTTACCAATTTTTACAGCG GGAGTACCACCATACATAcaa But my question is: it does not occur in the codon position (say, the third codon's position is not a times of 3). Why it effect the result? And also there is code to filter out the stop codon in the sample code (as the following shown) /////////////////////////////// if( $pseq =~ /\*/ && $pseq !~ /\*$/ ) { warn("provided a CDS sequence with a stop codon, PAML will choke!"); exit(0); } # Tcoffee can't handle '*' even if it is trailing $pseq =~ s/\*//g; ///////////////////////////// So, when translate back from aa_aln to dna_aln, there should be no stop codon included. SO, why it can not pass? Thanks for answer! P.S: attach my code here: ///////////////////////////////////////////////////////// #!/usr/bin/perl -w use strict; use Bio::Tools::Run::Phylo::PAML::Codeml; use Bio::Tools::Run::Alignment::Clustalw; # for projecting alignments from protein to R/DNA space use Bio::Align::Utilities qw(aa_to_dna_aln); # for input of the sequence data use Bio::SeqIO; use Bio::AlignIO; my $aln_factory = Bio::Tools::Run::Alignment::Clustalw->new('quiet'=>1); my $seqdata = shift || 'test.fa'; my $seqio = new Bio::SeqIO(-file => $seqdata, -format => 'fasta'); my %seqs; my @prots; # process each sequence while ( my $seq = $seqio->next_seq ) { $seqs{$seq->display_id} = $seq; # translate them into protein my $protein = $seq->translate(); my $pseq = $protein->seq(); if( $pseq =~ /\*/ && $pseq !~ /\*$/ ) { warn("provided a CDS sequence with a stop codon, PAML will choke!"); exit(0); } # Tcoffee can't handle '*' even if it is trailing $pseq =~ s/\*//g; $protein->seq($pseq); push @prots, $protein; } if( @prots < 2 ) { warn("Need at least 2 CDS sequences to proceed"); exit(0); } # open(OUT, ">align_output.txt") || die("cannot open output align_output for writing"); # Align the sequences with clustalw my $aa_aln = $aln_factory->align(\@prots); # project the protein alignment back to CDS coordinates my $dna_aln = aa_to_dna_aln($aa_aln, \%seqs); my @each = $dna_aln->each_seq(); my $kaks_factory = Bio::Tools::Run::Phylo::PAML::Codeml->new ( -params => { 'runmode' => -2, 'seqtype' => 1, }, -save_tempfiles => 1, -verbose => 1); # set the alignment object $kaks_factory->alignment($dna_aln); # run the KaKs analysis my ($rc,$parser) = $kaks_factory->run(); my $result = $parser->next_result; my $MLmatrix = $result->get_MLmatrix(); my @otus = $result->get_seqs(); # this gives us a mapping from the PAML order of sequences back to # the input order (since names get truncated) my @pos = map { my $c= 1; foreach my $s ( @each ) { last if( $s->display_id eq $_->display_id ); $c++; } $c; } @otus; print join("\t", qw(SEQ1 SEQ2 Ka Ks Ka/Ks PROT_PERCENTID CDNA_PERCENTID)),"\n"; for( my $i = 0; $i < (scalar @otus -1) ; $i++) { for( my $j = $i+1; $j < (scalar @otus); $j++ ) { my $sub_aa_aln = $aa_aln->select_noncont($pos[$i],$pos[$j]); my $sub_dna_aln = $dna_aln->select_noncont($pos[$i],$pos[$j]); print join("\t", $otus[$i]->display_id, $otus[$j]->display_id,$MLmatrix->[$i]->[$j]- >{'dN'}, $MLmatrix->[$i]->[$j]->{'dS'}, $MLmatrix->[$i]->[$j]->{'omega'}, sprintf("%.2f",$sub_aa_aln- >percentage_identity), sprintf("%.2f",$sub_dna_aln- >percentage_identity), ), "\n"; } } -- Xianjun Dong PhD Student Computational Biology Unit Bergen Center for Computational Science University of Bergen H?yteknologisenteret, Thorm?hlensgate 55 N-5008 Bergen,Norway. Webpage: http://www.ii.uib.no/~xianjund/ MSN: sterding at hotmail.com Phone No: +47 - 55584354 (office) +47 - 47361688 (mobile) Fax No: +47 - 55584295 _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l From nabil at broad.mit.edu Mon Jul 31 18:57:48 2006 From: nabil at broad.mit.edu (Nabil Hafez) Date: Mon, 31 Jul 2006 14:57:48 -0400 Subject: [Bioperl-l] Strangeness in parsing blast file In-Reply-To: <44CC3550.5070105@broad.mit.edu> References: <44CC3550.5070105@broad.mit.edu> Message-ID: <44CE52AC.4080108@broad.mit.edu> I have figured out the problem - not a problem with Bioperl. In my create_sample_file() subroutine I defined $/ = '>'; #define fasta record input seperator when it should have been this local $/ = "\n>"; the use of local made a big difference. Thanks to all for your help. Nabil Hafez Nabil Hafez wrote: > Hi, > I am having a somewhat similar problem to what was posted in > http://bioperl.org/pipermail/bioperl-l/2006-May/021416.html > however, I have read through all of that thread and I don't believe what > I am > experiencing is the exact same problem. I also realize that the Bioperl > version 1.5.1 > fixes a problem with blast parsing. > > My problem: > My blastresults file parses fine and everything works swimmingly if > I pass > the blast output file by name such as > $blast_result = 'test.blastout'; > > however when I do > $blast_result = &do_blast($sample_fasta); > > even though in both cases $blast_result evaluate to "test.blastout", the > parsing doesn't work, more specifically > it gets an undefined variable for $result in while( my $result = > $report_obj->next_result ) { > > Sorr y for the long email - any help would be appreciated, > Thanks - Nabil > > > The code...non releavant parts trimmed for size constraints....debugging > from working and non-working > versions after the code. > > use strict; > use Bio::SearchIO; > use Getopt::Std; > use List::Util qw(shuffle); > use Benchmark; > > my ($inputfile, $samplefile, $blastfile, $blast_result, $blast_report, > $blast_verbose); #files generated > > > #------------------# > # Subroutine Calls # > #------------------# > > my $test = &create_sample_file($inputfile); #inputfile being a fasta > file containing nucleotide sequence > $blast_result = &do_blast($test); > ##$blast_result = 'test.blastout'; #when this is uncommented and > replace the previous two lines with test.blastout being normal blast > output - the script works fine. > &parse_blast($blast_result); > > > ####################### > # create_sample_file > # > # Input: Original Fasta File > # > # Output: Fasta file containing randomly sampled reads > # > # > sub create_sample_file { > my $in = shift; > my $linecount = 0; > my @lines; > > $samplefile = $in . "_sample"; > > #Determine total # of reads in input fasta > $totalreads = `$grep -c '>' $inputfile`; > $totalreads =~ s/\s+//; > chomp $totalreads; > > if ($totalreads > 1000) { #sample if more than 1000 reads > $sample_reads = sprintf("%.0f", $totalreads * > ($per_to_sample/100)); #number of reads to sample > } > else { #otherwise use all reads > $sample_reads = $totalreads; > } > > $/ = '>'; #define fasta record input seperator > > open (IN, "<$in") or die "Cannot open $in $!\n"; > open (OUT, ">$samplefile") or die "Cannot open $samplefile $!\n"; > > > while () { #read lines into an array > chomp; > push (@lines, $_); > } > > @lines = shuffle(@lines); #shuffle array > foreach (@lines) { > print OUT ">$_" if $linecount <= $sample_reads; #output to > file sampled number of reads > $linecount++; > } > > close IN; > close OUT; > > return $samplefile; > > }#end create_sample_file > > > ####################### > # do_blast > # > # Input: Fasta File containing SCREENED sampled reads > # > # Output: Blast File > # > # > > sub do_blast { > my $bf = shift; > my $blastoutput = $bf . ".blastout"; > > print "Blasting against $db ...\n"; > > `blast/blast-2.2.10/bin/megablast -d /prodinfo/proddata_ntblastdb/nt > -e 1e-50 -p 99 -D 2 -i test -o test.blastout`; > > return $blastoutput; > > }#end do_blast > > > > ####################### > # parse_blast > # > # Input: Blast file > # > # Output: Creates hash containing best hit for each read > # > # > > sub parse_blast { > my $blastoutfile = shift; > > if (! -e $blastoutfile) { > die "$blastoutfile does not exist $!\n"; > } > > print "Parsing blast hits ...\n"; > > > my $report_obj = new Bio::SearchIO(-verbose => 1, > -format => 'blast', > -file => $blastoutfile); > > > die "no valid $report_obj" unless defined $report_obj; > > > while( my $result = $report_obj->next_result ) { > die "no valid $result" unless defined $result; > while( my $hit = $result->next_hit ) { > while( my $hsp = $hit->next_hsp ) { > my $name = $result->query_name; > my $hitDesc = $hit->description; > my $length = $hsp->length('total'); > my $per_id = sprintf("%.2f", $hsp->percent_identity); > my $eval = $hsp->evalue; > next if (defined $blast_results{$name} && > $blast_results{$name}->[0] > $length); #only keep best hit for any read > $blast_results{$name} = [$length, $per_id, $eval, $hitDesc]; > #store in hash of arrays > } > } > } > > } #end parse_blast > > > > > > Debug of NON-working blast-parse: > > main::(454/scripts/fasta_blasta_mb.pl:151): > 151: my $sample_fasta = &create_sample_file($inputfile); > DB<2> n > main::(454/scripts/fasta_blasta_mb.pl:152): > 152: $blast_result = &do_blast($sample_fasta); > DB<2> x $sample_fasta > 0 'G782.2005-08-16-16-48.fasta_sample' > DB<3> n > Blasting against NT ... > main::(454/scripts/fasta_blasta_mb.pl:154): > 154: &parse_blast($blast_result); > DB<3> s > main::parse_blast(454/scripts/fasta_blasta_mb.pl:293): > 293: my $blastoutfile = shift; > DB<3> s > main::parse_blast(454/scripts/fasta_blasta_mb.pl:295): > 295: if (! -e $blastoutfile) { > DB<3> x $blastoutfile > 0 'G782.2005-08-16-16-48.fasta_sample.blastout' > DB<4> s > main::parse_blast(454/scripts/fasta_blasta_mb.pl:299): > 299: print "Parsing blast hits ...\n"; > DB<4> s > Parsing blast hits ... > main::parse_blast(454/scripts/fasta_blasta_mb.pl:302): > 302: my $report_obj = new Bio::SearchIO(-verbose => 1, > 303: -format => 'blast', > 304: -file => > $blastoutfile);#or die "Could not open blast report $!"; > DB<4> s > Bio::SearchIO::new(/util/lib/perl5/site_perl/5.8.0/Bio/SearchIO.pm:129): > 129: my($caller, at args) = @_; > DB<4> r > scalar context return from Bio::SearchIO::new: '_file' => > 'G782.2005-08-16-16-48.fasta_sample.blastout' > '_filehandle' => GLOB(0x8cef40c) > -> *Symbol::GEN1 > FileHandle({*Symbol::GEN1}) => fileno(3) > '_flush_on_write' => 1 > '_handler' => > Bio::SearchIO::IteratedSearchResultEventBuilder=HASH(0x9505724) > '_factories' => HASH(0x95054c0) > 'hit' => Bio::Factory::ObjectFactory=HASH(0x95017b8) > '_loaded_types' => HASH(0x9506c0c) > 'Bio::Search::Hit::BlastHit' => 1 > '_root_verbose' => 0 > 'interface' => 'Bio::Search::Hit::HitI' > 'type' => 'Bio::Search::Hit::BlastHit' > 'hsp' => Bio::Factory::ObjectFactory=HASH(0x9500e10) > '_loaded_types' => HASH(0x9506c18) > 'Bio::Search::HSP::GenericHSP' => 1 > '_root_verbose' => 0 > 'interface' => 'Bio::Search::HSP::HSPI' > 'type' => 'Bio::Search::HSP::GenericHSP' > 'iteration' => Bio::Factory::ObjectFactory=HASH(0x9506c60) > '_loaded_types' => HASH(0x9506af8) > 'Bio::Search::Iteration::GenericIteration' => 1 > '_root_verbose' => 0 > 'interface' => 'Bio::Search::Iteration::IterationI' > 'type' => 'Bio::Search::Iteration::GenericIteration' > 'result' => Bio::Factory::ObjectFactory=HASH(0x9504c80) > '_loaded_types' => HASH(0x9501f74) > 'Bio::Search::Result::BlastResult' => 1 > '_root_verbose' => 0 > 'interface' => 'Bio::Search::Result::ResultI' > 'type' => 'Bio::Search::Result::BlastResult' > '_inclusion_threshold' => 0.001 > '_root_verbose' => 1 > '_handler_cache' => > Bio::SearchIO::IteratedSearchResultEventBuilder=HASH(0x9505724) > -> REUSED_ADDRESS > '_notfirsttime' => 0 > '_reporttype' => '' > '_root_cleanup_methods' => ARRAY(0x8cde434) > 0 CODE(0x82a9aec) > -> &Bio::Root::IO::_io_cleanup in > /util/lib/perl5/site_perl/5.8.0/Bio/Root/IO.pm:544-570 > 1 CODE(0x82a9aec) > -> REUSED_ADDRESS > '_root_verbose' => 1 > main::parse_blast(454/scripts/fasta_blasta_mb.pl:307): > 307: die "no valid $report_obj" unless defined $report_obj; > DB<4> s > main::parse_blast(454/scripts/fasta_blasta_mb.pl:310): > 310: while( my $result = $report_obj->next_result ) { > DB<4> s > Bio::SearchIO::blast::next_result(/util/lib/perl5/site_perl/5.8.0/Bio/SearchIO/blast.pm:389): > 389: my ($self) = @_; > DB<4> r > scalar context return from Bio::SearchIO::blast::next_result: undef > Bio::SearchIO::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/SearchIO.pm:438): > 438: my $self = shift; > DB<4> r > scalar context return from Bio::SearchIO::DESTROY: '' > Bio::Root::Root::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/Root/Root.pm:404): > 404: my $self = shift; > DB<4> r > scalar context return from Bio::Root::Root::DESTROY: undef > Bio::Root::Root::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/Root/Root.pm:404): > 404: my $self = shift; > DB<4> r > scalar context return from Bio::Root::Root::DESTROY: undef > Bio::Root::Root::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/Root/Root.pm:404): > 404: my $self = shift; > DB<4> r > scalar context return from Bio::Root::Root::DESTROY: undef > Bio::Root::Root::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/Root/Root.pm:404): > 404: my $self = shift; > DB<4> r > scalar context return from Bio::Root::Root::DESTROY: undef > Bio::Root::Root::DESTROY(/util/lib/perl5/site_perl/5.8.0/Bio/Root/Root.pm:404): > 404: my $self = shift; > DB<4> r > scalar context return from Bio::Root::Root::DESTROY: undef > main::(454/scripts/fasta_blasta_mb.pl:155): > 155: &output_results(); > DB<4> x $result > 0 undef > > > > Debug of WORKING blast-parse: > Parsing blast hits ... > main::parse_blast(454/scripts/fasta_blasta_mb.pl:302): > 302: my $report_obj = new Bio::SearchIO(-verbose => 1, > 303: -format => 'blast', > 304: -file => > $blastoutfile);#or die "Could not open blast report $!"; > DB<3> s > Bio::SearchIO::new(/util/lib/perl5/site_perl/5.8.0/Bio/SearchIO.pm:129): > 129: my($caller, at args) = @_; > DB<3> r > scalar context return from Bio::SearchIO::new: '_file' => > 'G782.2005-08-16-16-48.fasta_sample.blastout' > '_filehandle' => GLOB(0x8763100) > -> *Symbol::GEN1 > FileHandle({*Symbol::GEN1}) => fileno(3) > '_flush_on_write' => 1 > '_handler' => > Bio::SearchIO::IteratedSearchResultEventBuilder=HASH(0x8ab3be4) > '_factories' => HASH(0x8ab1594) > 'hit' => Bio::Factory::ObjectFactory=HASH(0x8a7b7c0) > '_loaded_types' => HASH(0x8abee10) > 'Bio::Search::Hit::BlastHit' => 1 > '_root_verbose' => 0 > 'interface' => 'Bio::Search::Hit::HitI' > 'type' => 'Bio::Search::Hit::BlastHit' > 'hsp' => Bio::Factory::ObjectFactory=HASH(0x8a87200) > '_loaded_types' => HASH(0x8abee1c) > 'Bio::Search::HSP::GenericHSP' => 1 > '_root_verbose' => 0 > 'interface' => 'Bio::Search::HSP::HSPI' > 'type' => 'Bio::Search::HSP::GenericHSP' > 'iteration' => Bio::Factory::ObjectFactory=HASH(0x8abee64) > '_loaded_types' => HASH(0x8abecfc) > 'Bio::Search::Iteration::GenericIteration' => 1 > '_root_verbose' => 0 > 'interface' => 'Bio::Search::Iteration::IterationI' > 'type' => 'Bio::Search::Iteration::GenericIteration' > 'result' => Bio::Factory::ObjectFactory=HASH(0x8a81a84) > '_loaded_types' => HASH(0x8a96ce8) > 'Bio::Search::Result::BlastResult' => 1 > '_root_verbose' => 0 > 'interface' => 'Bio::Search::Result::ResultI' > 'type' => 'Bio::Search::Result::BlastResult' > '_inclusion_threshold' => 0.001 > '_root_verbose' => 1 > '_handler_cache' => > Bio::SearchIO::IteratedSearchResultEventBuilder=HASH(0x8ab3be4) > -> REUSED_ADDRESS > '_notfirsttime' => 0 > '_reporttype' => '' > '_root_cleanup_methods' => ARRAY(0x8762efc) > 0 CODE(0x82a9aec) > -> &Bio::Root::IO::_io_cleanup in > /util/lib/perl5/site_perl/5.8.0/Bio/Root/IO.pm:544-570 > 1 CODE(0x82a9aec) > -> REUSED_ADDRESS > '_root_verbose' => 1 > main::parse_blast(454/scripts/fasta_blasta_mb.pl:307): > 307: die "no valid $report_obj" unless defined $report_obj; > DB<3> s > main::parse_blast(454/scripts/fasta_blasta_mb.pl:310): > 310: while( my $result = $report_obj->next_result ) { > DB<3> s > Bio::SearchIO::blast::next_result(/util/lib/perl5/site_perl/5.8.0/Bio/SearchIO/blast.pm:389): > 389: my ($self) = @_; > DB<3> r > blast.pm: unrecognized line Reference: Zheng Zhang, Scott Schwartz, > Lukas Wagner, and Webb Miller (2000), > blast.pm: unrecognized line "A greedy algorithm for aligning DNA > sequences", > blast.pm: unrecognized line J Comput Biol 2000; 7(1-2):203-14. > blast.pm: unrecognized > line > Score E > Got NCBI HSP score=354, evalue 0.0 > scalar context return from Bio::SearchIO::blast::next_result: > '_algorithm' => 'MEGABLAST' > '_algorithm_version' => '2.2.10 [Oct-19-2004]' > '_dbentries' => 4249067 > '_dbletters' => 17735149364 > '_dbname' => 'All GenBank+EMBL+DDBJ+PDB sequences (but no EST, > STS,GSS,environmental samples or phase 0, 1 or 2 HTGS sequences) ' > '_hitindex' => 0 > '_hits' => ARRAY(0x8b2acd0) > empty array > '_inclusion_threshold' => 0.001 > '_iteration_count' => 1 > '_iteration_index' => 0 > '_iterations' => ARRAY(0x8b2ac4c) > 0 Bio::Search::Iteration::GenericIteration=HASH(0x8b1cacc) > '_newhits_below_threshold' => ARRAY(0x8b1ca84) > 0 Bio::Search::Hit::BlastHit=HASH(0x8b1cf64) > '_accession' => 'AE004091' > '_algorithm' => 'MEGABLAST' > '_description' => 'Pseudomonas aeruginosa PAO1, complete genome' > '_hsps' => ARRAY(0x8b1ceb0) > 0 Bio::Search::HSP::GenericHSP=HASH(0x8b2098c) > '_algorithm' => 'MEGABLAST' > '_frac_conserved' => HASH(0x8b266a0) > 'hit' => 0.991803278688525 > 'query' => 0.991803278688525 > 'total' => 0.991803278688525 > '_frac_identical' => HASH(0x8b2658c) > 'hit' => 0.991803278688525 > 'query' => 0.991803278688525 > 'total' => 0.991803278688525 > '_gaps' => HASH(0x8b24d94) > 'hit' => 0 > 'query' => 0 > 'total' => 0 > '_gsf_tag_hash' => HASH(0x8b20998) > empty hash > '_hit_string' => > 'cctgacctccgctcaactgcgcaaatacgccagcgccggtcggccgttccccgaagggcgcctgctggccgcctcctgccacgacgcggaggaactggccctggctgcctcgatgggagtggagttcgtcaccctttcgccggtacagccgaccgagagccatcccggcgagccggcgctgggttgggacaaggccgccgaactgatcgccggcttcaaccagccggtctacctgctgggtggcctcggtccgcagcaagccgagcaggcttgggagcatggagcccagggcgtggcgggtatccgtgcgttctggccgggcggcctttgacggtggaatgaagaaaaaaggaggcttcggcctcc' > '_homology_string' => > '|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| > ||||||||||||||||||||||||||||||||||||||||| > ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| > |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||' > etc...... > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l